Search This Blog

Friday, December 19, 2008

Switching Gears

After years of suing thousands of people for allegedly stealing music via the Internet, the recording industry is set to drop its legal assault as it searches for more effective ways to combat online music piracy.
...
Instead, the Recording Industry Association of America said it plans to try an approach that relies on the cooperation of Internet-service providers. The trade group said it has hashed out preliminary agreements with major ISPs under which it will send an email to the provider when it finds a provider's customers making music available online for others to take.
Wall Street Journal

I definitely did not expect this so soon, as progress in the courts of stopping the RIAA's legal campaign is proceeding at a crawl, and the RIAA could probably have continued for a while longer before getting hit with serious legal penalties.

Wednesday, December 17, 2008

Graphics Programming Term Project

In case anybody is interested, here is the paper for my term project in graphics class. I was working on an implementation of it, but thanks to various brick wall problems, that didn't end up getting completed in time (which is why the results section does not discuss the results of the actual implementation), although it did result in me going 38 hours without sleep. It's an interesting method, and at least a couple small parts of it are novel (I'm not aware of them being proposed before), but probably isn't practically useful for the reasons explained in the paper.

Multi-Fragment Deferred Shading Using Index Rendering

& More Echoes

BahamutZero has informed me that Echoes of War is now available on iTunes+ (DRM-free 256 kbps AAC audio downloads) for $14.85 (the two CDs in Echoes of War are sold separately, totaling that price). I'm not aware of the actual CDs being available anywhere but Eminence, the creators. If you just want the audio files without the shipping (my Legendary Edition cost like $12 shipping), check out iTunes.

Tuesday, December 16, 2008

Intriguing

Well, just as school is almost over (finals are this week) and I don't have a job lined up yet, substantial amounts of amusement will be welcome in the near future (especially given how bleak the anime outlook is, this season...). Well, as it turns out, I'm in luck! While in the process of banging my head against a wall till I pass out while working on a term project, something amusing happened. I don't have time to explain the details now (despite the fact that this is much more interesting than my school project), but here's a short headline of what's up and coming: Q vs. Scam Debt Collection Agency.

Look forward to it!

Sunday, November 23, 2008

& Other Things

I managed to forget something important in my last post, despite the fact that it's closely related to one of the things I did mention; that is, Echoes of War is out. At least, it got shipped to me last week; few others seem to have gotten it already, and as far as I know it hasn't even hit peer-to-peer networks yet.

Echoes of War
is an orchestral remix/arrangement of music from all three of Blizzard's universes - Warcraft (III, World of Warcraft), Starcraft (I & II), and Diablo (I-III) - by Eminence. It's about one and a half hours of music, with several tracks from each game and each track being a medley of game pieces.

While some of them fairly closely follow the original sound, some of them are arranged in very novel and surprising ways. Two of the best examples of this are the big band jazz arrangement of the Starcraft I Terran music, and the crazy symphonic/operatic/Middle Eastern/The Rock arrangement of the Starcraft I Zerg music. (other samples can be played from the Echoes of War media section)

How much I like the tracks varies by the track. Several of them I really like, although I'm noticably less fond of the Diablo tracks than the Warcraft and Starcraft ones. But in any case, the album is awesome. If you like the music of Warcraft, Starcraft, and/or Diablo, buy it. I just wish the stupid thing was sold by stores that didn't charge you $14 for shipping...

Wednesday, November 19, 2008

Various Thingies

First of all, I should mention that my house is fine; the fire didn't get near it. The probability of it getting here was fairly low, but we did a bit of better-safe-than-sorry packing. Though last night while driving home from school I did drive past a (unrelated) fire that filled the entire intersection with smoke in about a 50 feet radius; I still don't know what was on fire (I couldn't see it), but the smoke was very obvious, and I heard fire trucks going by.

In other news, it's been relatively difficult to collect data on Firefox after reenabling the Feed Sidebar addon. Firefox crashed after three days of logging memory usage, and then a couple days later I needed to restart it because I needed the memory for WoW (Firefox was using about a gig). But the addon defintiely seems like the cause of the memory leak. From the days I gathered data, it looks like it leaks about 40 megs/hour (although that's only over a couple days; it might decrease over time).

Finally, I just noticed something that happened last year: the Starcraft soundtrack, not previously available (the compressed audio shipped with the games is 22 khz ADPCM, which is pretty poor quality), is on iTunes for $10; the other Blizzard OSTs that were included in the collectors' editions of Diablo 2, Warcraft 3, and World of Warcraft are also available there (though unfortunately all of them are single CDs, which means they are incomplete). The music is DRM-free (although I hear they encode personally identifying information in the audio files), 256 kbps AAC (good quality), though you will have to install the Apple iTunes crapware to buy it. I'm told the M4A files should play on all PC audio players that support AAC (I know they work on WinAmp), though they are not MP3s, and will not work on MP3 audio players. That's your public service announcement for today.

Saturday, November 15, 2008

Toasty

So, it's a blistering 4% humidity (91 degree temperature), and southern California is burning once again. As has happened several times, everything looks golden through the smoke filling the sky, and ash is accumulating on every outdoor surface. People working outside here are told to wear masks to cut down on the amount of smoke inhaled.

Currently several hundred homes have burned down and a few thousand have been evacuated. The fire isn't expected to get here (it's 10 miles away), but we're doing some preliminary packing in case things go badly and we have to evacuate. It's also possible that damage to the power lines at a distance might cause us to lose power here (in a bad case scenario), even if we don't have to evacuate.

Wednesday, November 12, 2008

& More Leakage

So, after writing that last post about the audio driver handle leak, I decided to log some data - specifically, the amount of memory Firefox allocates, and the number of handles in the Symantec Anti-Virus process smc.exe. It's now been about a week since I started gathering data (although unfortunately the power went out in the middle, so I ended up with two smaller replicates).

The data for smc.exe shows that it begins at approximately 450 handles on startup, and acquires an additional 3100 handles per day (although 'day' is about 14 hours, as I hibernate my computer at night; meaning about 220 handles/hour). This definitely doesn't seem normal, and I'm going to venture a guess that it's a handle leak. I also noted that the increase seems to be linear over the course of the day, so is unlikely to be related to something like automatic update.

I already knew that Firefox was hemmorhaging memory. If I recall correctly, the amount of memory allocated by Firefox increased by 200-300 MB per day. This time, I tried using Firefox for several days without two of the three addons I normally use (the third was NoScript, so I didn't want to try without that unless I had to). While this test didn't last as long as I'd hoped (thanks to that power outage), after four days, Firefox had only increased from 125 MB (when I first started it, with a lot of saved tabs) to 205 MB (now). In four days I would have predicted it would hit 600-900 megs.

This strongly suggests that one of the two plugins is responsible for the massive leakage, although I'll have to watch what happens after I reenable the one most likely the be causing the leak (as the other is newly installed, and this problem has been around for longer): Feed Sidebar (version 3.1.6). So, we'll see what happens with that. Might have an answer in another 4-7 days about that.

Tuesday, November 04, 2008

& the Audio Driver Incident

Several months ago, I (finally) upgraded my computer. My old one was a 1.8 GHz Athlon XP (single 32-bit core) with 1.25 gigs RAM and a GeForce 3; in other words, it was 2002 or 2003 hardware. My new computer is a 2.4 GHz Core 2 (quad 64-bit cores) with 4 gigs RAM and a Radeon 4850; depending on the benchmarks, my new CPU is 10-18x as fast as my old one, if you count all 4 cores. After trying various voodoo to try to get my old XP installation to run on my new computer (despite the fact that it wouldn't have been able to use about a gig of my RAM), I ultimately gave up and installed Windows Server 2008 64-bit. After dealing with a whole bunch of problems getting various stuff working with 64-bit 2008, things ultimately ended up being acceptable, and I've used that ever since.

However, a couple relatively minor problems have been pretty long-standing, and continued until a few days ago. One was easy to diagnose: Firefox was leaking memory like heck. For every day I left my computer on, Firefox would grow in RAM usage by a couple hundred megs, getting up to a good 2 gigs on occasion (I usually kill it before it gets to that point). While this was certainly an annoyance, it wasn't much of a problem, as I have 4 gigs memory, and I can simply restart it to reclaim all the leaked memory whenever it gets so large it becomes a problem.

One was much harder to diagnose, however. Something else was leaking memory in addition to Firefox, and it was not clear what was causing this. Total system memory usage would increase over days, and if you ignored Firefox, would end up using up all of my 4 gigs memory by about 2 weeks since the last reboot. Unlike with Firefox, there was no apparent problem - no single process was showing a significant accumulation of memory, nor were excess processes being created, leaving 1-2 gigs of memory I couldn't account for. So, I went several months without knowing what the problem was, usually handling it by restarting my computer every week or so.

Then, one day my dad called me from work to ask me why his computer at work was sometimes performing poorly. So I had him look through the process list and system statistics and look for memory leaks, excessive CPU usage, etc. As I don't have the exact terminology used on those pages memorized, I also opened up the listing on my computer to be sure I told him to look for the right things.

This brought something very curious to my attention: the total handle count for my computer was over 4 million. This is a VERY large number of handles; normally computers don't have more than 20-50k handles at a time - 2 orders of magnitude less than what my computer was experiencing. This was an almost certain indication that something was leaking handles on a massive scale. After adding the handles column to the process list, I found that audiodg.exe was the process with some 98% of those handles. Some looking online revealed that that process is a host for audio driver components and DRM. Some further looking for audiodg.exe and handle leaks found some reverse-engineering by one person that showed that this was due to the Sonic Focus audio driver for my Asus motherboard leaking registry handles.

Fortunately, there was an updated driver available by this time that addresses the issue. As my computer was currently at 96% RAM usage (the worst it's ever been - usually I reboot it before it gets to this point), I immediately installed the driver and restarted the audio services (of which audiodg.exe is one). This resulted in a shocking instant 1.3 gig drop in kernel memory usage to less than 400 megs total. It's been one and a half days since then, and audiodg.exe currently is using 226 handles, suggesting that the problem is either dead or drastically reduced (it has increased by like 70 handles in those 1.5 days); and even if it is still leaking handles, 50 handles a day is a tolorable leakage, as that's only like 10 k/day.

So, this whole thing revealed that Windows is quote robust. Given that most computers never go above 50k handles, I was very surprised that Windows was able to handle 6.6 million handles (the highest I've ever seen it get to) without falling over and dying (although this wouldn't have been possible with a 32-bit version of Windows, as that 1.7 gigs of kernel memory wouldn't have fit in the 2 gig kernel address space after memory-mapped devices have memory space allocated). Traditionally, Unix has had a limit of 1024 file handles per process, though I don't know what's typical these days (I know at least some Unix OS have that as a configurable option).

After pursuing that problem to its conclusion, I decided to do some more looking for handle leaks in other processes. While the average process used only 200-500 handles, a number a processes (which are not abnormally high) get as high as 2k handles. However, one process - smc.exe, a part of Symantec Antivirus - has almost 50 k handles allocated, making it a good candidate for a handle leak. Looking at the process in Process Explorer shows that a good 95% of these handles are of the same type - specifically, unnamed event handles - providing further evidence in support of handle leakage. That's as far as I've gotten so far; I haven't spent much time investigating the problem, or looking for an analysis online (though the brief searches I did didn't find anything related to this). So, that's work for the future.

Thursday, September 04, 2008

Final Lap

Well, it's currently the second week of the fall university semester. This semester is extra special because it's the last semester before I graduate, and anything that results in less school is always a good thing. Between my two majors - biology and computer science - I've been in college for entirely too long, and I'm hoping to move on to more enjoyable things after Christmas.

This semester, I only need 9 units left for graduation, and two specific classes. First, I need a "modern high-level language" class. This comes in three flavors: Visual Basic (probably .NET), C#, and Java. As I don't see much of a use for VB, it's a toss-up between C# and Java. I already have a decent amount of experience in C#, which means that Java would be the course I could gain the most from (as it would add another entry to my resume). Unfortunately, the only Java class this semester is at 8 AM, which is a bit (*cough*) too early for me. Thus C# wins by default. While I probably won't learn a great deal, it has the advantage of requiring less effort, which is also always a good thing.

The other class required specifically is Programming Languages and Translation. This is a recent class which merges two previous classes, one on high-level languages (a survey of like a dozen languages, and the various ways high-level languages accomplish common tasks) and the other on compiler development. I had actually taken the former of the previous classes, but they merged the two before I could take the higher-level course, forcing me to take this new one instead. On the plus side, this also means I'll have to put in less effort at this course, as well, and I probably won't have to study (in my case this means 'read the textbook and come to class') much of the first half of the semester.

One of the things we'll be doing in the class over the semester is writing our own compiler. I've already got some ideas for a high-level programming language which closely resembles natural (i.e. spoken) English, intended for use by people who are not computer science or math people. I ought to discuss some of the ideas for this on this blog; we'll see what my infamous laziness permits.

Unfortunately, the class I really wanted to take this semester isn't being offered - the Game Programming Development Project. It would have been awesome to have to spend a semester working on E Terra (Gord knows I'm too lazy to work on it when I don't have to) and get three units credit for it.

So, that left me needing to find another class. This semester is actually pretty bleak, as far as which courses are being offered and when. While there are maybe five other classes I wouldn't have minded taking if nothing else was available, like none of them are offered this semester (or those that are have prerequisites I don't have, or are at extremely inconvenient times). So, I was forced to improvise - by looking into the list of graduate classes. As it turns out, my school allows undergraduate students to take graduate classes with permission from the department, although you can get kicked out if there are more graduate students than spots in the class.

One class was at a convenient time, covers something useful to me, and only required courses I'd already taken: Advanced Graphics Programming. Unlike BahamutZero, I can't really say I especially like or get excited by graphics and graphics programming, but clearly a thorough knowledge of graphics is a big plus for game programming; as well, I hadn't had any trouble in the undergraduate graphics class, so I can at least get the job done. Unfortunately, the syllabus doesn't look as applicable to game programming as the course description suggests, but hopefully it'll end up being worthwhile (and hopefully graduate-level homework and projects won't be too painful).

One thing that may turn out to be fun is the term project. The teacher hasn't actually given out the assignment (which would have a few dozen example topics), but as I understand it, we can do just about anything, as long as it's related to graphics and is sufficiently ambitous for a graduate-level class. When I mentioned all this to BZ (who loves graphics stuff, and would probably take a graduate-level graphics course, if he had the chance), he immediately asked if we could do a project together (although I'd thought the same thing even before he asked about it). As it turns out, we can (I talked to the teacher), provided the project is large enough for us both, and our work is sufficiently separated so that the teacher can grade my part of the work on its own. So, this could turn out to be fun. I'll probably write about at least the topic (when we come up with something), if not details along the way (and if BZ is working with me, he may post about it on his blog as well).

Also, just to briefly mention a topic I should (as in ought to) write about in the near future: the first programming assignment in graphics class - that is, a rudimentary ray-tracer. This actually isn't very difficult. Writing a simple ray-tracer that can render simple things (e.g. plastic-looking spheres) is pretty easy; it's making it fast and photo-realistic that's hard - but neither of those are requirements for this project. I estimate it'll take two or three days of coding, and we have two weeks to do it (though I have a bunch of relatively easy optional features I want to add, so it will probably take me longer than the others in the class), which isn't bad - not unlike what I'd expected from a graduate-level class.

Sunday, August 24, 2008

& More MPQDraft

Well, looking at the SourceForge statistics, a few interesting things became apparent. First, there's still a remarkable amount of interest in MPQDraft, even after all these years. MPQDraft, released some 7 years ago, was originally targeted at Starcraft, which was released in 1997 - 11 years ago; while this has remained the primary target of MPQDraft and modders, MPQDraft has also been used to a lesser extent with Diablo II (8 or 9 years old, I think) and Warcraft III (7 years old, I believe). Given the fact that the most recent of those is still 7 years old, it's pretty surprising that MPQDraft is still heavily used today. As the graph indicates, over the last year, MPQDraft has been getting an average of 600 downloads/month, and trending upward, with 750 downloads last month.

I'm a bit curious what happened in November that produced such a huge rise in download count - about a 6x increase from September to November. The first thing I thought of was the announcement of Starcraft II, last year; however, this was quickly found to be incorrect, as SC2 was announced 6 months prior to that. So I'm really not sure what caused that increase. I can only imagine some very large site related to Blizzard games (Blizzard themselves or one of the major modding sites) linked to MPQDraft's recent home on SourceForge at that time.

The other surprise is the seeming lack of interest in the source code. According to the SourceForge statistics, there have been fewer than 50 source gets to date. I'm curious whether that's at all related to the fact that I'd only made the source available through Subversion (the source version control system I use for MPQDraft). To test that hypothesis, I've posted a package containing all the source on the SourceForge download page. You'll need to take a look at the notes (separate from the ZIP file) for what you need to get the code to build.

Thursday, August 21, 2008

& MPQDraft

Well, it's been a while. After I registered the MPQDraft project on SourceForge last April 1, I procrastinated for so many months that many people suspected it was an April fools' prank; however, the only prank that was intended was the fact that it wasn't a prank. One year after then, I finally got around to posting the complete patching code, although I'm only just now getting the last of the GUI code uploaded. This is to say that the code on SourceForge is now complete, and MPQDraft is now fully open-source.

Relevant links:
The MPQDraft project
Binary download page
Web-based source code browser
Instructions on downloading the source code
The OSS license it's licensed under

Sunday, August 10, 2008

Gah

Random fact of the day: Microsoft Developer Network library no longer gives information about what versions of Windows prior to 2000 support some functions. For example, MSDN does not list support for CreateFile in any versions of Windows prior to 2000, despite it being in every single 32-bit version of Windows (Windows 95 and NT 3.1 onward).

Wednesday, July 30, 2008

Comcast Revisited

A bit ago I wrote a Mac Chaos-inspired parody of the Comcast board room, explaining how they'd gotten into this P2P bandwidth crisis they're in, and proposing that the true cause of their crisis is that they'd signed up massively too many users for their old infrastructure while steadily increasing speeds offered to subscribers. A post on P2PNet today lends strong support for that conclusion.
There really is a problem on (at least some) cable upstreams today, based on what I hear from people I respect who have the data. My hope - which won’t be tested until 2009 - is that the DOCSIS 3.0 upstream will resolve most or all of the problems for the next few years. Full DOCSIS 3.0 has a minimum of 120 megabits upstream (shared) among typically 300 homes, something like 400K per subscriber. Current cable modems typically have 8 to 30K per subscriber. This is a huge difference.

While those 'K' don't indicate whether those are kilobits or kilobytes, a bit of quick math tells us that those are kilobit counts. In other words, currently Comcast is allocating a minimum of 1 to 4 KBps for each subscriber. As well, IIRC, Comcast sells 384 to 768 Kbps upstream connections. That puts the overselling ratio between 13 and 100.

Another section is also interesting, for comparison with DSL and FIOS:
Verizon, AT&T, and Free.fr are strongly on the record they do not have significant congestion problems. They do not have a shared local loop, and have allocated much more bandwidth per customer. I’ve sat at the network console of a large ISP and seen they had essentially no congestion amongst the millions of subscribers they served. They allocate 250K per subscriber, much more than current cable.

It's not clear who these figures are for. I believe AT&T DSL doesn't offer more than like 768 Kbps upstream, in which case this would be an overselling ratio of 3. If this is Verizon FIOS (let's say at 5 Mbps, which is their faster speed), that's an overselling ratio of 20. Suddenly it seems very unsurprising that Comcast is having problems and AT&T/Verizon are not. It also shows you who's been investing in their network over the last decade and who hasn't.

Tuesday, July 29, 2008

& Other Strange Ocurrences

As it turns out, the earthquake this morning was actually the second highly unusual thing to happen today. The first occurred late last night, as I was going into the bathroom one last time before going to sleep. A couple steps into the bathroom and I stepped into something wet. While having the floor of your bathroom wet for no apparent reason is unusual enough, I was more concerned with the smell: a faint smell of ammonia, and another smell I knew I had encountered before, though I couldn't think exactly what it was.

After I confirmed that the liquid was what was producing the smell, I hobbled back to my room (trying to avoid getting whatever it was on the floor as much as possible) to grab my glasses, and had another look. A fair amount of the floor was wet with several ounces or so of a liquid that was in some places clear, in other places milky white.

After tracing it under the sink, I found what seemed to be the cause: a can of insecticide. The entire can was wet, though not much that wasn't right near where it was, so it didn't look like an explosion (though there sure was a lot of the stuff on the floor). I didn't try removing the cap (for obvious reasons), but I'm thinking the spray nozzle might have exploded and the cap prevented the stuff from getting all over the cabinet under the sink. Ultimately, I wrapped it in a couple plastic bags and threw it in the trash, and wiped up all the stuff on the floor (probably wouldn't hurt to mop the floor with soap and water, either).

& Shaking

Just had an earthquake here, about an hour ago. The epicenter was 15 or 20 miles from here, and it was a 5.8, which is a pretty good size earthquake. Watching TV news for 20 mins or so, there have not been any reports of injuries, though cell phones and (less commonly) land lines are still out in some areas; I've heard some about damage to streets and one water line. Amusingly, lots of people e-mailed the news station to say that their phone lines or cell phones are out, so apparently it didn't do much for internet connectivity. I heard that there's a 5% chance of it being a foreshock to an even bigger earthquake.

More info on Yahoo and CNN.

Saturday, July 12, 2008

Epic Fail

So, on Friday I got a new computer. The computer consists of a quad-core Core 2 CPU, 4 gigs of memory, and a Radeon HD 4850 based video card. Although there are some known techniques for getting an existing Windows installation to work in a new computer, this install simply refused to work with the USB ports on this computer (the computer freezes up several seconds after Windows has booted; disabling the USB ports in the BIOS allows it to work, but is not an acceptable solution). So, I ultimately ended up reinstalling Windows.

I had quite a few options when it came to choosing a version of Windows. Thanks to my obsessive downloading of everything on MSDN Academic Alliance, I have legal copies of Windows 2000, Windows XP x86, Windows XP x64, Vista x86 & x64, two copies of Windows Server 2003, and Windows Server 2008 x86 & x64. For those not familiar with the Servers, 2003 is an updated server version of XP, and 2008 is an updated server version of Vista.

As Server 2008 is an updated version of Vista with additional features (and the newest of any version), I figured I'd use that, and that's what I'm writing on right now. However, this install may be short-lived. As it turns out, just about nothing works on Server 2008. In the last three hours I've encountered the following:
- The Asus motherboard driver installer for Vista x64 will not run. When run, it says "Does not support this Operating System: WNT_6.0I_64". If I understand this correctly, it's saying it doesn't support Windows NT 6.0 x64. This is curious, as this is exactly what Vista x64 is, suggesting that the installer does not run on the system it was made for. Furthermore, several pieces of motherboard hardware do not have drivers included with Server 2008, and so appear as Unknown Devices and PCI Devices (there are still a couple unknown devices left if you manually install each driver). Epic Asus fail.
- The other major driver I needed was the 4850 driver. This was especially important because the 4850 has a known issue where the fan speed stays too low, resulting in hot temperatures. So, I downloaded the latest version of the drivers and ATI Catalyst programs from the video card manufacturer (as best I can tell the ATI web site doesn't list drivers for the 4850) and installed the driver and program. Installation had no problems; running the Catalyst Control Center, however, resulted in the message "The Catalyst Control Center is not supported by the driver version of your enabled graphics adapter.". Very curious, considering that driver and the Control Center came bundled in the same ZIP file. Epic ATI fail.
- One of the programs I use most of all (by far) is Windows Live Messenger. Naturally I soon needed to install it on this computer. The Windows installer even helpfully created a Windows Live Messenger Download link in my start menu. Unfortunately, following the link, downloading the program, and double-clicking it (I'm not even mentioning the UAC and IE annoyances) brought up the error message "Sorry, Windows Live programs cannot be installed on Windows Server, Windows XP Professional x64 Edition, or Windows operating systems earlier than Windows XP Service Pack 2". By process of elimination, this appears to say that only supports XP x86 SP2+, Vista x86, and Vista x64; curious, given the fact that Microsoft advertises support for Server 2008. Epic Microsoft fail.
- The other program I use most often is FireFox. So, that was next on the list. Download, install, so far so good. Launching FireFox, however, is a completely different story: instant crash. Epic FireFox fail.
- And just for good measure, this install has blue-screened once so far (in about 3 hours), with the PAGE_FAULT_IN_NONPAGED_AREA bugcheck. I'm not sure exactly whose failure this is, but the Asus driver problems seem the most likely suspect. Epic fail.

Wednesday, July 09, 2008

& Fun with Turkish

Just saw this amusing segment in the Wikipedia page for Turkish grammar:

Avrupa (Europe)
Avrupalı (European)
Avrupalılaş (become European)
Avrupalılaştır (Europeanize)
Avrupalılaştırama (cannot Europeanize)
Avrupalılaştıramadık (whom [someone] could not Europeanize)
Avrupalılaştıramadıklar (those whom [someone] could not Europeanize)
Avrupalılaştıramadıklarımız (those whom we could not Europeanize)
Avrupalılaştıramadıklarımızdan (one of those whom we could not Europeanize)
Avrupalılaştıramadıklarımızdan mı? (one of those whom we could not Europeanize?)
Avrupalılaştıramadıklarımızdan mısınız? (Are you one of those whom we could not Europeanize?)

You now know the meaning of 'highly agglutinative language'.

Friday, June 27, 2008

Sansas & Bugs

Given how big I'm into music (particularly game, anime, and movie soundtracks), it'll probably come as a complete shock to most people to know that I've never had a portable CD or MP3 player (other than the CD player in my car). Probably the biggest reason for this is that I'm cheap - I save most of the money I make, and spend very little of it, even on things you'd expect me to buy (like a computer that's less than 6 years old). Well, yesterday I just bought a digital audio player: the SanDisk Sansa c250 2 gig, on sale at a price I couldn't refuse (cheaper than Amazon).

So, I spent some time playing with it yesterday, in preparation of today, when I drive my grandma to a doctor's appointment and various errands (she's had severe eye problems for the last couple months). Not a bad little sucker; though just as you might guess from the price, it didn't take long to run into problems. Naturally, as I'm too impatient to call tech support, and too inquisitive to give up on a technical challenge, this meant I had to debug the thing.

After loading almost 2 gigs of music onto it and disconnecting from the computer, it proceeded to promptly lock up on database refresh (after you modify the contents of the flash memory it scans all the files and indexes them). Wonderful. I could turn it off and on, but every time it turned on it immediately performed a database refresh, and promptly locked up. Worse, it would no longer connect to the computer, as the database refresh preempted other things, like USB port communication, meaning I couldn't delete anything that might be causing it to freeze (specifically, if you plugged it into the USB port while it was performing the database refresh, Windows would say "unrecognized USB device" after a couple seconds).

A substantial amount of experimentation revealed that it was possible to override this. Specifically, you had to have the computer send a USB signal to the device BEFORE it starts its database refresh. As the database refresh is the first thing it does when you turn it on, and plugging the USB cable in automatically turns the device on, this takes rather precise timing, and more or less requires pressing the button required to make it connect in mass storage mode*, insert the USB cable, and press "Scan for hardware changes" in Device Manager at essentially the same time (I'd say about 1/3 of a second). This will cause the USB signal from the computer to preempt the scheduled database refresh, and put it into USB storage mode.

Now that I was able to access the contents again, I spent some time fumbling around with trial and error, trying to figure out what was causing it to break; as it was 1 AM by this point, my brain wasn't in peak working condition, and this took some time. Searches on Google revealed that quite a few people had this problem and there are quite a few hypotheses as to what causes it and how to fix it, but no definitive explanation or solution (nor has Sandisk addressed this problem, despite people asking for help on their forums). As well, many of the "solutions" involved wiping the memory of the thing, and sometimes bricking it.

Through trial and error, I managed to burn through a number of hypotheses (which were either incorrect or simply not applicable to me). It appeared to be false that spaces in directory and file names caused lockups (or that bug only occurred in older versions of the firmware). I also did not observe any instances of odd characters in song titles or artists that caused this problem; to my surprise, the device even correctly handled and displayed the Japanese characters in some song and artist names (when I had first opened the package, I tried copying a single album onto it, which worked without incident; this album happened to have Japanese ID3 info). Lack of free space did not appear to cause it (I tried taking it down to 2 megs free space with good files, and it still worked fine). ID3v1 tags seemed to work fine. Even this one funky MP3 at "0 kbps" (what Explorer reports for it; I haven't looked at it with a hex editor to figure out why this is) did not cause the problem.

What ultimately ended up being the problem, at least in my instance, was that one of my game soundtrack MP3s was mislabeled as 'hard rock'. The significance of this, according to one person, is that it has a space in the genre name. Changing this to the proper genre corrected the freeze. I can't say for certain that the space in the genre is what causes the bug, but it's true that when none of my songs have a space, the player works fine, and it froze in that one case.

*The Sansa has two USB connection modes: MTP and MSC. MTP mode interfaces with media players such as Windows Media Player. This mode allows you to store media library files on the player, and make use of various features like tagging and playlists. MSC mode causes the player to act like a vanilla memory stick, allowing you to directly access the flash file system. I'd imagine it's only necessary to refresh the database in MSC mode; that's the only mode I've ever used.

Judging from Google, there are two different methods of switching between modes, which depend on what firmware you have. One method is that a USB mode option appears in the settings menu on the device. The other method (what mine has) is that the player is always in MTP mode, but connects in MSC mode if you hold the rewind button when you plug it into the USB port.

UPDATE:

Found another bug while playing around with putting DRMed WMAs on the critter (my dad also got one, and he has a bunch of DRMed WMAs to put on it, unlike my MP3s). It's only possible to load DRMed files onto the device in MTP mode, so I had to learn how to use that. It appears that my assumption was correct, that database refreshes are only necessary after adding files in MSC mode; after files are added in MTP mode, they appear in the player immediately after the player is disconnected from the computer.

While the player automatically turns on and goes into USB storage mode when you plug the USB cable in, it's possible to turn off the player by holding the power button (the same way you turn it off when it's not connected to the computer) while in USB storage mode. This is not a good idea. If you add some files to the device and then turn it off before unplugging it, it will lose track of those files, and they will not show up in the list of songs on the player (though they will still show up in the file list when it's connected to the computer in MTP mode). Adding additional files later will not cause this problem to be corrected; it is necessary to delete the files from the player and then transfer them from computer again

Tuesday, June 24, 2008

Random Thought of the Day

Did you ever notice that, in English, the simple past (e.g. "he wrote") and past progressive (e.g. "he was writing") are both very common, yet in the present tense, the present progressive (e.g. "he is writing") is overwhelmingly more common than the simple present (e.g. "he writes")? This fact actually leads into an important linguistic principle, which I'll probably write a post about in the future. I'll just leave it as food for thought, for now.

Monday, June 16, 2008

Cases, Ergative, & Accusative

Something that I vaguely implied previously, but I don't think actually said, was that there is a difference between roles and cases (even worse, there are multiple things that "role" could refer to). Roles are, in theory, purely rational, language-independent categories which describe how nouns relate to their clause's verb. Cases, on the other hand, are language-dependent categories representing many things, and there is rarely (if ever) a 1:1 mapping of the two for a language.

The Grammer of Discourse hypothesizes at least ten universal roles, which I'll only briefly describe.
Experiencer: the person experiencing an emotion or sensation
Patient: the one an action acts on
Agent: the one willfully performing an action
Range: an extension of the verb, such as indicating how, e.g. "Your blood smells good"
Measure: an extension of the verb indicating how much, e.g. "I was only bitten a little bit" (these examples brought to you by Vampire Knight)
Instrument: something which is used to perform an action; this can also be used for animate entities who unintentionally perform an action
Locative: the location an action occurs at
Source: the starting point of some kind of movement or transfer
Goal: the ending point of movement or transfer
Path: the path taken during movement or transfer

If we were to compare this list of roles with typical use of the Latin cases, we would get the following. Note that this list is approximate, and some of the roles like measure and range I'm not even sure how to represent in Latin.
Nominative case: agent, patient, experiencer, instrument
Genitive: unrelated to role in the sentence (roles refer to relation with the verb, not with other nouns)
Dative: goal, patient
Accusative: patient, experiencer, goal, rarely source
Ablative: source, instrument, locative, goal, path, possibly range and measure (some of those requiring prepositions)
Locative (rare): locative
Vocative: not related to role

However, while case is language-specific, some themes (common cases) occur much more often than others. Of the Latin cases, the nominative, genitive, dative, and accusative occur very frequently in all languages; this is not surprising, as these seem the most essential to language in general (though note that they are not guaranteed to mean exactly the same thing in all languages).

The nominative case is roughly defined as the subject of the verb. For transitive verbs having a direct object, the subject is the one performing the action (e.g. "He poked her"); for intransitive verbs the subject is the single argument (e.g. "He was hit"). The accusative case is the object of transitive verbs. Any language having this structure is called a nominative-accusative (or sometimes just accusative) language (which we're going to call N/A in the rest of this post).

However, two others - the ergative and the absolutive - also occur very commonly in languages. The ergative case is defined as the subject of transitive verbs. The absolutive case, however, includes both the subject of intransitive verbs and the object of transitive verbs. Languages using this system are called ergative-absolutive (or sometimes just ergative; E/A, here).

At first this seems very strange and arbitrary - splitting the subject depending on whether the verb is transitive or intransitive. However, this is due to the fact that we don't speak an language. In fact, even the word 'subject' reflects this bias in thinking. The N/A split carries the paradigm that all actions are done by somebody/something, regardless of whether the action is intentional or unintentional, or even whether there's anyone performing the action at all (e.g. in "He fell"). This is called the subject, and for transitive verbs, the one acted on is the called the object; thus the N/A split actually corresponds to a subject/object division.

However, we get a different picture if we discard this assumption and look at things from the perspective of roles. In reality, with many intransitive verbs (such as the one shown above) the "subject" is not the one doing the action at all, but rather the one who is subjected to the action - the patient. Thus the E/A split is based on the paradigm that the ergative case is the doer (agent or instrument) of the action, while the absolutive case is the patient of the action - an agent/patient separation. Taking it one step further, some E/A languages even require that the ergative argument commit the action intentionally, and use a different sentence structure to indicate otherwise (e.g. split-intransitivity languages use either the ergative or absolutive case for the subject of intransitive verbs, depending on whether the action is intentional or not; others use the passive voice for unintentional actions; etc.).

Given this, both seem equally sensible, and the choice itself now seems arbitrary. It's worth noting, also, that most languages in the world are either N/A or E/A. Languages using other systems are rare, which might suggest that the N/A and E/A splits are more sensible and/or useful than other methods. But hold onto that thought.

Thursday, June 12, 2008

Case & Other Cases

One thing necessary in all languages is that the nouns in a sentence that play various roles/cases must be identifiable. While the exact amount of precision varies by language and by sentence structure (there may be more than one way to say something, or only certain structures may be used in certain cases), all languages have a way to indicate the subject, direct object, etc. (although of course the exact set of roles that exists varies by language, as well). As far as I'm aware, there are three methods of accomplishing this: dependent-marking, head-marking, and analysis (note that none of these terms refers exclusively to role; I'm merely discussing them in this one specific context).

Let's start with the easy one: analysis. This is the method English uses for its core roles: subject, direct object, and sometimes the indirect object. As I pointed out in The Decline of the English Language, Modern English has a fairly rigid word order for its core roles: Subject Verb [IndirectObject] [DirectObject], as in "The boy gave the dog a bone"; some other word orders are used by native speakers, but they're uncommon, and generally only used in certain specific contexts (e.g. the Verb Subject Complement order in "Are you an idiot?"). Thus analysis refers to the use of strict word ordering to determine what role each noun has.

As I mentioned in the same paper, English wasn't always this way: it belongs to the same language family as Latin, all traditionally using dependent-marking of case. Dependent marking refers to the fact that each word is marked to indicate its role. In the same sentence "Puer [boy] cani [dog] os [bone] dabat [gave]", the four words may be placed in any order, and the meaning will still be clear, because the nouns carry the nominative, dative, and accusative cases, respectively (actually, that isn't 100% true; because some cases decline the same way, there can be some ambiguity here).

You might notice that English also does this for non-core roles, which corresponds to greater freedom as to word order. As dependent-marking does not require that the mark actually be attached to the word, English uses prepositions to mark non-core roles, rather than the traditional suffixes of Indo-European languages. This system is used for such roles as instrument in "The boy poked the dog with a bone" (the Latin version, "Puer canes osse pungebat", uses the ablative case, and the accusative case for the dog), the benefactor in "The boy bought a dog for her" (in the Latin version "Puer canes per ea emebat", a preposition is used with the ablative in this case), etc. The last example also illustrates that Latin uses prepositions as well, to mark roles outside the 6 core cases.

Both of those have been something that isn't entirely unfamiliar to English speakers. Even case still (barely) exists in the pronouns and nouns of English (having three and two cases, respectively); the third method, head-marking or agreement, is also not absolutely foreign, though it is uncommon in modern English. Verbs in Indo-European languages traditionally agree with the subject of the sentence - the verbs themselves indicate the grammatical person and number of the subject. While English has all but lost this form of agreement, you can still see vestiges of it. The verb 'am' uniquely identifies the subject as first person singular, while 'is' identifies the subject as third person singular ('are' is ambiguous, because it could refer either to second person singular or any person plural); similarly, the -s form of all other verbs (e.g. 'gives') identify the subject as third person singular. Romance languages like Spanish still contain robust subject-verb agreement, such that it is possible to uniquely identify the subject as first, second, or third person (never mind the bad terminology for now) and singular or plural.

However, you might have noticed something: in languages like Latin that have subject agreement, marking nouns with the nominative case (used for the subject) can be redundant. Head-marking, or polysynthetic, languages do away with this use of case, and purely rely on verb agreement to indicate which nouns have each role. I can't find a good example of a sentence that would indicate how this would work without introducing other things I don't want to get into, so I'm gonna make one up:
In this example, theyare attachedit pronouns representingthem the subject, direct object, and indirect object the verb of each clause. As with English pronouns in general, theyagree the attached pronouns with number and gender of the nouns. For the verbs, iusedthem the subject-verb-object order and pronoun cases, to makethem the verbs easier to read for English speakers. However, iusedthem varying word orders for nouns in the clauses to illustrateit how itcan be used head-marking with different word orders. Typically theywould useit head-marking head-marking languages with other modifiers like possessives, as well.
Finally, the Totonac language takes polysynthesis to a ridiculous extreme. According to the examples in The Grammar of Discourse, Totonac merely lists all roles in the sentence, without using agreement to indicate which nouns have which roles. One example given (I'm kind of making up my own orthography, here) is "liiteemaktamaahua [literally 'with-passing by-from-buy'] tumin [money]", which means "As [he] passes by, [he] buys [it] from [him] with money". Amazingly (and completely against expectations), native speakers of Totonac can actually understand each other.

Friday, June 06, 2008

Beyond Godly

On Recording Industry vs. the People, in response to this story, somebody suggested:
It would be interesting to set up a 'honey-pot' node (using maybe a printer or a network monitoring box), wait for a takedown notice, and say "see you in court". It would be even more interesting to see the discovery request for the hard disk of a printer.
That idea is beyond godly - set up a honeypot network that isn't actually sharing copyrighted material, and file DMCA abuse suits for every DMCA takedown notice they receive. I suspect that would very rapidly lead to more thorough investigations before companies fire off bogus DMCA takedown notices.

Thursday, June 05, 2008

Empirical Data and the RIAA

A bit ago I wrote up a rather lengthy list of factors which could, in theory, produce false-positives in identifying users sharing copyrighted files via peer-to-peer programs. Most of these risks could be mitigated by thorough investigation, though I noted that as the RIAA clearly cuts every corner they can, it's likely that few if any of these mitigating measures are taken in actual investigations.

Now the University of Washington has demonstrated some of these risks in actual occurrence in their project Tracking the Trackers: Investigating P2P Copyright Enforcement. While they've only looked at a couple of the risks I suggested, the results show quite a few false positives, indicating that my prediction that measures to minimize these risks are not being applied was accurate.

The research paper is here, if you don't want to go through the project's web site itself. The New York Times blog has also picked up this story. They also have a cute logo/illustration:


This was actually a study I've been wanting to see done for some time. The other study that I think is very important but has not yet been done is to determine empirically how, on a system like eDonkey, where users search all peers for a certain file, the number of requests a single computer gets for a single file varies with the popularity of the file. The basis of this investigation is the claim by RIAA and others that users could be sharing thousands or millions of copies of each copyrighted work, therefore constitutional limitations on civil damage awards do not apply.

Clearly files that are popular (e.g. the latest hit song) will be downloaded more (in total) than files which are unpopular. But does this mean any single computer will upload popular files significantly more often than unpopular files? I believe the answer is no, for the reason that because the files are more popular, not only are they downloaded more, but they are also available from more computers. In theory, the increase in demand is accompanied by a proportionate increase in supply, keeping the ratio invariant regardless of demand. According to this belief, I have argued on forums (one example here) that most of the people the RIAA has sued have, according to simple probability, not uploaded more than a single copy of each file, on average (so about $0.70 of damage per file, if you assume 1 download = 1 lost sale, which itself is highly suspect).

Thursday, May 29, 2008

Absolutely Amazing

Since it would take quite a patchwork of quotes to summarize this story, I'll just give a few bullet-points of my own as a summary.
- Revision3 uses BitTorrent to distribute its own content (legal distribution, in other words)
- Everybody's second-favorite company MediaDefender decided to play with R3's tracker. Once they found a hole that allowed the tracker to serve torrents not by R3, they began using the tracker to track their own files.
- R3 discovered that somebody was using their tracker for external content and banned MD's torrents
- MD's servers (the ones actually uploading the files that they were using R3's tracker to track) responded by DOSing R3's tracker (according to one person on Slashdot, MD has a 9 Gbps internet connection for this purpose), taking R3's tracker and other systems completely offline
- The FBI is currently investigating the incident. Some have suggested and are praying that the PATRIOT Act could be used to charge MD with cyber-terrorism, as defined by law.

Various coverage:
Inside the Attack that Crippled Revision3 (mirror)
MediaDefender, Revision3, screw-up
Revision3 Sends FBI after MediaDefender

Thursday, May 22, 2008

More on kd-tree Splitting

I was just reading the paper Analysis of Approximate Nearest Neighbor Searching with Clustered Point Sets, which, as the title indicates, performs performance analysis of 3 different kd-tree splitting policies. The policies used are the standard policy, the sliding-midpoint policy, and the minimum-ambiguity policy, which is new in this paper.

The minimum ambiguity method takes into account not only the full set of data points, like all the others, but also takes into account a set of training regions which represent the distribution of regions to find points within - that is, the future searches themselves. As with the other methods, the goal of the algorithm is to minimize the average number of nodes in the tree that overlap each search region; however, when both the searches and data points are known, the minimum-ambiguity method can do it better than the others.

Two different scenarios analyzed are of particular interest. In all cases the data points were clustered; the two correspond to the distribution of the training regions: the same clustered distribution as the data points, or uniform distribution. In the case of both using the same clustered distribution, the minimum-ambiguity policy > the standard policy > sliding-midpoint policy (here using the internet use of '>' as "is superior to"). In the case of searches distributed uniformly, the sliding-midpoint policy > minimum-ambiguity policy, with both far superior to the standard policy.

So, what's it mean for writing a kd-tree for a game? Well, it provides some pretty interesting information, though it doesn't change the bottom line. As mentioned in my paper, more complex splitting policies like sliding-midpoint and minimum-ambiguity are only viable for data sets that are essentially fixed. In a game, this corresponds to immobile objects that are either unchanging (e.g. cannot die) or change extremely infrequently; in E Terra, this corresponds to doodads - objects which take up space but do not have any function - and immobile game objects such as grass (which is edible but not consumable).

As also mentioned previously, the distribution of points is not expected to be uniform - it's expected that there will be clusters of things at various focal points on the map. Furthermore, in the case of mobile objects, the search distribution will roughly equal the distribution of the data points themselves.

Unfortunately, neither of these facts is useful to us. Despite the mostly known distribution of searches, we cannot use the minimum-ambiguity policy in any of our trees because the set of search regions - corresponding mostly to the mobile game objects - is dynamic. Furthermore, it wouldn't be of any particular benefit to use the data points in the static trees as the search region distribution, as the majority of searches will be from the mobile objects, for things like collision detection and sight radii.

Thursday, May 15, 2008

Detox

Well, I've finally finished my last final of the semester, and the last homework assignment/term project was turned in last week. Now comes the much more pleasant task of purging myself of any knowledge acquired this semester. A list of some of the things I ought to work on this summer:
- Clean up and post kdTrieDemo
- Clean up and post TextBreaker
- Clean up and post the GUI code from MPQDraft
- Work on E Terra
- Do follow-up experiments to my AI term project
- Work on Secret Project V (no, that doesn't stand for "vendetta")
- Work more on my various languages

Though somehow I'm betting the only one I'll do in any great amount is:
- Play World of Warcraft

Saturday, May 10, 2008

The Story of Comcast

As companies like Comcast are increasingly in the news on technology-related news sources, some might wonder how the entire situation with Comcast came to be. Well, this abridged version of the actual shareholder delegate meetings in two acts might shed some light on the topic.

Act 1

Shareholder Delegate #1: Gentlemen, I propose that we advertise higher bandwidth, increase prices, and sign up more customers. Yay or nay?
Delegate #4: What is the business impact of this proposal?
Delegate #1: More money for us.
Delegate #4: Yay.
Delegate #3: Yay.
Delegate #5: Yay.
Delegate #2: Yay!
Delegate #5: I propose we increase network capacity to accommodate those additional customers.
Delegate #2: What is the business impact of this proposal?
Delegate #5: We'll need to spend some money in the short term to...
Delegate #2: Nay!
Delegate #4: Nay!
Delegate #3: Yay.
Delegate #1: Nay.

Repeat 37 times

Act 2

Delegate #2: Holy shit! We've got way more network traffic than the network can handle! It's strangling our network to death!
Delegate #4: This is a disaster! Quick, somebody find a scapegoat!
Network Technician: Well, it looks like BitTorrent is using up a fair amount of bandwidth.
Delegate #4: Kill it, quick!
Network Technician: BitTorrent traffic blocked. Network performance has returned to mediocre levels.
Delegate #2: Whew. Crisis averted.
Delegate #2: Now then, I propose we increase advertised bandwidth and sign up more users. Oh, and we can't forget to increase the price; we are selling a limited resource, after all.
Delegate #3: What is the business impact of this proposal?
Delegate #2: More money for us.
Delegate #2: Yay.
Delegate #5: Yay.
Delegate #4: Yay!
Delegate #3: Yay.
Delegate #5: I propose we upgrade our network to...
Delegate #2: I thought we voted you out months ago. Security!
*delegate #5 is dragged out of the room by security guards*

*cut to next business meeting*

Delegate #4: Gentlemen, I propose that we raise advertised bandwidth, increase prices, and sign up more customers.

*curtain*

This post inspired by Mac Chaos' various parodies over the years, which are probably better than mine.

Friday, May 09, 2008

Spatial Sorting with kd-trees - Part 2

Here's the paper itself. The teacher required that I use the ACM format, but I cut corners where possible (e.g. he only said we needed to have those four sections) :P I should note that the sections themselves are from specifications given by the teacher. I think if it were an ACM paper, or even if I was coming up with the structure myself, it would have been organized significantly differently.

Thursday, May 08, 2008

Spatial Sorting with kd-trees - Part 1

So, today I turned in my graphics programming term project... what's done of it, anyway. Spatial Sorting with kd-trees is the name of the paper/presentation; not quite as impressive sounding as my AI project, but then, this project as a whole is less impressive than my AI project (and my third term project even less) :P

Well, I was supposed to give the presentation today, on the last day of class. Unfortunately, due to gross lack of time, that didn't actually happen. If I recall correctly, 12 people were scheduled to go today, in 75 minutes of class time; that's about 6 minutes per person. Problem is, everybody made their presentations with at least 10 minutes in mind (not unreasonable), and it took everybody a couple minutes to get their Powerpoint presentation running on the overhead projector. By the end we'd gone from allowing the first 3 or 4 people to give their full presentations, to "7 minutes each", to "5 minutes each", to "3 minutes each", to "run your demo program and sit down"; and at least one person (me) never did get a chance to go (I'm not sure if there were others). So, I wrote the URL of my blog on the whiteboard and told people I'd post the presentation there.

The presentation itself is here. I'd recommend that anyone who hasn't seen it before also look at my previous description of kd-trees and kd-tries, as I'm not sure the slides alone, without the spoken part of the presentation, will give you a clear explanation.

In addition, the demo program is also here, compiled with XNA 1.0 Refresh and XNA 2.0. The space bar toggles pause (initially paused); the H key toggles display of the kd-tree divisions, simulated view frustum, and statistics about the search for objects in the view frustum (the statistics are, in order: number of objects in the view frustum, number of objects evaluated, the percentage of objects evaluated that are in the view frustum, the number of tree nodes navigated, and the ratio of objects in the frustum to combined nodes and objects visited); the arrow keys move the view frustum. I'll probably post a rant about XNA in the near future; just make sure you download the version of my demo for the version of XNA you have installed. I'll probably post the written portion of the project tomorrow, after I finish it.

I'm still hoping to post the source to TextBreaker and kdTrieDemo, although that's gonna be at least a week from now (at the earliest), as finals are next week.

Sunday, May 04, 2008

Orthographic Language Identification Using Artificial Neural Networks

Finally finished adding in the figures from the presentation that weren't in the original version of the paper. I have no idea why those three Visio figures are so ugly; they don't look that way in Visio, Word, and Powerpoint, but magically turn ugly when printed with PDF reDirect (which has worked well before those three figures).

Orthographic Language Identification Using Artificial Neural Networks

I'd like to post the full source to TextBreaker, but a lot of it was written in a hurry and needs cleaning up and commenting. Combine my infamous laziness (and past experience with MPQDraft) and time will tell if I ever get around to it :P

Saturday, May 03, 2008

The Grand Unification

The assumptions you start with can often limit the set of conclusions you are able to arrive at through logical reasoning. Computer science people were having a heck of a time attempting to adapt the binary tree - the standard for in-memory search structures - to media with large seek time, particularly disk drives. Progress was slow and fairly unproductive while the basic definition of binary tree held. Finally, somebody questioned the assumption and thought: what if we built the tree from the bottom up? And so the B-tree was born, and remains the standard in index structures to this day, with incremental improvement.

Other times, when creating, the assumptions themselves are fun to play with and observe the results. In Caia, I initially envisioned all of the major words - nouns, adjectives, and verbs - to be nouns. Nouns became adjectives when used with an attributive particle analogous to "having" (e.g. "woman having beauty" vs. "beautiful woman"). Taking an idea from Japanese, nouns became verbs by an auxiliary verb meaning more or less "do" or "make" (e.g. "make cut" vs. "cut"). Unfortunately, I ultimately concluded that verbs had to be separate, due to both practical concerns (specifically, concerns about making thoughts too long) and theoretical concerns (different nouns require different semantics, which would produce inconsistent theoretical behavior).

With another one of my languages, I took a different route, with some very interesting results. As this language is synthetic (unlike Caia, which is strongly analytic and isolating), I had quite a bit more flexibility. This language was actually modeled on the Altaic languages - Japanese, Korean, Mongolian, Turkish, etc.; as such, I suppose I can't claim that I invented this (what I'm getting to), but merely took what existed in Altaic languages and perfected it to a degree that doesn't exist in nature - at least to my knowledge.

The result is a language is which nouns, adjectives, and verbs are, in fact, all verbs; they are conjugated exactly the same, and play the same grammatical role. Even though this particular language is fictional, and I'm not expecting anybody to actually speak it, this idea might show up in other languages of mine (possibly a real one), as I find it exceptionally elegant. However, I do like to refer to "nouns" as substantives, "adjectives" as attributes, and "verbs" as actions; this is done because there are slightly differences in meaning between the noun bases in the three cases.

I explained previously that Japanese verbs had several base forms - usually distinguished by one vowel - which were then agglutinated with other things to form complex verb forms. I won't describe them again, as I use different names for the ones in my language, which might result in confusion. In my language, there are at least five different base forms of each verb: neutral, conclusive, attributive, participial, instantiative, and conjunctive (note that these names are not final, and I'm open to suggestions).

The conclusive form is the same as the Japanese conclusive: it's the main verb of the last clause in a sentence; it indicates the end of the sentence, and often has a number of suffixes indicating various details about the sentence. The attributive form is the same as one of the two uses of the Japanese attributive: it's the verb of a relative clause; adjectives are attached to nouns by forming relative clauses (e.g. "cat that is fat" vs. "fat cat"). The participial form resembles the second use of the Japanese attributive form: it is a noun referring to the act named by the verb (essentially an English gerund or participle, as in "watching FLCL makes me want to kill people"). The instantiative is unique to my language, and refers to an instance of the verb's action; for example, "a run" would be an instance of the verb "run". The conjunctive is similar to the Japanese -te form (and also includes the Japanese conjunctive base); specifically, it is used in verbs not in the last clause of a sentence (I'll come back to that). Finally, the neutral form is used in agglutination, and has something of a flexible, context-dependent meaning.

So, how exactly does this framework allow the grand unification? Let's look at an example of the specific meanings of the different bases for each word type (substantive, attribute, and action). Although I should note that not every base is necessary to unify the three; some are simply part of the bigger picture for the language.

For the substantive "human":
Conclusive form: "is [a] human"
Attributive form: "who is [a] human"
Participial: "being human"
Instantiative: "human"
Conjunctive: "is [a] human, and..."

For the attribute "fat":
Conclusive form: "is fat"
Attributive form: "who is fat"/"fat"
Participial form: "being fat"
Instantiative form: "fat thing/person"
Conjunctive form: "is fat, and..."

For the action "travel":
Conclusive form: "travel"
Attributive form: "who travels"
Participial form: "traveling"
Instantiative form: "journey"
Conjunctive form: "travel, and..."

Thus we are able to use identical conjugation for each type of word, treating the first two as stative verbs and the last as an active verb, in an elegant unified system. The real key to this, I think, was the separation of participial and instantiative forms. Note that not all actions have an instantiative form; it only exists where it makes sense: where something is produced or performed.

Now that I've explained how the unification works, there's just one more loose end to tie up: the meaning of the conjunctive form. This form is somewhat foreign to English speakers, because English only works this way in one of the circumstances this language uses it for (specifically, conjunction - e.g. "Murasaki is seven years old and [is] surprisingly well-spoken").

In English, when we have multiple clauses in a sentence that are related in a particular way, they are generally joined by some linker word that carries information about the relationship between the clauses; furthermore, the verbs in all clauses are conjugated normally. In Japanese, all non-main clauses are simply joined, often without any indication of what the relationship is (for matters of time, this is not that unusual; many languages lack such words); as well, the verbs in all but the main clause are deficient - they lack various things like tense, mood, politeness auxiliaries, etc. This is a matter of economy; all that stuff they stick on the verb at the end of the sentence can be rather lengthy. Thus it uses a generic verb form which in some ways resembles the -ing form of English verbs; this form indicates a conjunctive relationship between sentences (note that the literal conjunctive, as in the example a bit above, is actually indicated with a separate form in Japanese, appropriately known as the "conjunctive form/base"; what I've done is merged the two uses).

Here are some examples of things that would use this conjunctive form in Japanese. The first version is how it would be said in Japanese (note that I'm conjugating all verbs here, even though only the last one would be conjugated in Japanese); the second sentences shows how we would typically say the same thing in English.

Simultaneity: "I looked at manga and she looked at novels"/"I looked at manga while/as she looked at novels"
Coincidence: "I went shopping and ran into a friend"/"I ran into a friend when I went shopping"
Sequence: "I got a haircut, [and] went to the bank, and went to the supermarket"/"I got a haircut, went to the bank, [and] then went to the supermarket"
Consequence: "I overslept and was late for class"/"I was late for class because I overslept"

So that's all the conjunctive form is. On an interesting random note, you might notice that in none of those examples does the first version sound unnatural, and might very well be used by native English speakers in addition to the more precise second versions (though of course it would have sounded extremely strange if I had only conjugated the last verb, like Japanese does). This indicates that even in English this kind of vagueness is used; and for that matter, there are ways of indicating some of those relationships explicitly in Japanese, as well - they just aren't always used.

Monday, April 28, 2008

Random Linguistic Fact of the Day

An affix is something in linguistics which attaches to the word it modifies. For example, the English plural suffix 's'/'es' is an affix, as shown in "books on the shelf". The suffix attaches to the word made plural - 'book'.

A clitic is something that attaches to something other than the word it modifies. An example of this in English is the possessive 's. Adding that to the previous example gives "books on the shelf's". Here, the possessive clitic is attached to 'shelf', even though the word it's actually modifying (the possessor) is 'books'.

This is the technical term for what Trique uses frequently (and I didn't know the name of until now). In Trique, personal and possessive pronouns often become clitics when following certain other types of words. Trique also uses clitic doubling, in some cases.

Friday, April 25, 2008

More Random Thoughts

Well, I've had two more random thoughts about English evolution (so far) today.

First, I realized that there is already a mechanism in common use for English to lose the past tense entirely. Figure it out, yet? I just used it. English has come to like to move auxiliary verbs ("do"/"did") before the subject in questions, e.g. "Did you figure it out, yet?" However, it's becoming common use to drop auxiliary verbs in English. This is one such case. As the auxiliary verb, not the main verb, carries the tense, this change leads to a loss of explicitly stated tense.

In the case of questions, the main verb is kept in the infinitive, which is identical in form to the present tense. Languages have a way of evolving based on analogy. It's not impossible that this could be applied to verbs in general, losing the past tense entirely (or at least a distinct form for the past tense; it's possible a periphrastic form, like "do"/"did" would then take it's place in all cases).

The third thought of the day came from me pondering the second one. This modern tendency to drop auxiliary verbs and sometimes the subject is not unique to questions. It's also applied to the perfect (e.g. "I've been" -> "Been") and the progressive (e.g. "I'm thinking" -> "Thinking"). If the latter case became the standard, the effect would basically be the same as with dropping "do"/"did". However, if the former occurred, it could result in the past participle replacing other forms, such as the perfect or past tense.

Here's where my random thought came in: the possibility of replacing the past tense is especially interesting, because it may have happened before in English. If you look back at Old English, you'll notice there are two distinct forms - past tense and past participle - for both strong verbs, which retain this distinction (verbs like "write"/"wrote"/"written"), as well as weak verbs (what became our modern regular verbs like "poke"/"poked"). The past tense for weak verbs, in Old English, was -de. Care to guess what the past participle was? It's -ed, the modern past tense suffix. In summary, the Old English part participle has become the modern English past tense form.

Random Thought of the Day

So, I was chatting with friends via instant messages, while doing some other things. I wrote the emote
*catches up on massive backlog of anime*
You might have noticed before how emotes tend to avoid the use of pronouns referring to the subject. If I had to take a guess, I'd say the reason for this is the fact that the sentence is actually first person (the speaker is also the subject), but the verb forms used refer to the third person; thus any pronoun referring to the speaker would seem out of place.

This got me thinking. While obviously omitting the subject for first-person sentences in spoken English would introduce ambiguity (as unlike in emotes, there isn't any indication that the speaker is referring to themself), in some cases, such as where the subject is the possessor of the direct object, this would not create an appreciable amount of ambiguity. Once convention takes over, you could say "He catches up on massive backlog of anime" (which is currently ungrammatical), and it would be understood that said backlog belonged to the subject. I wonder if we'll see this happen in the future.

For the skeptics, I note that this (not anything having to do with emotes, but rather the omission of the possessive pronoun) has already occurred for some nouns (especially body parts) in Spanish and Italian. For example, taken directly from one of my Spanish books, you would not say "El estudiante levantó su mano"(literally "The student raised their hand"), but simply "El estudiante levantó la mano" (literally "The student raised the hand", which is understood as belonging to the student).

Completely unrelated fact: In Indo-European languages that still have a male/female distinction for nouns, the word for "hand" is female.

Friday, April 18, 2008

Novel Method of Attack

Looks like the RIAA has just undertaken a novel new campaign against P2P.
"Are MP3s doing permanent damage to your ears?"

- sound bite in a commercial for the news, on an upcoming story. I had to stop what I was doing for a moment to convince myself that I hadn't misheard it.

Here Comes the Clue Train!

Poached from Slashdot Firehose:
Speaking at a Westminster eForum on Web 2.0 this week in London, Jim Cicconi, vice president of legislative affairs for AT&T, warned that the current systems that constitute the Internet will not be able to cope with the increasing amounts of video and user-generated content being uploaded.

"The surge in online content is at the center of the most dramatic changes affecting the Internet today," he said. "In three years' time, 20 typical households will generate more traffic than the entire Internet today."
...
"We are going to be butting up against the physical capacity of the Internet by 2010," he said.
Clue train says: Change happens; growth happens; why the hell have you waited this long to upgrade your network infrastructure?

Of course he then goes on to straight-out lie:
...the Internet only exists thanks to the infrastructure provided by a group of mostly private companies. "There is nothing magic or ethereal about the Internet--it is no more ethereal than the highway system. It is not created by an act of God but upgraded and maintained by private investors," he said.
The internet is paid for and maintained by the money of customers of those companies, who are the real "private investors", not the shareholders or executives like him. In some cases the internet infrastructure was even paid for directly by the government with taxpayer funds, then entrusted to private companies, who then go on to offer minimal service at maximum price by claiming they paid for what they're selling.

Q's pet peeve #5 (approximate number; lower numbers indicate higher hate): Companies who think that they can cope with the inevitable rise in demand for the internet simply by increasing overselling ratios (Comcast, etc.), blocking some types of traffic (Comcast, etc.), or charging for traffic (Rogers, etc.).

Oh, and while we're on the subject of greedy ISPs, I should note that this previous story has been recalled. The problem was found to be with a router device, not Comcast itself (this time).

Wednesday, April 16, 2008

Synthesis

One way of summarizing the characteristics of a given language, as well as to compare one language to another, is the index of synthesis: a measure of the degree of synthesis - the process of combining multiple units of meaning into smaller numbers of words - in a language. Thus, in other words, the index of synthesis is a measure of how much information is contained in each word.

The index of synthesis ranges from isolating to synthetic. Isolating languages contain exactly one unit of meaning per word, and adding additional meaning to some word requires adding other words that modify it. On the other end, a purely synthetic language (not known to exist), would have one word for each entire sentence.

English has been changing in favor of isolation for the last couple thousand years, and Modern English is close to the isolating end of the spectrum, with many words containing single units of meaning. Off the top of my head, I'm going to say English words other than nouns, pronouns, and verbs are isolating; for example, adjectives and prepositions contain only a single unit of meaning in each word. Verbs are becoming increasingly isolating, adding auxiliary verbs to indicate additional meaning to the base verb (e.g. 'should not have seen'). Nouns are also becoming less synthetic, as there are now only two flavors of most nouns: singular objective (e.g. 'cat'), and plural objective/possessive ('cats', 'cat's', and 'cats'', which are all pronounced the same when spoken).*

The romance languages are a little more synthetic than English. Spanish, for example, maintains distinctions between singular and plural, male and female (four forms total) in both nouns and adjectives; for example, 'gordo', 'gorda', 'gordos', and 'gordas' all mean the adjective 'fat', but they indicate singular male, singular female, plural male, and plural female, respectively. Spanish also contains more information in its verbs, chiefly the animacy level and number (singular/plural) of the subject.

Japanese is much further toward the synthetic end. In Japanese, some word types commonly undergo agglutination (agglutination and fusion are the two types of synthesis): the combining of multiple "independent" words into single words, when the context calls for it. For example, the pronoun 'watashi' ('me') can be fused with the suffix '-tachi' to form 'watashitachi' ('us'). Verbs are also fused with auxiliary verbs, commonly meaning things like the passive voice, causation, negation, or politeness; for example, the nine-syllable monstrosity 'kokoruminiawasezu' is formed from either 'kokorumi' ('trial') or 'kokorumiru' ('to test'; I'm not sure if this is the noun or verb, as they're both identical when agglutinated), 'niau' ('to suit; to match; to become; to be like'), '[sa]seru' ('cause to'), and 'zu' ('not'), from the Lord's Prayer meaning 'lead us not into temptation' (that's the popular archaic version of that line, but a more accurate Modern English version would be 'do not test us/our faith', which is closer to the Japanese version). As well, multiple types of words may be fused together, as in 'shakugan', formed from 'shaku' (this appears to be an abbreviation of 'shakuretsu', meaning 'burning') and 'gan' ('eye[s]'), from an anime called Shakugan no Shana ("Shana of the Burning Eyes").

A Turkish example from Wikipedia illustrates extreme levels of synthesis: 'Çekoslovakyalılaştıramadıklarımızdanmışsınız' (18 syllables, if I counted correctly), meaning "You are said to be one of those that we couldn't manage to convert to a Czechoslovak". Some people believe that Turkish is related to Japanese (the Altaic hypothesis), but this has not been conclusively proven.

* While English may have [sometimes large] agglutinated words such as 'antidisestablishmentarianism' (an agglutination of 'anti-', 'dis-', 'establish', '-ment', '-ary', '-an', and '-ism'), these are generally considered separate words, and so are not considered to be a synthesis of smaller words; this classification is supported by the fact that such agglutinations cannot be made freely, but must be words established through common use. In contrast, Japanese verbs may be freely agglutinated with other verbs/words that make sense (e.g. auxiliary verbs), without creating entirely new words.

Sunday, April 13, 2008

Random Linguistic Fact of the Day

Indo-European languages characteristically have distinct forms of verbs such that the verbs agree with some basic properties of the subject. English has almost entirely lost this property for most verbs (third-person singular being the only form that still agrees with the subject), although 'be', the most irregular verb in English, gives you a little bit of an idea of how things used to work:

I am
You are
He/she/it is
We/y'all/they are

In Spanish, the original (more complicated) Indo-European agreement system still exists. As is tradition for Indo-European languages, Spanish verbs agree in number and person (first, second, third) with the subject. At least, that's what most speakers and Spanish teachers will tell you. Here's the full list of forms for the indicative present:

Yo [I] creo [believe]
Nosotros [we] creemos
Tú [you, familiar] crees
Vosotros [y'all, familiar] creéis
Él [he]/ella [she]/usted [you, polite] cree (Spanish does not have a neuter gender, and inanimate objects are either male or female)
Ellos [male they]/ellas [female they]/ustedes [you, polite plural] creen

Now, there's something really weird about that system; did you see it? Usted/ustedes are second person pronouns, but verb agreement indicates that they're third person. How do we explain that?

This was something that mystified me until a couple years ago, when I bought the linguistics book I use as a reference, and learned about animacy/empathy hierarchies, which I've touched on before, and had one of those 'aha!' moments. The answer is that verbs don't actually agree with the person of the subject, but rather by the empathy level of the subject. It's very common in empathy hierarchies to see "first person > second person > third person" (which makes logical sense), so it isn't surprising that Indo-European verb agreement approximates the three persons.

What's out of place - that a second person pronoun is placed in the same level as third person pronouns - can be explained by noting that you would use 'tú' with friends, while you would typically use 'usted' with people you are less familiar with. You would clearly have greater empathy for your friends than random people you meet on the street. Thus it is not surprising that the familiar pronoun is higher on the empathy hierarchy.

Thus the actual hierarchy looks like this. There are five logical divisions, and they're listed from highest empathy to lowest. These five are then grouped into three discrete empathy levels, indicated by the numbers:

1. First person
2. Second person familiar
3. Second person polite
3. Third person
3. 'Fourth person' (this is a generic pronoun where we would use 'they' or the passive voice in English, not referring to anybody specific; e.g. "Se habla Español" - "Spanish is spoken")

Friday, April 11, 2008

Exercise for the Reader

Is it wrong to laugh at your own jokes?

So, tonight a friend asked me some hackingish-related questions, and what he was trying to do reminded me of a blog post I'd written a ways back. Looking back on the blog post, I thought it was amusing how many biology references/puns I used. Can you catch them all?

Thursday, April 10, 2008

E Terra Tree / Term Project Roundup

I already discussed some about my AI class term project: a language identifier. In this post I'll describe my other two term projects some.

In game programming class, we are making a tower defense game. This was decided by vote. We're still pretty early into development (still discussing gameplay design), so there isn't much to tell right now. It will be 2.5D, meaning that the gameplay will be two-dimensional, but the graphics will be three-dimensional. We decided to write it in C#/XNA, because XNA is a very nice self-contained platform for amateur game development (if you're looking to make a small-scale game, I'd definitely recommend XNA), and most of us have used it before in the introduction to game programming class.

In graphics programming class, I'm going to be making a commercial-grade space-partitioning tree for E Terra. Games have to do a lot involving spatial searches. Collision detection, range-finding, path-finding, and view frustum culling (only drawing objects that are actually visible on screen) are some of the things E Terra needs to do.

In the very beginning, I used a simple set to hold all objects on the playing field. This was just a temporary method, to allow me to work on other stuff before creating a space-partitioning structure. This worked okay when there were only a couple dozen things on the map, but as spatial search for anything in the set is O(n), obviously this would become a bottleneck as the number of units on the map increases. That was expected, and happened after not too long.

As I still didn't want to take the time to make a commercial-grade structure, I came up with something else that was much faster, yet still didn't take very long to code: a spatial hash table. The X and Y coordinates of each object on the map were quantized, and the objects were put into buckets corresponding to regular, fixed-size squares of the map. While this was still O(n), the actual time was much smaller, as the space which was searched in each case was much smaller than the entire map (with the maps I was using, things like collision detection were quite a few orders of magnitude faster). The problem is that this structure is only optimal if objects on the map are evenly distributed, which is extremely unlikely in a game (or anything, for that matter).

I'm still not certain which type of structure I'm going to use in the end, although I have it down to two: a kd-tree and a quadtree. Both are trees that partition space, but they do not partition space in fixed-size regions like my spatial hash table does. This allows them to maintain a roughly even distribution of objects per partition even when there isn't an even distribution of objects on the map, by creating more, smaller partitions in areas where there is a high object density. Search for the object nearest a given point/object is O(log n) for both; other algorithms are harder to give a definite complexity of.

A quadtree (diagram) is the simpler of the two. It's a two-dimensional space partitioning structure that separates a given region into four equal-sized squares, which may further be split. Thus it partitions the space by recursively splitting it in both dimensions at once. Partitions that have few units can be left large, while partitions that contain many objects can be further split into smaller partitions. The three-dimensional version of a quadtree is an octree, which partitions each region into eight equal-sized cubes.

A kd-tree (diagram from here) is a binary space-partitioning (BSP) tree. It also recursively divides the space in a region (this time into two parts), but the two partitions need not be equal in size. Furthermore, each partition may be along any axis (a kd-tree is a general structure that may represent spaces of any numbers of dimensions, and the logic is the same for any number of dimensions). The standard way of building a kd-tree is to partition each region such that half of the objects in it fall in one of the sub-partitions, and half in the other, producing uniform object density.

A quadtree has the benefit of using fixed-size partitions, making it very fast to add or remove partitions on the fly; this is beneficial for a game because objects often move frequently. As well, for maximum benefit of a kd-tree, the entire set of objects must be known in advance, to produce optimal partitioning. However, there is a fairly easy solution to these issues: for kd-trees that hold a dynamic set of objects, the tree behaves similar to a quad/octree, splitting each region in half as needed. Note also that because I require support for changes to the tree, I would use a kd-trie (diagram), rather than a true kd-tree (diagram; though I've been mostly using the terms interchangeably).

I'm actually leaning toward the kd-tree for several reasons. First, the same code can be used in any number of dimensions, making it highly reusable. Second, only minor modifications are necessary to make the kd-tree implementation work optimally for a non-changing set of objects yet still perform well for sets that change frequently (about as well as a quad/octree). Of course, a kd-tree will be more complicated to code, but I don't see that as a major deterrent.