Spell Checker

08 May, 2015

College Days

In college, the primary instructor for the Data Structures class is known for his frankness with students who need to drop his class. Either they're not prepared (despite having met the prerequisites), have too heavy a course load (the overachievers), or are not cut out for "serious" programming (weed-out). In a typical semester his roster would drop by 50% between the first day of class and the "drop date" (last day to drop a class without impacting your GPA). So, it was without great embarrassment that I dropped his class a few days in (see reason #2) and retried the following semester.

At this time, my school was just a few years into teaching their curriculum in Java as their primary language instead of C++. My professor wrote the book on data structures in Java, which was still in the early version 1.3 days. This book (actually, the 2nd edition) was our textbook as well, naturally. The meat-and-potatoes of our class was basically re-implementing everything in the Collections API, then using this knowledge to solve problems. These were fairly straight-forward tasks (in retrospect) with enough variety that it wasn't a drag. One week would be parsing a dump from IMDB to play the Kevin Bacon Game. Another was to implement a tree-backed double-ended priority queue. Ok, maybe that last one was fairly boring.

As a testament to your understanding, all projects had constraints. Getting the correct solution was not enough, it must work within a particular memory budget or time constraints on his reference machine. Grades were dependent upon staying within the constraints as well as getting the correct result, along with coding style. To this day I'll never forget that this is unacceptable and shall be punished:

if (condition){
  return true;
} else {
  return false;
}

He would expect the source sent to him electronically AND a hard-copy turned in so he could bleed red ink all over your code. The students would lament and remain astonished at his "mental compiler" abilities in being able to parse the code of so many students at a glance. I assume it's a matter of practice, though we did turn in some pretty lousy code.

I write all this to say that he was tough, but fair, and made us all better developers. If you received a high mark, you EARNED it and really got the most out of the class. Which I of course did...the second time.

Constraints

Looking back at the contemporary hardware, my laptop was an iBook with an 800Mhz G3 processor with RAM upgraded to the maximum 640MB. If memory serves, this was a bit more horsepower than my professor's Pentium III laptop, but not my a large margin. I definitely don't want to get into a PowerPC/x86 fight, but at least my machine was newer and prettier. Anyway, the point is that most assignments were based around machines of this era, even though I know that my professor likely cut his teeth on a PDP-11 or VAX machine. Some of the assignments which he gave us literally could not have been performed on the hardware he learned to code upon. Or at least not in the same way as we did them.

For example, the Kevin Bacon game had 2 parts, the first of which was to run within a 240MB JVM to parse the input files and produce a serialized form of the data. The second was to run within a 160MB JVM for producing the results. Considering that a PDP-11 only had 64KB of addressable memory initially, which was later expanded to 4MB with virtual addressing...this was simply not going to work. To be honest, I don't know how much storage would have cost at the time, but even storing the compressed 45.5MB of data listing actors and actresses would have been expensive at the time.

The Discussion

One of the things which we talked about outside of class was how you might implement a Spell Check algorithm. I believe this was actually an assignment for another professor's class. We decided that the easiest approach would be to have a dictionary file containing all the words of the English language, read that into a hashmap, and every word in a document could he looked up with O(1) efficiency. This would allow O(n) times where n was the length of the document (in words). It seemed perfectly reasonable. Given that word list files are in the few MB range, a hashmap to hold them all would be in the 10s of MBs to hold a few hundred thousand words. Lookups would be cheap, so in the case of misspellings we could introduce extra letters, perform inversions, etc for potential suggestions, see if any of those are valid words, and suggest them as corrections.

None of this is wrong. In fact, I wouldn't be surprised if the algorithm used by most spell checkers are a variation of this idea. Perhaps there is a ranking system so that more common words are suggested, or they have some weighted check of the Edit Distance. It was a valid approach at the time, and as hardware limitations are continually less constrained there probably won't be a reason to change.

AppleWorks 3.0

Also known as ClarisWorks, this was my first word processor. It ran on the family Apple IIe, a faithful companion of my childhood. AppleWorks was the program I lived in for writing school papers, exciting adventures, or just as an excuse to use the computer for something which wasn't a game. The keyboard shortcuts have long been forgotten from my muscle-memory, but I knew this application back-and-forwards. One thing I do remember is that it had a spell checker...in 1989. OpenApple-V (for Verify) was the keyboard command to access this function. Note that it wasn't a "shortcut" as there wasn't some other way to give commands, as this was standard 80 column text at its finest.

The 3.5 diskettes at the time only held 800KB (up from 400KB). And this program ran in 128KB of RAM. The spell check function was only a small part of the entire suite of applications, which included a word processor, database, and spreadsheet. According to the document above, the main dictionary took up 402 blocks. The rest of the data on the disk, including the operating system, was 467 blocks. The entire dictionary was 201KB. This literally would not fit in RAM. How did they do it?

I have no idea.

That's not entirely true, I do have some ideas, but they're challenging and interesting to think about. And if I had the time I might try some of them. 😀

The Interview

I don't remember where I was interviewing, but the question was asked: "How would you implement a spell checker?" I hadn't reflected upon the relic of my childhood, and so I gave the "right" answer involving a hashmap of the entire corpus of English words in RAM. I got an approving nod as I regurgitated the discussion from my not-so-distant college discussion. After the interview I thought about how things went and how my answers were received. Somehow the Apple IIe crept into my mind.

Had I been interviewed 20 years earlier, that answer wouldn't have been acceptable.

This isn't about nostalgia. Again, this answer I gave isn't wrong, it merely expects a certain set of constraints. I'm much happier using today's tools instead of AppleWorks 3. Most people are. Today, as always, there are constraints, but so much less limiting than in the decades past. I'm not hitting those limits in my day-to-day work, and I wonder if there's something I'm missing, something that I'm supposed to have learned but never even had the opportunity to forget.

Conclusion

Computer Science is a young discipline by any measure. And yet I can't read paper tape or punch cards, or translate between binary and octal and hex and decimal in my head. The previous generation could. They had to. I didn't/don't because the hardware could be abstracted so much. Compared to my seniors, I've lived in a fairy tale because of the current "speed" of hardware. That's hard to be proud of.

The truth is, Moore's Law is going to slow down. Transistors can be packed only so densely. There are limits to the speed of light and the size of atoms. The problems, however, will continue to grow in size. And maybe we're going to need the lessons of the past to get the most of what we have when the limit is finally reached. But there's no apprenticeship model in our industry. We'll have to learn them all over again.