Posts Tagged ‘Prime Number’

Comments Off

Rewriting the Anagram program


2008
04.12

As Calvin pointed out, its a pain to do the string length and other calculations, so I went back and rewrote the anagram calculation using the prime number trick. That is, assign a prime to each letter from ‘a’ to ‘z’, then you can easily tell if two words are anagrams by multiplying all the prime factors. This will give a unique number.

There are two tricks. First, if you assign too big a prime then you can get an overflow. On a Mac running gcc, the largest integers are 64 bits long. C represents these as “unsigned long long int” which is quite a mouthful. Given there are 26 letters in the alphabet, the worst case would be the 26th prime 103 times 10 before you over flow (103^10 > 2^64). This is pretty unlikely in a real world dictionary.

To make it less likely, you can pick an encoding where the most “frequent”:http://scottbryce.com/cryptograms/stats.htm letters have the smallest prime. That is E is the most likely in english, so assign it a prime of 2, T is next, so assign it prime of 3 and so forth (for your reference, the letter order by decreasing frequency is ETAOIN SRHLDC UMFPGW YGVKXJ QZ and using the Encore scrabble dictionary, the word with the highest prime is adrenocorticotrophin with a canonical number of 18,438,608,663,595,509,046 or 1.8 x 10^19 which is just shy of 2^64 which is 1.8×10^19, but you should have a check for integer overflow.

The code is really simple then, you just multiple the encoding prime for each letter and then match it against the canonical “number” for each entry in the dictionary.

Anagram Solver


2008
04.10

Calvin wants to solve anagrams. There are lots of great programs out there, but not much source. “Gtoal.com”:http://www.gtoal.com/wordgames/anagrams.html has a good list of source code that is out there, but its hard to find a simple program.

“Gtanag.mai”:http://www.gtoal.com/wordgames/anagrams/gtanag.mai is perhaps the simplest program that is a C program. I’m sure you could use it in any language. What it does is pretty brute force. It takes any dictionary like say a Google search for “unix dictionary list”:google which gets you to “words.txt”:http://linkage.rockefeller.edu/chaynes/words.txt that has a 25,000 word dictionary or the Unix V7 “/usr/dict/words”:http://unix-tree.huihoo.org/V7/usr/dict/words.html and finds all anagrams by first making each entry canonical that is, it takes all the letters and alpha sorts them, so “rich” becomes “chir” and “calvin” becomes “acilnv” and then it can compare them by sorting all the words. If “acilnv” matches, then you have all the words that are anagrams. That is, use the same letters. Quite clever really.

Then there is one that is quite elegant using primes. Basically instead of the canonical form being the sorted order, it assigns a prime number to each letter. Then when you multiply, that number will be unique since it is a product of primes, if the numbers are the same then the word has to be the same. Wow, people like “falbdablet”:http://www.gtoal.com/wordgames/anagrams/flabdablet/anagram.c are sure smart!

So the usage if you like Unix (or Mac OS X which is the same is):

bq. canonize < dictionary | sort | gather

The first command, takes an entire text file and turns into the canonical form that looks like “rich=chir” and then sort sorts the canonical alphabetically. Then gather just finds all the identical words. You could also use fgrep to find for instance all the canonical matches.

We worked on a new program “search.c” that opens a dictionary that is canonical and sorted and looks for all anagrams. The Unix V7 English list is 24,000 words, but to find some really great word lists, “net-comber.com”:http://www.net-comber.com/wordurls.html has a good word list. The main problem is that we really just want an uncluttered list. So “outpost9.com”:http://www.outpost9.com/files/WordLists.html has a dictionary and also things that include proper names and abbreviations. The “crosswordman.com”:http://www.crosswordman.com/wordlist.html is what is allowed by Scrabble or simple “zip”:http://personal.riverusers.com/~thegrendel/ format.

The only small problem is that this is a DOS file format, so you need a small tool like “tofrodos”:http://www.thefreecountry.com/tofrodos/ to get rid of the CRLF that is in DOS files vs. just the CR that is Unix files.