Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Robert - Okay. Here is the LM build script, lm2a.bat, which uses the CMU-Cambridge LM toolkit (v2) to build a language model. The script assumes that the LM toolkit binaries are on your path. The script invokes these python scripts (attached): refreq.py (filter the word list by the CMU dictionary) and redict.py (create a dict file from the vocab file and the CMU dictionary). Both scripts use (different versions of) the CMU dictionary, which they find in c:/work/cmudict/c06d and c:/work/cmudict/cmudict.06d , respectively. cmudict.06d is the version of the dictionary "post stress reduction", which is (more) appropriate for continuous speech recognition. They could probably both use the latter file. You will also need the file context.ccs, although I'm no longer sure it was doing anything useful. Watch out for the "echo" commands at lines 45 and 47 - win32 cmd echo is slightly different than /bin/csh's. This script builds a language model. The output files are lm2a.vocab, lm2a.dict0, and lm2a60k.5.5.arpa (which can be gzipped). Intermediate files (which are not deleted) include lm2a.wfreq, lm2a2.wfreq, lm2a.id3gram.ascii, and lm2a.eval-small and lm2a.eval-heldback. These last two files contain perplexity measurements on the files small.out and heldback.out. The perplexity is the "average branching factor" of the language model (when measured on test data which was "held back" = not included in the training data); higher is worse. 1-gram LMs typically measure a perplexity of 1000-2000. 2-gram LMs typically get a perplexity between 250 and . 3-gram LMs have perplexity between 150 and 250. NOTE, however, that an domain-specific LM built from training data taken exclusively from, say, the medical domain might have a very high perplexity when tested on text taken from a different domain (say legal text). The perplexity on training data should be less than the perplexity on held-back data. The difference gives you an estimate of how much you have overfit the LM to your training data - typically, this means that you don't have enough LM training data... A "good fit" is an advantage (on training data) of 10-20 in (absolute) perplexity. Perplexity is a statistical measure. A good experiment is to pick 10 different texts (or groups of texts) and measure the perplexity on each. Or choose some elementary texts, some high-school texts, some college-level texts, and some graduate-level texts, and compare the perplexity. --- Jonathan FYI, I just built this using my instructions, but I didn't prepare any held-back data. Testing on Shakespeare (etext00/00ws110.txt) gives me a perplexity of 656, while testing on The Zeppelin's Passenger, by Oppenheim (text99/zplnp10.txt) gives a perplexity of 160. These are both "cheating" measurements (measuring performance of any machine learning system on training data is generally referred to as "cheating"). The lesson I learn from this is DON'T BUILD A LM USING SHAKESPEARE! Also, I am attaching a new version of guttok with the debugging code conditionalized away. Content-Type: text/plain; name="refreq.py" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="refreq.py" # 18-Dec-02 # IF it's in the CMU Dictionary, then output it (in all upper case) # ELSE output it as normal? ############################################################ import sys import string import xreadlines ############################################################ # Read the CMU dictionary cmudictwords = {} for l in open("c:/work/cmudict/c06d").readlines(): if l[:3] == "## ": continue # Skip comment lines this_word = string.split(l," ",2)[0] cmudictwords[this_word] = this_word ############################################################ # # read words for l in xreadlines.xreadlines(sys.stdin): l = string.strip(l) # l should contain: TOUCHE 1 (word, freq) = string.split(l) if cmudictwords.get(word): print word, freq Content-Type: text/plain; name="redict.py" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="redict.py" # 18-Dec-02 # Filter the CMU dictionary (after removing stress) to just have the prons we want for our app # TODO: Better multiple prons handling !?!?!? # The bug is that because the dict is sorted ALPHA, we can't cut off multiple prons as well as we'd like... # TODO: Write error messages to STDERR !!!??? # 21-Dec-02 added hack including multiple prons for this_word[:3] == "THE" # 21-Dec-02 added error messages... # 21-Dec-02 fixed split logic - long words like COUNTERREVOLUTIONARY don't have 2 spaces after them. ############################################################ import re import sys import string import xreadlines ############################################################ # Read the vocab file words = {} def main(): # ASSUMES argv[1] is filename <foo>.vocab containing a wordlist for l in open(sys.argv[1]).readlines(): if l[:2] == "##": continue # Skip comment lines this_word = string.strip(l) words[this_word] = this_word main() ############################################################ # Read the CMU dictionary (POST STRESS REDUCTION!) altpron = re.compile( "(.*)\(([0-9]*)\)" ) nwords = 0 wordindex = {} # # pros for this word for l in open("c:/work/cmudict/cmudict.06d").readlines(): if l[:2] == "##": continue # Skip comment lines this_word = string.split(l," ",2)[0] # Yes, *MOST* words have 2 spaces after them BUT NOT ALL!!! # Ah. Could be FOO or FOO(2) m = altpron.match(this_word) if m and (nwords < 50000 or this_word[:3] == "THE"): this_word = m.group(1) # print this_word this_index = int( m.group(2) ) if this_index != wordindex.get(this_word,0) + 1: print "ERROR: %s has index %d, expected %ring.strip(l), this_index, wordindex.get(this_word,0) + 1 ) if words.get(this_word): print string.strip(l) nwords = nwords + 1 wordindex[this_word] = wordindex.get(this_word,0) + 1 ############################################################ # Final consistency check: for word in words.keys(): if not wordindex.has_key(word): print "ERROR: %s not in dictionary" % word Content-Type: application/octet-stream; name="context.ccs" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="context.ccs" This message contains raw digital data, which is being decoded and written to the file named "/tmp/context.ccs". If you do not want this data, you probably should delete that file. Wrote file /tmp/context.ccs Content-Type: application/octet-stream; name="lm2a.bat" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="lm2a.bat" This message contains raw digital data, which is being decoded and written to the file named "/tmp/lm2a.bat". If you do not want this data, you probably should delete that file. Wrote file /tmp/lm2a.bat Content-Type: text/plain; name="guttok.py" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="guttok.py" ############################################################ # # IMPORTS import os import re from xreadlines import xreadlines import string import sys import tokenize2 TRUE = 1 FALSE = 0 debug = FALSE ############################################################ # # Regular expressions DOCTYPE = re.compile( '<!DOCTYPE html PUBLIC "-//IETF//DTD.*' ) HTML = re.compile( '<HTML>' ) MIDRE = re.compile( 'further information is included below. We need your donations.' ) ENDMID = re.compile( '(.*) \[Etext #([1-9][0-9A-Za-z]*)\]' ) ENDMID2 = re.compile( '(.*) \[Etext #(.*)') ENDSMALLPRINT = re.compile( '\*END\*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS\ *Ver\.04\.29\.93\*END\*' ) ENDSMALLPRINT2 = re.compile( '\*END THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS\ *Ver\.04\.29\.93\*END\*' ) END2 = re.compile( '\*END\*' ) AFTERHEADEREND = re.compile( '[=-]*' ) # Note that (?: ... ) just serves to group the enclosed regexp # GOODEND = "End of [the] [this] [Project Gutenberg['s]] [etext] [of] ..." # BUT we really want at least Project Gutenberg or etext.. # SO we separate it out into 2 ! GOODEND = re.compile( 'End of (?:[Tt]he )?(?:this )?Project Gutenberg(?:\'s)? (? : [Ee]text(?:,)? )?(?:of )?(.*)' ) GOOD2END = re.compile( 'End of (?:[Tt]he )?(?:this )?(?:Project Gutenberg(?:\'s) ? )?[Ee]text(?:,)? (?:of )?(.*)' ) BADEND = re.compile( 'End of (.*)' ) INHEADERSTATE = 1 INMIDSTATE = 2 POSTMIDSTATE = 3 INDOCSTATE = 4 ENDSTATE = 5 pgidhash = {} def scanfile(filename): if debug: print "Scanning", filename f = open(filename) tokenize2.begin_document(filename) state = INHEADERSTATE gotDOCTYPE = 0 gotHTML = 0 nMidLines = 0 bestTitle = "" nBLANKLINES = 0 for l in xreadlines(f): l = string.strip(l) if l == '': nBLANKLINES = nBLANKLINES + 1 continue # print l if DOCTYPE.match(l): gotDOCTYPE = 1 if debug: print filename, "DOCTYPE" if HTML.match(l): gotHTML = 1 if debug: print filename, "HTML" # if state == INHEADERSTATE: if MIDRE.match(l): state = INMIDSTATE nMidLines = 0 elif state == INMIDSTATE: if ENDMID.match(l): m = ENDMID.match(l) if debug: print filename, "END MID", m.group(1), "#", m.group(2) if pgidhash.get( m.group(2) ): if debug: print filename, "DUP =>", pgidhash.get( m.group(2) ) else: pgidhash[ m.group(2) ] = filename state = POSTMIDSTATE elif ENDMID2.match(l): m = ENDMID2.match(l) if debug: print filename, "END MID", m.group(1), "#", m.group(2) if pgidhash.get( m.group(2) ): if debug: print filename, "DUP =>", pgidhash.get( m.group(2) ) else: pgidhash[ m.group(2) ] = filename state = POSTMIDSTATE else: if debug: print filename, "MID...", l if nMidLines == 0: bestTitle = l nMidLines = nMidLines + 1 if nMidLines > 20: if debug: print filename, "END MID ???" state = POSTMIDSTATE # elif state == INDOCSTATE: if GOODEND.match(l): if debug: print filename, "good end match", GOODEND.match(l).group(1) state = ENDSTATE elif GOOD2END.match(l): if debug: print filename, "good2 end match", GOOD2END.match(l).group(1) state = ENDSTATE elif BADEND.match(l) and nBLANKLINES > 2: if len(bestTitle) > 10 and string.upper( BADEND.match(l).group(1)[:10] ) == string.upper( bestTitle[:10] ): if debug: print filename, "good3 end match", BADEND.match(l).group(1) else: if debug: print filename, "bad end match", BADEND.match(l).group(1) state = ENDSTATE # the trick is: if we are in INDOCSTATE # AND this line doesn't trigger ENDSTATE # THEN we tokenize it!!! if state == INDOCSTATE: tokenize2.tokenize(l) # elif state == POSTMIDSTATE: if ENDSMALLPRINT.match(l): if debug: print filename, "ENDSMALLPRINT" state = INDOCSTATE elif ENDSMALLPRINT2.match(l): if debug: print filename, "ENDSMALLPRINT" state = INDOCSTATE elif END2.match(l): if debug: print filename, "END2" state = INDOCSTATE nBLANKLINES = 0 tokenize2.end_document(filename) ############################################################ # # Main code for l in open( 'c:/work/gutenberg/texts.txt' ).readlines(): l = string.strip(l) if l[0] == '#': continue else: scanfile(l) # sys.exit(1) REM e.g. python tokenize2.py some other files >HELDBACK.OUT REM NOTE: it is crucial that <some other files> be "held-back data" not included in guttok.out. tail -10000 c:\work\gutenberg\guttok.out >heldback.out cat c:\work\gutenberg\guttok.out | text2wfreq >lm2a.wfreq REM filter lm2a.wfreq by cmu dict... python refreq.py <lm2a.wfreq >lm2a2.wfreq wfreq2vocab -top 60000 <lm2a2.wfreq >lm2a.vocab cat c:\work\gutenberg\guttok.out | text2idngram -n 3 -vocab lm2a.vocab -temp c:\ temp -write_ascii >lm2a.id3gram.ascii idngram2lm -idngram lm2a.id3gram.ascii -ascii_input -context context.ccs -vocab lm2a.vocab -arpa lm2a60k.5.5.arpa -n 5 -good_turing python redict.py lm2a.vocab > lm2a.dict0 REM Compute perplexity on training data echo perplexity -text small.out | evallm -arpa lm2a60k.5.5.arpa >> lm2a.eval-small REM Compute perplexity on previously unseen test data echo perplexity -text heldback.out | evallm -arpa lm2a60k.5.5.arpa >> lm2a.eval-heldback :END BEEP