Robert - Okay. Here is the LM build script, lm2a.bat, which uses the CMU-Cambridge LM toolkit (v2) to build a language model. The script assumes that the LM toolkit binaries are on your path. The script invokes these python scripts (attached): (filter the word list by the CMU dictionary) and (create a dict file from the vocab file and the CMU dictionary). Both scripts use (different versions of) the CMU dictionary, which they find in c:/work/cmudict/c06d and c:/work/cmudict/cmudict.06d , respectively. cmudict.06d is the version of the dictionary "post stress reduction", which is (more) appropriate for continuous speech recognition. They could probably both use the latter file. You will also need the file context.ccs, although I'm no longer sure it was doing anything useful. Watch out for the "echo" commands at lines 45 and 47 - win32 cmd echo is slightly different than /bin/csh's. This script builds a language model. The output files are lm2a.vocab, lm2a.dict0, and (which can be gzipped). Intermediate files (which are not deleted) include lm2a.wfreq, lm2a2.wfreq, lm2a.id3gram.ascii, and lm2a.eval-small and lm2a.eval-heldback. These last two files contain perplexity measurements on the files small.out and heldback.out. The perplexity is the "average branching factor" of the language model (when measured on test data which was "held back" = not included in the training data); higher is worse. 1-gram LMs typically measure a perplexity of 1000-2000. 2-gram LMs typically get a perplexity between 250 and . 3-gram LMs have perplexity between 150 and 250. NOTE, however, that an domain-specific LM built from training data taken exclusively from, say, the medical domain might have a very high perplexity when tested on text taken from a different domain (say legal text). The perplexity on training data should be less than the perplexity on held-back data. The difference gives you an estimate of how much you have overfit the LM to your training data - typically, this means that you don't have enough LM training data... A "good fit" is an advantage (on training data) of 10-20 in (absolute) perplexity. Perplexity is a statistical measure. A good experiment is to pick 10 different texts (or groups of texts) and measure the perplexity on each. Or choose some elementary texts, some high-school texts, some college-level texts, and some graduate-level texts, and compare the perplexity. --- Jonathan FYI, I just built this using my instructions, but I didn't prepare any held-back data. Testing on Shakespeare (etext00/00ws110.txt) gives me a perplexity of 656, while testing on The Zeppelin's Passenger, by Oppenheim (text99/zplnp10.txt) gives a perplexity of 160. These are both "cheating" measurements (measuring performance of any machine learning system on training data is generally referred to as "cheating"). The lesson I learn from this is DON'T BUILD A LM USING SHAKESPEARE! Also, I am attaching a new version of guttok with the debugging code conditionalized away. # 18-Dec-02
# IF it's in the CMU Dictionary, then output it (in all upper case)
# ELSE output it as normal?
############################################################
import sys
import string
import xreadlines
############################################################
# Read the CMU dictionary
cmudictwords = {}
for l in open("c:/work/cmudict/c06d").readlines():
    if l[:3] == "## ": continue # Skip comment lines
    this_word = string.split(l," ",2)[0]
    cmudictwords[this_word] = this_word
############################################################
#
# read words
for l in xreadlines.xreadlines(sys.stdin):
    l = string.strip(l)
    # l should contain: TOUCHE 1
    (word, freq) = string.split(l)
    if cmudictwords.get(word):
        print word, freq

# 18-Dec-02
# Filter the CMU dictionary (after removing stress) to just have the prons we want for our app
# TODO: Better multiple prons handling !?!?!?
# The bug is that because the dict is sorted ALPHA, we can't cut off multiple prons as well as we'd like...
# TODO: Write error messages to STDERR !!!???
# 21-Dec-02 added hack including multiple prons for this_word[:3] == "THE"
# 21-Dec-02 added error messages...
# 21-Dec-02 fixed split logic - long words like COUNTERREVOLUTIONARY don't have 2 spaces after them.
############################################################
import re
import sys
import string
import xreadlines
############################################################
# Read the vocab file
words = {}
def main():
    # ASSUMES argv[1] is filename <foo>.vocab containing a wordlist
    for l in open(sys.argv[1]).readlines():
        if l[:2] == "##": continue # Skip comment lines
        this_word = string.strip(l)
        words[this_word] = this_word
main()
############################################################
# Read the CMU dictionary (POST STRESS REDUCTION!)
altpron = re.compile( "(.*)\(([0-9]*)\)" )
nwords = 0
wordindex = {} # # pros for this word
for l in open("c:/work/cmudict/cmudict.06d").readlines():
    if l[:2] == "##": continue # Skip comment lines
    this_word = string.split(l," ",2)[0] # Yes, *MOST* words have 2 spaces after them BUT NOT ALL!!!
    # Ah. Could be FOO or FOO(2)
    m = altpron.match(this_word)
    if m and (nwords < 50000 or this_word[:3] == "THE"):
        this_word =
        # print this_word
        this_index = int( )
        if this_index != wordindex.get(this_word,0) + 1:
            print "ERROR: %s has index %d, expected %ring.strip(l), this_index, wordindex.get(this_word,0) + 1 )
    if words.get(this_word):
        print string.strip(l)
        nwords = nwords + 1
        wordindex[this_word] = wordindex.get(this_word,0) + 1
############################################################
# Final consistency check:
for word in words.keys():
    if not wordindex.has_key(word):
        print "ERROR: %s not in dictionary" % word  We need your donations.' )
ENDMID = re.compile( '(.*) \[Etext #([1-9][0-9A-Za-z]*)\]' )
ENDMID2 = re.compile( '(.*) \[Etext #(.*)' )
ENDSMALLPRINT = re.compile( '\*END\*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS\ *Ver\.04\.29\.93\*END\*' )
ENDSMALLPRINT2 = re.compile( '\*END THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS\ *Ver\.04\.29\.93\*END\*' )
END2 = re.compile( '\*END\*' )
AFTERHEADEREND = re.compile( '[=-]*' )
# Note that (?: ... ) just serves to group the enclosed regexp
# GOODEND = "End of [the] [this] [Project Gutenberg['s]] [etext] [of] ..."
# BUT we really want at least Project Gutenberg or etext..
# SO we separate it out into 2 !
GOODEND = re.compile( 'End of (?:[Tt]he )?(?:this )?Project Gutenberg(?:\'s)? (? : [Ee]text(?:,)? )?(?:of )?(.*)' )
GOOD2END = re.compile( 'End of (?:[Tt]he )?(?:this )?(?:Project Gutenberg(?:\'s) ? )?[Ee]text(?:,)? (?:of )?(.*)' )
BADEND = re.compile( 'End of (.*)' )

INHEADERSTATE = 1
INMIDSTATE = 2
POSTMIDSTATE = 3
INDOCSTATE = 4
ENDSTATE = 5

pgidhash = {}

def scanfile(filename):
    if debug: print "Scanning", filename
    f = open(filename)
    tokenize2.begin_document(filename)
    state = INHEADERSTATE
    gotDOCTYPE = 0
    gotHTML = 0
    nMidLines = 0
    bestTitle = ""
    nBLANKLINES = 0
    for l in xreadlines(f):
        l = string.strip(l)
        if l == '':
            nBLANKLINES = nBLANKLINES + 1
            continue
        # print l
        if DOCTYPE.match(l):
            gotDOCTYPE = 1
            if debug: print filename, "DOCTYPE"
        if HTML.match(l):
            gotHTML = 1
            if debug: print filename, "HTML"
        #
        if state == INHEADERSTATE:
            if MIDRE.match(l):
                state = INMIDSTATE
                nMidLines = 0
        elif state == INMIDSTATE:
            if ENDMID.match(l):
                m = ENDMID.match(l)
                if debug: print filename, "END MID",, "#",
                if pgidhash.get( ):
                    if debug: print filename, "DUP =>", pgidhash.get( )
                else:
                    pgidhash[ ] = filename
                state = POSTMIDSTATE
            elif ENDMID2.match(l):
                m = ENDMID2.match(l)
                if debug: print filename, "END MID",, "#",
                if pgidhash.get( ):
                    if debug: print filename, "DUP =>", pgidhash.get( )
                else:
                    pgidhash[ ] = filename
                state = POSTMIDSTATE
            else:
                if debug: print filename, "MID...", l
                if nMidLines == 0:
                    bestTitle = l
                nMidLines = nMidLines + 1
                if nMidLines > 20:
                    if debug: print filename, "END MID ???"
                    state = POSTMIDSTATE
        #
        elif state == INDOCSTATE:
            if GOODEND.match(l):
                if debug: print filename, "good end match", GOODEND.match(l).group(1)
                state = ENDSTATE
            elif GOOD2END.match(l):
                if debug: print filename, "good2 end match", GOOD2END.match(l).group(1)
                state = ENDSTATE
            elif BADEND.match(l) and nBLANKLINES > 2:
                if len(bestTitle) > 10 and string.upper( BADEND.match(l).group(1)[:10] ) == string.upper( bestTitle[:10] ):
                    if debug: print filename, "good3 end match", BADEND.match(l).group(1)
                else:
                    if debug: print filename, "bad end match", BADEND.match(l).group(1)
                state = ENDSTATE
            # the trick is: if we are in INDOCSTATE
            # AND this line doesn't trigger ENDSTATE
            # THEN we tokenize it!!!
            if state == INDOCSTATE:
                tokenize2.tokenize(l)
        #
        elif state == POSTMIDSTATE:
            if ENDSMALLPRINT.match(l):
                if debug: print filename, "ENDSMALLPRINT"
                state = INDOCSTATE
            elif ENDSMALLPRINT2.match(l):
                if debug: print filename, "ENDSMALLPRINT"
                state = INDOCSTATE
            elif END2.match(l):
                if debug: print filename, "END2"
                state = INDOCSTATE
        nBLANKLINES = 0
    tokenize2.end_document(filename)

############################################################
#
# Main code
for l in open( 'c:/work/gutenberg/texts.txt' ).readlines():
    l = string.strip(l)
    if l[0] == '#': continue
    else:
        scanfile(l)
# sys.exit(1)

REM e.g. python some other files >HELDBACK.OUT
REM NOTE: it is crucial that <some other files> be "held-back data" not included in guttok.out.
tail -10000 c:\work\gutenberg\guttok.out >heldback.out
cat c:\work\gutenberg\guttok.out | text2wfreq >lm2a.wfreq
REM filter lm2a.wfreq by cmu dict...
python <lm2a.wfreq >lm2a2.wfreq
wfreq2vocab -top 60000 <lm2a2.wfreq >lm2a.vocab
cat c:\work\gutenberg\guttok.out | text2idngram -n 3 -vocab lm2a.vocab -temp c:\ temp -write_ascii >lm2a.id3gram.ascii
idngram2lm -idngram lm2a.id3gram.ascii -ascii_input -context context.ccs -vocab lm2a.vocab -arpa -n 5 -good_turing
python lm2a.vocab > lm2a.dict0
REM Compute perplexity on training data
echo perplexity -text small.out | evallm -arpa >> lm2a.eval-small
REM Compute perplexity on previously unseen test data
echo perplexity -text heldback.out | evallm -arpa >> lm2a.eval-heldback
:END
BEEP