[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Python tokenizing scripts (was Building LMs on our own)

To: "robert b" <http://dummy.us.eu.org/robert>
Subject: Re: Python tokenizing scripts (was Building LMs on our own)
From: "Jonathan Young" <http://www.attbi.com/~Jonathan_Young>
Date: Sun, 30 Mar 2003 19:09:27 -0500
Content-Type: text/plain;  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Robert -

Okay.  Here is the LM build script, lm2a.bat, which uses the CMU-Cambridge
LM toolkit (v2) to build a language model.  The script assumes that the LM
toolkit binaries are on your path.

The script invokes these python scripts (attached): refreq.py (filter the
word list by the CMU dictionary) and redict.py (create a dict file from the
vocab file and the CMU dictionary).  Both scripts use (different versions
of) the CMU dictionary, which they find in c:/work/cmudict/c06d and
c:/work/cmudict/cmudict.06d , respectively.  cmudict.06d is the version of
the dictionary "post stress reduction", which is (more) appropriate for
continuous speech recognition.  They could probably both use the latter
file.

You will also need the file context.ccs, although I'm no longer sure it was
doing anything useful.

Watch out for the "echo" commands at lines 45 and 47 - win32 cmd echo is
slightly different than /bin/csh's.

This script builds a language model.  The output files are lm2a.vocab,
lm2a.dict0, and lm2a60k.5.5.arpa (which can be gzipped).  Intermediate files
(which are not deleted) include lm2a.wfreq, lm2a2.wfreq, lm2a.id3gram.ascii,
and lm2a.eval-small and lm2a.eval-heldback.  These last two files contain
perplexity measurements on the files small.out and heldback.out.

The perplexity is the "average branching factor" of the language model (when
measured on test data which was "held back" = not included in the training
data); higher is worse.  1-gram LMs typically measure a perplexity of
1000-2000.  2-gram LMs typically get a perplexity between 250 and .
3-gram LMs have perplexity between 150 and 250.  NOTE, however, that an
domain-specific LM built from training data taken exclusively from, say, the
medical domain might have a very high perplexity when tested on text taken
from a different domain (say legal text).

The perplexity on training data should be less than the perplexity on
held-back data.  The difference gives you an estimate of how much you have
overfit the LM to your training data - typically, this means that you don't
have enough LM training data...  A "good fit" is an advantage (on training
data) of 10-20 in (absolute) perplexity.

Perplexity is a statistical measure.  A good experiment is to pick 10
different texts (or groups of texts) and measure the perplexity on each.  Or
choose some elementary texts, some high-school texts, some college-level
texts, and some graduate-level texts, and compare the perplexity.

--- Jonathan

FYI, I just built this using my instructions, but I didn't prepare any
held-back data.  Testing on Shakespeare (etext00/00ws110.txt) gives me a
perplexity of 656, while testing on The Zeppelin's Passenger, by Oppenheim
(text99/zplnp10.txt) gives a perplexity of 160.  These are both "cheating"
measurements (measuring performance of any machine learning system on
training data is generally referred to as "cheating").  The lesson I learn
from this is DON'T BUILD A LM USING SHAKESPEARE!

Also, I am attaching a new version of guttok with the debugging code
conditionalized away.

Content-Type: text/plain;  name="refreq.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;  filename="refreq.py"

# 18-Dec-02

# IF it's in the CMU Dictionary, then output it (in all upper case)
# ELSE output it as normal?

############################################################

import sys
import string
import xreadlines

############################################################

# Read the CMU dictionary

cmudictwords = {}

for l in open("c:/work/cmudict/c06d").readlines():

	if l[:3] == "## ":
		continue	# Skip comment lines

	this_word = string.split(l,"  ",2)[0]

	cmudictwords[this_word] = this_word

############################################################
#

# read words

for l in xreadlines.xreadlines(sys.stdin):
	
	l = string.strip(l)

	# l should contain: TOUCHE 1

	(word, freq) = string.split(l)

	if cmudictwords.get(word):

		print word, freq

Content-Type: text/plain;  name="redict.py"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;  filename="redict.py"

# 18-Dec-02

# Filter the CMU dictionary (after removing stress) to just have the prons we 
want for our app

# TODO: Better multiple prons handling !?!?!?
# 	The bug is that because the dict is sorted ALPHA, we can't cut off multiple 
prons as well as we'd like... 
# TODO: Write error messages to STDERR !!!???
# 21-Dec-02 added hack including multiple prons for this_word[:3] == "THE"
# 21-Dec-02 added error messages... 
# 21-Dec-02 fixed split logic - long words like COUNTERREVOLUTIONARY don't have 
2 spaces after them.

############################################################

import re
import sys
import string
import xreadlines

############################################################

# Read the vocab file

words = {}

def main():

	# ASSUMES argv[1] is filename <foo>.vocab containing a wordlist

	for l in open(sys.argv[1]).readlines():

		if l[:2] == "##":
			continue	# Skip comment lines

		this_word = string.strip(l)

		words[this_word] = this_word

main()

############################################################

# Read the CMU dictionary (POST STRESS REDUCTION!)

altpron = re.compile( "(.*)\(([0-9]*)\)" )

nwords = 0

wordindex = {}	# # pros for this word

for l in open("c:/work/cmudict/cmudict.06d").readlines():

	if l[:2] == "##":
		continue	# Skip comment lines

	this_word = string.split(l," ",2)[0]	# Yes, *MOST* words have 2 spaces after 
	them BUT NOT ALL!!!

	# Ah.  Could be FOO or FOO(2)
	m = altpron.match(this_word)

	if m and (nwords < 50000 or this_word[:3] == "THE"):
		this_word = m.group(1)
		# print this_word

		this_index = int( m.group(2) )

		if this_index != wordindex.get(this_word,0) + 1:
			print "ERROR: %s has index %d, expected %ring.strip(l), this_index, 
			wordindex.get(this_word,0) + 1 )

	if words.get(this_word):
		print string.strip(l)
		nwords = nwords + 1

	wordindex[this_word] = wordindex.get(this_word,0) + 1

############################################################

# Final consistency check:

for word in words.keys():

	if not wordindex.has_key(word): 

		print "ERROR: %s not in dictionary" % word

Content-Type: application/octet-stream;  name="context.ccs"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;  filename="context.ccs"

This message contains raw digital data,
which is being decoded and written to the file named "/tmp/context.ccs".
If you do not want this data, you probably should delete that file.
Wrote file /tmp/context.ccs
Content-Type: application/octet-stream;  name="lm2a.bat"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;  filename="lm2a.bat"

This message contains raw digital data,
which is being decoded and written to the file named "/tmp/lm2a.bat".
If you do not want this data, you probably should delete that file.
Wrote file /tmp/lm2a.bat
Content-Type: text/plain;  name="guttok.py"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;  filename="guttok.py"

############################################################
#
# IMPORTS

import os
import re

from xreadlines import xreadlines

import string
import sys

import tokenize2

TRUE = 1
FALSE = 0

debug = FALSE

############################################################
#
# Regular expressions

DOCTYPE = re.compile( '<!DOCTYPE html PUBLIC "-//IETF//DTD.*' )
HTML = re.compile( '<HTML>' )

MIDRE = re.compile( 'further information is included below.  We need your 
donations.' )

ENDMID = re.compile( '(.*) \[Etext #([1-9][0-9A-Za-z]*)\]' )
ENDMID2 = re.compile( '(.*) \[Etext #(.*)')

ENDSMALLPRINT = re.compile( '\*END\*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS\
*Ver\.04\.29\.93\*END\*' )
ENDSMALLPRINT2 = re.compile( '\*END THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS\
*Ver\.04\.29\.93\*END\*' )

END2 = re.compile( '\*END\*' )

AFTERHEADEREND = re.compile( '[=-]*' )

# Note that (?: ... ) just serves to group the enclosed regexp
# GOODEND = "End of [the] [this] [Project Gutenberg['s]] [etext] [of] ..."
# BUT we really want at least Project Gutenberg or etext..
# 	SO we separate it out into 2 !
GOODEND = re.compile( 'End of (?:[Tt]he )?(?:this )?Project Gutenberg(?:\'s)? (?
: [Ee]text(?:,)? )?(?:of )?(.*)' )
GOOD2END = re.compile( 'End of (?:[Tt]he )?(?:this )?(?:Project Gutenberg(?:\'s)
? )?[Ee]text(?:,)? (?:of )?(.*)' )
BADEND = re.compile( 'End of (.*)' )

INHEADERSTATE = 1
INMIDSTATE = 2
POSTMIDSTATE = 3
INDOCSTATE = 4
ENDSTATE = 5

pgidhash = {}

def scanfile(filename):
	if debug: print "Scanning", filename
	f = open(filename)

	tokenize2.begin_document(filename)

	state = INHEADERSTATE
	gotDOCTYPE = 0
	gotHTML = 0
	nMidLines = 0

	bestTitle = ""

	nBLANKLINES = 0

	for l in xreadlines(f):
		l = string.strip(l)
		if l == '':
			nBLANKLINES = nBLANKLINES + 1
			continue
		# print l

		if DOCTYPE.match(l):
		 	gotDOCTYPE = 1
			if debug: print filename, "DOCTYPE"
		if HTML.match(l):
		 	gotHTML = 1
			if debug: print filename, "HTML"

		# if state == INHEADERSTATE:
		if MIDRE.match(l):
			state = INMIDSTATE
			nMidLines = 0

		elif state == INMIDSTATE:
			if ENDMID.match(l):
				m = ENDMID.match(l)
				if debug: print filename, "END MID", m.group(1), "#", m.group(2)
				if pgidhash.get( m.group(2) ):
					if debug: print filename, "DUP =>", pgidhash.get( m.group(2) )
				else:
					pgidhash[ m.group(2) ] = filename

				state = POSTMIDSTATE
			elif ENDMID2.match(l):
				m = ENDMID2.match(l)
				if debug: print filename, "END MID", m.group(1), "#", m.group(2)
				if pgidhash.get( m.group(2) ):
					if debug: print filename, "DUP =>", pgidhash.get( m.group(2) )
				else:
					pgidhash[ m.group(2) ] = filename

				state = POSTMIDSTATE

			else:
				if debug: print filename, "MID...", l

				if nMidLines == 0:
					bestTitle = l

				nMidLines = nMidLines + 1

				if nMidLines > 20:
					if debug: print filename, "END MID ???"
					state = POSTMIDSTATE

		# elif state == INDOCSTATE:
		if GOODEND.match(l):
				if debug: print filename, "good end match", GOODEND.match(l).group(1)
				state = ENDSTATE		
		elif GOOD2END.match(l):
				if debug: print filename, "good2 end match", GOOD2END.match(l).group(1)
				state = ENDSTATE		
		elif BADEND.match(l) and nBLANKLINES > 2:
				if len(bestTitle) > 10 and string.upper( BADEND.match(l).group(1)[:10] ) == 
				string.upper( bestTitle[:10] ):
					if debug: print filename, "good3 end match", BADEND.match(l).group(1)
				else:
					if debug: print filename, "bad end match", BADEND.match(l).group(1)
				state = ENDSTATE

		# the trick is: if we are in INDOCSTATE 
		# AND this line doesn't trigger ENDSTATE
		# THEN we tokenize it!!!

		if state == INDOCSTATE:
			tokenize2.tokenize(l)

		# elif state == POSTMIDSTATE:
		if ENDSMALLPRINT.match(l):
				if debug: print filename, "ENDSMALLPRINT"
				state = INDOCSTATE
		elif ENDSMALLPRINT2.match(l):
				if debug: print filename, "ENDSMALLPRINT"
				state = INDOCSTATE
		elif END2.match(l):
				if debug: print filename, "END2"
				state = INDOCSTATE

		nBLANKLINES = 0

	tokenize2.end_document(filename)

############################################################
#
# Main code

for l in open( 'c:/work/gutenberg/texts.txt' ).readlines():
	l = string.strip(l)
	if l[0] == '#':
		continue
	else:
		scanfile(l)
		# sys.exit(1)

REM e.g. python tokenize2.py some other files >HELDBACK.OUT
REM NOTE: it is crucial that <some other files> be "held-back data" not 
included in guttok.out.
tail -10000 c:\work\gutenberg\guttok.out >heldback.out
cat c:\work\gutenberg\guttok.out | text2wfreq >lm2a.wfreq
REM filter lm2a.wfreq by cmu dict...
python refreq.py <lm2a.wfreq >lm2a2.wfreq
wfreq2vocab -top 60000 <lm2a2.wfreq >lm2a.vocab
cat c:\work\gutenberg\guttok.out | text2idngram -n 3 -vocab lm2a.vocab -temp c:\
temp -write_ascii >lm2a.id3gram.ascii
idngram2lm -idngram lm2a.id3gram.ascii -ascii_input -context context.ccs -vocab 
lm2a.vocab -arpa lm2a60k.5.5.arpa -n  5 -good_turing
python redict.py lm2a.vocab > lm2a.dict0
REM Compute perplexity on training data
echo perplexity -text small.out | evallm -arpa lm2a60k.5.5.arpa >>
lm2a.eval-small
REM Compute perplexity on previously unseen test data
echo perplexity -text heldback.out | evallm -arpa lm2a60k.5.5.arpa >>
lm2a.eval-heldback
:END
BEEP
Prev by Date: Re: Python tokenizing scripts (was Building LMs on our own)
Next by Date: usenet data
Previous by thread: Re: Python tokenizing scripts (was Building LMs on our own)
Next by thread: usenet data
Index(es):
- Date
- Thread