> From: "rnull2" <http://www.verizon.net/~rnu> > Date: Fri, 27 Feb 2009 19:59:27 -0500 > > Hi Robert, > > I was wondering if you have any documentation on your phrases.cc? It seems > useful and instructions on the command line parameters could be helpful. Here's a part from my .procmailrc file: :0w *PHRASES??^^^^ { PHRASES="phrases $HOME/.idata.phrases" } :0c:$HOME/.idata.lock |head -c 100000 | formail -I "From " $FILTER \ | $PHRASES | ifile -k -S -w -m 100000 -i $IFILE_FOLDER \ && cat >/dev/null This trains for ham. (I have other code in my junkicide procmail script which trains for spam.) > And do you describe anywhere how your algorithm works to find phrases? It's very simple. All 2 word pairs are considered a "phrase". This includes combinations (A B) C A (B C) Because of this, pairs can be linked together, thus: (A B C) because (A B) (B C) Counts are kept for each word pair. A maximum of 100000 phrases are kept and ones with the lowest counts fall off the bottom. Some words are not considered. Words < 3 letters are not considered. And, recently, words not found in a dictionary are the first to get cleaned when purging the database. Phone numbers are also stored as phrases. When filtering, '_' are stuck between each word, unless there are non-alphanum chars. Thus, in the above example, the phrase becomes A_B_C You can see in my .idata ifile database things like: message_content 164170 2:62 quotes_for_free 121393 2:20 register_new_account 278943 0:88 The first two seem to be marked as spam and the last as ham. Note that bogofilter and spambayes have bigram functionality which has similar effect. Please email back if you have any more specific questions. > Thanks. > > Larry Smith