> From: Alex <http://www.gmail.com/~alex.> > Date: Mon, 28 Aug 2017 09:39:48 -0400 > > If there's only thousands of features why do you want it to scale to > millions? SVM only has a few thousand features. The Bayesian model has 1.6 million features. > Also when I was making the correlation prog I had an idea that was kinda > interesting. Basically it was to cluster features, by finding features that > were mostly correlated, but didn't occur together all the time. So the idea > would be, you find a pair (and later on, maybe even a group) of features > that are very similar in idea, then treat them as the same feature in the > model. So for example maybe you'd group WORD_hat and NGRAM_baseball_cap > together such that if either occurs, then the hat+baseballcap feature in > the feature vector get incremented. You could just continue ANDing features > to make a group. Hmm, that's interesting. I had floated the idea of creating phrases (for cases where WORD_hat occurs only once in the message and NGRAM_baseball_hat occurs only once), but not creating semantic groups. That seems like a good idea. Might be a good alternative to using WordNet. > The only problem with this is if spammers only say a certain type of hat, > you'll be losing information. It's only good when you want to capture a > topic. Right. I expect that there will be a lot of errors with this approach in general. Words that shouldn't be associated with each other will end up getting associated. (The system that I used to work on in the late 1990s used the text used in html hyperlinks (usually bookmarks) for creating keywords associated with topics if you knew the topic of the destination web page.) BTW, Igor has been investigating topic creation using latent semantic analysis. That's far more compute intensive than the approach you suggest. And, it's unclear whether there is any sort of correlation between topics and spam. Or, at least, what our users consider to be spam. On another topic: what's your current deal with graduate school at this point? > On Aug 27, 2017 11:24 AM, "Robert" <http://dummy.us.eu.org/robert> wrote: > > From: Alex <http://www.gmail.com/~alex.> > > Date: Sat, 26 Aug 2017 21:07:27 -0400 > > > > Ah, I get it now. I looked up TLSH, that helped. Are you just thinking of > > using it for finding if an email has an approximate match with a known > > spam/ham (in a new way)? > > No. I do use TLSH and Nilsimsa for deduplicating the data during > training. It has made a dramatic improvement in the quality of the > bayesian model. > > I'm thinking I should try to get Jing to remove co-occurrent (is that a > word??) features since SVM has to have such a limited set of features due > to memory constraints during training. (There are only a few thousand > features -- even your existing Python code would work for this purpose.) > > > What was the solution to the to do list problem? Something with Markov > > chains, I think? > > Yes. That was your suggestion and I began doing some coding using that. > Of course, I don't have time to finish it. > > > On Aug 26, 2017 20:55, "Robert" <http://dummy.us.eu.org/robert> wrote: > > > BTW, I only thought to look at your code after I was growing frustrated at > > > having to rearrange my big todo list and wishing that I had a program > > > which rearranged my todo list automatically and wishing that you would > > > write it for me :-).