[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tag/feature/attribute/dimension correlation

To: Alex <http://www.gmail.com/~alex.>
Subject: Re: tag/feature/attribute/dimension correlation
From: Robert <http://dummy.us.eu.org/robert>
Date: Mon, 28 Aug 2017 08:53:17 -0700
Keywords: our-San-Jose-phone-number<

 > From: Alex  <http://www.gmail.com/~alex.>
 > Date: Mon, 28 Aug 2017 09:39:48 -0400
 >
 > If there's only thousands of features why do you want it to scale to
 > millions?

SVM only has a few thousand features.  The Bayesian model has 1.6 million
features.

 > Also when I was making the correlation prog I had an idea that was kinda
 > interesting. Basically it was to cluster features, by finding features that
 > were mostly correlated, but didn't occur together all the time. So the idea
 > would be, you find a pair (and later on, maybe even a group) of features
 > that are very similar in idea, then treat them as the same feature in the
 > model. So for example maybe you'd group WORD_hat and NGRAM_baseball_cap
 > together such that if either occurs, then the hat+baseballcap feature in
 > the feature vector get incremented. You could just continue ANDing features
 > to make a group.

Hmm, that's interesting.  I had floated the idea of creating phrases (for
cases where WORD_hat occurs only once in the message and
NGRAM_baseball_hat occurs only once), but not creating semantic groups.
That seems like a good idea.  Might be a good alternative to using
WordNet.

 > The only problem with this is if spammers only say a certain type of hat,
 > you'll be losing information. It's only good when you want to capture a
 > topic.

Right.  I expect that there will be a lot of errors with this approach in
general.  Words that shouldn't be associated with each other will end up
getting associated.  (The system that I used to work on in the late 1990s
used the text used in html hyperlinks (usually bookmarks) for creating
keywords associated with topics if you knew the topic of the destination
web page.)

BTW, Igor has been investigating topic creation using latent semantic
analysis.  That's far more compute intensive than the approach you
suggest.  And, it's unclear whether there is any sort of correlation
between topics and spam.  Or, at least, what our users consider to be
spam.

On another topic: what's your current deal with graduate school at this
point?

 > On Aug 27, 2017 11:24 AM, "Robert" <http://dummy.us.eu.org/robert> wrote:
 >  > From: Alex  <http://www.gmail.com/~alex.>
 >  > Date: Sat, 26 Aug 2017 21:07:27 -0400
 >  >
 >  > Ah, I get it now. I looked up TLSH, that helped. Are you just thinking of
 >  > using it for finding if an email has an approximate match with a known
 >  > spam/ham (in a new way)?
 > 
 > No.  I do use TLSH and Nilsimsa for deduplicating the data during
 > training.  It has made a dramatic improvement in the quality of the
 > bayesian model.
 > 
 > I'm thinking I should try to get Jing to remove co-occurrent (is that a
 > word??) features since SVM has to have such a limited set of features due
 > to memory constraints during training.  (There are only a few thousand
 > features -- even your existing Python code would work for this purpose.)
 > 
 >  > What was the solution to the to do list problem? Something with Markov
 >  > chains, I think?
 > 
 > Yes.  That was your suggestion and I began doing some coding using that.
 > Of course, I don't have time to finish it.
 > 
 >  > On Aug 26, 2017 20:55, "Robert" <http://dummy.us.eu.org/robert> wrote:
 >  > > BTW, I only thought to look at your code after I was growing frustrated at
 >  > > having to rearrange my big todo list and wishing that I had a program
 >  > > which rearranged my todo list automatically and wishing that you would
 >  > > write it for me :-).

References:
- tag/feature/attribute/dimension correlation
  - From: Robert
- Re: tag/feature/attribute/dimension correlation
  - From: Robert
- Re: tag/feature/attribute/dimension correlation
  - From: Robert
- Re: tag/feature/attribute/dimension correlation
  - From: Robert
- Re: tag/feature/attribute/dimension correlation
  - From: Robert
- Re: tag/feature/attribute/dimension correlation
  - From: Robert

Prev by Date: call Morgan-Stanley @ Palo Alto Group @ 650-856-4527 to order checks
Next by Date: Re: today
Previous by thread: Re: tag/feature/attribute/dimension correlation
Next by thread: Fw: Welcome! Here's what to know before you go to Albuquerque.
Index(es):
- Date
- Thread