[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tag/feature/attribute/dimension correlation



 > From: Alex  <http://www.gmail.com/~alex.>
 > Date: Sat, 26 Aug 2017 19:21:50 -0400
 >
 > Won't bitvector just become the entire dataset formatted differently?

Yes.  That's why compressing it would be useful.

Also: if you trust that the hash function will not create collisions, then
you could bail on the bitvector entirely.

 > Are you trying get hashval to map tags to points where the points that are
 > correlated are closer? Regardless, I don't understand how hashfunc
 > accumulates the values.

The particulars of how hashfunc works.  All that matters is that any new
val[n,d] is added to the hash value somehow.  There are variety of hash
functions available.  Google 'em.

Of course, this is *only* for the exact match case.  For the fuzzy case,
there are only a handful of fuzzy (locality sensitive) hash functions.
But, like all good hash functions, the accumulate values as new stuff is
added.

 > Why do we need hashval and bitvector?

Again, you could get rid of bitvector if you know that your hash function
will mostly never create collisions.  Bitvector only exists as a back-up
in case there are collisions and you need to "go to the source" make sure
the underlying data really are identical (or, in the case of fuzzy hash,
the data are mostly the same).

 > Are you only looking for exact correlations? The 2nd to last line leads me
 > to think that.

Yes.

The paragraph below about "fuzzy hash" describes a way to make inexact
correlations.

 > On Aug 26, 2017 6:41 PM, "Robert" <http://dummy.us.eu.org/robert> wrote:
 > Below is the pseudo code.
 > 
 > Although hashfunc() could be a "strong" (e.g., cryptographic) hash, it
 > could just as easily be a fuzzy hash (i.e., locality sensitive hash, such
 > as TLSH).  This would allow inexact dimension correlations.  In that case,
 >  1) you could do a first sweep of the "search for matches" below and then
 >     compare the bitvectors and see if most of the points match (maybe
 >     using a raw count would be sufficient)
 >  2) a second sweep would be needed where bits and pieces of the fuzzy hash
 >     are matched; that could get very hairy -- I would probably break
 >     pieces of the hash values into a tree and traverse the tree,
 >     determining whether most of each hash value matches; this is left as
 >     an exercise for the reader (i.e., you!)
 > 
 > // initialize
 > for d in D
 >  hashval[d] = 0
 > for n in rows
 >  for d in D
 >   // accumulate the new value into the hash value
 >   hashval[d] = hashfunc(hashval[d], val[n, d])
 >   bitvector[d].append(val[n, d])
 > // bitvector will obviously get large; using sparse bit vector/array
 > // compression would likely make sense
 > // search for matches
 > for d1 in D
 >  for d2 in D
 >   if hashval[d1] == hashval[d2] then
 >    if bitvector[d1] == bitvector[d2] then
 >     print "{} and {} correlate" % (d1, d2)




Why do you want this page removed?