[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tag/feature/attribute/dimension correlation



Below is the pseudo code.

Although hashfunc() could be a "strong" (e.g., cryptographic) hash, it
could just as easily be a fuzzy hash (i.e., locality sensitive hash, such
as TLSH).  This would allow inexact dimension correlations.  In that case,
 1) you could do a first sweep of the "search for matches" below and then
    compare the bitvectors and see if most of the points match (maybe
    using a raw count would be sufficient)
 2) a second sweep would be needed where bits and pieces of the fuzzy hash
    are matched; that could get very hairy -- I would probably break
    pieces of the hash values into a tree and traverse the tree,
    determining whether most of each hash value matches; this is left as
    an exercise for the reader (i.e., you!)

// initialize
for d in D
 hashval[d] = 0
for n in rows
 for d in D
  // accumulate the new value into the hash value
  hashval[d] = hashfunc(hashval[d], val[n, d])
  bitvector[d].append(val[n, d])
// bitvector will obviously get large; using sparse bit vector/array
// compression would likely make sense
// search for matches
for d1 in D
 for d2 in D
  if hashval[d1] == hashval[d2] then
   if bitvector[d1] == bitvector[d2] then
    print "{} and {} correlate" % (d1, d2)




Why do you want this page removed?