[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
tag/feature/attribute/dimension correlation
- To: http://www.gmail.com/~alex. (Alexander ), http://www.umass.edu/~a (Alexander )
- Subject: tag/feature/attribute/dimension correlation
- From: http://dummy.us.eu.org/robert (Robert)
- Date: Sat, 26 Aug 2017 15:39:09 -0700
Below is the pseudo code.
Although hashfunc() could be a "strong" (e.g., cryptographic) hash, it
could just as easily be a fuzzy hash (i.e., locality sensitive hash, such
as TLSH). This would allow inexact dimension correlations. In that case,
1) you could do a first sweep of the "search for matches" below and then
compare the bitvectors and see if most of the points match (maybe
using a raw count would be sufficient)
2) a second sweep would be needed where bits and pieces of the fuzzy hash
are matched; that could get very hairy -- I would probably break
pieces of the hash values into a tree and traverse the tree,
determining whether most of each hash value matches; this is left as
an exercise for the reader (i.e., you!)
// initialize
for d in D
hashval[d] = 0
for n in rows
for d in D
// accumulate the new value into the hash value
hashval[d] = hashfunc(hashval[d], val[n, d])
bitvector[d].append(val[n, d])
// bitvector will obviously get large; using sparse bit vector/array
// compression would likely make sense
// search for matches
for d1 in D
for d2 in D
if hashval[d1] == hashval[d2] then
if bitvector[d1] == bitvector[d2] then
print "{} and {} correlate" % (d1, d2)