How we do this? Hash all the items in a set and pick the min (or max, doesn’t matter). See Jaccard similarity for what’s that Jaccard.
def minhash(S: set[str], seed: int):
return min(mmh3.hash(x, seed) for x in S)Why? Sorted hash function over a heap is basically random permutation, or shuffle.
Say and set both have 5 elements, 3 of them are shared. , If one of the three produce the smallest hash among the 7 unique elements, then hash collides, .