This note is generated by Claude Sonnet 4.6 based off a conversation.


The goal: given a large corpus of documents, find all near-duplicate pairs efficiently — without O(n²) comparisons.

Estimating Jaccard from Signatures

Run independent MinHash functions. Each document gets a signature vector .

For a pair of documents with agreeing positions and disagreeing positions ():

This is the sample mean of i.i.d. Bernoulli() trials. By the law of large numbers, as , with confidence interval:

Bayesian interpretation

With a uniform prior on , the posterior given observations is:

So is just the Beta CDF tail. The MAP estimate recovers .

In practice this isn’t used — practitioners just threshold on directly — but it’s the principled answer to “how uncertain am I?”

The Search Problem

Even with compact signatures, comparing all pairs is still O(n²). For a million documents that’s comparisons.

MinHash signatures solve the estimation problem. LSH solves the search problem.

LSH via Banding

Take the length- signature and divide it into bands of rows each ().

For each band, hash that band’s chunk into a bucket. Documents sharing a bucket in any band become candidate pairs.

Collision probability

For two documents with true Jaccard similarity :

EventProbability
Agree on all rows in one band
Collide in at least one of bands

This is the S-curve. By tuning and , you control the threshold (knee of the curve):

graph LR
    A[Low similarity] -->|"(1-(s^r)^b) ≈ 0"| B[Rarely collide]
    C[High similarity] -->|"(1-(s^r)^b) ≈ 1"| D[Almost always collide]

What LSH specifically contributes

The bucket trick — “only compare pairs sharing a bucket” — is generic and gives you O(n) bucketing + O(candidate pairs) complexity for any hashing scheme.

What’s specific to LSH is the design of the hash family that produces the S-curve. A single minHash bucket gives a linear collision probability in — no sharp threshold. Banding introduces the threshold by:

  • The exponent (AND over rows) sharpens the lower cutoff
  • The union (OR over bands) softens the upper end Together they carve out the S-shape.

The Full Pipeline

Documents
  → MinHash signatures (k hashes)     # O(n·k), reduces each doc to a vector
  → LSH banding into buckets          # O(n·b), finds candidate pairs
  → Exact Jaccard on candidates only  # O(|candidates|), verification step

The final verification step is where you get certainty. LSH+banding is a probabilistic filter — it trades away some recall (missed pairs) and precision (false candidate pairs) in exchange for massive speedup.

Parameter Tuning Intuition

GoalAdjustment
Higher recall (catch more true duplicates)Increase (more bands)
Higher precision (fewer false candidates)Increase (larger bands)
Sharper threshold curveIncrease overall