This note is generated by Claude Sonnet 4.6 based off a conversation.
The goal: given a large corpus of documents, find all near-duplicate pairs efficiently — without O(n²) comparisons.
Estimating Jaccard from Signatures
Run independent MinHash functions. Each document gets a signature vector .
For a pair of documents with agreeing positions and disagreeing positions ():
This is the sample mean of i.i.d. Bernoulli() trials. By the law of large numbers, as , with confidence interval:
Bayesian interpretation
With a uniform prior on , the posterior given observations is:
So is just the Beta CDF tail. The MAP estimate recovers .
In practice this isn’t used — practitioners just threshold on directly — but it’s the principled answer to “how uncertain am I?”
The Search Problem
Even with compact signatures, comparing all pairs is still O(n²). For a million documents that’s comparisons.
MinHash signatures solve the estimation problem. LSH solves the search problem.
LSH via Banding
Take the length- signature and divide it into bands of rows each ().
For each band, hash that band’s chunk into a bucket. Documents sharing a bucket in any band become candidate pairs.
Collision probability
For two documents with true Jaccard similarity :
| Event | Probability |
|---|---|
| Agree on all rows in one band | |
| Collide in at least one of bands |
This is the S-curve. By tuning and , you control the threshold (knee of the curve):
graph LR A[Low similarity] -->|"(1-(s^r)^b) ≈ 0"| B[Rarely collide] C[High similarity] -->|"(1-(s^r)^b) ≈ 1"| D[Almost always collide]
What LSH specifically contributes
The bucket trick — “only compare pairs sharing a bucket” — is generic and gives you O(n) bucketing + O(candidate pairs) complexity for any hashing scheme.
What’s specific to LSH is the design of the hash family that produces the S-curve. A single minHash bucket gives a linear collision probability in — no sharp threshold. Banding introduces the threshold by:
- The exponent (AND over rows) sharpens the lower cutoff
- The union (OR over bands) softens the upper end Together they carve out the S-shape.
The Full Pipeline
Documents
→ MinHash signatures (k hashes) # O(n·k), reduces each doc to a vector
→ LSH banding into buckets # O(n·b), finds candidate pairs
→ Exact Jaccard on candidates only # O(|candidates|), verification step
The final verification step is where you get certainty. LSH+banding is a probabilistic filter — it trades away some recall (missed pairs) and precision (false candidate pairs) in exchange for massive speedup.
Parameter Tuning Intuition
| Goal | Adjustment |
|---|---|
| Higher recall (catch more true duplicates) | Increase (more bands) |
| Higher precision (fewer false candidates) | Increase (larger bands) |
| Sharper threshold curve | Increase overall |