Daft Minhash Deduplication

Canonical Multimodal Workload Sandbox for Minhash Deduplication on Common Crawl HTML Documents using daft Dataframes

Overview

minhash_dedupe.py implements a scalable deduplication pipeline using MinHash and Locality-Sensitive Hashing (LSH) for processing large text datasets, such as Common Crawl HTML extracts. It leverages daft (a distributed DataFrame library) for efficient computation, including text normalization, MinHash signature generation, LSH banding, and connected components for clustering duplicates. The pipeline is designed for high-throughput deduplication while minimizing false positives/negatives via optimized parameters.

Key goals: Identify and remove near-duplicate text blocks (e.g., from web crawls) based on Jaccard similarity thresholds, outputting unique representatives.

Key Components

Preprocessing:
- Extracts text blocks from HTML using Selectolax (removes scripts/styles).
- Filters non-empty, valid UTF-8 content.
- Adds unique block IDs.
Text Normalization:
- Optional: Remove punctuation, lowercase, NFD Unicode normalization, whitespace cleanup.
- Applied via daft's string functions for consistency.
MinHash & LSH:
- Computes MinHash signatures (e.g., 64 permutations, 5-grams) using XXHash.
- Bands signatures into buckets (optimal B/R from threshold) for candidate pair generation.
- Builds edges between similar nodes.
Connected Components:
- Uses alternating Large/Small Star algorithm (or two-phase variant) for union-find-like clustering.
- Includes global min-label propagation for convergence.
- Optional igraph validation for correctness.
Output:
- Merges results to keep only unique representatives per component.
- Partitioned Parquet saving with Snappy compression.

Installation

Clone this repository and then run:

cd daft-minhash-dedupe && uv venv && uv sync

if you don't have uv installed:

pip install uv

Authentication

In order to access the Common Crawl dataset from S3 you will need to authenticate with a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables.

Usage

Instantiation: Create a MinHashDedupePipeline object with params (e.g., num_perm=64, threshold=0.7). Note not all thresholds work for any number of permutations. The pipeline will assert an error upon instantiation if this is the case.
Running: Call pipeline(df) on a preprocessed daft DataFrame (e.g., from preprocess_common_crawl_html).
Main Script: Handles S3 I/O, env vars, and full pipeline execution for Common Crawl segments.

Example (from main):

pipeline = MinHashDedupePipeline(output_uri="s3://bucket/output", ...)
df_prepped = preprocess_common_crawl_html("s3://commoncrawl/...")
results = pipeline(df_prepped)
partitioned_save("s3://bucket/output", results, chunk_size=200000)

Parameters

Core: num_perm (signatures), ngram_size (shingles), threshold (Jaccard similarity), seed, hash_function.
Normalization: Booleans for punctuation, case, Unicode, whitespace.
Algo: algorithm ("alternating" or "two_phase"), max_loops, igraph_validate.
I/O: S3 configs via IOConfig; supports Ray for partitioning.

Quick Onboarding Tips

Dependencies: daft, Selectolax, SciPy, igraph (optional), Ray (for large-scale).
Testing: Use small ROW_LIMIT for local runs; check friction/ dir for prototypes.
Extensions: Modular—extend normalization or add custom hash functions easily.
Performance: Scales to millions of rows; tune partitions for memory.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.vscode		.vscode
friction		friction
references		references
tests		tests
workload		workload
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
FRICTION_LOG.md		FRICTION_LOG.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Daft Minhash Deduplication

Overview

Key Components

Installation

Authentication

Usage

Parameters

Quick Onboarding Tips

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

everettVT/daft-minhash-dedupe

Folders and files

Latest commit

History

Repository files navigation

Daft Minhash Deduplication

Overview

Key Components

Installation

Authentication

Usage

Parameters

Quick Onboarding Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages