Mithril Dedup Agent

Build data deduplication for ML training datasets at 100K+ docs/sec.

Status

Read crates/mithril-dedup/STATUS.md for current progress.

Reference Documentation

dedup/SPEC.md - Full product specification
RESEARCH.md - Papers (Google 2021 dedup paper, LSHBloom, text-dedup)

Module Responsibilities

minhash

MinHash signature generation:

pub struct MinHasher {
    num_permutations: usize,  // 128 default
    seeds: Vec<u64>,
}

impl MinHasher {
    pub fn new(num_permutations: usize) -> Self;
    pub fn signature(&self, tokens: &HashSet<u64>) -> MinHashSignature;
    pub fn similarity(sig1: &MinHashSignature, sig2: &MinHashSignature) -> f64;
}

pub struct MinHashSignature {
    pub values: Vec<u64>,
}

Use mithril_core::hashing::hash_with_seed() for hashing.

lsh

Locality-Sensitive Hashing for candidate pair generation:

pub struct LshIndex {
    num_bands: usize,
    rows_per_band: usize,
    buckets: Vec<HashMap<u64, Vec<DocId>>>,
}

impl LshIndex {
    /// Create with target similarity threshold
    /// For 0.85 threshold: typically b=20, r=5
    pub fn with_threshold(num_permutations: usize, threshold: f64) -> Self;
    pub fn insert(&mut self, doc_id: DocId, signature: &MinHashSignature);
    pub fn candidates(&self) -> impl Iterator<Item = (DocId, DocId)>;
}

cluster

Union-Find for grouping duplicates:

pub struct UnionFind {
    parent: Vec<usize>,
    rank: Vec<usize>,
}

impl UnionFind {
    pub fn new(n: usize) -> Self;
    pub fn find(&mut self, x: usize) -> usize;  // with path compression
    pub fn union(&mut self, x: usize, y: usize);  // by rank
    pub fn clusters(&mut self) -> HashMap<usize, Vec<usize>>;
}

io

File I/O for JSONL and Parquet:

pub fn read_jsonl(path: &Path, text_field: &str) -> Result<Vec<Document>>;
pub fn read_parquet(path: &Path, text_column: &str) -> Result<Vec<Document>>;
pub fn write_jsonl(path: &Path, docs: &[Document]) -> Result<()>;

cli (main.rs)

Command-line interface:

mithril-dedup input.jsonl -o output.jsonl --field text --threshold 0.85

Target Metrics

| Metric | Target | |--------|--------| | Throughput | ≥100K docs/sec | | Precision | ≥0.95 | | Recall | ≥0.90 | | Memory (LSH) | <16GB for 1B docs |

Key Dependencies

mithril-core = { workspace = true }
xxhash-rust = { workspace = true }
rayon = { workspace = true }
arrow = { workspace = true }
parquet = { workspace = true }
clap = { workspace = true }

Test Fixtures

fixtures/datasets/duplicates.jsonl - 1000 docs with 30% known duplicates

Testing

cargo test -p mithril-dedup
cargo bench -p mithril-dedup

Implementation Order

Implement minhash module with tests
Implement lsh module
Implement cluster (UnionFind)
Implement io for JSONL
Wire up CLI
Add Parquet support
Run benchmarks
Update STATUS.md

Completion Criteria

[ ] Detects duplicates with Jaccard ≥0.85
[ ] ≥100K docs/sec throughput
[ ] CLI works: mithril-dedup input.jsonl -o output.jsonl
[ ] Unit tests pass
[ ] STATUS.md updated to COMPLETE