Dark Data Curator Swarm (暗数据整理师 Swarm)

A 5-stage specialization pipeline (Pattern C) that solves the dark data problem: 80% of RAG effort is data preparation, not retrieval. This team automates the chaotic first mile — scanning scattered files, denoising, deduplicating, categorizing, and discovering cross-file associations — producing structured knowledge units ready for any knowledge base.

Workflow

Pre-flight: check dependencies — read dependencies.yaml and verify availability. Report missing items: required: true → likely fails; required: false → degraded but functional. User decides whether to proceed. The team can run in pure inline-persona mode with no external dependencies. Note: dependencies were populated via Stage 2 auto-matching — python is optional for PDF extraction; no local skills matched.
Stage 1: File Discovery — dispatch file-scanner with target directory, depth limit, file cap, and skip patterns. Produces a complete file inventory grouped by format category. Gate: ≥ 1 file found. See bind.md § Failure Handling for input over-scale triggers (>500 files).
Stage 2: Content Cleaning & Deduplication — dispatch content-cleaner with the file inventory and similarity threshold (default 0.85). Extracts text, removes noise, deduplicates, scores quality 1–5. Gate: ≥ 1 item extracted. If 0 items → HALT pipeline. See bind.md § Failure Handling.
Stage 3: Semantic Classification & Clustering — dispatch semantic-categorizer with the cleaned corpus. Assigns domain/type/depth/freshness labels and discovers emergent topic clusters. Gate: ≥ 1 item classified. If all unclassified → continue with "0 clusters" note.
Stage 4: Cross-File Association & Gap Discovery — dispatch association-mapper with categorized items. Finds thematic, referential, complementary, and contradictory associations; identifies knowledge gaps. Gate: always passes (0 associations is valid).
Stage 5: Final Synthesis & Output — dispatch knowledge-synthesizer with categorized items + association map. Produces Markdown knowledge units (one per cluster), full JSON, and RAG chunks (JSONL) with metadata and provenance.
Final: emit Dark Data Curator Report — Leader integrates all stage outputs into the Final Report (Summary, Output Manifest, Knowledge Unit Index, Knowledge Gap Summary, Pipeline Statistics). Leader never mediates or rewrites stage outputs — it composes.

Roles

| id | Purpose | When dispatched | Input | Key dependencies | Role file | |---|---|---|---|---|---| | file-scanner | Exhaustive file discovery with metadata, grouped by format | Every run, first | Target directory, depth, file cap, skip patterns | none | roles/file-scanner.md | | content-cleaner | Text extraction, noise removal, deduplication, quality scoring | After scanner succeeds (gate pass) | File inventory from scanner, similarity threshold | python (optional, for PDF extraction) | roles/content-cleaner.md | | semantic-categorizer | Per-item 4-dimension classification + emergent cluster discovery | After cleaner succeeds (gate pass) | Cleaned corpus from cleaner, min cluster size | none | roles/semantic-categorizer.md | | association-mapper | Cross-file relationship discovery + knowledge gap identification | After categorizer completes | Categorized items from categorizer | none | roles/association-mapper.md | | knowledge-synthesizer | Final three-format output (Markdown/JSON/RAG) with provenance | After associator completes | Categorized items + association map, output dir, chunk size | none | roles/knowledge-synthesizer.md |

Before dispatching each teammate, read the corresponding role file and extract the ## Inline Persona for Teammate section — paste it directly into the dispatch prompt. Most adopting agents do NOT auto-load role files for teammates.

Files

| File | What it contains | When to read | |---|---|---| | workflow.md | Mermaid diagram, step-by-step protocol, quality gates, Final Report format | Before first dispatch — the complete playbook | | bind.md | Resource limits, pipeline discipline rules, failure handling and degraded modes | When hitting limits, handling failures, or needing degraded-mode rules | | roles/*.md | Per-role identity, success criteria, boundary, output schema, Inline Persona for Teammate | Before dispatching each teammate — extract Inline Persona | | dependencies.yaml | External skills and tools required to run | Startup — verify deps, report missing items, user decides go/no-go |