返回 Skill 列表
extension
分类: 其它需要 API Key

Data Cleaner AI

利用AI字段检测、格式标准化和多源合并,对多格式数据进行清洗去重,输出Excel、CSV或飞书多维表格。

person作者: billjamno58hubclawhub

Data Cleaner AI

Upload messy data — get clean, structured output. Supports multi-format parsing, AI field identification, intelligent dedup/fill/formatting, multi-source join, and Feishu-native output (Bitable + quality report doc).

Use cases: E-commerce order cleanup, CRM customer data cleansing, bank statement reconciliation, roster cleanup, multi-system data merge.


Capabilities

F1 · Multi-Format Parsing

  • Excel (.xlsx / .ls)
  • CSV / TSV
  • JSON (semi-structured)
  • Clipboard paste text

F2 · Smart Field Identification

  • AI auto-detects: name, phone, email, address, amount, date, SKU, order ID, ID number, gender, etc.
  • Supports user-defined field mapping override

F3 · Data Cleaning

  • Deduplication: Exact match + fuzzy dedup (FuzzyWuzzy, threshold 88%)
  • Missing value fill: Mean / mode / semantic inference / leave blank
  • Format standardization:
    • Phone → 1xx-xxxx-xxxx
    • Date → YYYY-MM-DD
    • Amount → 2 decimal places
    • Address → Province/City/District/Street standardization

F4 · Data Classification / Tagging (PRO)

  • 8 built-in business rules (high-value customer, dormant user, VIP, enterprise, etc.)
  • Supports custom JSON rules
  • AI auto-tagging (requires PRO + AI API Key)

F5 · Multi-Source Join / Merge (PRO)

  • Cross-file relational join on key fields
  • Fuzzy join when exact key not available (FuzzyWuzzy)
  • Conflicted field resolution: priority by source order or latest timestamp

F6 · Feishu Native Output

  • Excel / CSV export
  • Feishu Bitable (multi-dimensional table) write-back
  • Data quality report auto-generated as Feishu Doc (Markdown)

Tier Feature Matrix

| Feature | FREE | PRO | |---------|:----:|:---:| | Multi-format parsing | ✅ | ✅ | | Basic dedup | ✅ | ✅ | | Smart fill | ❌ | ✅ | | Format standardization | ❌ | ✅ | | Fuzzy dedup | ❌ | ✅ | | Multi-source merge | ❌ | ✅ | | AI classification | ❌ | ✅ | | Data quality report | ❌ | ✅ | | Feishu Bitable output | ❌ | ✅ |


Pricing

Per-call billing (no monthly fee):

| Tier | Price per Call | |------|---------------| | FREE | $0.00 USDT | | PRO | $0.01 USDT |

Each cleaning pipeline execution (clean or merge) = one billable call.


Usage

Feishu Trigger

data cleaning
deduplication
spreadsheet cleanup
CRM data cleanup
Excel cleaning

CLI

python scripts/main.py clean -i data.xlsx -o cleaned.xlsx
python scripts/main.py clean -t "name,phone\nJohn,13800138000" -f csv -o cleaned.csv
python scripts/main.py merge --sources customers.xlsx orders.csv --on phone -o merged.xlsx

Python API

from main import run_clean_pipeline

result = run_clean_pipeline(
    sources=["orders.xlsx"],
    output_format="xlsx",
    output_path="/tmp/cleaned.xlsx",
    dedup_strategy="auto",
    fill_strategy="auto",
    classify=True,
    ai_model="deepseek",
    generate_report=True,
)

Directory Structure

data-cleaner-ai/
├── SKILL.md
└── scripts/
    ├── main.py              # Entry: run_clean_pipeline / run_merge_pipeline
    ├── parser.py            # F1: Multi-format parsing
    ├── field_identifier.py  # F2: AI field identification
    ├── cleaner.py           # F3: Cleaning engine
    ├── classifier.py        # F4: Classification / tagging
    ├── merger.py            # F5: Multi-source join
    ├── reporter.py          # F6: Quality report generation
    ├── output.py            # F6: Output (Excel/CSV/Bitable/Feishu Doc)
    ├── tier_limits.py       # Tier access control
    └── billing.py           # SkillPay billing integration

Billing

This skill uses SkillPay (skillpay.me) for per-call billing.

Fee: $0.0100 USDT per execution (all paid tiers) External API: https://skillpay.me/api/v1/billing Data transmitted: User identifier (FEISHU_USER_ID environment variable)

Billing occurs at the start of each cleaning or merge execution. If balance is insufficient, the tool returns a payment_url where the user can recharge.


Required Environment Variables

| Variable | Description | |----------|-------------| | FEISHU_USER_ID | Feishu user open_id for billing identification | | OPENAI_API_KEY | AI model API key (OpenAI, MiniMax, or OpenAI-compatible endpoint) | | OPENAI_API_BASE | Base URL for AI API (optional, defaults to MiniMax endpoint) | | SKILL_BILLING_API_KEY | Builder API Key from skillpay.me (required for paid calls) | | SKILL_BILLING_SKILL_ID | Skill slug on SkillPay (defaults to data-cleaner-ai) |


Error Handling

| Error | Handling | |-------|----------| | Balance insufficient | Return payment_url for recharge | | Network error on billing | Allow call through in dev mode (no charge) | | Tier feature not available | Skip feature gracefully, continue with available features | | No data source provided | Raise error requesting input |


License

MIT