返回 Skill 列表
extension
分类: 数据与分析无需 API Key

S_Excel数据架构师

通过可复用的 DuckDB 优先 ETL 执行器运行 Excel/CSV/XLSX 分析,同时保留脚本优先架构。适用于系统需要通用可导出执行器来完成标准表电子表格分析、安全只读 SQL 执行、多表注册、查询模板、验证产物、script_usage.md 使用说明、运行清单,以及调度友好的可重复输出等场景。DuckDB 为首选执行引擎;其他工具仅用于文件探测、Excel 标准化或有明确文档记录的降级场景。

person作者: user_af28addahubcommunity

Excel DuckDB ETL Runner

Purpose

Use this skill when a standard-table or normalized Excel/CSV analysis can be executed by a reusable generic runner rather than a fully custom Python script. The bundled scripts/duckdb_etl_runner.py is both the skill helper and the default exportable reuse script for scheduler use.

This skill must stay aligned with the previous two skills:

  • preserve script-first formal execution;
  • prefer DuckDB for processing and analysis;
  • include a reusable script and script_usage.md in formal deliverables;
  • avoid loading raw spreadsheets into LLM context;
  • preserve validation, run manifests, field maps, SQL, and fallback logs.

When To Use

Use scripts/duckdb_etl_runner.py when:

  • the data is a standard table or has already been normalized;
  • the analysis can be expressed as DuckDB SQL;
  • independent queries should run without requiring an ETL dependency chain;
  • later calculations depend on earlier SQL results and can use intermediate CSV outputs registered as temporary tables;
  • multiple source tables need explicit SQL joins or relationship checks;
  • the task needs repeatable execution with stable inputs and output artifacts;
  • the user needs a script that can be scheduled later;
  • multi-table joins can be represented by explicit SQL and profiled relationships.

Use the broader scaffold skill instead when custom transform logic, complex Excel normalization, multi-stage dataframe processing, or report-specific output is required.

Required Deliverables

Every formal run must include:

  • duckdb_etl_runner.py or a copied/adapted equivalent reusable script;
  • script_usage.md;
  • request.json;
  • table_profile.json;
  • queries.sql;
  • validation.json;
  • at least one business-named result artifact, such as customer_sales.xlsx, category_metrics.csv, or named query outputs;
  • run_manifest.json.

Do not mark a formal analysis complete without the reusable script and usage document.

Runner Pattern

Prefer configuration over rewriting Python:

duckdb_etl_runner.py
recipe.json       # optional, captures table files, SQL, engine policy, expected outputs
query.sql         # SQL or multi-query SQL file
intermediate/*.csv # optional step outputs registered for later steps
script_usage.md

The runner supports direct SQL arguments, SQL files, and recipe queries. These are independent query paths and must keep working even when no recipe.steps dependency chain is used. For repeatable jobs, prefer --sql-file or --recipe so the business logic is versioned outside the command line.

For ETL-style dependent steps, use recipe.steps. Each step executes read-only SQL, writes the requested result formats, and may specify register_as so that a CSV copy is registered as a table for later steps. This keeps the runner safe without allowing arbitrary DDL.

For multi-table work, prefer a recipe with inputs.tables when per-table sheet names or clearer lineage are needed. Use --table-files name=path,name2=path2 for simple multi-file CLI runs. Joins must be explicit in SQL with ON or USING.

Result exports must use business-specific names. Use --output-name or recipe.outputs.name for explicit naming. If omitted, the runner derives the name from the single query/step output, or from the output directory for multi-result jobs. Use --formats xlsx,csv,json or recipe.outputs.formats when CSV and JSON result files are also required. Execution metadata JSON files are always produced with stable technical names for validation and scheduling handoff.

DuckDB-First Policy

DuckDB is the default processing and SQL engine. Use it for registration, joins, aggregation, windows, filters, and local reproducible computation. Fallback is allowed only when:

  • DuckDB is unavailable;
  • Excel loading or normalization must happen before DuckDB registration;
  • resource limits make a dataframe/chunked path safer;
  • the task is not SQL-shaped.

Record fallback reasons in fallback_log.json or validation.json.

Source Reference

The user-provided duckdb_analyzer.py is a reference implementation, not a file to copy wholesale. Borrow its useful ideas: safe identifier quoting, read-only SQL checks, query risk estimates, compressed JSON output, multi-table registration, join-aware SQL guidance, and diagnostic payloads.

Read references/runner-design.md for implementation notes and command examples.