Excel ETL Script Scaffold
Purpose
Use this skill when spreadsheet analysis should be executed through reusable ETL scripts instead of ad hoc notebook-style code. Preserve the architecture from excel-intelligent-processing-architecture: inspect, classify, normalize to standard tables, profile relationships, generate a script before formal execution, validate the script output, and explain only computed results.
DuckDB is the preferred SQL and join engine because it is fast, local, reproducible, and good at multi-table analysis. It is not a hard boundary, but it is the default first choice for execution and analysis. Use pandas or polars only when DuckDB is unavailable, unsuitable for the step, or clearly less efficient/safe for the task.
Script Routing
- Use
scripts/inspect_table_profile.pyfirst when the task starts with unknown Excel/CSV structure. It writes compact profiles that can be shown to an LLM without loading raw spreadsheets. - Use
scripts/etl_analysis_template.pyas the starting point for a task-specific formal analysis script. Copy it into the task run directory or output folder, then fill in task-specific normalization and analysis logic. - Use
scripts/join_profiler.pywhen multiple files or sheets need relationship discovery before generating joins. It currently uses DuckDB internally for overlap checks, but its output is engine-neutral. - Use
scripts/safe_duckdb_query.pyonly when SQL execution through DuckDB is appropriate. Treat it as the preferred SQL runner, not the only analysis path.
Do not paste large raw spreadsheet content into the conversation. Run scripts and pass compact JSON/Markdown summaries back into reasoning.
Formal Analysis Workflow
For formal analysis tasks, follow this sequence:
- Inspect sources with
inspect_table_profile.py. - Resolve blocking ambiguity about headers, metrics, dates, filters, joins, or output format.
- Generate or adapt a formal script from
etl_analysis_template.py. - Choose an execution engine: try DuckDB first for SQL/multi-table work; use pandas for moderate cleaning or simple analysis when DuckDB is not a good fit; use polars for larger columnar work when available and justified.
- Execute the formal script for the first user-facing result.
- Validate the run directory with produced artifacts, especially
validation.json. - Explain the result using only executed outputs.
Exploration can use temporary SQL or dataframe code, but a formal result should come from a script run.
ETL Quality Rules
Generated or adapted scripts must:
- accept runtime inputs through CLI parameters;
- avoid hard-coded user-local paths, sheet names, or business dates unless documented as defaults;
- separate extract, transform, validate, analyze, and output steps;
- create a run directory with machine-readable artifacts;
- write validation checks to
validation.json; - write final data outputs as CSV/JSON/XLSX or Markdown as appropriate;
- fail with a non-zero exit code for missing inputs, unsafe SQL, missing required columns, ambiguous joins, or failed strict validation;
- leave source files unchanged;
- check optional dependencies before using them.
Hard requirement: every formal execution deliverable must include both a reusable executable script and a script usage document. The usage document should usually be named script_usage.md and must document purpose, inputs, parameters, example commands, dependencies, outputs, validation behavior, exit codes, and known limitations.
Expected Run Artifacts
A good formal run directory usually contains:
runs/<task_id>/
request.json
file_profile.json
table_profile.json
field_map.json
cleaning_log.json
queries.sql # when SQL is used
join_profile.json # when joins are profiled
validation.json
result.csv
summary.md
analyze.py # or another reusable task script
script_usage.md
run_manifest.json
Not every task needs every optional file, but request.json, validation.json, a result artifact, a reusable script, script_usage.md, and a run manifest are mandatory for formal analysis.
Conformance To The Architecture Skill
This skill must stay compatible with the pure architecture version:
- It must not require DuckDB for every task.
- It must prefer DuckDB for execution and analysis, and record the reason for any fallback.
- It must keep standard-table normalization as the analysis boundary.
- It must keep Recipe/script-first execution for formal analysis.
- It must require a reusable script and a script usage document in formal execution deliverables.
- It must prefer compact profiles over raw data in the LLM context.
- It must preserve field maps, cleaning logs, validation checks, and run manifests.
- It must allow engine choice based on data size, dependency availability, and task shape.
For implementation details and command examples, read references/script-guide.md.
微信扫一扫