Excel ETL Script Scaffold

Purpose

Use this skill when spreadsheet analysis should be executed through reusable ETL scripts instead of ad hoc notebook-style code. Preserve the architecture from excel-intelligent-processing-architecture: inspect, classify, normalize to standard tables, profile relationships, generate a script before formal execution, validate the script output, and explain only computed results.

DuckDB is the preferred SQL and join engine because it is fast, local, reproducible, and good at multi-table analysis. It is not a hard boundary, but it is the default first choice for execution and analysis. Use pandas or polars only when DuckDB is unavailable, unsuitable for the step, or clearly less efficient/safe for the task.

Script Routing

Use scripts/inspect_table_profile.py first when the task starts with unknown Excel/CSV structure. It writes compact profiles that can be shown to an LLM without loading raw spreadsheets.
Use scripts/etl_analysis_template.py as the starting point for a task-specific formal analysis script. Copy it into the task run directory or output folder, then fill in task-specific normalization and analysis logic.
Use scripts/join_profiler.py when multiple files or sheets need relationship discovery before generating joins. It currently uses DuckDB internally for overlap checks, but its output is engine-neutral.
Use scripts/safe_duckdb_query.py only when SQL execution through DuckDB is appropriate. Treat it as the preferred SQL runner, not the only analysis path.

Do not paste large raw spreadsheet content into the conversation. Run scripts and pass compact JSON/Markdown summaries back into reasoning.

Formal Analysis Workflow

For formal analysis tasks, follow this sequence:

Inspect sources with inspect_table_profile.py.
Resolve blocking ambiguity about headers, metrics, dates, filters, joins, or output format.
Generate or adapt a formal script from etl_analysis_template.py.
Choose an execution engine: try DuckDB first for SQL/multi-table work; use pandas for moderate cleaning or simple analysis when DuckDB is not a good fit; use polars for larger columnar work when available and justified.
Execute the formal script for the first user-facing result.
Validate the run directory with produced artifacts, especially validation.json.
Explain the result using only executed outputs.

Exploration can use temporary SQL or dataframe code, but a formal result should come from a script run.

ETL Quality Rules

Generated or adapted scripts must:

accept runtime inputs through CLI parameters;
avoid hard-coded user-local paths, sheet names, or business dates unless documented as defaults;
separate extract, transform, validate, analyze, and output steps;
create a run directory with machine-readable artifacts;
write validation checks to validation.json;
write final data outputs as CSV/JSON/XLSX or Markdown as appropriate;
fail with a non-zero exit code for missing inputs, unsafe SQL, missing required columns, ambiguous joins, or failed strict validation;
leave source files unchanged;
check optional dependencies before using them.

Hard requirement: every formal execution deliverable must include both a reusable executable script and a script usage document. The usage document should usually be named script_usage.md and must document purpose, inputs, parameters, example commands, dependencies, outputs, validation behavior, exit codes, and known limitations.

Expected Run Artifacts

A good formal run directory usually contains:

runs/<task_id>/
    request.json
    file_profile.json
    table_profile.json
    field_map.json
    cleaning_log.json
    queries.sql              # when SQL is used
    join_profile.json        # when joins are profiled
    validation.json
    result.csv
    summary.md
    analyze.py               # or another reusable task script
    script_usage.md
    run_manifest.json

Not every task needs every optional file, but request.json, validation.json, a result artifact, a reusable script, script_usage.md, and a run manifest are mandatory for formal analysis.

Conformance To The Architecture Skill

This skill must stay compatible with the pure architecture version:

It must not require DuckDB for every task.
It must prefer DuckDB for execution and analysis, and record the reason for any fallback.
It must keep standard-table normalization as the analysis boundary.
It must keep Recipe/script-first execution for formal analysis.
It must require a reusable script and a script usage document in formal execution deliverables.
It must prefer compact profiles over raw data in the LLM context.
It must preserve field maps, cleaning logs, validation checks, and run manifests.
It must allow engine choice based on data size, dependency availability, and task shape.

For implementation details and command examples, read references/script-guide.md.