Excel DuckDB ETL Runner

Purpose

Use this skill when a standard-table or normalized Excel/CSV analysis can be executed by a reusable generic runner rather than a fully custom Python script. The bundled scripts/duckdb_etl_runner.py is both the skill helper and the default exportable reuse script for scheduler use.

This skill must stay aligned with the previous two skills:

preserve script-first formal execution;
prefer DuckDB for processing and analysis;
include a reusable script and script_usage.md in formal deliverables;
avoid loading raw spreadsheets into LLM context;
preserve validation, run manifests, field maps, SQL, and fallback logs.

When To Use

Use scripts/duckdb_etl_runner.py when:

the data is a standard table or has already been normalized;
the analysis can be expressed as DuckDB SQL;
independent queries should run without requiring an ETL dependency chain;
later calculations depend on earlier SQL results and can use intermediate CSV outputs registered as temporary tables;
multiple source tables need explicit SQL joins or relationship checks;
the task needs repeatable execution with stable inputs and output artifacts;
the user needs a script that can be scheduled later;
multi-table joins can be represented by explicit SQL and profiled relationships.

Use the broader scaffold skill instead when custom transform logic, complex Excel normalization, multi-stage dataframe processing, or report-specific output is required.

Required Deliverables

Every formal run must include:

duckdb_etl_runner.py or a copied/adapted equivalent reusable script;
script_usage.md;
request.json;
table_profile.json;
queries.sql;
validation.json;
at least one business-named result artifact, such as customer_sales.xlsx, category_metrics.csv, or named query outputs;
run_manifest.json.

Do not mark a formal analysis complete without the reusable script and usage document.

Runner Pattern

Prefer configuration over rewriting Python:

duckdb_etl_runner.py
recipe.json       # optional, captures table files, SQL, engine policy, expected outputs
query.sql         # SQL or multi-query SQL file
intermediate/*.csv # optional step outputs registered for later steps
script_usage.md

The runner supports direct SQL arguments, SQL files, and recipe queries. These are independent query paths and must keep working even when no recipe.steps dependency chain is used. For repeatable jobs, prefer --sql-file or --recipe so the business logic is versioned outside the command line.

For ETL-style dependent steps, use recipe.steps. Each step executes read-only SQL, writes the requested result formats, and may specify register_as so that a CSV copy is registered as a table for later steps. This keeps the runner safe without allowing arbitrary DDL.

For multi-table work, prefer a recipe with inputs.tables when per-table sheet names or clearer lineage are needed. Use --table-files name=path,name2=path2 for simple multi-file CLI runs. Joins must be explicit in SQL with ON or USING.

Result exports must use business-specific names. Use --output-name or recipe.outputs.name for explicit naming. If omitted, the runner derives the name from the single query/step output, or from the output directory for multi-result jobs. Use --formats xlsx,csv,json or recipe.outputs.formats when CSV and JSON result files are also required. Execution metadata JSON files are always produced with stable technical names for validation and scheduling handoff.

DuckDB-First Policy

DuckDB is the default processing and SQL engine. Use it for registration, joins, aggregation, windows, filters, and local reproducible computation. Fallback is allowed only when:

DuckDB is unavailable;
Excel loading or normalization must happen before DuckDB registration;
resource limits make a dataframe/chunked path safer;
the task is not SQL-shaped.

Record fallback reasons in fallback_log.json or validation.json.

Source Reference

The user-provided duckdb_analyzer.py is a reference implementation, not a file to copy wholesale. Borrow its useful ideas: safe identifier quoting, read-only SQL checks, query risk estimates, compressed JSON output, multi-table registration, join-aware SQL guidance, and diagnostic payloads.

Read references/runner-design.md for implementation notes and command examples.