Data Analysis Toolkit

A set of composable Python modules plus a CLI for the full analysis lifecycle: load → profile → validate → clean → transform → analyze (stats / ML / time series) → visualize → report → export.

When to use which entry point

"Just analyze this file and give me a report" → run the full pipeline via cli.py analyze (or analyze.run_analysis). This is the default for open-ended requests. It cleans, profiles, charts, optionally models a target, and writes an HTML report.
A specific step (only clean, only profile, only forecast, etc.) → call that module directly or use the matching CLI subcommand. Don't run the whole pipeline when the user asked for one thing.
Programmatic / multi-step work → import the modules in Python. Every function returns plain dicts/DataFrames so results compose cleanly.

Setup

The scripts depend on the scientific Python stack. Install once:

pip install -r requirements.txt

Run from inside scripts/ (the modules import each other by bare name), or add that directory to PYTHONPATH. All charting uses a headless Matplotlib backend, so everything works without a display (servers, CI, agents).

CLI quick reference

cd scripts

# Full pipeline -> writes out/report.html + out/charts/*.png
python cli.py analyze ../data/sales.csv -o out/ --target churn

# Individual steps
python cli.py profile data.xlsx --correlations      # shape, types, missing, issues
python cli.py clean raw.csv cleaned.csv             # imputes, dedupes, fixes types
python cli.py stats data.csv --value revenue --group region   # significance test
python cli.py viz data.csv -o charts/              # auto-pick default charts
python cli.py model data.csv --target churn        # baseline classify/regress
python cli.py model data.csv --task cluster        # k-means, auto-picks k
python cli.py forecast sales.csv --time date --value revenue --horizon 12
python cli.py validate data.csv --schema schema.json   # exit code 1 if it fails
python cli.py validate data.csv --infer            # generate a starter schema

JSON-producing subcommands print to stdout so you can pipe them (| jq, > out.json).

Module map

Read the module's docstring before using it — each explains its role and the "why" behind its defaults. They live in scripts/:

| Module | Responsibility | |---|---| | utils.py | Shared foundation: load/save any format, type inference, logging, numpy-aware JSON. Everything imports from here. | | profiler.py | Dataset profile (shape, dtypes, missingness, per-column stats), correlation matrix, data-quality issue detection. | | validator.py | Validate a DataFrame against a JSON schema (dtypes, ranges, uniqueness, allowed values, regex); can also infer a schema from trusted data. | | cleaner.py | Standardize names, coerce types, drop empties/dupes, impute missing, treat outliers. Every action is recorded in a report. | | transformer.py | Feature engineering: scaling, categorical encoding, binning, datetime expansion, log transform. Returns refit params. | | statistics.py | Descriptive + inferential stats: normality, t-test/ANOVA/Mann-Whitney/Kruskal (auto-chosen), chi-square, correlation tests, CIs. | | visualization.py | Charts (histogram, box, scatter, correlation heatmap, bar, line) saved as PNG. auto_visualize picks sensible defaults. | | ml.py | Baseline models: RandomForest classify/regress with metrics + feature importance, KMeans clustering with auto-k. auto_model picks the task. | | timeseries.py | Datetime indexing, resampling, rolling stats, seasonal decomposition, ADF stationarity test, Holt-Winters forecast. | | excel.py | Multi-sheet Excel read/write with autofit and frozen headers; append sheets to existing workbooks. | | report.py | Assemble profile + stats + charts into Markdown or a standalone HTML page (charts embedded as data URIs). | | export.py | Deliverables: single-table export (csv/json/parquet/xlsx/html), multi-sheet workbooks, results JSON, full bundle. | | analyze.py | Orchestrator. run_analysis() wires the steps together end-to-end. | | cli.py | Argparse front end exposing every module as a subcommand. |

Typical programmatic flow

import analyze
result = analyze.run_analysis("data.csv", outdir="out", target="churn")
# result has: profile, cleaning, correlations, statistics, model,
#             charts, cleaned_data (DataFrame), report_path

Or compose steps yourself when you need control:

from utils import load_data
import cleaner, statistics as st, visualization as viz

df = load_data("data.csv")
df = cleaner.clean(df)["df"]
test = st.compare_groups(df, value_col="revenue", group_col="region")
viz.auto_visualize(df, "charts/")

Guidance for good analysis

Profile before you clean, and clean before you model. The pipeline does this for you; preserve that order if composing by hand.
Don't hide what cleaning did. cleaner.clean returns a report list — surface it to the user so dropped rows and imputations are visible, not silent.
Let the stats module pick the test. compare_groups auto-selects parametric vs non-parametric from a normality check; only override when the user has a specific test in mind.
Treat ML output as a baseline, not a verdict. ml.py answers "is there signal here?" Report metrics honestly (including a weak R²/accuracy) rather than overselling.
For time series, set a frequency. Decomposition and forecasting need a regular index; pass freq= (D/W/M/…) so gaps are resampled and interpolated.