Data Analysis Toolkit
A set of composable Python modules plus a CLI for the full analysis lifecycle: load → profile → validate → clean → transform → analyze (stats / ML / time series) → visualize → report → export.
When to use which entry point
- "Just analyze this file and give me a report" → run the full pipeline via
cli.py analyze(oranalyze.run_analysis). This is the default for open-ended requests. It cleans, profiles, charts, optionally models a target, and writes an HTML report. - A specific step (only clean, only profile, only forecast, etc.) → call that module directly or use the matching CLI subcommand. Don't run the whole pipeline when the user asked for one thing.
- Programmatic / multi-step work → import the modules in Python. Every function returns plain dicts/DataFrames so results compose cleanly.
Setup
The scripts depend on the scientific Python stack. Install once:
pip install -r requirements.txt
Run from inside scripts/ (the modules import each other by bare name), or add
that directory to PYTHONPATH. All charting uses a headless Matplotlib backend,
so everything works without a display (servers, CI, agents).
CLI quick reference
cd scripts
# Full pipeline -> writes out/report.html + out/charts/*.png
python cli.py analyze ../data/sales.csv -o out/ --target churn
# Individual steps
python cli.py profile data.xlsx --correlations # shape, types, missing, issues
python cli.py clean raw.csv cleaned.csv # imputes, dedupes, fixes types
python cli.py stats data.csv --value revenue --group region # significance test
python cli.py viz data.csv -o charts/ # auto-pick default charts
python cli.py model data.csv --target churn # baseline classify/regress
python cli.py model data.csv --task cluster # k-means, auto-picks k
python cli.py forecast sales.csv --time date --value revenue --horizon 12
python cli.py validate data.csv --schema schema.json # exit code 1 if it fails
python cli.py validate data.csv --infer # generate a starter schema
JSON-producing subcommands print to stdout so you can pipe them (| jq, > out.json).
Module map
Read the module's docstring before using it — each explains its role and the
"why" behind its defaults. They live in scripts/:
| Module | Responsibility |
|---|---|
| utils.py | Shared foundation: load/save any format, type inference, logging, numpy-aware JSON. Everything imports from here. |
| profiler.py | Dataset profile (shape, dtypes, missingness, per-column stats), correlation matrix, data-quality issue detection. |
| validator.py | Validate a DataFrame against a JSON schema (dtypes, ranges, uniqueness, allowed values, regex); can also infer a schema from trusted data. |
| cleaner.py | Standardize names, coerce types, drop empties/dupes, impute missing, treat outliers. Every action is recorded in a report. |
| transformer.py | Feature engineering: scaling, categorical encoding, binning, datetime expansion, log transform. Returns refit params. |
| statistics.py | Descriptive + inferential stats: normality, t-test/ANOVA/Mann-Whitney/Kruskal (auto-chosen), chi-square, correlation tests, CIs. |
| visualization.py | Charts (histogram, box, scatter, correlation heatmap, bar, line) saved as PNG. auto_visualize picks sensible defaults. |
| ml.py | Baseline models: RandomForest classify/regress with metrics + feature importance, KMeans clustering with auto-k. auto_model picks the task. |
| timeseries.py | Datetime indexing, resampling, rolling stats, seasonal decomposition, ADF stationarity test, Holt-Winters forecast. |
| excel.py | Multi-sheet Excel read/write with autofit and frozen headers; append sheets to existing workbooks. |
| report.py | Assemble profile + stats + charts into Markdown or a standalone HTML page (charts embedded as data URIs). |
| export.py | Deliverables: single-table export (csv/json/parquet/xlsx/html), multi-sheet workbooks, results JSON, full bundle. |
| analyze.py | Orchestrator. run_analysis() wires the steps together end-to-end. |
| cli.py | Argparse front end exposing every module as a subcommand. |
Typical programmatic flow
import analyze
result = analyze.run_analysis("data.csv", outdir="out", target="churn")
# result has: profile, cleaning, correlations, statistics, model,
# charts, cleaned_data (DataFrame), report_path
Or compose steps yourself when you need control:
from utils import load_data
import cleaner, statistics as st, visualization as viz
df = load_data("data.csv")
df = cleaner.clean(df)["df"]
test = st.compare_groups(df, value_col="revenue", group_col="region")
viz.auto_visualize(df, "charts/")
Guidance for good analysis
- Profile before you clean, and clean before you model. The pipeline does this for you; preserve that order if composing by hand.
- Don't hide what cleaning did.
cleaner.cleanreturns areportlist — surface it to the user so dropped rows and imputations are visible, not silent. - Let the stats module pick the test.
compare_groupsauto-selects parametric vs non-parametric from a normality check; only override when the user has a specific test in mind. - Treat ML output as a baseline, not a verdict.
ml.pyanswers "is there signal here?" Report metrics honestly (including a weak R²/accuracy) rather than overselling. - For time series, set a frequency. Decomposition and forecasting need a
regular index; pass
freq=(D/W/M/…) so gaps are resampled and interpolated.
Scan to join WeChat group