AutoML自动机器学习

Overview

借鉴Karpathy autoresearch理念的表格数据AutoML技能。AI代理自主进行ML实验，通过迭代优化模型参数和结构，在CPU环境下自动寻找最佳配置。支持7种传统ML模型、早停机制、交叉验证和集成策略。

Workflow

用户请求数据建模
    ↓
1. 收集参数（数据路径/目标列/任务类型/指标/预算）
    ↓
2. 分析数据（读取prepare.py元特征，确定推荐模型和策略）
    ↓
3. 初始化工作区（复制脚本/数据，git init，建立baseline）
    ↓
4. 实验循环（修改train.py → 运行 → 评估 → keep/discard → 迭代）
    ↓
5. 输出最佳模型和配置报告

Step 1: Collect parameters

Ask user for:

data_path: CSV file path (required)
target_col: target column name (required)
task_type: classification / regression / multiclass (auto-infer if not specified)
metric: evaluation metric (default by task type: accuracy / rmse / f1_macro)
time_budget: seconds per experiment (default 120)
max_experiments: max rounds (default 50)
models: limit model list, e.g. "lgb,xgb" (default: auto-select based on data scale)

Step 2: Analyze data

Run data check:

python scripts/prepare.py --check-data

This outputs: sample count, feature count, target distribution, missing values, recommended models. For tuning strategy, read references/optimization_strategies.md when planning experiments.

Step 3: Initialize workspace

python scripts/auto_research.py --data <path> --target <col> --task <type> --setup-only

This creates:

./automl_exp/
├── prepare.py       # Fixed: data loading, preprocessing, metrics
├── train.py         # Modifiable: model, params, feature engineering
├── data.csv         # User data
├── results.tsv      # Experiment log
└── .automl_cache/   # Cache + config

Then run baseline to establish reference metric.

Step 4: Experiment loop (core)

For each round:

Analyze results.tsv — current best, recent trend, what worked
Design experiment — follow priority: hyperparameter tuning > model switch > feature engineering > ensemble
Modify train.py — update MODEL_NAME, MODEL_PARAMS, FEATURE_ENGINEERING
Run — python train.py > run.log 2>&1 with timeout
Evaluate — parse output for val_metric, memory, train_time
Decide — keep (improvement) or discard (no gain), git commit or revert
Loop — back to step 1

For experiment design rules and tuning strategies, read references/program.md and references/optimization_strategies.md.

Key tuning priority (GBT models):

learning_rate + n_estimators (paired, coarse-to-fine)
num_leaves / max_depth (complexity control)
subsample + colsample_bytree (regularization via sampling)
reg_alpha + reg_lambda (L1/L2)

Do NOT ask user "should I continue?" during loop. Run autonomously until max_experiments or user stops.

Step 5: Output results

Report:

Total experiments / kept / discarded / crashed
Best metric and improvement over baseline
Best model configuration
Feature importance (if available)
Overfitting diagnostic (train-val gap)
Next step recommendations

Model selection by data scale

| Scale | Samples | Priority Models | |-------|---------|----------------| | Tiny | <1K | rf, extra | | Small | 1K-10K | gbdt, catboost | | Medium | 10K-100K | lgb, xgb | | Large | >100K | lgb |

Supported metrics

Classification: accuracy, f1, f1_macro, f1_weighted, precision, recall, roc_auc Regression: rmse, mae, r2, mape

Dependencies

scikit-learn, pandas, numpy, xgboost, lightgbm, catboost, psutil

Install if missing:

pip install scikit-learn pandas numpy xgboost lightgbm catboost psutil

Quick start example

请使用automl技能优化我的客户流失预测模型：
- 数据：./customer_churn.csv
- 目标列：churn
- 任务：classification
- 指标：roc_auc
- 时间预算：180秒/轮
- 最多30轮

Resources

scripts/prepare.py: Fixed data pipeline (do NOT modify during experiments)
scripts/train.py: Training script (modify MODEL_NAME, MODEL_PARAMS, feature engineering)
scripts/auto_research.py: Automated experiment runner with baseline and loop
references/optimization_strategies.md: Hyperparameter tuning priority, model selection, ensemble strategies
references/program.md: Experiment behavior rules, keep/discard criteria, phase strategies