返回 Skill 列表
extension
分类: 内容与媒体无需 API Key

finetune-generate

在为多轮对话微调生成合成训练数据时使用。触发条件 - 已准备好设计文档,需要生成对话,准备评估质量。需要先完成finetune-design。

person作者: jakexiaohubgithub

Fine-tune Generate

Iteratively generate and filter training data until quality stabilizes.

Prerequisites

Complete finetune-design first. You need:

  • [ ] Model choice and token constraints
  • [ ] Input taxonomy
  • [ ] Evaluation rubric with calibration examples
  • [ ] Persona template
  • [ ] User simulator, assistant, and system prompts

Outputs

By the end of this phase, you will have:

  • [ ] training_data.jsonl — Filtered, sliced training examples
  • [ ] generation_stats.md — Pass rates, criterion breakdown, iterations
  • [ ] prompt_versions/ — History of prompt iterations

The Core Loop

This is the most important part of the entire pipeline.

┌─────────────────────────────────────────────────────────────┐
│  TIGHT LOOP (5 transcripts per iteration)                   │
│                                                             │
│  1. Generate 5 transcripts                                  │
│  2. Assess with rubric (all backends)                       │
│  3. HUMAN REVIEWS both transcripts AND assessments          │
│  4. Iterate based on human judgment                         │
│  5. Repeat until ≥70% pass rate AND human satisfied         │
│                                                             │
│  Then: Scale to full volume                                 │
└─────────────────────────────────────────────────────────────┘

Why 5 Transcripts?

  • Small enough for human to actually READ each one carefully
  • Fast feedback (minutes, not hours)
  • See patterns without wasting compute
  • Iterate while context is fresh

Why Human-in-the-Loop? (Non-Negotiable)

Human review is required, not optional. The human reviews BOTH transcripts AND assessment results:

| Human reviews... | Looking for... | |------------------|----------------| | Transcripts | Quality issues the rubric might miss | | Assessment results | False positives (passed but shouldn't have) | | Assessment results | False negatives (failed but seems fine) | | Both together | Gaps in what the rubric even checks |

Without human review:

  • You're optimizing against a potentially broken metric
  • False positives silently corrupt training data
  • Rubric blind spots never get discovered

Red Flags: Rationalizations to Resist

| Rationalization | Reality | |-----------------|---------| | "Human review slows us down" | Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data. | | "Pass rate is high, must be fine" | High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required. | | "We can add calibration examples later" | Without calibration examples, backends disagree silently. Add them during design. | | "The rubric is complete" | Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge. | | "One assessor backend is enough" | Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures. | | "Let's just scale and filter later" | Scaling before 70% pass rate wastes compute. Fix prompts first. |

If you catch yourself using any of these rationalizations: STOP. Follow the gates.

Dual Iteration

You iterate on TWO things, not one:

| When you see... | Iterate on... | |-----------------|---------------| | Transcript quality issues | Generation prompts (user-sim, assistant) | | Assessment seems wrong | Assessor prompt, criteria wording | | Backend disagreement | Calibration examples for that criterion | | Missing failure mode | Add new criterion to rubric | | Pass rates high but something feels off | Run expert role-play critique |

The rubric is never "done." In one project, criteria evolved: 12 → 14 → 16 → 17 → 18.

Expert role-play critique is required — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See assessment-guide.md#expert-role-play-critique.


Workflow

Step 1: Tight Iteration Loop

For each batch of 5 transcripts:

  1. Generate 5 transcripts using two-agent simulation
  2. Assess with rubric using multiple backends (Claude, Gemini, GPT-5)
  3. Human reviews both transcripts and assessments:
    • Read each transcript: Is this actually good?
    • Read each assessment: Did the rubric catch what matters?
    • Note: false positives, false negatives, missing criteria
  4. Iterate based on human judgment:
    • Fix generation prompts (if transcript quality issues)
    • Fix assessor prompt/criteria (if assessment issues)
    • Add calibration examples (if edge cases found)
  5. Repeat until quality stabilizes

Gate (before scaling):

| Condition | Action | |-----------|--------| | ≥70% pass rate AND human satisfied | Proceed to scale | | 50-70% OR human sees issues | Continue iterating | | <50% | Major revision needed |

Reference: generation-guide.md, assessment-guide.md


Step 2: Scale Generation

Once the tight loop stabilizes:

  1. Generate target volume (100+ transcripts)
  2. Continue assessment with same multi-backend approach
  3. Periodic human spot-checks (every 20-50 transcripts)
  4. Track statistics (pass rate, criterion breakdown)

Warning signs during scale:

  • Pass rate drifting down → Revisit prompts
  • New failure patterns emerging → Add criteria
  • Perfect scores (1.0) → Suspiciously high, investigate

Step 3: Audit Patterns

Run quantitative analysis on the full dataset to catch issues invisible in spot-checks:

| Check | Red Flag | Action | |-------|----------|--------| | Phrase repetition | Any phrase in >50% of responses | Add to anti-patterns, regenerate | | Structural rigidity | 100% same format | Vary response structure | | Response length ratio | Avg >2x user length | Tighten length constraints | | Praise distribution | Late responses 2x more praise | Adjust tone consistency |

Gate: No audit red flags

Reference: assessment-guide.md#audit-patterns


Step 4: Fixup or Reject

For failing transcripts, decide whether to fix or reject:

| Failure Type | Action | |--------------|--------| | Soft failures (language, tone) | Attempt fixup with entailment constraint | | Safety gate failures | Truncate at failure point or reject entirely | | Structural issues | Usually reject |

Entailment constraint: Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead.

If >30% need fixup: Generation prompts need revision.

Reference: assessment-guide.md#fixup-strategy


Step 5: Slice for Training

Create training examples from full transcripts:

50-turn transcript → ~8-10 training examples via slicing

Slicing strategy:

  • Random slice points (seeded by transcript ID for reproducibility)
  • Minimum 3 exchanges before first slice
  • 2-5 exchange gaps between slices
  • Always include final turn

Token validation:

  • Each slice must be under your token limit (e.g., 16K)
  • Long transcripts may need truncation

Leakage prevention:

  • Split by transcript/persona FIRST
  • Then slice within each split
  • Never let slices from same transcript in both train and validation

Reference: assessment-guide.md#slicing-strategy

Optional: Use hugging-face-dataset-creator skill when ready to push training_data.jsonl to HF Hub.


Infrastructure

Checkpointing

Write progress after each transcript, not at the end:

for persona in personas:
    transcript = generate_transcript(persona)
    save_immediately(transcript)  # Don't batch

Retry with Backoff

API failures will happen. Use exponential backoff:

  • Claude: 7 attempts, 1-hour max wait
  • Google: Extract retry delay from error message
  • OpenAI: Standard exponential backoff

Progress Tracking

Track throughout generation:

  • Transcripts generated / target
  • Transcripts assessed / generated
  • Pass rate (rolling and cumulative)
  • Criterion failure breakdown

Reference: assessment-guide.md#infrastructure


Resources

| Resource | What It Contains | |----------|------------------| | code/SETUP-REFERENCE.md | Script templates: generate.py, assess.py, slice.py | | code/infrastructure.py | Copy-paste ready: LLM backend, retry strategies, checkpointing | | examples/therapy-domain.md | Complete therapy example: prompts, flaw patterns, criteria |


Done When

  • [ ] Target training example count reached
  • [ ] Pass rate stable across last 2-3 batches (≥70%)
  • [ ] Human satisfied with transcript quality
  • [ ] Audit patterns within thresholds
  • [ ] training_data.jsonl validated

Next Phase

finetune-train