Fine-tune Generate

Iteratively generate and filter training data until quality stabilizes.

Prerequisites

Complete finetune-design first. You need:

[ ] Model choice and token constraints
[ ] Input taxonomy
[ ] Evaluation rubric with calibration examples
[ ] Persona template
[ ] User simulator, assistant, and system prompts

Outputs

By the end of this phase, you will have:

[ ] training_data.jsonl — Filtered, sliced training examples
[ ] generation_stats.md — Pass rates, criterion breakdown, iterations
[ ] prompt_versions/ — History of prompt iterations

The Core Loop

This is the most important part of the entire pipeline.

┌─────────────────────────────────────────────────────────────┐
│  TIGHT LOOP (5 transcripts per iteration)                   │
│                                                             │
│  1. Generate 5 transcripts                                  │
│  2. Assess with rubric (all backends)                       │
│  3. HUMAN REVIEWS both transcripts AND assessments          │
│  4. Iterate based on human judgment                         │
│  5. Repeat until ≥70% pass rate AND human satisfied         │
│                                                             │
│  Then: Scale to full volume                                 │
└─────────────────────────────────────────────────────────────┘

Why 5 Transcripts?

Small enough for human to actually READ each one carefully
Fast feedback (minutes, not hours)
See patterns without wasting compute
Iterate while context is fresh

Why Human-in-the-Loop? (Non-Negotiable)

Human review is required, not optional. The human reviews BOTH transcripts AND assessment results:

| Human reviews... | Looking for... | |------------------|----------------| | Transcripts | Quality issues the rubric might miss | | Assessment results | False positives (passed but shouldn't have) | | Assessment results | False negatives (failed but seems fine) | | Both together | Gaps in what the rubric even checks |

Without human review:

You're optimizing against a potentially broken metric
False positives silently corrupt training data
Rubric blind spots never get discovered

Red Flags: Rationalizations to Resist

| Rationalization | Reality | |-----------------|---------| | "Human review slows us down" | Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data. | | "Pass rate is high, must be fine" | High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required. | | "We can add calibration examples later" | Without calibration examples, backends disagree silently. Add them during design. | | "The rubric is complete" | Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge. | | "One assessor backend is enough" | Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures. | | "Let's just scale and filter later" | Scaling before 70% pass rate wastes compute. Fix prompts first. |

If you catch yourself using any of these rationalizations: STOP. Follow the gates.

Dual Iteration

You iterate on TWO things, not one:

| When you see... | Iterate on... | |-----------------|---------------| | Transcript quality issues | Generation prompts (user-sim, assistant) | | Assessment seems wrong | Assessor prompt, criteria wording | | Backend disagreement | Calibration examples for that criterion | | Missing failure mode | Add new criterion to rubric | | Pass rates high but something feels off | Run expert role-play critique |

The rubric is never "done." In one project, criteria evolved: 12 → 14 → 16 → 17 → 18.

Expert role-play critique is required — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See assessment-guide.md#expert-role-play-critique.

Workflow

Step 1: Tight Iteration Loop

For each batch of 5 transcripts:

Generate 5 transcripts using two-agent simulation
Assess with rubric using multiple backends (Claude, Gemini, GPT-5)
Human reviews both transcripts and assessments:
- Read each transcript: Is this actually good?
- Read each assessment: Did the rubric catch what matters?
- Note: false positives, false negatives, missing criteria
Iterate based on human judgment:
- Fix generation prompts (if transcript quality issues)
- Fix assessor prompt/criteria (if assessment issues)
- Add calibration examples (if edge cases found)
Repeat until quality stabilizes

Gate (before scaling):

| Condition | Action | |-----------|--------| | ≥70% pass rate AND human satisfied | Proceed to scale | | 50-70% OR human sees issues | Continue iterating | | <50% | Major revision needed |

Reference: generation-guide.md, assessment-guide.md

Step 2: Scale Generation

Once the tight loop stabilizes:

Generate target volume (100+ transcripts)
Continue assessment with same multi-backend approach
Periodic human spot-checks (every 20-50 transcripts)
Track statistics (pass rate, criterion breakdown)

Warning signs during scale:

Pass rate drifting down → Revisit prompts
New failure patterns emerging → Add criteria
Perfect scores (1.0) → Suspiciously high, investigate

Step 3: Audit Patterns

Run quantitative analysis on the full dataset to catch issues invisible in spot-checks:

| Check | Red Flag | Action | |-------|----------|--------| | Phrase repetition | Any phrase in >50% of responses | Add to anti-patterns, regenerate | | Structural rigidity | 100% same format | Vary response structure | | Response length ratio | Avg >2x user length | Tighten length constraints | | Praise distribution | Late responses 2x more praise | Adjust tone consistency |

Gate: No audit red flags

Reference: assessment-guide.md#audit-patterns

Step 4: Fixup or Reject

For failing transcripts, decide whether to fix or reject:

| Failure Type | Action | |--------------|--------| | Soft failures (language, tone) | Attempt fixup with entailment constraint | | Safety gate failures | Truncate at failure point or reject entirely | | Structural issues | Usually reject |

Entailment constraint: Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead.

If >30% need fixup: Generation prompts need revision.

Reference: assessment-guide.md#fixup-strategy

Step 5: Slice for Training

Create training examples from full transcripts:

50-turn transcript → ~8-10 training examples via slicing

Slicing strategy:

Random slice points (seeded by transcript ID for reproducibility)
Minimum 3 exchanges before first slice
2-5 exchange gaps between slices
Always include final turn

Token validation:

Each slice must be under your token limit (e.g., 16K)
Long transcripts may need truncation

Leakage prevention:

Split by transcript/persona FIRST
Then slice within each split
Never let slices from same transcript in both train and validation

Reference: assessment-guide.md#slicing-strategy

Optional: Use hugging-face-dataset-creator skill when ready to push training_data.jsonl to HF Hub.

Infrastructure

Checkpointing

Write progress after each transcript, not at the end:

for persona in personas:
    transcript = generate_transcript(persona)
    save_immediately(transcript)  # Don't batch

Retry with Backoff

API failures will happen. Use exponential backoff:

Claude: 7 attempts, 1-hour max wait
Google: Extract retry delay from error message
OpenAI: Standard exponential backoff

Progress Tracking

Track throughout generation:

Transcripts generated / target
Transcripts assessed / generated
Pass rate (rolling and cumulative)
Criterion failure breakdown

Reference: assessment-guide.md#infrastructure

Resources

| Resource | What It Contains | |----------|------------------| | code/SETUP-REFERENCE.md | Script templates: generate.py, assess.py, slice.py | | code/infrastructure.py | Copy-paste ready: LLM backend, retry strategies, checkpointing | | examples/therapy-domain.md | Complete therapy example: prompts, flaw patterns, criteria |

Done When

[ ] Target training example count reached
[ ] Pass rate stable across last 2-3 batches (≥70%)
[ ] Human satisfied with transcript quality
[ ] Audit patterns within thresholds
[ ] training_data.jsonl validated

Next Phase

→ finetune-train