PaperBanana: Academic Illustration Pipeline

Automates publication-ready academic illustrations via 5 specialized agents, each implemented as a separate Gemini API call: Retriever (categorize & select references) -> Planner (multimodal description) -> Stylist (polish) -> Visualizer (render) -> Critic (evaluate & refine).

Two output modes:

DIAGRAM MODE: Each agent is a Python script calling Gemini VLM/image APIs. Run scripts/orchestrate.py for end-to-end execution.
PLOT MODE: Statistical plots generated as executable Python matplotlib/seaborn code (code-based to eliminate data hallucination).

Requirements: GOOGLE_API_KEY env var (used for VLM calls in retriever/planner/stylist/critic AND image generation in visualizer), Python 3.10+ with google-genai, matplotlib, seaborn, numpy, pillow.

Paper: PaperBanana: Automating Academic Illustrations with Multi-Agent Systems (arXiv:2601.23265, Google/PKU)

Step 1: Determine Output Mode

Decide which track to follow:

| Signal | Mode | |--------|------| | User provides raw data, table, CSV + visual intent (bar chart, scatter, etc.) | PLOT MODE | | User provides methodology text, description, or figure caption | DIAGRAM MODE | | User provides existing figure to improve | Match original type |

Critical rule: PLOT MODE always generates Python code (never image generation for data visualizations). Code-based generation eliminates data hallucination errors that corrupt numerical accuracy in image-based approaches.

Step 2: Execute Pipeline

DIAGRAM MODE — Automated Pipeline

Primary entry point: Run the end-to-end orchestrator:

python scripts/orchestrate.py \
  --methodology-file methodology.txt \
  --caption "Figure 1: Overview of proposed framework" \
  --mode diagram \
  --output output/diagram.png

Or with inline text:

python scripts/orchestrate.py \
  --methodology "Our framework consists of three modules..." \
  --caption "Figure 1: System overview" \
  --mode diagram \
  --output output/diagram.png

The orchestrator chains all 5 agents automatically and handles the Critic's refinement loop (up to 3 iterations). Intermediate outputs are saved to output/work/ for inspection.

Pipeline Details

Read references/DIAGRAM-PROMPTS.md for the actual Gemini prompt templates used by each agent.

Phase 1: RETRIEVER (scripts/retriever.py) — Gemini VLM call

Classifies methodology into 1 of 4 categories from references/DIAGRAM-CATEGORIES.md
Selects 2 most relevant reference diagrams from the 13 curated examples in assets/references/
Identifies visual intent: Framework Overview, Pipeline/Flow, Detailed Module, Architecture Diagram

Phase 2: PLANNER (scripts/planner.py) — Multimodal Gemini VLM call

Sends the 2 selected reference images + methodology text to Gemini as a multimodal prompt
The VLM "sees" what good methodology diagrams look like (in-context learning from images)
Generates an extremely detailed textual description of the target diagram
Critical: Natural language only for all visual attributes. NEVER hex codes or pixel dimensions

Phase 3: STYLIST (scripts/stylist.py) — Gemini VLM call

Takes the Planner's description + full NeurIPS 2025 style guide
Applies domain-specific styling based on the category from Phase 1
Follows 5 critical rules: preserve aesthetics, intervene minimally, respect domain, enrich details, preserve content
Outputs the polished description only

Phase 4: VISUALIZER (scripts/generate_image.py) — Gemini Image API call

Uses gemini-3-pro-image-preview to generate the diagram image from the styled description
Prepends quality prefix (high-res, legible text, clean background, no watermarks)
Aspect ratio selected based on visual intent (16:9 for pipelines, 3:2 for modules)

Phase 5: CRITIC (scripts/critic.py) — Multimodal Gemini VLM call

Sends the generated image + methodology text to Gemini for multimodal evaluation
Scores on 4 dimensions (faithfulness, readability, conciseness, aesthetics)
If faithfulness < 7 OR readability < 7: generates revised description → loops to Phase 4
Maximum 3 refinement iterations

DIAGRAM MODE — Manual Execution

You can also run each agent individually for more control:

# Phase 1: Retriever
python scripts/retriever.py --methodology-file text.txt --output work/retriever.json

# Phase 2: Planner
python scripts/planner.py --methodology-file text.txt --caption "Figure 1: ..." \
  --references work/retriever.json --output work/planner.json

# Phase 3: Stylist
python scripts/stylist.py --description work/planner.json --output work/stylist.json

# Phase 4: Visualizer (extract styled_description from JSON, pass to generate_image.py)
python scripts/generate_image.py --prompt-file work/styled_desc.txt --output output/diagram.png

# Phase 5: Critic
python scripts/critic.py --image output/diagram.png --methodology-file text.txt \
  --description work/stylist.json --output work/critic.json

PLOT MODE

Read references/PLOT-PROMPTS.md for detailed agent prompts. Read references/PLOT-STYLE-GUIDE.md for aesthetic rules.

Plot mode uses Claude (or the host agent) for reasoning and code generation — no Gemini API calls needed for plot generation itself.

Phase 1: CATEGORIZE (Retriever)

Match data characteristics and visual intent:

| Data Type | Plot Types | |-----------|------------| | Categorical comparison | Bar chart, grouped bar, stacked bar | | Continuous trends | Line chart, area chart | | Correlation/distribution | Scatter plot, histogram, box plot, violin | | Matrix/similarity | Heatmap, confusion matrix | | Multi-dimensional | Radar/spider chart | | Proportional | Pie/donut chart, treemap |

Phase 2: PLAN (Planner)

Create a detailed specification that explicitly enumerates:

Every raw data point with exact coordinates/values
Axis ranges, labels, tick marks, scales (linear/log)
Color assignments for each series/category
Font sizes for title, axis labels, tick labels, legend
Line widths, marker sizes, marker shapes
Legend placement and formatting
Grid style (major/minor, dashed/solid)
Figure dimensions and DPI

Phase 3: STYLE (Stylist)

Read references/PLOT-STYLE-GUIDE.md for NeurIPS 2025 plot aesthetics.

Key styling rules:

White backgrounds only
Colorblind-friendly palettes (see assets/palettes/colorblind_safe.json)
Sans-serif fonts (Helvetica, Arial, or DejaVu Sans)
Markers on line charts for print readability
Inward-facing tick marks
Subtle grid lines (light gray, dashed)

Phase 4: VISUALIZE (Visualizer — Code Generation)

Generate complete, self-contained Python matplotlib/seaborn code. Use scripts/plot_generator.py as a reference implementation or run it directly with a JSON config:

python scripts/plot_generator.py --config plot_config.json --output figure.pdf

Code requirements:

Self-contained: all data defined inline, no external file dependencies
Apply .mplstyle from assets/matplotlib_styles/academic_default.mplstyle
Set OUTPUT_PATH variable for output file location
300 DPI, bbox_inches='tight'
No plt.show() — save only
Support both PDF and PNG output

After generating the code, execute it to produce the plot image.

Phase 5: CRITIQUE (Critic)

Same rubric as diagram mode, plus plot-specific checks:

Data fidelity: Every data point correctly plotted
Axis accuracy: Ranges, labels, scales match specification
Layout: No overlapping labels, legends, or data points
Code correctness: Syntax valid, imports available, output saved

If code execution failed, analyze the error, simplify the approach, and regenerate.

Quick Start Examples

Diagram (automated): Run scripts/orchestrate.py with your methodology text file and caption.

Diagram (via agent): "Generate a methodology diagram for my transformer architecture. Here is the methodology section: [paste text]. Caption: Overview of our proposed multi-head attention framework."

Plot: "Create a bar chart comparing model performance. Data: {BERT: 92.3, GPT-4: 88.1, Claude: 95.7, Gemini: 91.2}. Intent: F1 score comparison across language models."

Improve: "Improve the aesthetics of this diagram: [paste existing description or attach current figure]"

File Reference

| File | Purpose | When to Read | |------|---------|-------------| | scripts/orchestrate.py | End-to-end pipeline runner | Diagram mode primary entry point | | scripts/retriever.py | VLM-based reference selection | Phase 1 (diagram mode) | | scripts/planner.py | Multimodal description generation | Phase 2 (diagram mode) | | scripts/stylist.py | VLM-based style application | Phase 3 (diagram mode) | | scripts/generate_image.py | Gemini Image API call | Phase 4 (diagram mode) | | scripts/critic.py | VLM-based image evaluation | Phase 5 (diagram mode) | | scripts/plot_generator.py | Template-based matplotlib generator | Phase 4 (plot mode) | | scripts/validate_output.py | Output validation and dependency check | Post-generation validation | | references/DIAGRAM-PROMPTS.md | Actual Gemini prompt templates for diagrams | All diagram phases | | references/PLOT-PROMPTS.md | Agent prompts for plots | All plot phases | | references/DIAGRAM-STYLE-GUIDE.md | NeurIPS 2025 diagram aesthetics | Phase 3 (Style) | | references/PLOT-STYLE-GUIDE.md | NeurIPS 2025 plot aesthetics | Phase 3 (Style) | | references/EVALUATION-RUBRIC.md | Critic scoring criteria (4 dimensions) | Phase 5 (Critique) | | references/DIAGRAM-CATEGORIES.md | 4 diagram categories with keywords | Phase 1 (Categorize) | | assets/references/index.json | 13 curated reference diagram metadata | Phase 1 (Retriever) | | assets/references/*.jpg | 13 curated reference diagram images | Phase 2 (Planner multimodal input) | | assets/palettes/*.json | Color palette definitions | Phase 3 (Style) | | assets/matplotlib_styles/*.mplstyle | Matplotlib style sheets | Phase 4 (plot mode) |

Environment Setup

# Required for all Gemini API calls (VLM reasoning + image generation)
export GOOGLE_API_KEY="your-api-key-here"

# Install dependencies
pip install google-genai matplotlib seaborn numpy pillow

Verify setup: python scripts/validate_output.py --check-deps