Back to skills
extension
Category: Content & MediaNo API key required

paper-banana

-

personAuthor: jakexiaohubgithub

PaperBanana: Academic Illustration Pipeline

Automates publication-ready academic illustrations via 5 specialized agents, each implemented as a separate Gemini API call: Retriever (categorize & select references) -> Planner (multimodal description) -> Stylist (polish) -> Visualizer (render) -> Critic (evaluate & refine).

Two output modes:

  • DIAGRAM MODE: Each agent is a Python script calling Gemini VLM/image APIs. Run scripts/orchestrate.py for end-to-end execution.
  • PLOT MODE: Statistical plots generated as executable Python matplotlib/seaborn code (code-based to eliminate data hallucination).

Requirements: GOOGLE_API_KEY env var (used for VLM calls in retriever/planner/stylist/critic AND image generation in visualizer), Python 3.10+ with google-genai, matplotlib, seaborn, numpy, pillow.

Paper: PaperBanana: Automating Academic Illustrations with Multi-Agent Systems (arXiv:2601.23265, Google/PKU)


Step 1: Determine Output Mode

Decide which track to follow:

| Signal | Mode | |--------|------| | User provides raw data, table, CSV + visual intent (bar chart, scatter, etc.) | PLOT MODE | | User provides methodology text, description, or figure caption | DIAGRAM MODE | | User provides existing figure to improve | Match original type |

Critical rule: PLOT MODE always generates Python code (never image generation for data visualizations). Code-based generation eliminates data hallucination errors that corrupt numerical accuracy in image-based approaches.


Step 2: Execute Pipeline

DIAGRAM MODE — Automated Pipeline

Primary entry point: Run the end-to-end orchestrator:

python scripts/orchestrate.py \
  --methodology-file methodology.txt \
  --caption "Figure 1: Overview of proposed framework" \
  --mode diagram \
  --output output/diagram.png

Or with inline text:

python scripts/orchestrate.py \
  --methodology "Our framework consists of three modules..." \
  --caption "Figure 1: System overview" \
  --mode diagram \
  --output output/diagram.png

The orchestrator chains all 5 agents automatically and handles the Critic's refinement loop (up to 3 iterations). Intermediate outputs are saved to output/work/ for inspection.

Pipeline Details

Read references/DIAGRAM-PROMPTS.md for the actual Gemini prompt templates used by each agent.

Phase 1: RETRIEVER (scripts/retriever.py) — Gemini VLM call

  • Classifies methodology into 1 of 4 categories from references/DIAGRAM-CATEGORIES.md
  • Selects 2 most relevant reference diagrams from the 13 curated examples in assets/references/
  • Identifies visual intent: Framework Overview, Pipeline/Flow, Detailed Module, Architecture Diagram

Phase 2: PLANNER (scripts/planner.py) — Multimodal Gemini VLM call

  • Sends the 2 selected reference images + methodology text to Gemini as a multimodal prompt
  • The VLM "sees" what good methodology diagrams look like (in-context learning from images)
  • Generates an extremely detailed textual description of the target diagram
  • Critical: Natural language only for all visual attributes. NEVER hex codes or pixel dimensions

Phase 3: STYLIST (scripts/stylist.py) — Gemini VLM call

  • Takes the Planner's description + full NeurIPS 2025 style guide
  • Applies domain-specific styling based on the category from Phase 1
  • Follows 5 critical rules: preserve aesthetics, intervene minimally, respect domain, enrich details, preserve content
  • Outputs the polished description only

Phase 4: VISUALIZER (scripts/generate_image.py) — Gemini Image API call

  • Uses gemini-3-pro-image-preview to generate the diagram image from the styled description
  • Prepends quality prefix (high-res, legible text, clean background, no watermarks)
  • Aspect ratio selected based on visual intent (16:9 for pipelines, 3:2 for modules)

Phase 5: CRITIC (scripts/critic.py) — Multimodal Gemini VLM call

  • Sends the generated image + methodology text to Gemini for multimodal evaluation
  • Scores on 4 dimensions (faithfulness, readability, conciseness, aesthetics)
  • If faithfulness < 7 OR readability < 7: generates revised description → loops to Phase 4
  • Maximum 3 refinement iterations

DIAGRAM MODE — Manual Execution

You can also run each agent individually for more control:

# Phase 1: Retriever
python scripts/retriever.py --methodology-file text.txt --output work/retriever.json

# Phase 2: Planner
python scripts/planner.py --methodology-file text.txt --caption "Figure 1: ..." \
  --references work/retriever.json --output work/planner.json

# Phase 3: Stylist
python scripts/stylist.py --description work/planner.json --output work/stylist.json

# Phase 4: Visualizer (extract styled_description from JSON, pass to generate_image.py)
python scripts/generate_image.py --prompt-file work/styled_desc.txt --output output/diagram.png

# Phase 5: Critic
python scripts/critic.py --image output/diagram.png --methodology-file text.txt \
  --description work/stylist.json --output work/critic.json

PLOT MODE

Read references/PLOT-PROMPTS.md for detailed agent prompts. Read references/PLOT-STYLE-GUIDE.md for aesthetic rules.

Plot mode uses Claude (or the host agent) for reasoning and code generation — no Gemini API calls needed for plot generation itself.

Phase 1: CATEGORIZE (Retriever)

Match data characteristics and visual intent:

| Data Type | Plot Types | |-----------|------------| | Categorical comparison | Bar chart, grouped bar, stacked bar | | Continuous trends | Line chart, area chart | | Correlation/distribution | Scatter plot, histogram, box plot, violin | | Matrix/similarity | Heatmap, confusion matrix | | Multi-dimensional | Radar/spider chart | | Proportional | Pie/donut chart, treemap |

Phase 2: PLAN (Planner)

Create a detailed specification that explicitly enumerates:

  • Every raw data point with exact coordinates/values
  • Axis ranges, labels, tick marks, scales (linear/log)
  • Color assignments for each series/category
  • Font sizes for title, axis labels, tick labels, legend
  • Line widths, marker sizes, marker shapes
  • Legend placement and formatting
  • Grid style (major/minor, dashed/solid)
  • Figure dimensions and DPI

Phase 3: STYLE (Stylist)

Read references/PLOT-STYLE-GUIDE.md for NeurIPS 2025 plot aesthetics.

Key styling rules:

  • White backgrounds only
  • Colorblind-friendly palettes (see assets/palettes/colorblind_safe.json)
  • Sans-serif fonts (Helvetica, Arial, or DejaVu Sans)
  • Markers on line charts for print readability
  • Inward-facing tick marks
  • Subtle grid lines (light gray, dashed)

Phase 4: VISUALIZE (Visualizer — Code Generation)

Generate complete, self-contained Python matplotlib/seaborn code. Use scripts/plot_generator.py as a reference implementation or run it directly with a JSON config:

python scripts/plot_generator.py --config plot_config.json --output figure.pdf

Code requirements:

  • Self-contained: all data defined inline, no external file dependencies
  • Apply .mplstyle from assets/matplotlib_styles/academic_default.mplstyle
  • Set OUTPUT_PATH variable for output file location
  • 300 DPI, bbox_inches='tight'
  • No plt.show() — save only
  • Support both PDF and PNG output

After generating the code, execute it to produce the plot image.

Phase 5: CRITIQUE (Critic)

Same rubric as diagram mode, plus plot-specific checks:

  • Data fidelity: Every data point correctly plotted
  • Axis accuracy: Ranges, labels, scales match specification
  • Layout: No overlapping labels, legends, or data points
  • Code correctness: Syntax valid, imports available, output saved

If code execution failed, analyze the error, simplify the approach, and regenerate.


Quick Start Examples

Diagram (automated): Run scripts/orchestrate.py with your methodology text file and caption.

Diagram (via agent): "Generate a methodology diagram for my transformer architecture. Here is the methodology section: [paste text]. Caption: Overview of our proposed multi-head attention framework."

Plot: "Create a bar chart comparing model performance. Data: {BERT: 92.3, GPT-4: 88.1, Claude: 95.7, Gemini: 91.2}. Intent: F1 score comparison across language models."

Improve: "Improve the aesthetics of this diagram: [paste existing description or attach current figure]"


File Reference

| File | Purpose | When to Read | |------|---------|-------------| | scripts/orchestrate.py | End-to-end pipeline runner | Diagram mode primary entry point | | scripts/retriever.py | VLM-based reference selection | Phase 1 (diagram mode) | | scripts/planner.py | Multimodal description generation | Phase 2 (diagram mode) | | scripts/stylist.py | VLM-based style application | Phase 3 (diagram mode) | | scripts/generate_image.py | Gemini Image API call | Phase 4 (diagram mode) | | scripts/critic.py | VLM-based image evaluation | Phase 5 (diagram mode) | | scripts/plot_generator.py | Template-based matplotlib generator | Phase 4 (plot mode) | | scripts/validate_output.py | Output validation and dependency check | Post-generation validation | | references/DIAGRAM-PROMPTS.md | Actual Gemini prompt templates for diagrams | All diagram phases | | references/PLOT-PROMPTS.md | Agent prompts for plots | All plot phases | | references/DIAGRAM-STYLE-GUIDE.md | NeurIPS 2025 diagram aesthetics | Phase 3 (Style) | | references/PLOT-STYLE-GUIDE.md | NeurIPS 2025 plot aesthetics | Phase 3 (Style) | | references/EVALUATION-RUBRIC.md | Critic scoring criteria (4 dimensions) | Phase 5 (Critique) | | references/DIAGRAM-CATEGORIES.md | 4 diagram categories with keywords | Phase 1 (Categorize) | | assets/references/index.json | 13 curated reference diagram metadata | Phase 1 (Retriever) | | assets/references/*.jpg | 13 curated reference diagram images | Phase 2 (Planner multimodal input) | | assets/palettes/*.json | Color palette definitions | Phase 3 (Style) | | assets/matplotlib_styles/*.mplstyle | Matplotlib style sheets | Phase 4 (plot mode) |

Environment Setup

# Required for all Gemini API calls (VLM reasoning + image generation)
export GOOGLE_API_KEY="your-api-key-here"

# Install dependencies
pip install google-genai matplotlib seaborn numpy pillow

Verify setup: python scripts/validate_output.py --check-deps