Vision Skill

Overview

This skill provides capabilities for visual recognition and image generation using Doubao AI models. It handles image storage via Tencent Cloud COS and executes tasks asynchronously.

Capabilities

1. Vision Recognition

Analyze images to describe content, extract text (OCR), or answer questions about the image.

Input: Local image path or URL, optional prompt.
Process: Uploads local images to COS, then calls Doubao Vision API.
Output: Text description or answer.

2. Image Generation

Generate images from text prompts, optionally using reference images.

Text-to-Image: Generate images from a text description.
Image-to-Image: Generate images based on a reference image and text prompt.
Sequential Generation: Generate a series of consistent images (e.g., storyboards).

Usage

The skill is exposed via a CLI script scripts/vision_cli.py.

Prerequisites

Environment variables must be set in .env or the system environment:

COS_SECRET_ID, COS_SECRET_KEY, COS_REGION, COS_BUCKET_NAME
DOUBAO_API_KEY, DOUBAO_VISION_MODEL, DOUBAO_IMAGE_MODEL

Commands

Vision Recognition

# Basic Usage
python3 scripts/vision_cli.py recognize <image_path> --prompt "Describe this image"

# Using Presets (--format)
# Available formats: invoice, contract, form, slide, whiteboard, table, json, key_value, markdown_note, qa_pairs, code, ocr, analysis
python3 scripts/vision_cli.py recognize ./invoice.jpg --format json
python3 scripts/vision_cli.py recognize ./screenshot.png --format code

# Batch recognition
python3 scripts/vision_cli.py recognize ./a.jpg ./b.jpg ./c.jpg --format table --wait --output ./batch_result.json

# Quality mode and retry
python3 scripts/vision_cli.py recognize ./contract.png --format contract --quality high --retry 3 --wait

# Wait for result and save to file
python3 scripts/vision_cli.py recognize ./doc.jpg --format ocr --wait --output ./result.txt

Image Generation

# Text to Image with Style Presets (--style)
# Available styles: ppt, business_flat, cartoon, tech_isometric, hand_drawn, icon, photo, anime, sketch
python3 scripts/vision_cli.py generate "A cyberpunk city" --style anime

# Image to Image
python3 scripts/vision_cli.py generate "Make it snowy" --ref <image_path>

# Sequential Generation
python3 scripts/vision_cli.py generate "A story about a cat" --seq 4 --style cartoon

# Wait for result and save image
python3 scripts/vision_cli.py generate "App icon for a camera" --style icon --wait --output ./icon.png

# Quality mode and retry
python3 scripts/vision_cli.py generate "A SaaS architecture illustration" --style tech_isometric --quality high --retry 3 --wait

Check Status

python3 scripts/vision_cli.py status <task_id>
# Or save result if completed
python3 scripts/vision_cli.py status <task_id> --output ./final_result.png

Task Management

All tasks are executed asynchronously by default.

Use --wait flag to block until completion (useful for Agent workflow).
Use --output flag to automatically save text or download images.
Task data is stored in .tasks/ directory.