Back to skills
extension
Category: Content & MediaNo API key required

comfyui-video-pipeline

Generate videos using ComfyUI with Wan 2.2, FramePack, or AnimateDiff. Handles image-to-video, text-to-video, talking heads, and motion-controlled animation. Use when creating any video content from character images or text descriptions.

personAuthor: jakexiaohubgithub

ComfyUI Video Pipeline

Orchestrates video generation across three engines, selecting the best one based on requirements and available resources.

Engine Selection

VIDEO REQUEST
    |
    |-- Need film-level quality?
    |   |-- Yes + 24GB+ VRAM → Wan 2.2 MoE 14B
    |   |-- Yes + 8GB VRAM → Wan 2.2 1.3B
    |
    |-- Need long video (>10 seconds)?
    |   |-- Yes → FramePack (60 seconds on 6GB)
    |
    |-- Need fast iteration?
    |   |-- Yes → AnimateDiff Lightning (4-8 steps)
    |
    |-- Need camera/motion control?
    |   |-- Yes → AnimateDiff V3 + Motion LoRAs
    |
    |-- Need first+last frame control?
    |   |-- Yes → Wan 2.2 MoE (exclusive feature)
    |
    |-- Default → Wan 2.2 (best general quality)

Pipeline 1: Wan 2.2 MoE (Highest Quality)

Image-to-Video

Prerequisites:

  • wan2.1_i2v_720p_14b_bf16.safetensors in models/diffusion_models/
  • umt5_xxl_fp8_e4m3fn_scaled.safetensors in models/clip/
  • open_clip_vit_h_14.safetensors in models/clip_vision/
  • wan_2.1_vae.safetensors in models/vae/

Settings: | Parameter | Value | Notes | |-----------|-------|-------| | Resolution | 1280x720 (landscape) or 720x1280 (portrait) | Native training resolution | | Frames | 81 (~5 seconds at 16fps) | Multiples of 4 + 1 | | Steps | 30-50 | Higher = better quality | | CFG | 5-7 | | | Sampler | uni_pc | Recommended for Wan | | Scheduler | normal | |

Frame count guide: | Duration | Frames (16fps) | |----------|----------------| | 1 second | 17 | | 3 seconds | 49 | | 5 seconds | 81 | | 10 seconds | 161 |

VRAM optimization:

  • FP8 quantization: halves VRAM with minimal quality loss
  • SageAttention: faster attention computation
  • Reduce frames if OOM

Text-to-Video

Same as I2V but uses wan2.1_t2v_14b_bf16.safetensors and EmptySD3LatentImage instead of image conditioning.

First+Last Frame Control (Wan 2.2 Exclusive)

Wan 2.2 MoE allows specifying both the first and last frame, enabling precise video planning:

  1. Generate two hero images with consistent character
  2. Use first as start frame, second as end frame
  3. Wan interpolates the motion between them

Pipeline 2: FramePack (Long Videos, Low VRAM)

Key Innovation

VRAM usage is invariant to video length - generates 60-second videos at 30fps on just 6GB VRAM.

How it works:

  • Dynamic context compression: 1536 markers for key frames, 192 for transitions
  • Bidirectional memory with reverse generation prevents drift
  • Frame-by-frame generation with context window

Settings

| Parameter | Value | Notes | |-----------|-------|-------| | Resolution | 640x384 to 1280x720 | Depends on VRAM | | Duration | Up to 60 seconds | VRAM-invariant | | Quality | High (comparable to Wan) | Uses same base models |

When to Use

  • Videos longer than 10 seconds
  • Limited VRAM systems (but RTX 5090 doesn't need this)
  • When VRAM is needed for parallel operations
  • Batch video generation

Pipeline 3: AnimateDiff V3 (Fast, Controllable)

Strengths

  • Motion LoRAs for camera control (pan, zoom, tilt, roll)
  • Effect LoRAs (shatter, smoke, explosion, liquid)
  • Sliding context window for infinite length
  • Very fast with Lightning model (4-8 steps)

Settings

| Parameter | Value (Standard) | Value (Lightning) | |-----------|-----------------|-------------------| | Motion Module | v3_sd15_mm.ckpt | animatediff_lightning_4step.safetensors | | Steps | 20-25 | 4-8 | | CFG | 7-8 | 1.5-2.0 | | Sampler | euler_ancestral | lcm | | Resolution | 512x512 | 512x512 | | Context Length | 16 | 16 | | Context Overlap | 4 | 4 |

Camera Motion LoRAs

| LoRA | Motion | |------|--------| | v2_lora_ZoomIn | Camera zooms in | | v2_lora_ZoomOut | Camera zooms out | | v2_lora_PanLeft | Camera pans left | | v2_lora_PanRight | Camera pans right | | v2_lora_TiltUp | Camera tilts up | | v2_lora_TiltDown | Camera tilts down | | v2_lora_RollingClockwise | Camera rolls clockwise |

Post-Processing Pipeline

After any video generation:

1. Frame Interpolation (RIFE)

Doubles or quadruples frame count for smoother motion:

Input (16fps) → RIFE 2x → Output (32fps)
Input (16fps) → RIFE 4x → Output (64fps)

Use rife47 or rife49 model.

2. Face Enhancement (if character video)

Apply FaceDetailer to each frame:

  • denoise: 0.3-0.4 (lower than image - preserves temporal consistency)
  • guide_size: 384 (speed optimization for video)
  • detection_model: face_yolov8m.pt

3. Deflicker (if needed)

Reduces temporal inconsistencies between frames.

4. Color Correction

Maintain consistent color grading across frames.

5. Video Combine

Final output via VHS Video Combine:

frame_rate: 16 (native) or 24/30 (after interpolation)
format: "video/h264-mp4"
crf: 19 (high quality) to 23 (smaller file)

Talking Head Pipeline

Complete pipeline for character dialogue:

1. Generate audio → comfyui-voice-pipeline
2. Generate base video → This skill (Wan I2V or AnimateDiff)
   - Prompt: "{character}, talking naturally, slight head movement"
   - Duration: match audio length
3. Apply lip-sync → Wav2Lip or LatentSync
4. Enhance faces → FaceDetailer + CodeFormer
5. Final output → video-assembly

Quality Checklist

Before marking video as complete:

  • [ ] Character identity consistent across frames
  • [ ] No flickering or temporal artifacts
  • [ ] Motion looks natural (not jerky or frozen)
  • [ ] Face enhancement applied if character video
  • [ ] Frame rate is smooth (24+ fps for delivery)
  • [ ] Audio synced (if talking head)
  • [ ] Resolution matches delivery target

Reference

  • references/workflows.md - Workflow templates for Wan and AnimateDiff
  • references/models.md - Video model download links
  • references/research-2025.md - Latest video generation advances
  • state/inventory.json - Available video models