Fireworks AI Skill

Fast, cost-effective access to 100+ open-source models with OpenAI-compatible APIs, LoRA fine-tuning, and advanced deployment options.

When to Use This Skill

| Scenario | Example | Relevant Section | |----------|---------|------------------| | Query text models | "Chat completion with Llama" | Quick Reference → Chat Completion | | Fine-tune a model | "Train model on my data" | Fine-Tuning Overview | | Deploy custom model | "On-demand GPU deployment" | Deployments | | Migrate from OpenAI | "Use OpenAI SDK with Fireworks" | OpenAI Compatibility | | Batch processing | "Process 10K prompts offline" | Batch Inference | | Image generation | "FLUX Kontext image editing" | Image Generation | | Embeddings/RAG | "Generate embeddings for search" | Embeddings & Reranking | | CLI operations | "firectl commands" | firectl Reference |

Quick Reference

Chat Completion (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<YOUR_FIREWORKS_API_KEY>",
)

chat_completion = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-8b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Say this is a test"},
    ],
)
print(chat_completion.choices[0].message.content)

Chat Completion (curl)

curl --request POST \
     --url https://api.fireworks.ai/inference/v1/chat/completions \
     --header "accept: application/json" \
     --header "authorization: Bearer $FIREWORKS_API_KEY" \
     --header "content-type: application/json" \
     --data '{
       "model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
       "messages": [{"role": "user", "content": "Hello!"}]
     }'

Supervised Fine-Tuning Job

firectl supervised-fine-tuning-job create \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset my-training-dataset \
  --output-model my-fine-tuned-model \
  --epochs 3 \
  --learning-rate 1e-4 \
  --lora-rank 8

Create Dataset for Fine-Tuning

from fireworks.client import Dataset

dataset = Dataset.from_file(
    "path/to/training_data.jsonl",
    name="my-training-dataset"
)
# Dataset is now available on Fireworks for fine-tuning

Monitor Training Progress

while not job.is_completed:
    job.raise_if_bad_state()
    print(f"Training state: {job.state}")
    time.sleep(10)
    job = job.get()

print(f"Training completed! New model: {job.output_model}")

Deploy Fine-Tuned Model (Multi-LoRA)

from fireworks import LLM

base_model = LLM(
    model="accounts/fireworks/models/llama-v3p2-3b-instruct",
    deployment_type="on-demand",
    id="shared-base-deployment",
    enable_addons=True
)

Generate Embeddings

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<YOUR_FIREWORKS_API_KEY>",
)

response = client.embeddings.create(
    model="fireworks/qwen3-embedding-8b",
    input="Your text to embed"
)
embeddings = response.data[0].embedding

Export Billing Metrics

firectl billing export-metrics \
  --start-time "2025-01-01" \
  --end-time "2025-01-31" \
  --filename january_metrics.csv

Create Deployment

firectl deployment create accounts/fireworks/models/deepseek-v3 \
  --deployment-shape throughput

Key Concepts

Fine-Tuning Methods

| Method | Use Case | When to Use | |--------|----------|-------------| | SFT (Supervised) | Classification, extraction | Large labeled dataset (~1000+ examples) | | RFT (Reinforcement) | Complex reasoning, agents | Small dataset, verifiable outputs, multi-step tasks | | DPO (Preference) | Alignment, style | Pairwise preference comparisons |

Decision Tree:

Have 1000+ labeled examples? → SFT
Task is verifiable but lacks golden outputs? → RFT
Want to align with preferences? → DPO

LoRA (Low-Rank Adaptation)

Fireworks uses LoRA for efficient fine-tuning:

Faster & cheaper - Train in hours, not days
Easy to deploy - Instant deployment on Fireworks
Flexible - Run multiple LoRAs on single base deployment

Deployment Types

| Type | Use Case | Scaling | |------|----------|---------| | Serverless | Variable traffic, cost optimization | Auto-scale to zero | | On-Demand | Consistent performance, high throughput | Dedicated GPUs | | Reserved | Predictable workloads, discounts | Pre-purchased capacity |

Agent Tracing (RFT)

For reinforcement fine-tuning with agents:

Use model_base_url from trainer (points to tracing.fireworks.ai)
Attach FireworksTracingHttpHandler for structured logging
Log Status.rollout_finished() or Status.rollout_error() on completion
Trainer joins traces + logs via rollout_id

API Compatibility

Fireworks is OpenAI-compatible. Key differences:

| Feature | OpenAI | Fireworks | |---------|--------|-----------| | max_tokens overflow | Error | Auto-truncate (configurable) | | Streaming usage stats | Not returned | Returned in final chunk | | Model names | gpt-4 | accounts/fireworks/models/llama-v3p1-8b-instruct |

Set context_length_exceeded_behavior: "error" for OpenAI-like behavior.

firectl CLI Quick Reference

# Authentication
firectl login

# Account operations
firectl account list

# Dataset operations
firectl dataset download <dataset-id>
firectl dataset list

# Fine-tuning jobs
firectl supervised-fine-tuning-job create --help
firectl supervised-fine-tuning-job list
firectl dpo-job resume <job-id>

# Deployments
firectl deployment create <model> --deployment-shape <shape>
firectl deployment scale <deployment-id> --replicas <n>

# Evaluators
firectl evaluator-revision get <evaluator-id>

# Billing
firectl billing export-metrics

Available Models (Highlights)

Text Models:

DeepSeek V3, DeepSeek R1
Llama 3.1/3.2/3.3 (8B, 70B, 405B)
Qwen 2.5 family
Kimi K2

Embedding Models:

fireworks/qwen3-embedding-8b (serverless)
fireworks/qwen3-embedding-4b
nomic-ai/nomic-embed-text-v1.5

Reranking Models:

fireworks/qwen3-reranker-8b (serverless)

Image Models:

FLUX Kontext Pro/Max
SDXL ControlNet

Browse all: https://fireworks.ai/models

Reference Files

| File | Content | Use For | |------|---------|---------| | references/llms-txt.md | Complete API reference (410 pages) | Detailed API docs, all CLI commands, parameters |

Navigation tips:

Search for specific CLI commands: firectl <command>
API endpoints follow pattern: /v1/accounts/{account_id}/<resource>
Fine-tuning docs under #fine-tuning-* sections
Deployment docs under #deployment-* sections

Working with This Skill

For Beginners

Start with Chat Completion example above
Get API key from https://app.fireworks.ai
Use OpenAI SDK (familiar interface)
Try serverless models first (no deployment needed)

For Fine-Tuning

Prepare JSONL dataset with messages format
Upload with Dataset.from_file() or firectl
Choose fine-tuning method (SFT/RFT/DPO)
Monitor with firectl supervised-fine-tuning-job list
Deploy LoRA or merge into base model

For Production

Consider on-demand deployments for consistent performance
Enable prompt caching for repeated prefixes
Use batch inference for offline processing
Monitor usage via billing export or dashboard
Set up service accounts for CI/CD

Common Patterns

Streaming with Usage Stats

for chunk in client.chat.completions.create(stream=True, ...):
    if chunk.usage:  # Available in final chunk
        print(f"Tokens: {chunk.usage.total_tokens}")

Variable-Length Embeddings

response = client.embeddings.create(
    model="fireworks/qwen3-embedding-8b",
    input="Your text",
    dimensions=128  # Reduce from default for faster similarity
)

Reranking Documents

# Using /rerank endpoint
response = client.post("/rerank", json={
    "model": "fireworks/qwen3-reranker-8b",
    "query": "search query",
    "documents": ["doc1", "doc2", "doc3"]
})

Resources

Model Library: https://fireworks.ai/models
Playground: https://app.fireworks.ai/playground
Usage Dashboard: https://app.fireworks.ai/account/usage
API Reference: https://docs.fireworks.ai/api-reference
firectl Docs: https://docs.fireworks.ai/tools-sdks/firectl

Notes

Generated from official Fireworks AI documentation (410 pages)
OpenAI SDK examples work directly with Fireworks
Model names use accounts/fireworks/models/<model-name> format
Fine-tuning uses LoRA by default (set --lora-rank 0 for full parameter)