Model Quantization

Overview

Quantization reduces model precision to save memory and speed up inference. A 7B model at FP32 requires ~28GB, but at 4-bit only ~4GB.

Quick Reference

| Precision | Bits | Memory | Quality | Speed | |-----------|------|--------|---------|-------| | FP32 | 32 | 4x | Best | Slowest | | FP16 | 16 | 2x | Excellent | Fast | | BF16 | 16 | 2x | Excellent | Fast | | INT8 | 8 | 1x | Good | Faster | | INT4 | 4 | 0.5x | Acceptable | Fastest |

Memory Estimation

def estimate_memory(params_billions, precision_bits):
    """Estimate model memory in GB."""
    bytes_per_param = precision_bits / 8
    return params_billions * bytes_per_param

# Example: 7B model
model_size = 7  # billion parameters

print(f"FP32: {estimate_memory(7, 32):.1f} GB")  # 28 GB
print(f"FP16: {estimate_memory(7, 16):.1f} GB")  # 14 GB
print(f"INT8: {estimate_memory(7, 8):.1f} GB")   # 7 GB
print(f"INT4: {estimate_memory(7, 4):.1f} GB")   # 3.5 GB

Measure Model Size

def get_model_size(model):
    """Get model size in GB including buffers."""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    total = (param_size + buffer_size) / 1024**3
    return total

print(f"Model size: {get_model_size(model):.2f} GB")

Load Model at Different Precisions

FP32 (Default)

from transformers import AutoModelForCausalLM

model_32bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto"
)

print(f"FP32 size: {get_model_size(model_32bit):.2f} GB")

FP16 / BF16

import torch

model_16bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,  # or torch.bfloat16
    device_map="auto"
)

print(f"FP16 size: {get_model_size(model_16bit):.2f} GB")

8-bit Quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

print(f"8-bit size: {get_model_size(model_8bit):.2f} GB")

4-bit Quantization (Recommended)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True  # Nested quantization
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

print(f"4-bit size: {get_model_size(model_4bit):.2f} GB")

BitsAndBytesConfig Options

4-bit Configuration

from transformers import BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_4bit=True,

    # Quantization type
    bnb_4bit_quant_type="nf4",  # "nf4" or "fp4"

    # Compute dtype for dequantized weights
    bnb_4bit_compute_dtype=torch.bfloat16,

    # Double quantization (saves more memory)
    bnb_4bit_use_double_quant=True,
)

Options Explained

| Option | Values | Effect | |--------|--------|--------| | load_in_4bit | True/False | Enable 4-bit | | bnb_4bit_quant_type | "nf4", "fp4" | nf4 better for LLMs | | bnb_4bit_compute_dtype | float16, bfloat16 | Computation precision | | bnb_4bit_use_double_quant | True/False | Quantize quantization constants |

Compare Precision Performance

from transformers import pipeline
import time

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Test message
messages = [{"role": "user", "content": "Explain quantum computing."}]

def benchmark(model, tokenizer, name):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

    start = time.time()
    output = pipe(messages, max_new_tokens=100, return_full_text=False)
    elapsed = time.time() - start

    print(f"{name}:")
    print(f"  Time: {elapsed:.2f}s")
    print(f"  Size: {get_model_size(model):.2f} GB")
    print(f"  Output: {output[0]['generated_text'][:50]}...")
    print()

# Benchmark each precision
benchmark(model_32bit, tokenizer, "FP32")
benchmark(model_16bit, tokenizer, "FP16")
benchmark(model_8bit, tokenizer, "8-bit")
benchmark(model_4bit, tokenizer, "4-bit")

Quantization for Training

QLoRA Setup

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit base model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Precision Comparison

| Precision | Memory | Quality | Training | Best For | |-----------|--------|---------|----------|----------| | FP32 | 4x | Perfect | Yes | Research, baselines | | FP16 | 2x | Excellent | Yes | Standard training | | BF16 | 2x | Excellent | Yes | Large models | | INT8 | 1x | Good | Limited | Inference | | INT4 | 0.5x | Acceptable | QLoRA | Memory-constrained |

FP16 vs BF16

| Aspect | FP16 | BF16 | |--------|------|------| | Range | Smaller | Larger (like FP32) | | Precision | Higher | Lower | | Overflow risk | Higher | Lower | | Hardware | All GPUs | Ampere+ | | Best for | Inference | Training |

4-bit NF4 vs BF16 Comparison (Tested)

Based on experiments with Qwen3-4B-Thinking models:

Comparison Results

| Method | Peak Memory | Final Loss | Quality | |--------|-------------|------------|---------| | 4-bit NF4 | ~5.7GB | 3.0742 | Excellent | | BF16 | ~6.5GB | 3.0742 | Reference |

Key Finding: 4-bit NF4 achieves identical final loss with 11-15% memory savings.

Pre-Quantized Models (Recommended)

Use pre-quantized models for faster loading:

from unsloth import FastLanguageModel

# Pre-quantized (fast loading)
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",  # -bnb-4bit suffix
    max_seq_length=1024,
    load_in_4bit=True,
)

# vs. On-demand quantization (slower)
model, tokenizer = FastLanguageModel.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",  # Full precision
    max_seq_length=1024,
    load_in_4bit=True,  # Quantize during load
)

GPU Memory Recommendations

| GPU VRAM | Recommended | Notes | |----------|-------------|-------| | <12GB | 4-bit NF4 | Required for training | | 12-16GB | 4-bit NF4 | Allows larger batches | | >16GB | BF16 or 4-bit | Choose based on batch needs |

Quality Preservation

4-bit NF4 preserves:

Training convergence (identical final loss)
Thinking tag structure (<think>...</think>)
Response quality and coherence
Model reasoning capabilities

Troubleshooting

Out of Memory

Symptom: CUDA OOM error

Fix:

# Use 4-bit quantization
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True
)

Quality Degradation

Symptom: Poor model outputs after quantization

Fix:

Use nf4 instead of fp4
Try 8-bit instead of 4-bit
Increase LoRA rank if fine-tuning

Slow Loading

Symptom: Model takes long to load

Fix:

Quantization happens at load time
Use device_map="auto" for multi-GPU

When to Use This Skill

Use when:

Model doesn't fit in GPU memory
Need faster inference
Training with limited resources (QLoRA)
Deploying to edge devices

Cross-References

bazzite-ai-jupyter:qlora - Advanced QLoRA experiments
bazzite-ai-jupyter:peft - LoRA with quantization (QLoRA)
bazzite-ai-jupyter:finetuning - Full fine-tuning
bazzite-ai-jupyter:sft - SFT training with quantization
bazzite-ai-jupyter:inference - Fast inference patterns
bazzite-ai-jupyter:transformers - Model architecture