返回 Skill 列表
extension
分类: 数据与分析无需 API Key

Self-Hosting Mastery

完整的自托管与家庭实验室操作系统。可在生产级可靠性下部署、保护、监控和维护自托管服务。适用于...

person作者: 1kalinhubclawhub

Self-Hosting Mastery

Complete system for building and operating reliable self-hosted infrastructure — from first server to multi-node homelab.

Phase 1: Infrastructure Assessment

Server Profile YAML

server_profile:
  name: ""
  hardware:
    cpu: ""              # e.g., "Intel i5-12400" or "Raspberry Pi 5"
    ram_gb: 0
    storage:
      - device: ""       # e.g., "/dev/sda"
        type: ""         # ssd | hdd | nvme
        size_gb: 0
        role: ""         # boot | data | backup
    network: ""          # 1gbe | 2.5gbe | 10gbe
  os: ""                 # debian | ubuntu | proxmox | unraid | truenas
  location: ""           # home | closet | rack | colo | vps
  power:
    ups: false
    wattage_idle: 0
    wattage_load: 0
    monthly_cost_estimate: ""  # electricity
  network:
    public_ip: ""        # static | dynamic | cgnat
    domain: ""
    dns_provider: ""     # cloudflare | duckdns | custom
    isp_ports_open: true # some ISPs block 80/443
  goals:
    - ""                 # media server, smart home, dev environment, etc.
  budget_monthly: ""     # electricity + domain + any VPS

Hardware Decision Matrix

| Budget | RAM | Storage | Good For | Example Hardware | |--------|-----|---------|----------|-----------------| | $0 | 4-8GB | 64GB+ | Pi-hole, AdGuard, small tools | Raspberry Pi 4/5 | | $50-150 | 8-16GB | 256GB+ | Docker host, 5-10 services | Used SFF PC (Dell Optiplex, Lenovo Tiny) | | $150-400 | 16-32GB | 1TB+ | NAS + services, media server | Mini PC (Intel NUC, Beelink) | | $400-800 | 32-64GB | 4TB+ | Full homelab, VMs + containers | Used enterprise (Dell R720, HP DL380) | | $800+ | 64GB+ | 10TB+ | Multi-node, Proxmox cluster | Multiple nodes, dedicated NAS |

Self-Host vs SaaS Decision

Ask before self-hosting anything:

  1. Data sensitivity — Does keeping data local matter? (passwords, health, finance = yes)
  2. Reliability need — Can you tolerate occasional downtime? (email = risky, media = fine)
  3. Maintenance budget — Do you have 2-4 hours/month for updates?
  4. Skill level — Can you debug Docker/networking issues?
  5. Cost comparison — Is the SaaS < $10/mo? Often not worth self-hosting for trivial savings.

Always self-host: Password manager, DNS/ad-blocking, VPN, bookmarks, notes Usually self-host: Media server, file sync, photo backup, monitoring, git Think twice: Email (deliverability hell), calendar (sync complexity), chat (uptime expectations) Rarely worth it: Search engine (resource hungry), social media (no network effect)


Phase 2: OS & Virtualization

OS Selection Guide

| OS | Best For | Learning Curve | Notes | |----|----------|---------------|-------| | Debian 12 | Docker-only host | Low | Stable, minimal, just works | | Ubuntu Server 24.04 | Beginners, wide docs | Low | More packages, snap controversy | | Proxmox VE | VMs + containers | Medium | Free, enterprise features, ZFS | | Unraid | NAS + Docker + VMs | Medium | $59-129, great UI, parity array | | TrueNAS Scale | ZFS NAS + Docker | Medium | Free, ZFS-first, apps improving | | NixOS | Reproducible configs | High | Declarative, steep learning curve |

Proxmox Quick Setup

# Post-install essentials
# 1. Remove enterprise repo (if no subscription)
sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list
apt update && apt upgrade -y

# 2. Create a Docker LXC (lightweight container)
# Download template: Datacenter → Storage → CT Templates → Download → debian-12
# Create CT: 2 cores, 2GB RAM, 32GB disk, bridge vmbr0
# Inside CT: install Docker
apt install -y curl
curl -fsSL https://get.docker.com | sh

# 3. Enable IOMMU for GPU passthrough (if needed)
# Edit /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
# update-grub && reboot

VM vs LXC vs Docker Decision

| Factor | VM | LXC | Docker | |--------|----|-----|--------| | Isolation | Full (own kernel) | Partial (shared kernel) | Process-level | | Overhead | High (1-2GB base) | Low (50-200MB) | Minimal | | Use when | Different OS, GPU passthrough, untrusted workloads | Dedicated service host, ZFS datasets | Most services | | Avoid when | RAM-constrained | Need Windows, custom kernel | Stateful databases (use LXC/VM) |

Rule: Docker for 90% of services. LXC for Docker hosts or isolated environments. VM for Windows, different kernel needs, or GPU passthrough.


Phase 3: Docker Infrastructure

Docker Compose Project Structure

/opt/stacks/           # or ~/docker/
├── traefik/
│   ├── docker-compose.yml
│   ├── .env
│   ├── config/
│   │   └── traefik.yml
│   └── data/
│       ├── acme.json          # chmod 600
│       └── dynamic/
├── monitoring/
│   ├── docker-compose.yml
│   ├── .env
│   └── config/
├── media/
│   ├── docker-compose.yml
│   ├── .env
│   └── config/
├── productivity/
│   ├── docker-compose.yml
│   ├── .env
│   └── config/
└── scripts/
    ├── backup.sh
    ├── update-all.sh
    └── health-check.sh

Docker Compose Best Practices

# Template: production-grade service
services:
  app:
    image: vendor/app:1.2.3           # ALWAYS pin version
    container_name: app               # Explicit name
    restart: unless-stopped           # Auto-restart
    networks:
      - proxy                         # Traefik network
      - internal                      # Backend network
    volumes:
      - ./config:/config              # Bind mount for config
      - app-data:/data                # Named volume for data
    environment:
      - TZ=Europe/London              # Always set timezone
      - PUID=1000                     # Match host user
      - PGID=1000
    env_file:
      - .env                          # Secrets in .env (gitignored)
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.app.rule=Host(`app.example.com`)"
      - "traefik.http.routers.app.tls.certresolver=letsencrypt"
      - "traefik.http.services.app.loadbalancer.server.port=8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    deploy:
      resources:
        limits:
          memory: 512M               # Prevent OOM cascades
    security_opt:
      - no-new-privileges:true        # Security hardening
    read_only: true                   # Where possible
    tmpfs:
      - /tmp

volumes:
  app-data:

networks:
  proxy:
    external: true
  internal:

Docker Security Checklist

  • [ ] Pin all image versions (never :latest in production)
  • [ ] Set restart: unless-stopped on all services
  • [ ] Use .env files for secrets (never hardcode in compose)
  • [ ] Set memory limits on all containers
  • [ ] Use security_opt: no-new-privileges:true
  • [ ] Use read_only: true where possible + tmpfs for /tmp
  • [ ] Create separate Docker networks per stack
  • [ ] Never expose database ports to 0.0.0.0
  • [ ] Run containers as non-root (PUID/PGID or user:)
  • [ ] Enable Docker content trust: export DOCKER_CONTENT_TRUST=1
  • [ ] Prune unused images/volumes monthly: docker system prune -af
  • [ ] Use named volumes (not anonymous) for all persistent data
  • [ ] Set TZ environment variable on every container

Phase 4: Reverse Proxy & SSL

Reverse Proxy Selection

| Proxy | Best For | SSL | Config Style | Learning Curve | |-------|----------|-----|-------------|---------------| | Traefik | Docker-native, auto-discovery | Auto (ACME) | Labels + YAML | Medium | | Caddy | Simplicity, auto-SSL | Auto (built-in) | Caddyfile | Low | | Nginx Proxy Manager | GUI preference | Auto (UI) | Web UI | Very Low | | Nginx (manual) | Maximum control | Manual/certbot | Config files | High |

Recommendation: Traefik for Docker power users. Caddy for simplicity. NPM for beginners.

Traefik Production Config

# traefik/config/traefik.yml
api:
  dashboard: true
  insecure: false

entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entryPoint:
          to: websecure
          scheme: https
  websecure:
    address: ":443"
    http:
      tls:
        certResolver: letsencrypt

certificatesResolvers:
  letsencrypt:
    acme:
      email: you@example.com
      storage: /data/acme.json
      # Use DNS challenge if ISP blocks port 80
      # dnsChallenge:
      #   provider: cloudflare
      httpChallenge:
        entryPoint: web

providers:
  docker:
    exposedByDefault: false    # Explicit opt-in per service
    network: proxy
  file:
    directory: /data/dynamic
    watch: true

log:
  level: WARN

accessLog:
  filePath: /data/access.log
  bufferingSize: 100

Cloudflare Tunnel (Zero Port Forwarding)

For CGNAT or ISPs blocking ports — expose services without opening firewall:

# cloudflared/docker-compose.yml
services:
  cloudflared:
    image: cloudflare/cloudflared:2024.1.0
    container_name: cloudflared
    restart: unless-stopped
    command: tunnel run
    environment:
      - TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}
    networks:
      - proxy

When to use Cloudflare Tunnel vs port forwarding:

  • CGNAT (no public IP) → Tunnel (only option)
  • ISP blocks 80/443 → Tunnel or DNS challenge + non-standard ports
  • Security-first → Tunnel (no open ports)
  • Performance-first → Direct (lower latency)
  • LAN-only access → Neither (use Tailscale/WireGuard)

Phase 5: Essential Services Stack

Tier 1 — Deploy First (Foundation)

| Service | Purpose | Image | RAM | Notes | |---------|---------|-------|-----|-------| | Traefik/Caddy | Reverse proxy + SSL | traefik:v3.0 | 64MB | Gateway to everything | | Pi-hole/AdGuard | DNS + ad blocking | pihole/pihole | 128MB | Network-wide ad blocking | | Authelia/Authentik | SSO + 2FA | authelia/authelia | 128MB | Protect services without built-in auth | | Uptime Kuma | Monitoring | louislam/uptime-kuma | 128MB | Know when things break | | Watchtower | Auto-updates | containrrr/watchtower | 32MB | Optional — some prefer manual |

Tier 2 — Core Services

| Service | Purpose | Alt | RAM | |---------|---------|-----|-----| | Vaultwarden | Password manager | Bitwarden | 64MB | | Nextcloud | File sync + office | Seafile (lighter) | 512MB | | Immich | Photo backup | PhotoPrism | 1-4GB | | Jellyfin | Media server | Plex (less free) | 512MB-2GB | | Paperless-ngx | Document management | - | 256MB | | Home Assistant | Smart home | - | 512MB |

Tier 3 — Power User

| Service | Purpose | RAM | |---------|---------|-----| | Gitea/Forgejo | Git hosting | 256MB | | n8n | Workflow automation | 256MB | | Grafana + Prometheus | Metrics & dashboards | 512MB | | Tandoor | Recipe management | 256MB | | Mealie | Meal planning | 128MB | | Linkwarden/Hoarder | Bookmark manager | 256MB | | Stirling PDF | PDF tools | 512MB | | IT-Tools | Developer utilities | 64MB |

RAM Planning

Total RAM needed ≈ OS base (1-2GB) + sum of service RAM + 20% headroom
Example 16GB server:
  OS + Docker:     2 GB
  Traefik:         0.1 GB
  Pi-hole:         0.1 GB
  Authelia:        0.1 GB
  Uptime Kuma:     0.1 GB
  Vaultwarden:     0.1 GB
  Nextcloud:       0.5 GB
  Immich:          2.0 GB
  Jellyfin:        1.0 GB
  Paperless:       0.3 GB
  Home Assistant:  0.5 GB
  ──────────────────────
  Total:           6.8 GB → 8.2 GB with headroom
  Available:       ~7.8 GB free for more services

Phase 6: Networking & DNS

DNS Architecture

Internet → Cloudflare DNS → Your Public IP → Router → Server
                                                        ↓
                                             Reverse Proxy (Traefik)
                                                        ↓
                                     ┌──────────────────┼──────────────────┐
                                     ↓                  ↓                  ↓
                                app.domain.com   files.domain.com   media.domain.com

Split DNS (Access Services Locally Without Hairpin NAT)

# Pi-hole/AdGuard: Local DNS rewrites
# Point *.home.example.com → 192.168.1.100 (server LAN IP)
# External: Cloudflare points to public IP
# Result: LAN traffic stays local, external goes through internet

VPN for Remote Access

| Solution | Type | Best For | Complexity | |----------|------|----------|-----------| | Tailscale | Mesh VPN | Easiest setup, multi-device | Very Low | | WireGuard | Point-to-point | Performance, full control | Medium | | Headscale | Self-hosted Tailscale | Privacy, no vendor lock | Medium-High |

Recommendation: Start with Tailscale (free for 3 users). Move to Headscale when you want full control.

Firewall Rules (UFW)

# Default deny incoming
ufw default deny incoming
ufw default allow outgoing

# Allow SSH (change port from 22!)
ufw allow 2222/tcp comment 'SSH'

# Allow HTTP/HTTPS for reverse proxy
ufw allow 80/tcp comment 'HTTP redirect'
ufw allow 443/tcp comment 'HTTPS'

# Allow local network for discovery
ufw allow from 192.168.1.0/24 comment 'LAN'

# Enable
ufw enable

Phase 7: Backup Strategy

3-2-1 Rule Implementation

3 copies:  Live data + Local backup + Remote backup
2 media:   SSD/HDD (server) + External drive or NAS
1 offsite: Cloud (Backblaze B2, Wasabi) or second location

Backup Script Template

#!/bin/bash
# /opt/stacks/scripts/backup.sh
set -euo pipefail

BACKUP_DIR="/mnt/backup/docker"
STACKS_DIR="/opt/stacks"
DATE=$(date +%Y-%m-%d_%H%M)
RETENTION_DAYS=30

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }

# 1. Stop services that need consistent backups
log "Stopping database services..."
cd "$STACKS_DIR/productivity" && docker compose stop db

# 2. Backup Docker volumes
log "Backing up volumes..."
for vol in $(docker volume ls -q); do
    docker run --rm \
        -v "$vol":/source:ro \
        -v "$BACKUP_DIR/volumes":/backup \
        alpine tar czf "/backup/${vol}_${DATE}.tar.gz" -C /source .
done

# 3. Backup compose files and configs
log "Backing up configs..."
tar czf "$BACKUP_DIR/configs/stacks_${DATE}.tar.gz" \
    --exclude='*.log' \
    --exclude='node_modules' \
    "$STACKS_DIR"

# 4. Restart services
log "Restarting services..."
cd "$STACKS_DIR/productivity" && docker compose start db

# 5. Cleanup old backups
log "Cleaning up backups older than ${RETENTION_DAYS} days..."
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

# 6. Sync to remote (Backblaze B2 example)
# rclone sync "$BACKUP_DIR" b2:my-backups/docker/ --transfers 4

# 7. Verify
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)
log "Backup complete. Total size: $BACKUP_SIZE"

# 8. Send notification (optional)
# curl -s "https://ntfy.sh/my-backups" -d "Backup complete: $BACKUP_SIZE"

Backup Schedule

| What | Frequency | Retention | Method | |------|-----------|-----------|--------| | Docker volumes | Daily 3 AM | 30 days | Script + cron | | Compose files + configs | Daily 3 AM | 90 days | Script + cron | | Database dumps | Every 6 hours | 7 days | pg_dump/mysqldump | | Full disk image | Monthly | 3 months | Clonezilla/dd | | Offsite sync | Daily 5 AM | 60 days | rclone to B2/Wasabi |

Backup Verification (Monthly)

  • [ ] Pick a random backup from last week
  • [ ] Restore to a test VM/container
  • [ ] Verify data integrity (check file counts, DB row counts)
  • [ ] Time the restore process (document RTO)
  • [ ] Log results in backup-verification.md

Phase 8: Monitoring & Alerting

Monitoring Stack (Docker Compose)

# monitoring/docker-compose.yml
services:
  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: uptime-kuma
    restart: unless-stopped
    volumes:
      - uptime-data:/app/data
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.uptime.rule=Host(`status.example.com`)"

  prometheus:
    image: prom/prometheus:v2.49.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:10.3.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.0
    container_name: cadvisor
    restart: unless-stopped
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

volumes:
  uptime-data:
  prometheus-data:
  grafana-data:

Alert Rules

| Metric | Warning | Critical | Action | |--------|---------|----------|--------| | Disk usage | >80% | >90% | Cleanup or expand | | RAM usage | >85% | >95% | Identify memory leak, add RAM | | CPU sustained | >80% 5min | >95% 5min | Check runaway process | | Container restart | >2/hour | >5/hour | Check logs, fix root cause | | SSL cert expiry | <14 days | <3 days | Renew cert | | Backup age | >26 hours | >48 hours | Check backup script/cron | | Service down | >2 min | >10 min | Investigate, restart |

Notification Channels

| Channel | Service | Best For | |---------|---------|----------| | Push notification | ntfy.sh (self-hosted) | Mobile alerts | | Chat | Discord/Slack webhook | Team alerts | | Email | Uptime Kuma built-in | Formal notifications | | Dashboard | Grafana + Uptime Kuma | Visual monitoring |


Phase 9: Security Hardening

Server Hardening Checklist

# 1. SSH hardening
# /etc/ssh/sshd_config
Port 2222                          # Change default port
PermitRootLogin no                 # No root SSH
PasswordAuthentication no          # Key-only
MaxAuthTries 3
AllowUsers yourusername

# 2. Install fail2ban
apt install fail2ban -y
systemctl enable fail2ban

# 3. Automatic security updates
apt install unattended-upgrades -y
dpkg-reconfigure -plow unattended-upgrades

# 4. Disable unused services
systemctl list-unit-files --state=enabled
# Disable anything you don't need

Authentication Architecture

Internet → Traefik → Authelia/Authentik → Service
                         ↓
                    Check: authenticated?
                    Yes → Forward to service
                    No → Redirect to login page + 2FA

Authelia (lightweight, YAML config) — good for smaller setups Authentik (full IdP, web UI) — good for many users/services, SAML/OIDC

Security Scoring (0-100)

| Dimension | Weight | Score Guide | |-----------|--------|-------------| | SSH hardened (keys, non-root, non-22) | 15 | 0=default, 15=fully hardened | | Firewall active (deny-by-default) | 15 | 0=none, 15=UFW/iptables configured | | Reverse proxy (no direct port exposure) | 15 | 0=ports exposed, 15=all behind proxy | | SSL/TLS on all services | 10 | 0=HTTP, 10=HTTPS everywhere | | Auth on all public services | 15 | 0=open, 15=SSO/2FA on everything | | Container security (non-root, limits) | 10 | 0=default, 10=hardened | | Auto-updates enabled | 10 | 0=manual, 10=automated | | Secrets management (.env, not hardcoded) | 10 | 0=in compose, 10=.env + restricted perms |

Score: 0-40 = Vulnerable, 41-70 = Acceptable, 71-90 = Good, 91-100 = Hardened


Phase 10: Maintenance & Updates

Update Strategy

Option A: Manual (Recommended for critical services)

# Update script: /opt/stacks/scripts/update-all.sh
#!/bin/bash
set -euo pipefail

STACKS_DIR="/opt/stacks"
LOG="/var/log/docker-updates.log"

for stack in "$STACKS_DIR"/*/; do
    if [ -f "$stack/docker-compose.yml" ]; then
        echo "[$(date)] Updating $(basename $stack)..." | tee -a "$LOG"
        cd "$stack"
        docker compose pull 2>&1 | tee -a "$LOG"
        docker compose up -d 2>&1 | tee -a "$LOG"
    fi
done

docker image prune -f | tee -a "$LOG"
echo "[$(date)] Update complete" | tee -a "$LOG"

Option B: Watchtower (Automated — use with caution)

services:
  watchtower:
    image: containrrr/watchtower:1.7.1
    container_name: watchtower
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WATCHTOWER_SCHEDULE=0 0 4 * * MON  # Monday 4 AM
      - WATCHTOWER_CLEANUP=true
      - WATCHTOWER_NOTIFICATIONS=shoutrrr
      - WATCHTOWER_NOTIFICATION_URL=discord://webhook
      - WATCHTOWER_LABEL_ENABLE=true    # Only update labeled containers
    # Add label to containers: com.centurylinklabs.watchtower.enable=true

Weekly Maintenance Checklist

  • [ ] Check Uptime Kuma for any downtime events
  • [ ] Review disk usage (df -h)
  • [ ] Check container health (docker ps --filter health=unhealthy)
  • [ ] Review fail2ban bans (fail2ban-client status)
  • [ ] Check backup logs (last successful backup)
  • [ ] Review Docker logs for errors (docker logs --since 7d <container>)
  • [ ] Prune unused resources (docker system prune -f)

Monthly Maintenance

  • [ ] Update all container images (read changelogs first!)
  • [ ] Update host OS (apt update && apt upgrade)
  • [ ] Test a backup restore
  • [ ] Review and rotate secrets/passwords
  • [ ] Check SSL certificate expiry dates
  • [ ] Review Grafana dashboards for trends
  • [ ] Clean up unused Docker networks/volumes

Phase 11: Advanced Patterns

Multi-Node Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Node 1    │     │   Node 2    │     │   Node 3    │
│ (Proxy/DNS) │────│ (Services)  │────│   (NAS)     │
│ Traefik     │     │ Apps        │     │ TrueNAS     │
│ Pi-hole     │     │ Databases   │     │ NFS/SMB     │
│ Authelia    │     │ Media       │     │ Backup      │
└─────────────┘     └─────────────┘     └─────────────┘
       ↑                   ↑                   ↑
       └───────── Tailscale Mesh ──────────────┘

Docker Compose Includes (Compose v2.20+)

# Shared fragments
include:
  - path: ../common/traefik-labels.yml
  - path: ../common/logging.yml

services:
  app:
    # inherits common configs

GitOps for Homelab

homelab-configs/           # Git repo
├── .github/
│   └── workflows/
│       └── deploy.yml     # CI: lint + push to server
├── stacks/
│   ├── traefik/
│   ├── monitoring/
│   └── media/
├── scripts/
└── README.md

Workflow: Edit compose locally → commit → push → CI deploys to server Tools: Flux/ArgoCD (overkill), or simple git pull && docker compose up -d via webhook

Hardware Redundancy

| Component | Solution | Cost | |-----------|----------|------| | Power | UPS (APC Back-UPS 600VA+) | $60-150 | | Storage | RAID1/ZFS mirror (not RAID0!) | 2x disk cost | | Network | Dual NIC, managed switch | $30-100 | | Server | Second node (cold spare or active) | $100-400 |

Rule: RAID is NOT backup. It protects against disk failure only, not ransomware/deletion/corruption.


Phase 12: Troubleshooting

Common Issues Decision Tree

Service not accessible?
├── Can you ping the server? → No → Network/firewall issue
├── Is the container running? (`docker ps`) → No → Check logs: `docker logs <name>`
├── Is the port exposed? (`docker port <name>`) → No → Check compose ports/networks
├── Is Traefik routing? (Check Traefik dashboard) → No → Check labels, network
├── Is DNS resolving? (`dig app.example.com`) → No → Check DNS provider
└── SSL error? → Check acme.json permissions (chmod 600), cert resolver logs

Docker Debug Commands

# Container not starting
docker logs <name> --tail 50
docker inspect <name> | jq '.[0].State'

# Network issues
docker network ls
docker network inspect <network>
docker exec <name> ping other-container

# Resource issues
docker stats                          # Live resource usage
docker system df                      # Disk usage
docker volume ls -f dangling=true     # Orphaned volumes

# Nuclear options (use carefully)
docker compose down && docker compose up -d    # Full restart
docker system prune -af --volumes              # Clean EVERYTHING

Performance Optimization

| Symptom | Likely Cause | Fix | |---------|-------------|-----| | Slow file access | HDD for database | Move DB to SSD | | High CPU idle | Monitoring too frequent | Increase scrape intervals | | OOM kills | No memory limits | Set deploy.resources.limits.memory | | Slow Nextcloud | Missing Redis cache | Add Redis container | | Jellyfin buffering | No hardware transcoding | Enable GPU passthrough | | Slow Docker builds | No layer caching | Use multi-stage + .dockerignore |


Service Configuration Quick Reference

Vaultwarden (Password Manager)

services:
  vaultwarden:
    image: vaultwarden/server:1.30.5
    container_name: vaultwarden
    restart: unless-stopped
    volumes:
      - vaultwarden-data:/data
    environment:
      - SIGNUPS_ALLOWED=false       # Disable after creating your account
      - WEBSOCKET_ENABLED=true
      - ADMIN_TOKEN=${ADMIN_TOKEN}  # Generate: openssl rand -base64 48
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.vault.rule=Host(`vault.example.com`)"

Immich (Photo Backup)

# Use their official docker-compose.yml from:
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
# Key settings:
# - Set UPLOAD_LOCATION to a large storage mount
# - Enable hardware transcoding if GPU available
# - Set IMMICH_MACHINE_LEARNING_URL for face detection

Paperless-ngx (Document Management)

services:
  paperless:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.4
    container_name: paperless
    restart: unless-stopped
    volumes:
      - paperless-data:/usr/src/paperless/data
      - paperless-media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume  # Drop PDFs here
      - ./export:/usr/src/paperless/export
    environment:
      - PAPERLESS_OCR_LANGUAGE=eng
      - PAPERLESS_TIME_ZONE=Europe/London
      - PAPERLESS_ADMIN_USER=${ADMIN_USER}
      - PAPERLESS_ADMIN_PASSWORD=${ADMIN_PASS}

Homelab Quality Rubric (0-100)

| Dimension | Weight | 0 (Poor) | 50 (Decent) | 100 (Excellent) | |-----------|--------|----------|-------------|-----------------| | Security | 20% | Default passwords, open ports | Firewall + SSL | Hardened SSH, SSO/2FA, no-new-privileges | | Backups | 20% | None | Local only, untested | 3-2-1, automated, verified monthly | | Monitoring | 15% | None | Uptime Kuma only | Full stack: metrics + logs + alerts | | Documentation | 10% | Nothing written | README per stack | GitOps, full runbook, diagrams | | Updates | 10% | Never updated | Manual quarterly | Scheduled weekly, changelogs reviewed | | Reliability | 10% | Frequent crashes | Mostly stable | UPS, auto-restart, health checks | | Performance | 10% | Slow, OOM kills | Adequate | Resource limits, SSD, HW transcoding | | Scalability | 5% | Single machine, no plan | Compose organized | Multi-node ready, IaC |


10 Self-Hosting Mistakes

| # | Mistake | Fix | |---|---------|-----| | 1 | Using :latest tag | Pin versions: image:1.2.3 | | 2 | No backups | 3-2-1 backup rule, test restores | | 3 | Exposing ports directly | Everything behind reverse proxy | | 4 | Default passwords | Change immediately, use password manager | | 5 | No monitoring | Uptime Kuma minimum, Grafana for depth | | 6 | RAID = backup mentality | RAID protects disks, not data | | 7 | Over-engineering day 1 | Start small, add complexity as needed | | 8 | No documentation | Document every service, every port, every cron | | 9 | Ignoring updates | Security patches matter, schedule updates | | 10 | Running as root | Non-root containers, restricted SSH |


Natural Language Commands

| Say | Agent Does | |-----|-----------| | "Set up a new service" | Guide through compose file creation with security best practices | | "Audit my homelab security" | Run through security scoring checklist | | "Plan my backup strategy" | Design 3-2-1 backup plan for your setup | | "What should I self-host?" | Assess needs and recommend services by tier | | "My container keeps crashing" | Walk through troubleshooting decision tree | | "Help me set up Traefik" | Generate production Traefik config with SSL | | "Compare NAS options" | Compare TrueNAS vs Unraid vs DIY for your needs | | "Optimize my Docker setup" | Review compose files for security and performance | | "Set up monitoring" | Deploy Uptime Kuma + Prometheus + Grafana stack | | "Plan a hardware upgrade" | Assess current usage, recommend hardware by budget | | "Migrate from cloud to self-hosted" | Plan migration with data export and service mapping | | "Set up remote access" | Compare and deploy VPN/Tailscale for secure remote access |