File Summarization

Apply this methodology when summarizing files of any type. This skill provides the routing logic and type-specific strategies for faithful file summarization.

Pre-Summarization Assessment

Before summarizing any file, the model MUST:

Read the file - Use the Read tool to access the actual content. Never guess from the filename.
Assess size - Run $CLAUDE_PLUGIN_ROOT/scripts/file_metrics.py to determine word count and file type. If the script is unavailable, use the Read tool and manually estimate word count from line count.
Select strategy - Based on size thresholds from the table below.
Verify file type - Use file extension and content inspection to determine which type-specific strategy to apply.

Size-Based Strategy Selection

| File Size | Strategy | Approach | |-----------|----------|----------| | Small (< 2,000 words) | Full read with extractive summarization | Read entire file, extract key passages, summarize from extracts | | Medium (2,000-10,000 words) | Section-based extraction | Read full file, identify sections/modules, extract from each section, synthesize | | Large (> 10,000 words) | Chunk and map-reduce | Split into chunks, summarize each chunk, synthesize chunk summaries |

SOURCE: Size thresholds adapted from Anthropic knowledge-synthesis skill (knowledge-work-plugins repository, accessed 2026-02-06). Strategy patterns informed by Map-Reduce Summarization methodology.

File Type Strategies

Code Files

File extensions: .py, .js, .ts, .jsx, .tsx, .rs, .go, .java, .c, .cpp, .h, .rb, .php, .swift, .kt, .scala, .sh, .bash, .zsh

The model MUST extract:

Imports/dependencies - List external modules and standard library imports
Structure - Classes, functions, methods with signatures
Purpose - Inferred from docstrings, comments, function names
Key logic - Core algorithms, state machines, data transformations
Entry points - main(), CLI argument parsing, exported functions
Configuration - Environment variables, config file references

Extraction method: Read sequentially. Capture top-level definitions with their line numbers. Extract docstrings verbatim. Quote complex logic rather than paraphrasing.

Example summary structure:

## Summary

Python module for HTTP client authentication. Implements JWT token refresh flow with retry logic. Exports `AuthClient` class and `refresh_token()` function.

## What Was Found

- Class `AuthClient` (lines 15-87): JWT-based HTTP client with automatic token refresh
- Function `refresh_token()` (lines 92-105): Retries up to 3 times on 401 errors
- Dependencies: `httpx`, `jwt`, `tenacity` (lines 1-3)
- Environment variables: `AUTH_BASE_URL`, `AUTH_CLIENT_ID` (lines 10-11)

## What Was NOT Found

- No test coverage information in this file
- No error handling for network failures
- Configuration schema not documented

Configuration Files

File extensions: .json, .yaml, .yml, .toml, .ini, .env, .conf, .cfg, .properties

The model MUST extract:

Top-level keys - All root keys with their value types
Nested structure - Hierarchy depth and organization
Settings categories - Group keys by purpose if clear
Notable values - Endpoints, file paths, feature flags, credentials (note presence, do not expose values)
Validation constraints - Type requirements, enums, ranges if documented

Extraction method: Parse structure. For small files, include all keys. For large files, sample representative sections and note structure patterns.

Example summary structure:

## Summary

Application configuration in YAML format. Defines database connection, API endpoints, feature flags, and logging settings. 47 configuration keys across 5 top-level sections.

## What Was Found

- `database.host`, `database.port`, `database.name` (lines 2-4): PostgreSQL connection settings
- `api.base_url`, `api.timeout` (lines 7-8): External API configuration
- `features.experimental_mode: false` (line 12): Feature flag for beta features
- `logging.level: INFO`, `logging.format` (lines 15-16): Logging configuration

## What Was NOT Found

- No schema validation rules present
- No environment-specific overrides documented
- API authentication credentials not in this file

Data Files

File extensions: .csv, .tsv, .parquet, .json (when data-structured), .jsonl, .ndjson

The model MUST extract:

Row count - Exact number of records
Column names - All column headers
Data types - Inferred from first N rows
Sample values - Representative examples from each column
Missing data - Columns with null/empty values
Unique identifiers - Primary key columns if evident

Extraction method: For CSV/TSV, read header row and first 10 data rows. For Parquet, note that binary inspection is limited. For JSON, inspect array structure.

Example summary structure:

## Summary

CSV file containing user activity logs. 1,247 rows with 8 columns. Timestamps range from 2025-01-01 to 2026-02-06. No missing values detected.

## What Was Found

- Column `user_id` (integer): User identifiers, range 1001-5432
- Column `timestamp` (ISO 8601): Activity timestamps
- Column `action` (string): Values include "login", "logout", "view_page", "click_button"
- Column `duration_ms` (integer): Range 0-45000
- 1,247 total records (line count: 1,248 including header)

## What Was NOT Found

- No schema documentation in file
- Column `referrer` is present but not documented
- No indication of data collection methodology

Documentation Files

File extensions: .md, .rst, .txt, .adoc, .org

The model MUST extract:

Topic hierarchy - Top-level headings and structure
Key sections - Main topics covered
Commands/examples - Code blocks, shell commands, API calls
Links - External references and internal cross-references
Definitions - Technical terms defined in the text

Extraction method: Read sequentially. Extract headings to build table of contents. Quote key passages that define core concepts. Note code examples.

Example summary structure:

## Summary

User guide for deploying containerized applications. Covers Docker setup, image building, registry configuration, and troubleshooting. 5 main sections with 23 subsections. Includes 12 shell command examples.

## What Was Found

- Section "Getting Started" (lines 10-45): Docker installation on Linux and macOS
- Section "Building Images" (lines 47-89): Dockerfile syntax and multi-stage builds
- Section "Troubleshooting" (lines 200-245): Common errors with solutions
- 12 shell command examples throughout document

## What Was NOT Found

- No Windows deployment instructions
- Security best practices not covered
- Performance tuning section mentioned but not written (line 15: "TODO")

Binary and Unknown Files

File extensions: .pdf, .zip, .tar, .gz, .bin, .exe, .so, .dylib, .dll, or unrecognized extensions

The model MUST:

Attempt to read - Use the Read tool. If the tool returns binary content or an error, note this.
State limitation - Do NOT guess contents. State: "Binary file, cannot extract text content."
Provide file metadata - File size, extension, location.
For PDFs: Use the Read tool with pages parameter to extract text from specific page ranges. Summarize text content if extraction succeeds.

Example for unreadable binary:

## Summary

Binary file, cannot extract text content.

## What Was Found

- File path: ./build/output.bin
- File size: 2.3 MB
- Extension: .bin

## What Was NOT Found

Unable to determine contents without binary inspection tools.

## Uncertain

File may be compiled binary, compressed archive, or proprietary format.

Quote-Grounding Technique

For all text-based files, the model MUST apply the quote-grounding technique:

First pass - Read file, identify key passages
Extract - Copy exact quotes with line numbers
Organize extracts - Group by theme or importance
Summarize from extracts - Write summary grounded in the extracted quotes
Verify - Ensure every claim in summary traces to an extract

SOURCE: Technique adapted from Fidelity Rules Rule 2 (lines 27-41).

Output Format

All file summaries MUST use the structured output format defined in Structured Summary.

Required sections:

YAML frontmatter - Include source_type: file, source_path, method, confidence, word counts
Summary - Condensed content (BLUF style)
What Was Found - Items discovered with line number references
What Was NOT Found - Expected items that were absent
Uncertain - Ambiguous items requiring interpretation
Sources - Full file path, access date

Fidelity Rules

The model MUST follow all fidelity rules defined in Fidelity Rules.

Critical rules for file summarization:

Rule 1: Read the file before summarizing. Never guess from filename.
Rule 2: Extract before abstracting. Identify key passages first.
Rule 3: Preserve counts and specifics. "7 functions" not "several functions."
Rule 4: Distinguish absence from nonexistence. "Not in file" not "doesn't exist."
Rule 6: State confidence explicitly. Full read of small file = high confidence. Truncated large file = medium/low confidence.

Multi-File Summarization

When the user requests summarization of multiple files:

Summarize each file individually using this methodology
Write each summary to a separate output file or section
Do NOT merge file summaries into a single combined summary without explicit user request
If synthesis across files is requested, load the multi-source-synthesis skill after completing individual summaries

SOURCE: Multi-source synthesis approach from Summarizer lines 33-37.

Error Handling

If a file cannot be read:

Attempt to read with the Read tool
If read fails, report the error: "Unable to read [file path]: [error message]"
Do NOT speculate about file contents
Do NOT proceed with summarization
Ask user if they want to try alternative access methods

Output Rendering

Read template - Load the template file at ../summarizer/templates/{format_id}.md (default: structured). The template defines the schema, required sections, and fidelity constraints for the selected format.
Render - Produce output following the template's Schema section. Use the template's Example as a reference for structure and style.
Verify fidelity - Confirm the output satisfies the template's Fidelity Constraints and all applicable Fidelity Rules.

Anti-Patterns

The model MUST NOT:

Summarize a file based on its name without reading it
Guess file contents from directory structure or naming conventions
Assume file type from extension without verifying contents
Summarize from partial reads (head/tail/grep) without disclosing the limitation
Upgrade "not found in file" to "file doesn't contain" in a way that implies certainty about what the file should contain
Present interpretation as observation
Skip the "What Was NOT Found" section
Omit line number references for key findings

file-summarization

File Summarization

Pre-Summarization Assessment

Size-Based Strategy Selection

File Type Strategies

Code Files

Configuration Files

Data Files

Documentation Files

Binary and Unknown Files

Quote-Grounding Technique

Output Format

Fidelity Rules

Multi-File Summarization

Error Handling

Output Rendering

Anti-Patterns