File Summarization
Apply this methodology when summarizing files of any type. This skill provides the routing logic and type-specific strategies for faithful file summarization.
Pre-Summarization Assessment
Before summarizing any file, the model MUST:
-
Read the file - Use the Read tool to access the actual content. Never guess from the filename.
-
Assess size - Run
$CLAUDE_PLUGIN_ROOT/scripts/file_metrics.pyto determine word count and file type. If the script is unavailable, use the Read tool and manually estimate word count from line count. -
Select strategy - Based on size thresholds from the table below.
-
Verify file type - Use file extension and content inspection to determine which type-specific strategy to apply.
Size-Based Strategy Selection
| File Size | Strategy | Approach | |-----------|----------|----------| | Small (< 2,000 words) | Full read with extractive summarization | Read entire file, extract key passages, summarize from extracts | | Medium (2,000-10,000 words) | Section-based extraction | Read full file, identify sections/modules, extract from each section, synthesize | | Large (> 10,000 words) | Chunk and map-reduce | Split into chunks, summarize each chunk, synthesize chunk summaries |
SOURCE: Size thresholds adapted from Anthropic knowledge-synthesis skill (knowledge-work-plugins repository, accessed 2026-02-06). Strategy patterns informed by Map-Reduce Summarization methodology.
File Type Strategies
Code Files
File extensions: .py, .js, .ts, .jsx, .tsx, .rs, .go, .java, .c, .cpp, .h, .rb, .php, .swift, .kt, .scala, .sh, .bash, .zsh
The model MUST extract:
- Imports/dependencies - List external modules and standard library imports
- Structure - Classes, functions, methods with signatures
- Purpose - Inferred from docstrings, comments, function names
- Key logic - Core algorithms, state machines, data transformations
- Entry points -
main(), CLI argument parsing, exported functions - Configuration - Environment variables, config file references
Extraction method: Read sequentially. Capture top-level definitions with their line numbers. Extract docstrings verbatim. Quote complex logic rather than paraphrasing.
Example summary structure:
## Summary
Python module for HTTP client authentication. Implements JWT token refresh flow with retry logic. Exports `AuthClient` class and `refresh_token()` function.
## What Was Found
- Class `AuthClient` (lines 15-87): JWT-based HTTP client with automatic token refresh
- Function `refresh_token()` (lines 92-105): Retries up to 3 times on 401 errors
- Dependencies: `httpx`, `jwt`, `tenacity` (lines 1-3)
- Environment variables: `AUTH_BASE_URL`, `AUTH_CLIENT_ID` (lines 10-11)
## What Was NOT Found
- No test coverage information in this file
- No error handling for network failures
- Configuration schema not documented
Configuration Files
File extensions: .json, .yaml, .yml, .toml, .ini, .env, .conf, .cfg, .properties
The model MUST extract:
- Top-level keys - All root keys with their value types
- Nested structure - Hierarchy depth and organization
- Settings categories - Group keys by purpose if clear
- Notable values - Endpoints, file paths, feature flags, credentials (note presence, do not expose values)
- Validation constraints - Type requirements, enums, ranges if documented
Extraction method: Parse structure. For small files, include all keys. For large files, sample representative sections and note structure patterns.
Example summary structure:
## Summary
Application configuration in YAML format. Defines database connection, API endpoints, feature flags, and logging settings. 47 configuration keys across 5 top-level sections.
## What Was Found
- `database.host`, `database.port`, `database.name` (lines 2-4): PostgreSQL connection settings
- `api.base_url`, `api.timeout` (lines 7-8): External API configuration
- `features.experimental_mode: false` (line 12): Feature flag for beta features
- `logging.level: INFO`, `logging.format` (lines 15-16): Logging configuration
## What Was NOT Found
- No schema validation rules present
- No environment-specific overrides documented
- API authentication credentials not in this file
Data Files
File extensions: .csv, .tsv, .parquet, .json (when data-structured), .jsonl, .ndjson
The model MUST extract:
- Row count - Exact number of records
- Column names - All column headers
- Data types - Inferred from first N rows
- Sample values - Representative examples from each column
- Missing data - Columns with null/empty values
- Unique identifiers - Primary key columns if evident
Extraction method: For CSV/TSV, read header row and first 10 data rows. For Parquet, note that binary inspection is limited. For JSON, inspect array structure.
Example summary structure:
## Summary
CSV file containing user activity logs. 1,247 rows with 8 columns. Timestamps range from 2025-01-01 to 2026-02-06. No missing values detected.
## What Was Found
- Column `user_id` (integer): User identifiers, range 1001-5432
- Column `timestamp` (ISO 8601): Activity timestamps
- Column `action` (string): Values include "login", "logout", "view_page", "click_button"
- Column `duration_ms` (integer): Range 0-45000
- 1,247 total records (line count: 1,248 including header)
## What Was NOT Found
- No schema documentation in file
- Column `referrer` is present but not documented
- No indication of data collection methodology
Documentation Files
File extensions: .md, .rst, .txt, .adoc, .org
The model MUST extract:
- Topic hierarchy - Top-level headings and structure
- Key sections - Main topics covered
- Commands/examples - Code blocks, shell commands, API calls
- Links - External references and internal cross-references
- Definitions - Technical terms defined in the text
Extraction method: Read sequentially. Extract headings to build table of contents. Quote key passages that define core concepts. Note code examples.
Example summary structure:
## Summary
User guide for deploying containerized applications. Covers Docker setup, image building, registry configuration, and troubleshooting. 5 main sections with 23 subsections. Includes 12 shell command examples.
## What Was Found
- Section "Getting Started" (lines 10-45): Docker installation on Linux and macOS
- Section "Building Images" (lines 47-89): Dockerfile syntax and multi-stage builds
- Section "Troubleshooting" (lines 200-245): Common errors with solutions
- 12 shell command examples throughout document
## What Was NOT Found
- No Windows deployment instructions
- Security best practices not covered
- Performance tuning section mentioned but not written (line 15: "TODO")
Binary and Unknown Files
File extensions: .pdf, .zip, .tar, .gz, .bin, .exe, .so, .dylib, .dll, or unrecognized extensions
The model MUST:
-
Attempt to read - Use the Read tool. If the tool returns binary content or an error, note this.
-
State limitation - Do NOT guess contents. State: "Binary file, cannot extract text content."
-
Provide file metadata - File size, extension, location.
-
For PDFs: Use the Read tool with
pagesparameter to extract text from specific page ranges. Summarize text content if extraction succeeds.
Example for unreadable binary:
## Summary
Binary file, cannot extract text content.
## What Was Found
- File path: ./build/output.bin
- File size: 2.3 MB
- Extension: .bin
## What Was NOT Found
Unable to determine contents without binary inspection tools.
## Uncertain
File may be compiled binary, compressed archive, or proprietary format.
Quote-Grounding Technique
For all text-based files, the model MUST apply the quote-grounding technique:
- First pass - Read file, identify key passages
- Extract - Copy exact quotes with line numbers
- Organize extracts - Group by theme or importance
- Summarize from extracts - Write summary grounded in the extracted quotes
- Verify - Ensure every claim in summary traces to an extract
SOURCE: Technique adapted from Fidelity Rules Rule 2 (lines 27-41).
Output Format
All file summaries MUST use the structured output format defined in Structured Summary.
Required sections:
- YAML frontmatter - Include
source_type: file,source_path,method,confidence, word counts - Summary - Condensed content (BLUF style)
- What Was Found - Items discovered with line number references
- What Was NOT Found - Expected items that were absent
- Uncertain - Ambiguous items requiring interpretation
- Sources - Full file path, access date
Fidelity Rules
The model MUST follow all fidelity rules defined in Fidelity Rules.
Critical rules for file summarization:
- Rule 1: Read the file before summarizing. Never guess from filename.
- Rule 2: Extract before abstracting. Identify key passages first.
- Rule 3: Preserve counts and specifics. "7 functions" not "several functions."
- Rule 4: Distinguish absence from nonexistence. "Not in file" not "doesn't exist."
- Rule 6: State confidence explicitly. Full read of small file = high confidence. Truncated large file = medium/low confidence.
Multi-File Summarization
When the user requests summarization of multiple files:
- Summarize each file individually using this methodology
- Write each summary to a separate output file or section
- Do NOT merge file summaries into a single combined summary without explicit user request
- If synthesis across files is requested, load the multi-source-synthesis skill after completing individual summaries
SOURCE: Multi-source synthesis approach from Summarizer lines 33-37.
Error Handling
If a file cannot be read:
- Attempt to read with the Read tool
- If read fails, report the error: "Unable to read [file path]: [error message]"
- Do NOT speculate about file contents
- Do NOT proceed with summarization
- Ask user if they want to try alternative access methods
Output Rendering
- Read template - Load the template file at
../summarizer/templates/{format_id}.md(default:structured). The template defines the schema, required sections, and fidelity constraints for the selected format. - Render - Produce output following the template's Schema section. Use the template's Example as a reference for structure and style.
- Verify fidelity - Confirm the output satisfies the template's Fidelity Constraints and all applicable Fidelity Rules.
Anti-Patterns
The model MUST NOT:
- Summarize a file based on its name without reading it
- Guess file contents from directory structure or naming conventions
- Assume file type from extension without verifying contents
- Summarize from partial reads (head/tail/grep) without disclosing the limitation
- Upgrade "not found in file" to "file doesn't contain" in a way that implies certainty about what the file should contain
- Present interpretation as observation
- Skip the "What Was NOT Found" section
- Omit line number references for key findings
Scan to join WeChat group