PWC Audit Intelligence Skill
Purpose
This skill packages audit domain expertise for the ground-truth project, enabling Claude to understand audit concepts, compliance requirements, and quality validation without re-explanation in each session.
When to Use This Skill
- Extracting ground truth data from SEC filings (10-K, 10-Q, DEF 14A)
- Analyzing Critical Audit Matters (CAMs)
- Validating Internal Controls over Financial Reporting (ICFR)
- Ensuring PCAOB independence compliance (Rule 3520)
- Building audit intelligence systems (RAG, ML models, partner dashboards)
Core Audit Concepts
1. Critical Audit Matters (CAMs)
Definition: Matters arising from the current period audit of financial statements that were communicated (or required to be communicated) to the audit committee and that:
- Relate to accounts/disclosures material to the financial statements
- Involved especially challenging, subjective, or complex auditor judgment
Why They Matter:
- High audit risk: Areas requiring significant professional judgment
- Complexity indicator: More CAMs = more complex client
- Partner resource allocation: CAM-heavy clients require senior staff
- Churn risk: Clients with increasing CAM count may be at risk
Examples (from ground-truth extractions):
- PFSI (PennyMac): Mortgage Servicing Rights (MSRs) valuation
- Why critical: "Involved especially challenging, subjective, or complex judgments"
- Sector-specific: Unique to mortgage banking
- ILMN (Illumina): Revenue recognition for complex multi-element arrangements
- Banking sector: Loan loss reserves (CECL calculations)
Typical CAM Counts:
- 0-1 CAMs: Low complexity (straightforward audits)
- 2-3 CAMs: Moderate complexity (industry-standard)
- 4+ CAMs: High complexity (partner-intensive)
Red Flags:
- Increasing CAM count year-over-year
- Same CAM repeated for 3+ years (unresolved issues)
- CAMs related to management estimates (subjectivity risk)
2. Internal Controls over Financial Reporting (ICFR)
Definition (SOX 404): Process designed to provide reasonable assurance regarding reliability of financial reporting and preparation of financial statements.
Effectiveness Conclusion: Binary (Effective / Ineffective)
- Effective: No material weaknesses identified
- Ineffective: One or more material weaknesses exist
Why It Matters:
- Audit quality: Clean ICFR = lower detection risk
- Client health: ICFR failures signal governance issues
- Churn risk: Material weaknesses increase partner workload, may lead to resignation
- Regulatory risk: ICFR failures attract SEC scrutiny
Material Weakness Examples:
- Inadequate segregation of duties
- Ineffective IT general controls (ITGC)
- Lack of evidence for key control execution
- Management override of controls
Ground Truth Extraction (sec_10k.controls):
{
"icfr_effective": true,
"auditor": "Deloitte & Touche LLP",
"opinion_date": "2025-02-19",
"opinion": "In our opinion, the Company maintained, in all material respects, effective internal control over financial reporting..."
}
ICFR Status Distribution (across 13 companies):
- Effective: 13/13 (100%) — all test companies have clean controls
- Ineffective: 0/13 — no material weaknesses found (expected, these are mature public companies)
3. PCAOB Independence Compliance (Rule 3520)
Rule: Auditors must be independent in fact and appearance
Key Prohibitions:
- Cannot audit own firm's clients (independence conflict)
- Cannot perform certain non-audit services (e.g., bookkeeping)
- Cannot have financial interest in audit client
- Rotation requirements (lead partner: 5 years, reviewing partner: 5 years)
Why It Matters for ground-truth:
- Testing restriction: Can ONLY use non-PWC clients for development
- Deployment restriction: Must get Independence Office approval before using PWC client data
- Provenance requirement: Must document that all test data is independence-compliant
Permitted Test Companies (127 total):
- Audited by: EY, Deloitte, KPMG (NOT PWC)
- Primary test company: PFSI (PennyMac Financial Services) — Deloitte client
- Full list:
config/test_companies_permitted.csv
Verification Process:
- Check PCAOB Form AP (auditor registration)
- Cross-reference company CIK with Form AP client list
- If PWC is auditor → FAIL (cannot use)
- If EY/Deloitte/KPMG → PASS (permitted)
Deployment Checklist (before using PWC clients):
- [ ] Independence Office consultation
- [ ] Legal/Compliance sign-off
- [ ] Client consent (where required)
- [ ] Bias audit completed
- [ ] PCAOB consultation (recommended)
4. SEC Filing Types
10-K (Annual Report):
- Item 1A: Risk Factors
- Item 7: MD&A (Management's Discussion & Analysis)
- Item 9A: Controls and Procedures (ICFR opinion)
- Critical Audit Matters: In auditor's report (near end of 10-K)
10-Q (Quarterly Report):
- Condensed financials (no full CAM disclosure)
- ICFR disclosure only if material change
DEF 14A (Proxy Statement):
- Auditor information (fees, tenure)
- Executive compensation
- Board composition
- Related party transactions
8-K (Current Report):
- Material events (M&A, executive changes, restatements)
Ground Truth Extraction Workflow
Phase 1: Company Resolution
Input: Ticker (e.g., "PFSI") or CIK (e.g., "0001745916")
Process:
- Resolve ticker → CIK via SEC API
- Fetch company profile (includes SIC code, auditor, fiscal year end)
- Check independence: Is company a PWC client? (FAIL if yes)
Output: Company metadata
{
"ticker": "PFSI",
"cik": "0001745916",
"company_name": "PENNYMAC FINANCIAL SERVICES INC",
"sic_code": "6162",
"sector": "mortgage",
"auditor": "Deloitte & Touche LLP"
}
Phase 2: Data Extraction
Sources:
- SEC XBRL: Financial statements (Assets, Liabilities, Equity, Net Income, EPS)
- SEC 10-K: CAMs, ICFR, Risk Factors
- Market data: Current price, market cap
- Sector-specific: HMDA (mortgage), FDIC (banking), EIA (utilities)
Provenance Requirements:
- source_url: Direct link to SEC filing
- file_sha256: Cryptographic hash of source document
- extraction_method: How fact was extracted (XBRL API, regex, table parser)
- confidence: Score 0.0-1.0 (1.0 = deterministic, <1.0 = heuristic)
Phase 3: Validation
Balance Sheet Reconciliation:
Assets = Liabilities + Stockholders' Equity
- PASS: Difference < $1M or < 0.1% of assets
- FAIL: Significant imbalance (check for off-balance-sheet items, extraction errors)
EPS Consistency:
Calculated EPS = Net Income / Shares Outstanding
- PASS: Within $0.01 of reported diluted EPS
- FAIL: Material difference (check for share count errors, preferred dividends)
Data Quality:
- Required facts present (Assets, Liabilities, Equity, Net Income)
- Numeric plausibility (no negative equity for going concerns)
- Date validity (ISO 8601 format)
Provenance Completeness:
- All facts have source_url
- All facts have SHA-256 checksums
- Evidence files archived locally
Phase 4: Sector Routing
SIC Code Classification:
- Banking (6000-6099): FDIC call reports, summary of deposits
- Mortgage (6100-6199): HMDA originations, PMMS rates, FHFA HPI
- Utilities (4900-4999): EIA data, EPA CAMD emissions
- Airlines (4500-4599): BTS Form 41, on-time performance
- Tech/Other: Base extractors only (XBRL, 10-K, market data)
Routing Command:
python -m ground_truth.cli classify PFSI
# Output: Sector: mortgage, Extractors: sec_xbrl, sec_10k, market_data, hmda, pmms, fhfa_hpi
RAG Integration (Phase 2)
Chunking Strategy
10-K Sections to Embed:
- Critical Audit Matters (keep each CAM as separate chunk)
- ICFR controls (single chunk)
- Risk Factors (split if >1000 tokens)
- MD&A (split by subsection)
Chunk Metadata (required):
{
"chunk_id": "PFSI_2024_CAM_01",
"ticker": "PFSI",
"cik": "0001745916",
"section": "Critical Audit Matters",
"filing_date": "2024-12-31",
"source_sha256": "f52e532ba...",
"chunk_index": 0,
"total_chunks": 2
}
Query Patterns
Factual Retrieval:
- "What are PFSI's Critical Audit Matters?"
- "Is PFSI's ICFR effective?"
- "What is EWBC's total assets?"
Comparative:
- "Compare PFSI and RKT risk factors"
- "Which has more CAMs: PFSI or RKT?"
Analytical:
- "Which mortgage companies have ICFR failures?"
- "Which companies have 3+ CAMs?"
- "Show all companies audited by Deloitte"
Temporal (requires multi-year data):
- "Did PFSI's net income increase in 2024?"
- "What CAMs did PFSI have in 2023 vs 2024?"
LTV Prediction Model (Phase 3)
Features (Churn Risk Indicators)
Complexity Indicators:
- CAM count (0-1 = low, 2-3 = medium, 4+ = high)
- ICFR effectiveness (effective = -20 pts, ineffective = +50 pts)
- Restatement history (each restatement = +30 pts)
Financial Health:
- Profitability (net income < 0 = +20 pts)
- Leverage (debt/equity > 3.0 = +15 pts)
- Liquidity (current ratio < 1.0 = +10 pts)
Sector Risk:
- Mortgage (cyclical, sensitive to rates) = +10 pts
- Banking (regulatory-heavy) = +5 pts
- Tech (fast-changing, valuation risk) = +5 pts
Heuristic Score (0-100):
def calculate_client_risk_score(ground_truth):
score = 50 # Base score
# CAM complexity
cam_count = len(ground_truth['critical_audit_matters'])
if cam_count >= 3:
score += 30
elif cam_count == 0:
score -= 30
# ICFR status
if not ground_truth['controls']['icfr_effective']:
score += 50
else:
score -= 20
# Profitability
if ground_truth['xbrl']['NetIncomeLoss'] < 0:
score += 20
return max(0, min(100, score)) # Clamp to 0-100
Risk Bands:
- 0-30: Low risk (retain, minimal partner time)
- 31-60: Medium risk (monitor, standard engagement)
- 61-100: High risk (churn candidate, consider resignation)
Quality Standards
Audit-Grade Provenance
Every fact must include:
- Source URL: Direct link to SEC filing
- SHA-256: Cryptographic hash for verification
- Extraction method: How it was obtained
- Confidence score: Reliability estimate
Example:
{
"matter": "Mortgage Servicing Rights (MSRs)",
"why_critical": "Involved especially challenging, subjective, or complex judgments",
"provenance": {
"source_url": "https://www.sec.gov/Archives/edgar/data/1745916/000155837025001148/pfsi-20241231x10k.htm",
"file_sha256": "f52e532ba113920525cf682c979c52e107904c847d6532ef0d922b71ba684e6a",
"extraction_method": "section_locator",
"confidence": 0.7
}
}
Verification Process
Partners must be able to:
- Download original 10-K from SEC
- Compute SHA-256 hash
- Verify it matches provenance metadata
- Manually locate cited text in filing
If mismatch:
- Flag for investigation (corruption or fabrication)
- Re-run extraction
- Update evidence files
PWC Pitch Framework
Problem
Partners manually review 5,000+ client 10-Ks annually:
- 200 pages per 10-K × 2 hours = 10,000 hours/year
- No institutional memory (knowledge locked in partner minds)
- Reactive (find issues after they occur) vs proactive
Solution
Automated ground truth extraction + RAG query interface:
- Extract CAMs, ICFR, financials in 3 minutes
- Natural language queries: "Which clients have material weaknesses?"
- Full provenance (SHA-256 verification)
- Proactive alerts (CAM count increasing, ICFR failures)
Proof
- 13 companies extracted (100% success rate)
- 43.8 MB evidence archived with SHA-256 checksums
- All 127 test companies are independence-compliant
- RAG POC: Partners query in 2 seconds vs 2 hours
Business Impact
Efficiency:
- 200 hours → 2 hours per partner per year (100x gain)
Risk Detection:
- Automated alerts for high-risk clients
- Early warning for churn candidates
- Proactive partner-client matching
Compliance:
- Audit trail maintained (SHA-256 provenance)
- Independence safeguards built-in
- PCAOB-compliant validation
Ask
6 weeks + AWS/Azure credits for 50-company pilot
Common Failure Modes
Extraction Errors
CAM extraction returns 0 CAMs (but company likely has them):
- Cause: Section locator regex doesn't match 10-K format
- Fix: Debug section headers, test on 3-5 known CAM companies
Balance sheet doesn't balance:
- Cause: XBRL period inconsistency (mixing quarterly/annual)
- Fix: Consistent period selection logic (see PFSI bug fix)
"Revenues" fact missing for banks:
- Cause: Banks use different XBRL tags (InterestAndDividendIncomeOperating)
- Fix: Add sector-specific tag mappings
Validation Failures
EPS consistency check fails:
- Possible causes: Preferred dividends, share count timing, rounding
- Action: Check 10-K footnotes, verify share count source
Data quality FAIL:
- Cause: Required fact missing (Assets, Liabilities, etc.)
- Action: Check XBRL API response, add fallback tag mappings
Independence Violations
Extracted JPM data (PWC client):
- Risk: PCAOB Rule 3520 violation
- Action: Delete immediately, add to blocked list, re-check permitted companies
Reference Files
In ground-truth repo:
config/test_companies_permitted.csv— 127 independence-compliant companiesINDEPENDENCE.md— Full PCAOB compliance frameworkRAG_ARCHITECTURE.md— Technical deep dive on RAG systemMILESTONE_01_PFSI_SUCCESS.md— PFSI case study (first successful extraction)
Vocabulary
Terms to use consistently:
- Ground truth: Authoritative data from primary sources (SEC filings, regulatory databases)
- Provenance: Metadata chain (source URL → SHA-256 → extraction method)
- CAM: Critical Audit Matter (NOT "significant audit matter")
- ICFR: Internal Control over Financial Reporting (NOT "internal controls")
- Extraction: Automated data retrieval (NOT "scraping")
- Validation gates: Quality checks (balance sheet, EPS, provenance)
Avoid:
- "Scraping" (implies unstructured/aggressive data collection)
- "AI-generated" (use "LLM-assisted" or "RAG-powered")
- "Black box" (emphasize explainability, provenance, human-in-the-loop)
Notes
Baseline date: October 25, 2025
Current status:
- Phase 1 (Extraction): ✅ COMPLETE (13 companies)
- Phase 2 (RAG): 🔴 IN PROGRESS (3-week timeline)
- Phase 3 (ML): ⚪ NOT STARTED (4-6 weeks after RAG)
Key decisions:
- ChromaDB selected over Pinecone (cost, simplicity)
- OpenAI embeddings selected over open-source (quality)
- RAG-first strategy (not scale-first)
- Supervised ML (not RL)
Audit context:
- Owner: Nirvan Chitnis (PWC Audit Associate, started Oct 3, 2025)
- Public repo: https://github.com/nirvanchitnis-cmyk/ground-truth
- Professional reputation protection: No inappropriate content, audit-grade standards
微信扫一扫