RTL Debugging Methodology
Systematic approach for debugging RTL from verification results and test scenarios.
When to Use This Skill
- Analyzing UVM test failures to identify RTL bugs
- Investigating assertion violations in simulation
- Debugging scoreboard mismatches between expected and actual behavior
- Triaging multiple test failures to find common root causes
- Understanding why specific test scenarios fail while others pass
- Analyzing coverage holes related to bugs
Debugging Workflow (推論プロセス)
1. Analyze Test Failure Pattern
Objective: Understand which tests fail and why
Questions to answer:
- Which tests pass and which fail? (Pattern analysis)
- Do failures occur in specific test scenarios only?
- Is the failure deterministic or random? (Check with different seeds)
- At what phase does the test fail? (Build, run, scoreboard check)
Evidence sources:
- Test execution logs (sim/logs/)
- Regression test results (sim/reports/)
- UVM report summary (UVM_ERROR, UVM_FATAL locations)
- DSIM Collect Verification Evidence
Objective: Gather all available evidence from verification components
Verification evidence sources:
From Assertions:
UVM_ERROR @ 1250ns: Assertion 'a_axi_wdata_stable' failed
Location: sim/assertions/axi4_protocol_checker.sv:45
Property: wdata must remain stable when wvalid=1 and wready=0
→ RTL violates AXI4 protocol specification
From Scoreboard:
UVM_ERROR: [SCOREBOARD] Data mismatch detected
Expected: 0xDEADBEEF
Actual: 0xDEADBEE0
Address: 0x1000
Time: 1250ns
→ LSB nibble corrupted, check datapath width or masking logic
From Monitor:
UVM_WARNING: [MONITOR] Unexpected transaction observed
Type: WRITE
Address: 0x1004 (expected: 0x1000)
PossiMap Evidence to RTL Problem Domain
**Objective**: Translate verification failures to RTL problem categories
**Evidence-to-Problem mapping**:
| Verification Evidence | RTL Problem Domain | Investigation Focus |
|----------------------|-------------------|---------------------|
| **Assertion: Protocol violation** | Interface logic | Check handshake FSM, signal timing |
| **Scoreboard: Data mismatch** | Datapath logic | Check ALU, mux select, forwarding |
| **Scoreboard: Missing transaction** | Control logic | Check enable signals, FSM transitions |
| **Scoreboard: Extra transaction** | Control logic | Check termination conditions, counters |
| **Monitor: Wrong address** | Address generation | Check increment/decrement logic, offset calculation |
| **Monitor: Wrong timing** | Pipeline control | Check stall logic, valid/ready propagation |
| **Assertion: X-propagation** | Reset/initialization | Check reset assignments, case completeness |
**Test scenario analysis**:
Failing scenario: Back-to-back writes with no idle cycles Passing scenario: Writes with 2-cycle gaps
Hypothesis generation:
- Pipeline hazard when no bubble between transactions
- Backpressure handling assumes idle cycles
- State machine doesn't handle consecutive valid inputs
- Register forwarding path missing for zero-latency case
**Objective**: Create minimal test to isolate root cause
**Experiment design strategies**:
**Modify existing failing test**:
```systemverilog
// Original failing test: Back-to-back writes
sequence.add_transaction(WRITE, addr=0x1000, data=0xAA);
sequence.add_transaction(WRITE, addr=0x1004, data=0xBB); // ← FAILS
// Experiment 1: Add gap between transactions
sequence.add_transaction(WRITE, addr=0x1000, data=0xAA);
sequence.add_idle_cycles(2);
sequence.add_transaction(WRITE, addr=0x1004, data=0xBB); // ← PASS?
// If passes: Confirms pipeline hazard hypothesis
// Experiment 2: Same address back-to-back
sequence.add_transaction(WRITE, addr=0x1000, data=0xAA);
sequence.add_transaction(WRITE, addr=0x1000, data=0xBB); // ← PASS/FAIL?
// If passes: Problem is address-generation specific
Create minimal directed test:
// Hypothesis: Burst counter overflows at length=16
class minimal_burst_test extends base_test;
virtual task run_phase(uvm_phase phase);
phase.raise_objection(this);
// Test exactly at boundary
send_burst(addr=0x0, length=15); // Should work
send_burst(addr=0x0, length=16); // Should fail
send_burst(addr=0x0, length=17); // Should fail
phase.drop_objection(this);
endtask
endclass
Add debug assertions:
// Insert temporary assertion at suspected problem point
bind axi_slave_fsm debug_assertions (
.clk(clk),
.state(current_state),
.wvalid(wvalid),
.wready(wready)
);
Trace from Verification to RTL Root Cause
**Objective**: Navigate from high-level test failure to specific RTL bug
**Top-down tracing workflow**:
-
Test Failure └─ axiuart_burst_test fails with scoreboard mismatch
-
Scoreboard Analysis └─ Expected data: 0xBB, Actual: 0xAA └─ Second write returned first write's data
-
Monitor Analysis (check transactions observed) └─ WRITE(addr=0x1000, data=0xAA) @ 1000ns - acknowledged └─ WRITE(addr=0x1004, data=0xBB) @ 1002ns - acknowledged └─ READ(addr=0x1004) @ 1010ns - returned 0xAA (wrong!)
-
Waveform Analysis at 1002ns (second write) └─ axi_wdata = 0xBB ✓ └─ axi_waddr = 0x1004 ✓ └─ write_enable = 1'b1 ✓ └─ But: register_select still points to 0x1000 ✗
-
RTL Module Analwith Test Suite
Objective: Confirm fix resolves issue without breaking other tests
Verification workflow:
Step 1: Re-run failing test
# Run specific test that previously failed
run_uvm_simulation --test axiuart_burst_test --seed 12345
# Expected: PASS
Step 2: Run related tests (test suite partitioning)
# Run all tests that exercise same RTL module
run_uvm_simulation --regression smoke_suite
# Focus: Tests with write transactions, address decoding
Step 3: Full regression
### By Test Failure Type
| Failure Type | Root Cause Category | Investigation Focus |
|-------------|---------------------|---------------------|
| **Scoreboard mismatch: wrong data** | Datapath error | Trace data from source to sink, check mux selects, forwarding |
| **Scoreboard mismatch: missing transaction** | Control flow error | Check FSM transitions, enable signals, counter termination |
| **Scoreboard mismatch: extra transaction** | Control flow error | Check counter overflow, FSM looping, duplicate strobes |
| **Assertion: Protocol violation** | Interface timing | Check handshake sequences, stability requirements, backpressure |
| **Assertion: Stability violation** | Combinational logic | Check for unintended signal changes, glitches, race conditions |
| **Assertion: X-propagation** | Initialization error | Check reset coverage, case statement completeness, undriven signals |
| **Timeout: No response** | Deadlock or FSM stuck | Check FSM for unreachable transitions, missing conditions |
| **UVM_FATAL: Null object** | Verification code bug | Not RTL issue - check testbench configuration |
### By Test Pass/Fail Pattern
**Pattern: Only random tests fail, directed tests pass**
- **Hypothesis**: Corner case not covered by directed tests
- **Action**: Analyze failing random test stimulus for common characteristics
- **Example**: Random test hits burst length=256, directed tests only ≤16
**Pattern: All tests with feature X fail, others pass**
- **Hypothesis**: Feature X has RTL bug
- **Action**: Focus debug on RTL module implementing feature X
- **Example**: All interrupt tests fail → debug interrupt controller
**Pattern: Intermittent failures with different seeds**
- **Hypothesis**: Race condition or initialization dependency
- *From Verification Evidence to RTL Root Cause
### Scoreboard-Driven Investigation
**Scoreboard reports data mismatch**:
Step 1: Identify transaction with mismatch Monitor: WRITE(addr=0x1000, data=0xDEADBEEF) @ 1000ns Scoreboard: Expected 0xDEADBEEF at 0x1000 Monitor: READ(addr=0x1000) → 0xDEADBEE0 @ 1100ns Mismatch: LSB nibble changed 0xF → 0x0
Step 2: Hypothesize based on bit pattern
- All bits except LSB nibble correct → byte masking issue
- LSB nibble zeroed → possible width/alignment problem
Step 3: Check waveform at write cycle (1000ns) axi_wdata[31:0] = 0xDEADBEEF ✓ write_strobe[3:0] = 4'b1111 ✓ register_wdata[31:0] = 0xDEADBEE0 ✗ ← BUG IS HERE
Step 4: Trace write path axi_wdata → data_align_unit → register_wdata Check data_align_unit for LSB nibble handling
Step 5: Find root cause in RTL // Bug found in data_align_unit assign register_wdata = {axi_wdata[31:4], 4'b0000}; // ← Hardcoded zero!
### Assertion-Driven Investigation
**Assertion reports protocol violation**:
Assertion 'a_axi_wdata_stable' failed @ 1250ns Property: (wvalid && !wready) |=> $stable(wdata)
Step 1: Understand assertion semantics
- wdata must not change when wvalid=1 and wready=0
- This is AXI4 protocol requirement
Step 2: Check waveform at violation timestamp @1249ns: wvalid=1, wready=0, wdata=0xAAAA @1250ns: wvalid=1, wready=0, wdata=0xBBBB ← Changed illegally
Step 3: Find source of wdata in RTL assign wdata = write_fifo_dout;
Step 4: Check FIFO read logic assign fifo_read_en = wvalid && wready; ✓ Correct condition
Step 5: Check for other paths affecting wdata // Found: Debug logic bypassing FIFO! assign wdata = debug_mode ? debug_data : write_fifo_dout; // debug_mode changed during backpressure → violation
### Test Suite Differential Analysis
**Multiple tests analysis**:
| Test Name | Scenario | Result | Common Attribute |
|-----------|----------|--------|------------------|
| basic_write | Single write | ✓ PASS | Burst length = 1 |
| burst4_write | 4-beat burst | ✓ PASS | Burst length = 4 |
| bDebugging Techniques from Test Results
### Regression Test Triage
**Analyze multiple test results to find common root cause**:
Regression suite: 42 tests total
- 38 PASS
- 4 FAIL: axiuart_burst16, axiuart_burst32, axiuart_wrap16, axiuart_wrap32
Pattern recognition:
- All failures involve burst length ≥ 16
- Both INCR and WRAP burst types affected
- Burst length ≤ 8 always passes
Common root cause hypothesis:
- Burst counter width insufficient for length ≥ 16
- Not specific to burst type (INCR vs WRAP)
- Not data-pattern dependent
Single fix expected to resolve all 4 failures.
### Minimal Reproducing Test
**Create simplest test that triggers bug**:
```systemverilog
// Original failing test: 200 lines, 10 minutes runtime
class axiuart_burst16_test extends base_test;
// Complex randomization, multiple sequences, ...
endclass
// Minimal reproducer: 15 lines, 10 seconds runtime
class minimal_burst16_test extends base_test;
task run_phase(uvm_phase phase);
axi_seq seq = axi_seq::type_id::create("seq");
phase.raise_objection(this);
// Single burst-16 transaction
seq.addr = 32'h1000;
seq.burst_length = 16; // Minimal case that fails
seq.start(env.agent.sequencer);
phase.drop_objection(this);
endtask
endclass
// Run: Still fails with same root cause
// Benefit: Faster debug iteration (10s vs 10min)
Test Modification Experiments
Systematically modify test to isolate variable: Debugging Pitfalls
Don't Debug Without Test Evidence
❌ Wrong: "I think the problem is in module X, let me check the code" ✅ Right: "Test Y failed with scoreboard mismatch at time T, let me analyze the evidence"
Don't Ignore Test Pass/Fail Patterns
❌ Wrong: Debug first failure in isolation, ignore other tests ✅ Right: Analyze which tests pass/fail to identify common characteristics
Don't Trust Single Test Result
❌ Wrong: Test passed once → bug is fixed ✅ Right: Run regression suite (multiple seeds, scenarios) to confirm fix
Don't Modify RTL Without Evidence
❌ Wrong: Change RTL based on intuition, hope test passes ✅ Right: Trace from test failure → scoreboard → monitor → waveform → RTL
Don't Create Tests Without Purpose
❌ Wrong: Write random tests hoping to find bugs ✅ Right: Analyze coverage holes, create tests targeting untested scenarios
Don't Skip Regression After Fix
❌ Wrong: Failing test now passes → Done ✅ Right: Run full regression to ensure fix doesn't break other tests // Final conclusion: Pure burst length issue, check counter width
### Coverage-Guided Root Cause Analysis
**Use coverage to identify untested paths related to bug**:
```systemverilog
// Coverage report after test failures
covergroup cg_burst_length;
cp_length: coverpoint burst_length {
bins short[] = {[1:8]}; // 100% hit
bins boundary = {15, 16}; // 16 causes failures
bins long[] = {[17:256]}; // 0% hit ← Never tested!
}
endgroup
// Analysis:
// - Tests never tried burst_length > 16
// - Bug might affect all values ≥ 16, not just 16
// - After fix, add test for burst_length=256 to verify
from Test Failures
### From Scoreboard Timestamp to Waveform
**Workflow**:
-
Test log shows scoreboard error at simulation time 1250ns UVM_ERROR: [SCOREBOARD] Data mismatch at addr=0x1000
-
Set waveform viewer to time 1250ns
-
Identify relevant signals from monitor transaction:
- axi_awaddr (write address channel)
- axi_wdata (write data channel)
- Internal register_file signals
-
Check transaction timing: @1240ns: awvalid=1, awaddr=0x1000, awready=1 (address accepted) @1242ns: wvalid=1, wdata=0xBEEF, wready=1 (data accepted) @1250ns: register_file[0] = 0xBEE0 ← Should be 0xBEEF
-
Trace internal path: axi_wdata (0xBEEF) → write_data_reg (0xBEEF) → data_align (0xBEE0) ← BUG HERE
### Backward Tracing from Assertion
**Assertion fires, trace backward to root cause**:
Assertion violation @ 1250ns: a_valid_stable: (valid && !ready) |=> $stable(data)
Waveform analysis: @1249ns: valid=1, ready=0, data=0xAAAA @1250ns: valid=1, ready=0, data=0xBBBB ← Violated $stable()
Trace data signal backward: data ← output_mux output_mux ← select between fifo_out and bypass_data mux_select changed at 1250ns ← WHY?
Trace mux_selefrom verification results is evidence-driven investigation:
- Analyze test failure patterns - Which tests fail? What do they have in common?
- Collect verification evidence - Scoreboard, assertions, monitors, logs
- Map evidence to RTL problem domain - Translate test failure to RTL category
- Design targeted experiments - Create minimal tests to isolate root cause
- Trace from verification to RTL - Navigate from test → scoreboard → waveform → RTL
- Verify with test suite - Confirm fix with regression, add prevention tests
Key principle: Test results guide investigation. Start from verification evidence (test failures, assertion violations, scoreboard mismatches), not RTL code reading
By Affected Component
Datapath issues:
- Check operand widths, sign extension, overflow handling
- Verify bypass/forwarding conditions
- Trace data flow from source to destination
Control logic issues:
- Draw state transition diagram from code
- Verify all states are reachable
- Check for conflicting control signals
Interface issues:
- Review protocol timing diagrams
- Check handshake signal relationships (valid before ready, stable until accepted)
- Verify backpressure handling
Hypothesis Generation Strategies
Backwards Tracing
Start at the failure point and work backwards:
- Identify the first wrong signal at failure timestamp
- Find all signals that directly drive it (combinational or registered)
- Check if those signals are correct one cycle earlier
- Repeat until you find where correct values become incorrect
Dependency Analysis
Map signal dependencies:
output_wrong [time=1250ns]
├─ driven by: alu_result (combinational)
│ ├─ operand_a (registered at 1249ns) ✓ correct
│ ├─ operand_b (registered at 1249ns) ✗ INCORRECT
│ └─ operation (registered at 1249ns) ✓ correct
└─ operand_b driven by: bypass_mux
├─ mem_result (registered at 1248ns) ✓ correct
├─ ex_result (registered at 1249ns) ✗ INCORRECT
└─ bypass_select ✗ WRONG MUX SELECT ← ROOT CAUSE
Differential Diagnosis
Compare working vs failing cases:
| Aspect | Working Case | Failing Case | Insight | |--------|-------------|--------------|---------| | Input pattern | 0x00000001 | 0x80000000 | MSB triggers bug | | Execution path | State A→B→C | State A→B→D | Transition B→D buggy | | Timing | No stalls | Pipeline stall | Stall logic incorrect |
Verification Techniques
Assertion-Based Isolation
Insert temporary assertions to partition the design:
// Check: Does problem occur before or after this pipeline stage?
property p_debug_stage2_input;
@(posedge clk) stage2_valid |-> stage2_input inside {[0:1000]};
endproperty
assert property (p_debug_stage2_input)
else $error("Problem exists at stage2 input");
Minimal Reproducer
Reduce test case to absolute minimum:
- Start with failing test
- Remove stimulus that doesn't affect failure
- Shorten simulation time to just before failure
- Remove unrelated RTL modules
- Result: ~20 line testbench, ~50 line RTL
Benefits: Faster iteration, easier to share, clearer root cause
Force/Release Experiments
Test hypotheses by overriding signals:
// Hypothesis: Bug disappears if bypass is disabled
initial begin
#100ns;
force top.cpu.bypass_enable = 1'b0;
// Observe if problem still occurs
end
Caution: Only for debugging, never in production code
Coverage-Guided Debugging
Use coverage holes to identify untested scenarios:
covergroup cg_state_transitions @(posedge clk);
cp_current: coverpoint state;
cp_next: coverpoint state_next;
cross cp_current, cp_next; // Are all transitions covered?
endgroup
If bug occurs: Check if failing scenario corresponds to coverage hole
Common Pitfalls
Don't Trust Assumptions
❌ Wrong: "Signal X is always stable, so I won't check it" ✅ Right: Add assertion to verify assumption, then proceed
Don't Skip Symptom Observation
❌ Wrong: Jump straight to suspected module and start modifying ✅ Right: Observe exact failure in waveform, then form hypothesis
Don't Fix Symptoms
❌ Wrong: Add logic to mask the symptom without understanding root cause ✅ Right: Trace to root cause, fix it, verify symptom disappears
Don't Test Multiple Changes
❌ Wrong: Make 3 changes simultaneously, rerun simulation ✅ Right: Change one thing at a time, verify effect
Waveform Analysis Patterns
Cause → Effect Tracing
- Find the symptom signal at failure timestamp
- Look 1-2 cycles back for potential causes
- Check if cause signals deviated from expected
- Repeat backwards until finding the origin
Critical Path Analysis
Identify longest combinational path:
// Use $time in always_comb to detect long paths
always_comb begin
logic [31:0] temp1, temp2, temp3;
temp1 = input_a & input_b; // 1 gate delay
temp2 = temp1 | input_c; // 1 gate delay
temp3 = temp2 ^ input_d; // 1 gate delay
output_z = temp3 + input_e; // 1 gate delay
// Total: 4 gate delays - may violate timing
end
Clock Domain Crossing Detection
Look for signals crossing without proper synchronization:
Clock A domain: signal_a toggles at time 1250ns
Clock B domain: signal_b samples signal_a at 1251ns
↑ METASTABILITY RISK if clocks unrelated
Integration with Other Skills
- dsim-debugging: Use when DSIM tool itself has issues (environment, waves, logs)
- rtl-coding-standards: Apply when fixing identified bugs to maintain code quality
- assertion-design: Create permanent assertions for bugs found during debugging
- mcp-workflow: Use MCP commands to compile/run debug experiments quickly
Summary
RTL debugging is systematic reasoning:
- Reproduce the problem reliably
- Observe symptoms without assumptions
- Generate hypotheses based on evidence
- Test each hypothesis independently
- Narrow down to single root cause
- Verify fix and prevent regression
Key principle: Evidence over intuition. Always trace from observed symptoms to root cause using waveforms and assertions.
微信扫一扫