HSI Debug Agent: An AI Regression Analyzer for Chip Verification

As a verification engineer, I spend a significant chunk of my time analyzing regression results. Grepping through massive log files, cross-referencing register values against an Excel spec database with 10,000+ rows across multiple sheets, and trying to figure out whether a mismatch is a real bug or just a spec change someone forgot to mention. It’s the kind of work that eats hours and demands attention, but doesn’t really require engineering judgment until the very end.

So I built an AI agent to do the mechanical part for me — the lookup, the triage, and the first-pass conclusion — and it beats manual effort at every one of those steps.

The Problem

Hardware-Software Integration (HSI) testing verifies that hardware registers behave as expected:

That default values match the spec
That registers can be written and read back correctly
And that software can properly configure hardware components

When a regression run completes, it produces log files full of error lines like this:

ERROR: test_hsi_reg_db["test_name"].regs["FaultControl_3"].flds["CORECHECKSUM"]:
expected_field_value = 0x1 actual_field_value = 0x0

Now multiply that by hundreds or thousands of mismatches across dozens of tests. The manual workflow looks something like this:

Extract errors from log files (grep, custom scripts)
Look up each register and field in the spec database — an Excel file with 10,000+ rows
Cross-reference test configuration to understand the context
Decide whether each mismatch is a real bug or an expected change:
- Did someone update the spec?
- Has the test setup changed?
- Was the register deliberately altered during boot?
Categorize and prioritize the real issues

Each Excel lookup takes about 10 seconds manually, even after you already know the recurring registers and field names. Multiply that by ~10 errors per log and 10–20 logs per regression, and you have yourself a busy afternoon — before the real debugging even begins.

The agent collapses all three of the human steps — lookup, triage, conclusion — into a single natural-language request, and it’s faster at each.

The Approach: LLM as Orchestrator, Tools as Executors

I used a GPT-4-class model as an orchestrator and gave it a set of purpose-built tools that encode the domain knowledge. The model decides when and how to use each tool based on natural-language queries; the tools do the deterministic, testable work.

The model isn’t hardcoded to a vendor. It’s wired through an OpenAI-compatible interface, so the model id, base URL, temperature, and timeout all come from config rather than the source:

agent = AuthRecoveringAgent(
    name="hsi_agent",
    session_id=session_id,
    model=OpenAILike(                 # any OpenAI-compatible endpoint
        id=cfg.llm.model,             # model, base_url, temperature — all config-driven
        base_url=cfg.llm.base_url,
        temperature=cfg.llm.temperature,
        timeout=float(cfg.llm.timeout_seconds),
    ),
    tools=[
        parse_hsi_list_file, extract_hsi_errors, lookup_field_in_hsi_db,
        create_hsi_db_snapshot, list_hsi_db_snapshots, compare_hsi_db_snapshots,
        list_regression_runs, find_test_across_regression_history,
        list_failed_tests_in_regression_run, list_failed_tests_latest_run,
        compare_hsi_log_extractions,
    ],
    db=SqliteDb(db_file=cfg.sqlite_db_file),   # persists session history
    system_message=get_system_message(),
    add_history_to_context=True,
)

agent = AuthRecoveringAgent(
    name="hsi_agent",
    session_id=session_id,
    model=OpenAILike(                 # any OpenAI-compatible endpoint
        id=cfg.llm.model,             # model, base_url, temperature — all config-driven
        base_url=cfg.llm.base_url,
        temperature=cfg.llm.temperature,
        timeout=float(cfg.llm.timeout_seconds),
    ),
    tools=[
        parse_hsi_list_file, extract_hsi_errors, lookup_field_in_hsi_db,
        create_hsi_db_snapshot, list_hsi_db_snapshots, compare_hsi_db_snapshots,
        list_regression_runs, find_test_across_regression_history,
        list_failed_tests_in_regression_run, list_failed_tests_latest_run,
        compare_hsi_log_extractions,
    ],
    db=SqliteDb(db_file=cfg.sqlite_db_file),   # persists session history
    system_message=get_system_message(),
    add_history_to_context=True,
)

The framework is Agno — I’ve used it before and it has the fast response and loading times I need for a tool engineers interact with throughout the day. Its tool model means adding a capability is trivial: write a function, list it, and the model discovers it from the signature and docstring. Two production details worth calling out:

AuthRecoveringAgent is a thin subclass that transparently recovers when the model endpoint’s auth token expires mid-session — important for a long-running internal tool that an engineer might leave open all day.
The agent is not stateless. Session history is persisted to SQLite (db=SqliteDb(...), add_history_to_context=True), so a conversation can build on its earlier findings instead of re-deriving them.

The Evolution: From In-RAM Cache to Snapshot-Only

The most interesting part of this project isn’t the agent — it’s how the data layer evolved through three phases, each one removing a weakness the previous one couldn’t.

The root tension: Excel is the team’s source of truth, but it’s a terrible runtime dependency. Loading the spec workbook into pandas takes ~30 seconds, and the file is ephemeral — it gets overwritten in place, so there’s no history to compare against.

PHASE 01

Singleton RAM cache

Sub-second lookups via an in-memory (register, field) hash index.

Dies each session · holds only “now” — no history.

PHASE 02

SQLite snapshots

Spec history on disk — diff any two dates to spot changes.

Two lookup paths doing essentially one job.

PHASE 03

Snapshot-only

One indexed store serves current values AND history. Excel falls out of the runtime path entirely.

The arc resolves here.

Phase 1 — Singleton cache + hash index

The first version loaded Excel once per session into a singleton, then built an in-memory (register, field) hash index so every lookup was O(1) instead of an O(n) row scan:

def _build_index(self) -> None:
    """Build {(register_name, field_name): [matching_rows]} for O(1) lookups."""
    self._index = {}
    for sheet_name in ["Sheet_HW", "Sheet_SW", "Sheet_Core", "Sheet_IPXact"]:
        df = pd.read_excel(self.db_file_path, sheet_name=sheet_name)
        current_register = None
        for idx, row in df.iterrows():
            reg, field = row.get("Register Name"), row.get("Field Name")
            if pd.notna(reg):
                current_register = str(reg)
            if pd.notna(field) and current_register:
                key = (current_register, str(field))
                self._index.setdefault(key, []).append({
                    "worksheet": sheet_name, "row_idx": idx,
                    "data": self._row_to_dict(row),
                })

def _build_index(self) -> None:
    """Build {(register_name, field_name): [matching_rows]} for O(1) lookups."""
    self._index = {}
    for sheet_name in ["Sheet_HW", "Sheet_SW", "Sheet_Core", "Sheet_IPXact"]:
        df = pd.read_excel(self.db_file_path, sheet_name=sheet_name)
        current_register = None
        for idx, row in df.iterrows():
            reg, field = row.get("Register Name"), row.get("Field Name")
            if pd.notna(reg):
                current_register = str(reg)
            if pd.notna(field) and current_register:
                key = (current_register, str(field))
                self._index.setdefault(key, []).append({
                    "worksheet": sheet_name, "row_idx": idx,
                    "data": self._row_to_dict(row),
                })

This made per-lookup time go from ~10 seconds of manual Excel scrolling to ~0.001 seconds — roughly 10,000x faster per lookup. Weakness: the cache dies with the session, and RAM only ever holds now. There’s no way to answer “did this register change yesterday?”

Phase 2 — SQLite snapshots for history

To answer change-over-time questions, I added a snapshot system: convert the Excel workbook into a timestamped, indexed SQLite file on disk. A cron job creates daily snapshots, the agent can trigger an on-demand snapshot when Excel changes mid-day, and old snapshots roll off after 14 days.

def compare_hsi_db_snapshots(registers: List[str], date1: str, date2: str) -> Dict[str, Any]:
    """Compare register field values between two snapshot dates."""
    # Query both snapshots, build lookup dicts, diff field values
    # Returns structured diff: added, removed, modified fields

def compare_hsi_db_snapshots(registers: List[str], date1: str, date2: str) -> Dict[str, Any]:
    """Compare register field values between two snapshot dates."""
    # Query both snapshots, build lookup dicts, diff field values
    # Returns structured diff: added, removed, modified fields

Weakness: now there were two lookup paths — current values from the RAM cache, history from SQLite — doing essentially the same job twice.

Phase 3 — Snapshots for everything

The realization that collapsed the design: if every spec value already lives in a fast, indexed, persisted snapshot, why load Excel at runtime at all? So current-value lookups were pointed at the latest snapshot too. Now one store serves both questions — “what is this value now?” and “what was it last Tuesday?” — and Excel is fully out of the runtime path. It’s read exactly once, at snapshot-creation time.

Excel workbook (source of truth, slow to load, ephemeral)
        │  read once, at snapshot creation (~30s)
        ▼
SQLite snapshots on disk  ──►  current lookups  (latest snapshot)
(indexed, 14-day retention) ─►  history diffs    (any two dates)

What happens when a snapshot is missing? The lookup tools target the latest snapshot; if none exists at all, the agent calls create_hsi_db_snapshot (a one-time ~30s cold start) and then queries it. If a specific historical date is requested but unavailable — beyond the 14-day window, or a day with no run — the past can’t be reconstructed, so the agent calls list_hsi_db_snapshots, reports what’s actually on disk, and works from the nearest available date instead of guessing.

The Tools

Eleven tools, organized into ingest, spec lookup + spec history, and regression-run history — plus a boot-sequence checker. The two history groups are orthogonal time axes: one tracks how the spec changed, the other how the tests behaved across runs.

Ingest & test context

parse_hsi_list_file reads the test-list file to extract the context that defines what “correct” even means for a test: which defines are set, which libraries are included, and the test type. A test compiled with one set of defines compares against entirely different expected values than another — without this, the agent would flag legitimate values as mismatches.
The boot-sequence checker (CHECK_BOOT) searches the boot-sequence files that run before any test .cpp to determine whether a register was deliberately altered during boot. So when a post-boot value doesn’t match the hardware default, the agent knows it’s expected — not a bug.
extract_hsi_errors parses logs into structured mismatches. It also includes a watchdog for the trickiest failures: timeouts. When a read operation hangs and never emits an ERROR line — the log just stops progressing — the extractor tracks the expected progression of read phases and flags a silent timeout that would otherwise slip through entirely.

Spec lookup + spec history (SQLite)

lookup_field_in_hsi_db — expected values for a register field, served from the latest snapshot.
create_hsi_db_snapshot — the only place Excel is ever touched; converts the workbook into a timestamped, indexed SQLite file.
list_hsi_db_snapshots — enumerates what’s on disk (and powers the graceful missing-snapshot path above).
compare_hsi_db_snapshots — diffs any two dates to answer “did this spec change, and when?”

Regression-run history

list_regression_runs / list_failed_tests_latest_run / list_failed_tests_in_regression_run — navigate runs and their failures.
find_test_across_regression_history — track a single test’s fate across many runs.
compare_hsi_log_extractions — diff the extracted errors of two runs to answer “what broke or got fixed since last regression?”

Tool Design: Structured Returns, Never Exceptions

Tools never throw exceptions to the agent. Every tool returns a structured dictionary with a status field. This means the model can gracefully handle errors, explain what went wrong in plain English, and keep going even with partial failures:

@tool
def lookup_field_in_hsi_db(register_name: str, field_name: str, instance_path: str) -> dict:
    """Look up register field information in the latest HSI spec snapshot."""
    try:
        result = query_latest_snapshot(register_name, field_name, instance_path)
        return result  # {"status": "ok", "candidates": [...], "ambiguous": False}
    except Exception as e:
        return {"status": "error", "error": f"Lookup failed: {str(e)}"}

@tool
def lookup_field_in_hsi_db(register_name: str, field_name: str, instance_path: str) -> dict:
    """Look up register field information in the latest HSI spec snapshot."""
    try:
        result = query_latest_snapshot(register_name, field_name, instance_path)
        return result  # {"status": "ok", "candidates": [...], "ambiguous": False}
    except Exception as e:
        return {"status": "error", "error": f"Lookup failed: {str(e)}"}

This was the difference between a fragile demo and a tool the team actually trusts. The rich docstring matters too — the model reads it to decide when to call the tool, so clear type hints and parameter descriptions translate directly into correct calls.

Teaching the Agent New Tricks: A Cursor/Claude Skill

As new error types and edge cases emerge, the agent needs to evolve. Rather than hand-editing tools every time, I built a skill file that works with AI coding assistants like Cursor and Claude Code. It encodes the project’s architecture, coding patterns, and tool conventions, so when I need to add support for a new error type, I describe what I need in natural language and the assistant produces code — or recommends a system-prompt edit — that fits the existing patterns.

This creates a tight feedback loop: the agent finds errors → I discover a pattern it can’t handle → I use the skill to teach it the new pattern → it handles that pattern next time. It significantly shortens the iteration cycle for extending the agent.

Real-World Usage

The team uses the HSI Debug Agent through a CLI. A typical interaction looks like this:

User: "Analyze the regression logs in /path/to/logs"

Agent: "Found 247 mismatches across 18 tests.
       Let me check the spec snapshot and recent changes...

       Analysis complete:
       • 189 mismatches match recent spec changes (expected) —
         FaultControl_3 defaults changed 0x0 → 0x1 yesterday
       • 58 require attention:
         - 23 in network controller (Read Default phase)
         - 35 in DMA channels (Read after Write phase)

       The network controller errors look concerning —
       FaultControl_3.CORECHECKSUM shows unexpected value changes..."

The agent understands the difference between Read Default phase errors (hardware initialization issues) and Read After Write phase errors (functional issues), and gives phase-specific debugging guidance. Because it can reach across both history axes, it answers questions a single log can’t — “Did this register change in the spec this week?”(snapshot diff) and “When did this test start failing?” (regression history) — each in well under a second.

Performance Summary

Single spec lookup

~10,000× faster

Human in Excel10 s

Agent (cached)0.001 s

Full regression triage

hours → minutes

Manual180 min

Agent5 min

Metric	Before (manual)	After (HSI Debug Agent)	Notes
Full regression triage	Hours	Minutes	Lookup + triage + conclusion all automated
Single spec lookup	~10s (human in Excel)	~0.001s (indexed snapshot)	~10,000x faster per lookup
Spec DB → queryable form	~30s (pandas load)	~30s, one-time at snapshot	Excel never loaded at runtime again
Snapshot storage	~13 MB (Excel)	~100 MB (indexed SQLite)	Bigger on disk, but persisted and queryable
Historical comparison	Manual Excel archaeology	Sub-second SQLite diff	Two axes: spec history + regression history

The SQLite snapshots are larger than the source Excel files — the indexing and structured schema add overhead. But that’s the right trade: what you lose in disk space you gain in query speed, persistence, and the ability to compare across dates programmatically.

Lessons Learned

LLMs + tools beat custom models for structured workflows. I didn’t train anything. A capable model with good tools and clear docstrings handles the orchestration remarkably well. The domain knowledge lives in the tools, not in model weights.

Build reliable tools first, add the agent second. Every tool runs standalone and is tested independently of the model. The agent is just orchestration on top of a solid foundation — which keeps iteration fast and failures localized.

Structured error handling makes agents robust. When tools return structured dicts instead of throwing, the agent can explain failures in plain English and continue. That robustness — not any claim of zero hallucination — is what earned the team’s trust.

Requirements emerge from usage — design for that. Phase 1 solved “analyze current errors fast.” Real use surfaced “when did this change?”, which became the snapshot system, which in turn made the original cache redundant. Each phase replaced the previous approach’s weakness instead of bolting on alongside it — and the payoff was Excel falling out of the runtime path entirely.