The Verification Gap: Why AI Still Struggles with Hardware Testing

Every verification engineer I’ve talked with since this AI thing started has told me the same thing: “It’s far from excellent.” And honestly, it looked like if you don’t write fullstack JavaScript or Python, ChatGPT or Claude’s code generation will be pretty awful.

Even when you’re being super helpful—providing the model with the actual spec sheet, the RTL code, timing diagrams, and basically everything short of your coffee order—there are more holes in the generated testbench than in a block of Swiss cheese. You find yourself going back and forth with Claude, and it keeps telling you “Good catch! Nice gotcha! Cool bug!” while praising you all along the road to verification doom.

So what’s actually causing this? Is there even a benchmark for verification-related tasks? How about agentic-style tasks where the AI can actually use tools and iterate? We want that too!

Well, NVIDIA just answered our prayers (and confirmed our worst fears) with the Comprehensive Verilog Design Problems (CVDP) benchmark. And spoiler alert: the numbers are…


The Numbers Don’t Lie (Unfortunately)

Here’s what happens when you actually test state-of-the-art models on verification tasks:

  • RTL code generation: Claude 3.7 Sonnet gets 34% right. Not amazing, but workable.
  • Testbench stimulus generation: Drops to 25%. Okay, getting concerning.
  • Testbench checker generation: 6%. Yes, you read that right. Six percent.
  • Assertion generation: 19%. Better than checkers, still terrible.

For context, these are the same models that can write you a React app or debug your Python script like they’ve been doing it for years. But ask them to write a proper SystemVerilog testbench? Suddenly they’re as functional as a waterproof towel.

Why Verification Makes AI Cry

After digging into NVIDIA’s failure analysis (yes, they actually analyzed why these models fail), some patterns emerge that’ll make you nod your head if you’ve been in the verification trenches:

The “Looks Right, Works Wrong” Problem

You know that feeling when you get code back from an AI that looks syntactically correct, compiles fine, but when you actually run it, weird stuff happens? (I know you are nodding your head right now) The NVIDIA team found exactly this pattern. Models would generate testbenches that:

  • Mixed blocking and non-blocking assignments (from paper’s brick sort case study) – because apparently timing is just a suggestion to these models
  • Put assertions in completely wrong places (paper identifies “Misplaced SVA” as a major failure category)
  • Had zero bounds checking (paper example: “fails to perform bounds checking before accessing data_array[2*pair_idx+2]”) – Array access? What could go wrong?
  • Generated “insufficient coverage” (direct quote from paper’s failure analysis) that would make any coverage-driven verification engineer weep

The Procedural vs. Declarative Brain Melt

Here’s something interesting: the same model that fails at testbench code can sometimes handle RTL reasonably well. Why? The researchers think it’s because:

  • RTL is declarative: “Here’s what this hardware does”
  • Testbenches are procedural: “Here’s how to poke this hardware and check if it’s lying to you”

Since LLMs are trained mostly on software (which is procedural), they kind of understand RTL’s declarative nature better than verification’s procedural “do this, then that, then check this other thing” flow. It’s like they’re better at describing a car than actually driving one.

Here’s the key difference:
RTL is declarative (describes WHAT hardware IS):

always_comb begin
    output_valid = input_valid && ready;  // "output IS this condition"
    output_data = input_data + offset;    // "output IS this calculation"
end

Testbenches are procedural (step-by-step instructions like software):

initial begin
    reset = 1;           // Step 1: Assert reset
    #10;                 // Step 2: Wait 10 cycles
    reset = 0;           // Step 3: Deassert reset  
    data_in = 8'h55;     // Step 4: Apply test data
    check_output();      // Step 5: Verify result
end

RTL’s declarative nature is different enough from software that LLMs don’t try to apply their software training directly. But testbenches look like software, so LLMs try to apply their procedural knowledge while missing hardware-specific timing and verification nuances.

It’s like they’re better at describing a car than actually driving one.

The “What Did I Forget to Test?” Blindness

This one hits home for anyone who’s done verification. AI doesn’t seem to understand the concept of “coverage.” It’ll test the happy path, maybe a few obvious error cases, but completely miss:

  • Edge cases (because who cares about corner conditions, right?)
  • Timing relationships (clock domain crossings are apparently optional)
  • Reset sequences (reset? What reset?)
  • The 47 other things you didn’t think to explicitly tell it to test

Real Talk: What This Means for Us

If You’re Using AI for Verification Today

Stop. Just… stop. Or at least treat anything it gives you like radioactive material that needs extensive decontamination. The 6% pass rate on testbench checkers isn’t a typo! it’s a warning.

If You’re a Verification Engineer Worried About Job Security

Relax. At this rate, you’ll be retirement age before AI can reliably generate a proper UVM testbench. Your expertise in thinking about what could go wrong is still very much needed.

If You’re Managing a Verification Team

Maybe don’t bank your project timeline on AI-generated verification just yet. Use it for initial scaffolding if you want, but budget for human review of… well, everything.

The Silver Lining (There Actually Is One)

Before you throw your laptop out the window, there’s some good news buried in this research:

We finally have data. Instead of just complaining that “AI verification sucks” (which we all knew), we now have specific, quantified ways it sucks. The NVIDIA team identified exact failure patterns:

  • Timing violations
  • Coverage gaps
  • Assertion misplacement
  • Synchronization issues

This isn’t just academic hand-wraving—it’s a roadmap. We know what’s broken, which means we (or the AI companies) can actually fix it.

Agentic approaches exist. The benchmark includes “agentic” tasks where AI can use tools, iterate, and basically act more like a human engineer than a one-shot code generator. The results are still not great, but they’re testing the right things.

The Bottom Line

Look, I wanted AI verification to work as much as anyone. The idea of having Claude whip up a comprehensive testbench while I grab coffee was appealing. But the CVDP benchmark confirms what we’ve all been experiencing: we’re not there yet.

The good news? At least now we know why we’re (probably) not there yet, and that’s the first step to getting there eventually.

Until then, keep your verification skills sharp. AI might help you write that first testbench faster, but you’re still going to be the one figuring out what you forgot to test. And in verification, what you don’t test will bite you.


Have your own AI verification horror stories? I’d love to hear them. Because honestly, we’re all in this together.

Newsletter Updates

Enter your email address below and subscribe to my newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *