Draft done with AI during demo

Your “30% AI Visibility Score” Is Statistically Meaningless

I did the math on agent SEO benchmarks

May 2026

A bunch of companies are pitching “agent SEO” and benchmarking tools right now. Profound, Peec AI, AIclicks, Adobe's LLM Optimizer, LLMrefs, Semrush... the list keeps growing. They all promise to tell you how often Claude, ChatGPT, Perplexity, or Gemini recommend your dev tool instead of a competitor.

They give you a percentage. A “visibility score.” A “brand mention rate.”

And I keep seeing founders and marketers treat that number like gospel, making strategy decisions off it, shifting budgets, panicking because they went from 28% to 19% between two reports.

So I sat down and did the math. And the math says: most of these numbers are noise.

How the benchmarking works

The idea is simple. You pick a set of prompts that a potential user might ask an LLM... “what's the best serverless database,” “recommend a tool for X,” that kind of thing. You run each prompt some number of times, count how often your product gets mentioned, and calculate a percentage.

If you test a prompt 10 times and see your tool mentioned 3 times, the intuition says 30% coverage. Your tool shows up in roughly a third of AI answers. Sounds reasonable.

The problem is, 3 out of 10 is just one roll of the dice... the error margin is enormous and almost nobody talks about it.

The binomial distribution tells you why

This is a classic binomial probability setup. You have n trials (the number of times you run the prompt), each trial has two outcomes (your tool is mentioned or it isn't), and you want to know how likely a specific result is given some “true” underlying probability p.

The formula:

P(k) = C(n,k) · p^k · (1−p)^n−k

Where C(n,k) is “n choose k,” the number of ways to pick k successes from n trials.

Let's plug in the numbers. If the real coverage is exactly 30% (p = 0.3), and you run 10 trials (n = 10), what's the probability of seeing exactly 3 mentions (k = 3)?

P(3) = C(10,3) · 0.3³ · 0.7⁷ = 120 · 0.027 · 0.0824 = 0.267

About 26.7%. So even if your tool truly is a “30% tool,” you'll only see that exact result roughly one in four times. The other three times out of four, you'll see 2 mentions, or 4, or 1, or 5... and each of those gives you a wildly different “visibility score.”

The confidence interval is where it gets ugly

Let's calculate a 95% confidence interval for our 3/10 result.

The formula for a 95% confidence interval of a proportion is:

p ± 1.96 · √(p · (1−p) / n)

Plugging in p = 0.3 and n = 10:

0.3 ± 1.96 · √(0.3 · 0.7 / 10) = 0.3 ± 0.284

The real range is anywhere from about 2% to 58%.

You ran 10 tests, got 30%, and the actual coverage could be anywhere from nearly zero to nearly sixty percent. A sample size of 10 tells you almost nothing.

What this means in plain English

A confidence interval is not “the range where the answer probably is.” It means: if you repeated this entire experiment 100 times, about 95 of those experiments would produce an interval that contains the true value. It's a statement about the reliability of the method, not a prediction about where the truth sits.

But practically, what it means for your agent SEO report is this: the number they gave you is compatible with a huge range of actual values. You cannot tell the difference between a tool that gets recommended 5% of the time and one that gets recommended 55% of the time... not from 10 samples.

How many samples do you actually need?

This is the question nobody asks the vendor. Let's figure it out.

If you want a confidence interval that's +/- 5 percentage points (so if the true value is 30%, your interval would be 25-35%), you can rearrange the formula:

n = (1.96² · p · (1−p)) / margin²

For p = 0.3 and a 5% margin:

n = (3.84 · 0.21) / 0.0025 = 323 samples

You need about 323 runs of each prompt to get a confidence interval of +/- 5%. For +/- 3% precision, you need around 897 samples. Per prompt.

Now think about what that costs. Each API call to Claude or GPT-4 costs money. If you're testing 50 prompts at 323 runs each, that's 16,150 API calls per benchmark round. At, say, $0.03 per call on average, that's about $485 per report. Not insane, but not cheap either, especially if you're running weekly benchmarks.

Most tools I've seen run somewhere between 5 and 20 repetitions per prompt. At 10 reps, your margin of error is +/- 28 percentage points. The number on the dashboard might as well be random.

The intuition

Think of it like polling. When Gallup runs a presidential poll, they don't ask 10 people. They ask 1,000+ people and still report a margin of error of +/- 3%. That's because statistical precision doesn't come cheap... you need a lot of data points to narrow the range.

Agent SEO benchmarks are doing the polling equivalent of asking your family of four who they're voting for and then publishing the results as a national forecast.

Who's building these tools

The space is growing fast. Here's who I've seen so far:

Profound (tryprofound.com)... probably the most visible one, focused on “AI search optimization.” They track brand mentions across ChatGPT, Perplexity, Claude, and Gemini.
AIclicks (aiclicks.io)... prompt analytics and LLM visibility tracking.
Peec AI... similar positioning, benchmarking your brand against competitors in AI answers.
Adobe LLM Optimizer... Adobe entered the game with an enterprise-grade tool. That tells you the market is real, even if the methodology needs work.
LLMrefs (llmrefs.com)... tracks citations and rank across generative search engines.
Semrush... added AI visibility features to their existing SEO suite.
Otterly, Scrunch AI, and several others I keep seeing at conferences.

The tools and the market are real, and visibility tracking for AI answers is a genuinely important thing to measure... the problem is that most of these tools present results with sample sizes that make the data unreliable, and almost none of them show you the confidence interval alongside the score.

What should you actually do

I'm not saying ignore agent SEO. If LLMs are becoming a discovery channel for your product (and for dev tools, they definitely are), you should care about how you show up. But:

Ask your vendor about sample sizes. If they can't tell you how many times they run each prompt, that's a red flag. If the answer is less than 100, treat the results as directional at best.
Look at trends over time, not point estimates. A single report saying “32% visibility” is meaningless. Twenty reports over five months showing a steady climb from the low 20s to the high 30s... that's a signal, even with noisy individual data points.
Don't panic over week-to-week changes. If your score drops from 30% to 22%, that might just be sampling noise. The confidence interval at small sample sizes means those two numbers are statistically indistinguishable.
Do the math yourself. The binomial distribution isn't hard. If someone gives you a visibility score, ask for n (sample size) and k (mentions), then calculate the confidence interval. If it's wider than +/- 10 percentage points, the score is decoration.
Focus on what you can control. Good documentation, clear product positioning, active presence in conversations where LLMs scrape training data... these things compound over time regardless of what any benchmark says right now.

The agent SEO category is going to mature. The tools will get better, sample sizes will increase, and we'll eventually have reliable benchmarks. But right now, in mid-2026, most of these numbers deserve a squint, not a strategy pivot.