Measurement Methodology

How AITWIRE measures the way AI systems represent you — what the scores mean, how we calculate confidence, and what the numbers can and cannot tell you. We would rather you trust fewer numbers more.

How we measure

AITWIREsends controlled questions ("polls," also called probes) to the major AI answer engines and scores each response across quality dimensions (accuracy, citation, sentiment, quality, recommendation) and signal categories.

  • Polls are run "cold" — no personalization, account history, or AITWIRE context is injected — at low randomness, and repeated across phrasings, engines, and cycles.
  • This measures the default answer an un-personalized user is likely to get, as a sample over many queries — not a single check.
  • Each response is graded, and results are aggregated per dimension with a sample size and an interval (below).

Precision — what "high confidence" means

Every dimension score is reported with its sample size (n) and a 95% confidence interval. We label a dimension "high confidence" only when n ≥ 30 and the 95% interval is within ±7 points.

Because the margin narrows with the square root of n, a noisy (near 50/50) metric typically needs roughly 150–200 pollsto reach ±7 points — so "high" usually reflects far more than 30 polls. We always show the actual n and interval, not just the label. Confidence here is the statistical precision of the sample — never certainty.

Validity — the limits the interval does not capture

A tight interval tells you the sample is precise. It does not, by itself, tell you the measured value is true. The honest caveats:

  • Scoring can err. Response grading uses automated and machine-learning classifiers, which can mis-grade. We mitigate with independent judge cross-checks and inter-rater agreement — but do not eliminate error. A tight interval around a mis-scored value is "confidently wrong."
  • Samples are not fully independent. Repeated, similar prompts to the same model are correlated, so the effective sample is smaller than the raw count and true intervals can be modestly wider than computed.
  • Scope. We measure un-personalized, point-in-time responses across a selected set of engines — not every personalized answer, every phrasing, every surface, or future model states. AI systems also drift as their models change.

How we keep the numbers honest

Beyond a precise interval, we apply controls so a number is only reported as meaningful when it earns it:

  • Statistical power. When a sample is too small to detect a meaningful change, a non-result is reported as "inconclusive," not "flat" — absence of evidence is not evidence of absence.
  • Multiple-comparison control. We test several dimensions at once, so we apply a correction that controls the false-discovery rate: a movement is "confirmed" only if it survives correction, otherwise it is flagged "exploratory."
  • Representativeness. We stratify polls by query intent (branded, category, competitor, local, buyer-intent) and equal-weight the covered strata, so the headline is not dominated by whichever intent was polled most — and we show the gap versus a naive average.
  • Citation integrity. A brand mention is not proof. Only a cited source that actually substantiates the answer counts toward the evidence rate; stale, third-party, competitor, or unresolved citations are labelled, not counted.
  • Change vs. model drift. AI engines move on their own. We measure and control for model-wide volatility and report your lift net of it — downgrading attribution to inconclusive when the model itself is too volatile.
  • Auditability. Every poll stores a reproducible evidence record — the scorer and rubric versions, the resolved model, and the market — so a historical score can be re-checked under today's methodology.

How to read the numbers

  • Trust the trend and relative comparison more than any single absolute number.
  • Use the confidence label and interval to decide which dimensions are reliable enough to act on; treat "insufficient" or "low" dimensions as directional only.
  • The strongest evidence is measured lift — the same metric before and after a change — together with citation delivery, i.e. when an engine actually quotes your published source.

What we do not claim

  • We do not claim to read every user's personalized answer.
  • We do not guarantee that any AI system will adopt, cite, or correctly interpret your information.
  • Analytics are informational and should not be the sole basis for legal, financial, medical, or compliance decisions.

Versioning

This methodology is versioned. When we change how scores are computed, we restate affected figures and note the change, so period-over-period comparisons stay honest. See the AITWIRE Terms of Service for the legal terms that govern measurement and estimates.