LAIBench

LAIBench

A governance-oriented benchmark for turning an exam descriptor and concise findings into a faithful radiology report — scored where it matters clinically, not on prose. A missed or fabricated critical finding is a hard veto, never a soft deduction.

Designed so form never rescues substance

01 — HOW IT SCORES

Hard critical-finding veto

A missed or fabricated critical finding caps the score and forces FAIL — regardless of how polished the rest of the report reads.

No prose / aesthetic axis

No standalone style, fluency or "communication quality" dimension. The only discourse signals are minor, non-gating, fallback-only.

Conservative combination

The combined dimension score is MIN(deterministic, judge): an optional LLM judge can lower a score but never inflate it past the gate.

Tamper-resistant numbers

Every run is re-scored through the gated combiner; a relabeled critical-miss is rejected before it reaches the leaderboard. scoringHash + suiteHash provenance.

Dimensions: CRIT 30% · QUAL 25% · TERM 20% · GUIDE 15% · RAG 10%, with hard failure gates. LAIBench is a technical benchmark framework — not a medical device, not regulatory approval, not clinical validation.

Leaderboard

02 — RESULTS

Two governance tiers, never mixed: controlled-eval (gated, first-party, aggregate-only) and public-smoke (synthetic, contaminable, diagnostic baselines — not a ranking). See the benchmark cards for case source, leakage risk, and adjudication status.

Designed so form never rescues substance

Hard critical-finding veto

No prose / aesthetic axis

Conservative combination

Tamper-resistant numbers

Leaderboard

Research

Beyond Templates: A Compositional Model and Lower Bound for Radiology Report Variability

Radiologist adjudication & inter-rater agreement for automated report scoring

External validation of finding-to-report faithfulness across sites and equipment