VOL II · ISSUE 1 · MAY 2026 · SAO PAULO ESTABLISHED 2025 · BY LAUDOS.AI
A public register
of evaluation gates.

laibench

Public technical preview -
radiology finding ⤳ report evaluation.
Public technical preview

Radiology finding ⤳ report evaluation.

In official evaluation, the primary verdict is Strict PASS - a case passes only when every clinically decisive gate holds.

The public demo proves the harness. The controlled benchmark scores the model.
PUBLICsynthetic demo only
LOCAL SUITElite-public.pt-BR
OFFICIALcontrolled / hosted evaluation
DATAclinical corpus controlled
HIDDEN TESTnot distributed
UPDATED11.v.2026
Standing

Standing - public technical preview

Synthetic harness demo, not the official benchmark.

The current public package exposes the harness, scoring contract, synthetic demo suite, preprint materials, and submission rules.

The synthetic demo is not used to judge clinical model performance. No official clinical leaderboard is published in this technical preview.

Official rows require controlled or hosted evaluation with frozen outputs, suite hashes, disclosure metadata, and eligibility review.

System
Evaluation
Status
No official public leaderboard yet.
Official scoring requires controlled or hosted evaluation.
not ranked
Synthetic harness demolite-public.pt-BR
4 synthetic cases · smoke test and contract inspection
demo · not ranked
Your agent here
controlled / hosted evaluation · frozen outputs · suite hash · disclosure
official · request evaluation
Method

A locked finding-to-report protocol.

The public demo verifies the harness; controlled evaluation scores the model.

Visible task input only.

The evaluated system receives only visible task input: exam descriptor, provided findings, locale, and allowed public context.

Hidden material stays gated.

It never receives gold labels, hidden criteria, reference answers, judge prompts, private scoring rules, leakage markers, or hidden test metadata.

Aggregation never overrides the verdict.

Strict PASS is binary per case. Per-dimension means explain failures; they do not erase clinically decisive gates.

Dimensions

Five diagnostic axes, one primary verdict.

Scores explain failure modes. Strict PASS decides whether a case survives.

Strict PASSbinary · per case
Per-dimension meandiagnostic only
Bootstrap intervalsofficial only · when applicable
Aggregationnever overrides the verdict
CRIT

Decisive findings

Decisive imaging findings are checked with negation handling where implemented. In controlled evaluation, hidden labels may be used to verify preservation of urgent or clinically decisive findings. The public mirror does not ship the full clinical category set.

QUAL

Clinical quality

Severity-aware matching checks whether clinically relevant findings are preserved without rewarding unsupported normality. In controlled evaluation, comparison may use hidden labels, adjudicated references, or controlled reference reports - never plain string overlap alone.

TERM

Terminology

Terminology checks detect modality drift, section drift, forbidden openers, local-style violations, and modality-specific vocabulary errors where implemented.

GUIDE

Guidelines

Guideline modules run only when the rule is applicable to the case.

RAG

Retrieval fidelity

Retrieval fidelity is evaluated only for retrieval-enabled agents.

Integrity

Benchmark integrity depends on what remains controlled.

The method is public. The official test set is not.

The method can be public while the official test set remains controlled. What is published must be reproducible; what protects privacy, anti-contamination, and benchmark integrity stays gated.
Submission

Use the public demo to inspect the contract.

Request controlled evaluation for official scoring.

Run the public suite.

Use lite-public.pt-BR for local smoke testing and contract inspection. Public cases are synthetic:true.

Freeze outputs.

Official rows require frozen outputs, suite hash, disclosure metadata, and eligibility review.

Preprint

Beyond Templates: A Compositional Model and Lower Bound for Radiology Report Variability

Formal and architectural contribution. Not a clinical-use claim.

Scope

This preprint is a formal and architectural contribution. It does not claim clinical validation, regulatory approval, autonomous diagnosis, product clearance, or replacement of radiologist oversight.

Artifacts

The public site contains the method, paper materials, public-safe synthetic demo, and submission contract.

Boundary

The clinical corpus, raw reports, hidden test set, answer keys, and private scoring criteria are not distributed.