How ARIA evaluates its own AI
ARIA's compliance work uses Claude for narration and ingestion.
An evaluation harness runs nightly that asserts invariants on
each LLM-touching surface — quote integrity, derivability,
citation resolution, PII pre-flight. The live posture below is
the same data our ops team watches. ADR-014 governs the
no-invention rule the harness enforces.
What we test for
Quote integrity
Every extracted quote appears verbatim in the source document.
The extractor can never fabricate evidence. If a quote isn't
in the source, the harness flags it and the extraction is
rejected before it reaches a reviewer.
Derivability gate
Narrations are split into clauses; unsupported clauses are dropped.
Every clause in a Pattern 1 narration has to map back to an
input observation. Clauses without support never ship.
Citation resolution
Pattern 2 narrations cite observations via inline markers.
Each citation must resolve to a real observation we fed the
model. Invented references are rejected by the parser.
PII pre-flight
Documents with HIPAA-class PII never reach the LLM.
A pre-flight scan blocks the upstream API call entirely.
Blocked documents never leave the customer's tenancy.