Hallucination diagnosis for RAG
Wrap one line around your retrieval pipeline and get a structured report on what failed and how to fix it — not just a single yes/no hallucination flag. Veralith decomposes every (query, context, response) trace, runs three LLM-as-judge metrics over it, and classifies it into one of six diagnostic cells with a concrete remediation.
Overview
A monolithic “is this response hallucinated?” judge is a smoke alarm — it tells you something is wrong, but not what or where. Veralith is the diagnostic dashboard behind the alarm. For each trace it answers three independent questions:
- Sufficiency — was the retrieval good enough to answer each part of the query?
- Faithfulness — is each claim in the response grounded in the retrieved context?
- Completeness — does the response actually answer every part of the query (and stay on topic)?
Cross-tabulating these gives a named failure mode (retrieval gap, intrinsic hallucination, padded answer, …) plus actionable fixes (lower temperature, bump retrieval‑K, tighten the generator prompt, …) for every trace.
Veralith evaluates traces you already have — it does not sit in your request path or change your responses. It only needs the query, the retrieved chunks, and the response your system produced.
Installation
Veralith targets Python 3.10+ and uses the OpenAI API for its judges.
pip install veralithOptional extras:
pip install "veralith[langchain]" # LangChain auto-tracing adapter
pip install "veralith[sample]" # chromadb, for the sample RAG app
pip install "veralith[dev]" # pytest, ruff, build, twine (contributors)Set your OpenAI key (read once at import via python-dotenv, so a .env file works too):
export OPENAI_API_KEY=sk-...30-second quickstart
Run one synchronous evaluation and read the diagnosis straight off the typed result — no database, no polling:
import veralith
result = veralith.evaluate(
query="What is a P/E ratio and what was Apple's P/E in 2023?",
context=[
"The price-to-earnings (P/E) ratio is a company's share price "
"divided by its earnings per share."
],
response=(
"A P/E ratio divides share price by earnings per share. "
"Apple's P/E in 2023 was 42.7."
),
persist=False, # run entirely in memory
)
print(result.failure_cell.value) # 'incomplete_ungrounded'
print(result.suggestion.title) # 'Worst-case failure'
for action in result.suggestion.actions:
print(" -", action)Here the response invents a number (42.7) that the context never grounds and the context didn't actually cover Apple's P/E — so Veralith lands the trace in the worst-case cell and returns concrete next steps. You get back a typed EvaluationResult with per-claim verdicts, per-sub-question sufficiency, a failure-cell diagnosis, and a suggestion.
How it works
Every evaluation runs the same deterministic pipeline over one trace — the triple (query Q, context C, response R):
Decomposition is deliberately conservative: the splitter only acts on content literally present in the text, resolves pronouns so each piece is self-contained, and never invents sub-topics. A single-purpose query stays one sub-question; a refusal (“I can't answer that from the context”) yields zero claims — a case the pipeline handles explicitly.
Each of the three judges is isolated: if one fails (an API error, a malformed verdict), the others still complete and the failure is recorded in result.errors rather than aborting the whole evaluation. Per-phase wall-clock timings land in result.latency_ms.
Cost. A typical evaluation is ~5 LLM calls — 2 decomposition calls on the cheaper decomposer model plus 3 batched judges — roughly $0.005 / trace on the default models. The per-item judges batch in groups of 5, so 15 sub-questions cost 3 calls, not 15.
The three metrics
The metrics measure independent things. Two are per-item binary judges; Completeness is a single whole-trace alignment call.
Per sub-question Qᵢ: do the retrieved chunks contain enough information to answer it — using only the chunks? A retrieval-quality signal.
Per claim Rᵢ: is it grounded in the chunks? Checks grounding, not correctness — a plausible inference beyond the context still counts as ungrounded; strict on numbers, dates, and entities.
Whole trace: does R cover every Qᵢ and stay on topic? Definitions, formulas, examples, caveats, and history of the asked topic are on-topic — only a claim about a different topic is “extra”.
Sufficiency is a deliberate sidecar: it does not decide which failure cell you land in. It is collapsed to a HIGH / LOW level that only refines which remediation text you get — see calibration.
The six failure cells
Each trace lands in exactly one cell — a strict cross-tab of Completeness (3 rows: does R cover Q?) × Faithfulness (2 columns: is every claim grounded?). The cell name follows the pattern <completeness>_<faithfulness>, so you can decode any cell without a lookup chart: read it as “the response is <X> and the claims are <Y>.”
every claim supported
some claim invented
The Completeness axis follows a fixed precedence — incomplete > extra > complete: an uncovered sub-question always makes a trace incomplete, even if it also has extra claims. A response with zero claims (a refusal) is treated as grounded — there is nothing to be ungrounded about — so a refusal lands in incomplete_grounded.
Sufficiency calibration
“Was retrieval good enough?” depends on your corpus. Rather than a fixed bar, Veralith can learn the Sufficiency HIGH/LOW threshold per knowledge base from your own trace history.
The idea: if a trace reached the healthy outcome — fully grounded and complete — despite imperfect retrieval, then that retrieval was “good enough” for this corpus. Veralith takes the 10th percentile of the sufficiency fraction across those successful traces and uses it as the threshold.
- A fresh project starts strict (a fallback threshold), so
HIGHrequires strong retrieval. - Calibration only kicks in after ≥ 20 successful traces exist in the database; until then the fallback is used.
- The level never moves a trace between cells — it only selects between the
HIGHandLOWphrasing of the same cell's remediation.
Calibration reads from the SQLite database, so it only applies on the persisted path (persist=True). In-memory evaluations use the default threshold.
Integration · log()
The minimum-friction entry point. It persists the trace immediately and, by default, runs the evaluation on a background worker thread — so it adds almost nothing to your request latency. It returns the integer trace_id right away.
import veralith
def answer(query: str) -> str:
chunks = my_retriever(query) # list[str] | list[dict] | list[ContextChunk]
response = my_generator(query, chunks)
veralith.log(query=query, context=chunks, response=response) # background eval
return responseSignature
log(query: str, context, response: str, *, sync: bool | None = None) -> int | EvaluationResult| Param | Type | Notes |
|---|---|---|
| query | str | The user query. Must be non-empty. |
| context | list[str] | list[dict] | list[ContextChunk] | Retrieved chunks. Strings and dicts are normalized to ContextChunk; dict keys read are text, rank, source, score. |
| response | str | The generated response. |
| sync | bool | None | None → uses VERALITH_DEFAULT_SYNC (default False). False → background eval, returns trace_id. True → inline eval, returns EvaluationResult. |
A pre-flight budget guard runs before anything is written, so an over-budget trace raises BudgetExceeded without persisting or calling any model. Background evals never crash your process; if the process exits, drain them first:
veralith.shutdown(wait=True) # block until in-flight background evals finish (also runs at exit)Integration · @trace
Wrap a RAG function and Veralith captures its (response, context) automatically — zero code reshape. The wrapped function still returns just the response to your callers.
import veralith
@veralith.trace
def my_rag(query: str):
chunks = my_retriever(query)
response = my_generator(query, chunks)
return response, chunks # return (response, context) — the decorator captures both
answer = my_rag("How do I reset my password?") # -> the response string; eval runs in the backgroundReturn either a (response, context) tuple or a TraceReturn(response=..., context=...) when a bare tuple is awkward. The decorator supports bare @trace, parameterized @trace(...), and async functions.
from veralith import trace, TraceReturn
@trace(query_arg="user_question", sync=False, on_error="warn")
def my_rag(user_question: str):
...
return TraceReturn(response=answer, context=chunks)| Option | Default | Notes |
|---|---|---|
| query_arg | None | Name of the parameter holding the query. Defaults to the first positional arg (or kwargs["query"]). |
| sync | None | Forwarded to log(). |
| on_error | "warn" | "warn" emits a warning if capture/log fails; "silent" swallows it. A BudgetExceeded always propagates. |
Telemetry is best-effort: if Veralith can't extract a clean (response, context) it warns and passes your return value through untouched — it never disturbs your pipeline.
Integration · evaluate()
The low-level orchestrator. It always runs synchronously and returns the full typed EvaluationResult inline — ideal for tests, gating, or pulling a verdict into your own control flow.
evaluate(query: str, context, response: str, *, persist: bool = True, trace_id: int | None = None) -> EvaluationResultresult = veralith.evaluate(query, context, response, persist=False)
if result.failure_cell and result.failure_cell.value.endswith("ungrounded"):
handle_hallucination(result.faithfulness) # per-claim verdicts + grounding chunks| Param | Default | Notes |
|---|---|---|
| persist | True | True writes the trace + all artifacts to SQLite (and enables per-KB calibration). False runs entirely in memory and skips all DB writes. |
| trace_id | None | Reuse an existing trace row (e.g. one log() already persisted) instead of inserting a new one. |
A reserved sync keyword exists for API symmetry with log() but has no effect here — evaluate() is always synchronous.
Integration · LangChain
Zero-code auto-tracing. One install() patches LangChain's retrieval chains so every .invoke() also logs (query, source documents, answer) to Veralith.
import veralith.adapters.langchain as adapter
adapter.install() # patches RetrievalQA + RetrievalQAWithSourcesChain .invoke()
# ... every existing chain.invoke() now auto-traces to Veralith ...
adapter.is_installed() # True
adapter.uninstall() # restore the originalsinstall() returns the number of chain classes patched, is idempotent, and raises ImportError only if LangChain isn't installed at all. As with @trace, extraction failures warn and pass the chain result through unchanged — the adapter never breaks your chain.
The result object
evaluate() (and log(sync=True)) return an EvaluationResult — a typed Pydantic model. Every field is structured and inspectable.
| Field | Type | Meaning |
|---|---|---|
| trace_id | int | DB id of the trace (negative when persist=False). |
| query | str | The original query. |
| sub_questions | list[SubQuestion] | The decomposed query {Qᵢ}. |
| claims | list[Claim] | The decomposed response {Rᵢ}. |
| sufficiency | list[SufficiencyJudgment] | Per-Qᵢ verdicts; empty if the sufficiency judge failed. |
| faithfulness | list[FaithfulnessJudgment] | Per-Rᵢ verdicts + grounding chunk ranks; empty if it failed. |
| completeness | CompletenessJudgment | None | Qᵢ↔Rᵢ alignment; None if it failed. |
| diagnosis | Diagnosis | None | Failure cell + sufficiency level + counts; None if the cell couldn't be determined. |
| suggestion | Suggestion | Remediation (always present). |
| created_at | datetime | UTC timestamp. |
| errors | dict[str, str] | metric → error message, for any judge that failed. |
| latency_ms | dict[str, float] | phase → elapsed milliseconds. |
The top-level result.failure_cell property is a shortcut for result.diagnosis.failure_cell (or None). A Diagnosis also carries the supporting signals: sufficiency_level (HIGH/LOW), sufficiency_fraction, faithfulness_fraction, and counts — n_sub_questions, n_claims, n_uncovered_sub_questions, n_extra_claims.
Nested judgment models
| Model | Key fields |
|---|---|
| SubQuestion / Claim | id, text, order_idx |
| SufficiencyJudgment | sub_question_id, verdict (Y/N), reasoning, supporting_chunk_ranks |
| FaithfulnessJudgment | claim_id, verdict (Y/N), reasoning, grounding_chunk_ranks |
| CompletenessJudgment | overall (complete/incomplete/extra), mappings (Qᵢ → covering Rᵢ or None), extra_claim_ids, reasoning |
| Suggestion | title, body, actions: list[str] |
| ContextChunk | text, rank (0 = top), source?, score? |
Reading a verdict
r = veralith.evaluate(query, context, response, persist=False)
print(r.failure_cell.value) # e.g. 'complete_ungrounded'
print(r.diagnosis.sufficiency_level.value) # 'high' | 'low'
for claim, judgment in zip(r.claims, r.faithfulness):
if judgment.verdict.value == "N":
print("UNGROUNDED:", claim.text, "—", judgment.reasoning)
print(r.suggestion.title)
for step in r.suggestion.actions:
print(" -", step)Configuration
Defaults work out of the box. Everything is tunable via environment variables (read once at import) or the veralith.config.settings singleton.
| Environment variable | Default | Purpose |
|---|---|---|
OPENAI_API_KEY | — | Required. |
VERALITH_JUDGE_MODEL | gpt-4o | Model for the S / F / C judges. |
VERALITH_DECOMPOSER_MODEL | gpt-4o-mini | Model for query / response decomposition. |
VERALITH_EMBED_MODEL | text-embedding-3-small | Embedding model (cost tracking). |
VERALITH_DB_PATH | veralith.db | SQLite persistence path. |
VERALITH_BATCH_SIZE | 5 | Per-item judge batch size. |
VERALITH_DEFAULT_SYNC | False | Default sync for log() / @trace. |
VERALITH_PER_TRACE_BUDGET_USD | 0.50 | Pre-flight budget ceiling per trace. |
VERALITH_WORKER_CONCURRENCY | 4 | Background eval thread-pool size. |
VERALITH_CACHE_ENABLED | True | Enable the LLM-result cache. |
Booleans accept 1, true, yes, on (case-insensitive). For tests and scoped tweaks, settings.override(...) is a context manager that validates keys and restores them on exit:
from veralith.config import settings
settings.judge_model # 'gpt-4o'
with settings.override(batch_size=1):
... # scoped change, auto-restored on exitCommand-line interface
Installing the package adds a veralith console script with four subcommands.
# Batch-evaluate a JSONL file (each line: {"query", "context", "response"})
veralith eval traces.jsonl
veralith eval traces.jsonl --concurrency 8
veralith eval traces.jsonl --no-persist # dry run, do not write to the DB
# Inspect one trace by id (read-only, colorized)
veralith inspect 42
# List recent traces
veralith list --limit 50
# Failure-cell distribution + totals across the DB
veralith stats| Command | Args | Does |
|---|---|---|
veralith eval | <file.jsonl> · --concurrency (4) · --no-persist | Evaluates each record concurrently and prints per-cell counts + total cost. |
veralith inspect | <trace_id> | Pretty-prints one trace: claims, verdicts, completeness, suggestion. |
veralith list | --limit (20) | Most recent traces with their failure cell. |
veralith stats | — | Totals + an ASCII bar chart of the failure-cell distribution. |
veralith eval exits 0 on a clean run, 1 if the input is missing/empty, and 2 if it finished with one or more per-record errors.
Cost & budget
Veralith tracks token usage and USD per call, and guards every evaluation with a pre-flight budget estimate. Instrument your client once, then attribute cost per trace with a scope:
from veralith.observability.cost import (
instrument, get_tracker, CostScope, enforce_budget, BudgetExceeded,
)
from veralith.llm import get_client
instrument(get_client()) # record every LLM call this client makes
tracker = get_tracker()
with CostScope() as scope: # per-trace attribution
result = veralith.evaluate(query, context, response)
print(scope.usd, scope.tokens) # read AFTER the block exits
print(tracker.total_usd)
try: # pre-flight guard (raises before any model call)
enforce_budget(query, context, response) # uses VERALITH_PER_TRACE_BUDGET_USD
except BudgetExceeded as e:
print(e.estimated_usd, e.budget_usd)instrument() is idempotent and records per-model usage by diffing the client's token counters around each chat / structured / embed call. The budget guard runs automatically inside log() and the eval CLI, so an over-budget trace fails fast and cheap.
The built-in pricing table is a convenience estimate — verify it against current OpenAI pricing before relying on it for production budgeting.
Persistence
When persist=True (the default for log() / evaluate()), Veralith writes to a local SQLite database — veralith.db by default, or VERALITH_DB_PATH. The schema mirrors the result models one-to-one: traces and their context_chunks, sub_questions, claims, the three judgment tables, completeness with its mappings/extras, plus a grounding join table.
The database is self-bootstrapping: the first connection creates the schema and runs any pending migrations, so library callers never have to initialize it explicitly. An LLM-result cache (in-process LRU in front of a SQLite cache table) keeps repeated decomposition / judge calls cheap.
Read a verdict back later with the CLI (veralith inspect <id>) or query the tables directly — the traces row carries the rolled-up failure_cell, overall_verdict, and the sufficiency / faithfulness fractions for fast aggregation.
Status & roadmap
alpha · 0.1.x The public API is stable — expect additions, not breaking changes.
In 0.1
- Three judges (Sufficiency, Faithfulness, Completeness) with batched LLM calls.
- Six-cell diagnostic classifier and rule-based suggester.
- Outcome-based sufficiency-threshold calibration per knowledge base.
- SDK:
log(),@trace, the LangChain adapter, and a background eval worker. - SQLite persistence with self-healing migrations, an LLM-result cache, and a CLI.
- Cost tracking with a per-trace budget guard.
On the roadmap
- LLM-enriched, trace-specific suggestions.
- Cross-trace pattern detection (“you keep hallucinating on time-sensitive queries”).
- Additional judges (reasoning validity, temporal validity) and framework adapters (LlamaIndex, raw OpenAI tools).
- A hosted dashboard with multi-tenant projects.