Overview
Veralith · v0.1.x (alpha)

Hallucination diagnosis for RAG

Wrap one line around your retrieval pipeline and get a structured report on what failed and how to fix it — not just a single yes/no hallucination flag. Veralith decomposes every (query, context, response) trace, runs three LLM-as-judge metrics over it, and classifies it into one of six diagnostic cells with a concrete remediation.

Overview

A monolithic “is this response hallucinated?” judge is a smoke alarm — it tells you something is wrong, but not what or where. Veralith is the diagnostic dashboard behind the alarm. For each trace it answers three independent questions:

  • Sufficiency — was the retrieval good enough to answer each part of the query?
  • Faithfulness — is each claim in the response grounded in the retrieved context?
  • Completeness — does the response actually answer every part of the query (and stay on topic)?

Cross-tabulating these gives a named failure mode (retrieval gap, intrinsic hallucination, padded answer, …) plus actionable fixes (lower temperature, bump retrieval‑K, tighten the generator prompt, …) for every trace.

Veralith evaluates traces you already have — it does not sit in your request path or change your responses. It only needs the query, the retrieved chunks, and the response your system produced.

Installation

Veralith targets Python 3.10+ and uses the OpenAI API for its judges.

bash
pip install veralith

Optional extras:

bash
pip install "veralith[langchain]"   # LangChain auto-tracing adapter
pip install "veralith[sample]"      # chromadb, for the sample RAG app
pip install "veralith[dev]"         # pytest, ruff, build, twine (contributors)

Set your OpenAI key (read once at import via python-dotenv, so a .env file works too):

bash
export OPENAI_API_KEY=sk-...

30-second quickstart

Run one synchronous evaluation and read the diagnosis straight off the typed result — no database, no polling:

python
import veralith

result = veralith.evaluate(
    query="What is a P/E ratio and what was Apple's P/E in 2023?",
    context=[
        "The price-to-earnings (P/E) ratio is a company's share price "
        "divided by its earnings per share."
    ],
    response=(
        "A P/E ratio divides share price by earnings per share. "
        "Apple's P/E in 2023 was 42.7."
    ),
    persist=False,                       # run entirely in memory
)

print(result.failure_cell.value)        # 'incomplete_ungrounded'
print(result.suggestion.title)          # 'Worst-case failure'
for action in result.suggestion.actions:
    print(" -", action)

Here the response invents a number (42.7) that the context never grounds and the context didn't actually cover Apple's P/E — so Veralith lands the trace in the worst-case cell and returns concrete next steps. You get back a typed EvaluationResult with per-claim verdicts, per-sub-question sufficiency, a failure-cell diagnosis, and a suggestion.

How it works

Every evaluation runs the same deterministic pipeline over one trace — the triple (query Q, context C, response R):

01
Decompose
Split Q into atomic sub-questions {Qᵢ} and R into atomic claims {Rᵢ}.
02
Judge ×3
Sufficiency (per Qᵢ), Faithfulness (per Rᵢ), Completeness (Rᵢ ↔ Qᵢ alignment).
03
Classify
Cross-tab Completeness × Faithfulness into one of six failure cells.
04
Suggest
Map the diagnosis to a concrete, rule-based remediation.

Decomposition is deliberately conservative: the splitter only acts on content literally present in the text, resolves pronouns so each piece is self-contained, and never invents sub-topics. A single-purpose query stays one sub-question; a refusal (“I can't answer that from the context”) yields zero claims — a case the pipeline handles explicitly.

Each of the three judges is isolated: if one fails (an API error, a malformed verdict), the others still complete and the failure is recorded in result.errors rather than aborting the whole evaluation. Per-phase wall-clock timings land in result.latency_ms.

Cost. A typical evaluation is ~5 LLM calls — 2 decomposition calls on the cheaper decomposer model plus 3 batched judges — roughly $0.005 / trace on the default models. The per-item judges batch in groups of 5, so 15 sub-questions cost 3 calls, not 15.

The three metrics

The metrics measure independent things. Two are per-item binary judges; Completeness is a single whole-trace alignment call.

Sufficiency

Per sub-question Qᵢ: do the retrieved chunks contain enough information to answer it — using only the chunks? A retrieval-quality signal.

verdict Y / N · per Qᵢ · batched
Faithfulness

Per claim Rᵢ: is it grounded in the chunks? Checks grounding, not correctness — a plausible inference beyond the context still counts as ungrounded; strict on numbers, dates, and entities.

verdict Y / N · per Rᵢ · batched
Completeness

Whole trace: does R cover every Qᵢ and stay on topic? Definitions, formulas, examples, caveats, and history of the asked topic are on-topic — only a claim about a different topic is “extra”.

complete / incomplete / extra

Sufficiency is a deliberate sidecar: it does not decide which failure cell you land in. It is collapsed to a HIGH / LOW level that only refines which remediation text you get — see calibration.

The six failure cells

Each trace lands in exactly one cell — a strict cross-tab of Completeness (3 rows: does R cover Q?) × Faithfulness (2 columns: is every claim grounded?). The cell name follows the pattern <completeness>_<faithfulness>, so you can decode any cell without a lookup chart: read it as “the response is <X> and the claims are <Y>.”

Grounded
every claim supported
Ungrounded
some claim invented
Complete
complete_grounded
Answers everything; every claim is grounded.
healthy
complete_ungrounded
Answers everything, but at least one claim is fabricated.
hallucination
Incomplete
incomplete_grounded
Misses part of the query; what is there is grounded.
gap
incomplete_ungrounded
Misses parts and fabricates within what it did answer.
worst case
Extra
extra_grounded
Adds unrequested content; everything is still grounded.
padded
extra_ungrounded
Adds unrequested content and fabricates some of it.
padded + invented

The Completeness axis follows a fixed precedence — incomplete > extra > complete: an uncovered sub-question always makes a trace incomplete, even if it also has extra claims. A response with zero claims (a refusal) is treated as grounded — there is nothing to be ungrounded about — so a refusal lands in incomplete_grounded.

Sufficiency calibration

“Was retrieval good enough?” depends on your corpus. Rather than a fixed bar, Veralith can learn the Sufficiency HIGH/LOW threshold per knowledge base from your own trace history.

The idea: if a trace reached the healthy outcome — fully grounded and complete — despite imperfect retrieval, then that retrieval was “good enough” for this corpus. Veralith takes the 10th percentile of the sufficiency fraction across those successful traces and uses it as the threshold.

  • A fresh project starts strict (a fallback threshold), so HIGH requires strong retrieval.
  • Calibration only kicks in after ≥ 20 successful traces exist in the database; until then the fallback is used.
  • The level never moves a trace between cells — it only selects between the HIGH and LOW phrasing of the same cell's remediation.

Calibration reads from the SQLite database, so it only applies on the persisted path (persist=True). In-memory evaluations use the default threshold.


Integration · log()

The minimum-friction entry point. It persists the trace immediately and, by default, runs the evaluation on a background worker thread — so it adds almost nothing to your request latency. It returns the integer trace_id right away.

python
import veralith

def answer(query: str) -> str:
    chunks = my_retriever(query)            # list[str] | list[dict] | list[ContextChunk]
    response = my_generator(query, chunks)

    veralith.log(query=query, context=chunks, response=response)   # background eval
    return response

Signature

python
log(query: str, context, response: str, *, sync: bool | None = None) -> int | EvaluationResult
ParamTypeNotes
querystrThe user query. Must be non-empty.
contextlist[str] | list[dict] | list[ContextChunk]Retrieved chunks. Strings and dicts are normalized to ContextChunk; dict keys read are text, rank, source, score.
responsestrThe generated response.
syncbool | NoneNone → uses VERALITH_DEFAULT_SYNC (default False). False → background eval, returns trace_id. True → inline eval, returns EvaluationResult.

A pre-flight budget guard runs before anything is written, so an over-budget trace raises BudgetExceeded without persisting or calling any model. Background evals never crash your process; if the process exits, drain them first:

python
veralith.shutdown(wait=True)   # block until in-flight background evals finish (also runs at exit)

Integration · @trace

Wrap a RAG function and Veralith captures its (response, context) automatically — zero code reshape. The wrapped function still returns just the response to your callers.

python
import veralith

@veralith.trace
def my_rag(query: str):
    chunks = my_retriever(query)
    response = my_generator(query, chunks)
    return response, chunks            # return (response, context) — the decorator captures both

answer = my_rag("How do I reset my password?")   # -> the response string; eval runs in the background

Return either a (response, context) tuple or a TraceReturn(response=..., context=...) when a bare tuple is awkward. The decorator supports bare @trace, parameterized @trace(...), and async functions.

python
from veralith import trace, TraceReturn

@trace(query_arg="user_question", sync=False, on_error="warn")
def my_rag(user_question: str):
    ...
    return TraceReturn(response=answer, context=chunks)
OptionDefaultNotes
query_argNoneName of the parameter holding the query. Defaults to the first positional arg (or kwargs["query"]).
syncNoneForwarded to log().
on_error"warn""warn" emits a warning if capture/log fails; "silent" swallows it. A BudgetExceeded always propagates.

Telemetry is best-effort: if Veralith can't extract a clean (response, context) it warns and passes your return value through untouched — it never disturbs your pipeline.

Integration · evaluate()

The low-level orchestrator. It always runs synchronously and returns the full typed EvaluationResult inline — ideal for tests, gating, or pulling a verdict into your own control flow.

python
evaluate(query: str, context, response: str, *, persist: bool = True, trace_id: int | None = None) -> EvaluationResult
python
result = veralith.evaluate(query, context, response, persist=False)

if result.failure_cell and result.failure_cell.value.endswith("ungrounded"):
    handle_hallucination(result.faithfulness)   # per-claim verdicts + grounding chunks
ParamDefaultNotes
persistTrueTrue writes the trace + all artifacts to SQLite (and enables per-KB calibration). False runs entirely in memory and skips all DB writes.
trace_idNoneReuse an existing trace row (e.g. one log() already persisted) instead of inserting a new one.

A reserved sync keyword exists for API symmetry with log() but has no effect here — evaluate() is always synchronous.

Integration · LangChain

Zero-code auto-tracing. One install() patches LangChain's retrieval chains so every .invoke() also logs (query, source documents, answer) to Veralith.

python
import veralith.adapters.langchain as adapter

adapter.install()           # patches RetrievalQA + RetrievalQAWithSourcesChain .invoke()
# ... every existing chain.invoke() now auto-traces to Veralith ...

adapter.is_installed()      # True
adapter.uninstall()         # restore the originals

install() returns the number of chain classes patched, is idempotent, and raises ImportError only if LangChain isn't installed at all. As with @trace, extraction failures warn and pass the chain result through unchanged — the adapter never breaks your chain.


The result object

evaluate() (and log(sync=True)) return an EvaluationResult — a typed Pydantic model. Every field is structured and inspectable.

FieldTypeMeaning
trace_idintDB id of the trace (negative when persist=False).
querystrThe original query.
sub_questionslist[SubQuestion]The decomposed query {Qᵢ}.
claimslist[Claim]The decomposed response {Rᵢ}.
sufficiencylist[SufficiencyJudgment]Per-Qᵢ verdicts; empty if the sufficiency judge failed.
faithfulnesslist[FaithfulnessJudgment]Per-Rᵢ verdicts + grounding chunk ranks; empty if it failed.
completenessCompletenessJudgment | NoneQᵢ↔Rᵢ alignment; None if it failed.
diagnosisDiagnosis | NoneFailure cell + sufficiency level + counts; None if the cell couldn't be determined.
suggestionSuggestionRemediation (always present).
created_atdatetimeUTC timestamp.
errorsdict[str, str]metric → error message, for any judge that failed.
latency_msdict[str, float]phase → elapsed milliseconds.

The top-level result.failure_cell property is a shortcut for result.diagnosis.failure_cell (or None). A Diagnosis also carries the supporting signals: sufficiency_level (HIGH/LOW), sufficiency_fraction, faithfulness_fraction, and counts — n_sub_questions, n_claims, n_uncovered_sub_questions, n_extra_claims.

Nested judgment models

ModelKey fields
SubQuestion / Claimid, text, order_idx
SufficiencyJudgmentsub_question_id, verdict (Y/N), reasoning, supporting_chunk_ranks
FaithfulnessJudgmentclaim_id, verdict (Y/N), reasoning, grounding_chunk_ranks
CompletenessJudgmentoverall (complete/incomplete/extra), mappings (Qᵢ → covering Rᵢ or None), extra_claim_ids, reasoning
Suggestiontitle, body, actions: list[str]
ContextChunktext, rank (0 = top), source?, score?

Reading a verdict

python
r = veralith.evaluate(query, context, response, persist=False)

print(r.failure_cell.value)                          # e.g. 'complete_ungrounded'
print(r.diagnosis.sufficiency_level.value)           # 'high' | 'low'

for claim, judgment in zip(r.claims, r.faithfulness):
    if judgment.verdict.value == "N":
        print("UNGROUNDED:", claim.text, "—", judgment.reasoning)

print(r.suggestion.title)
for step in r.suggestion.actions:
    print(" -", step)

Configuration

Defaults work out of the box. Everything is tunable via environment variables (read once at import) or the veralith.config.settings singleton.

Environment variableDefaultPurpose
OPENAI_API_KEYRequired.
VERALITH_JUDGE_MODELgpt-4oModel for the S / F / C judges.
VERALITH_DECOMPOSER_MODELgpt-4o-miniModel for query / response decomposition.
VERALITH_EMBED_MODELtext-embedding-3-smallEmbedding model (cost tracking).
VERALITH_DB_PATHveralith.dbSQLite persistence path.
VERALITH_BATCH_SIZE5Per-item judge batch size.
VERALITH_DEFAULT_SYNCFalseDefault sync for log() / @trace.
VERALITH_PER_TRACE_BUDGET_USD0.50Pre-flight budget ceiling per trace.
VERALITH_WORKER_CONCURRENCY4Background eval thread-pool size.
VERALITH_CACHE_ENABLEDTrueEnable the LLM-result cache.

Booleans accept 1, true, yes, on (case-insensitive). For tests and scoped tweaks, settings.override(...) is a context manager that validates keys and restores them on exit:

python
from veralith.config import settings

settings.judge_model            # 'gpt-4o'
with settings.override(batch_size=1):
    ...                         # scoped change, auto-restored on exit

Command-line interface

Installing the package adds a veralith console script with four subcommands.

bash
# Batch-evaluate a JSONL file (each line: {"query", "context", "response"})
veralith eval traces.jsonl
veralith eval traces.jsonl --concurrency 8
veralith eval traces.jsonl --no-persist     # dry run, do not write to the DB

# Inspect one trace by id (read-only, colorized)
veralith inspect 42

# List recent traces
veralith list --limit 50

# Failure-cell distribution + totals across the DB
veralith stats
CommandArgsDoes
veralith eval<file.jsonl> · --concurrency (4) · --no-persistEvaluates each record concurrently and prints per-cell counts + total cost.
veralith inspect<trace_id>Pretty-prints one trace: claims, verdicts, completeness, suggestion.
veralith list--limit (20)Most recent traces with their failure cell.
veralith statsTotals + an ASCII bar chart of the failure-cell distribution.

veralith eval exits 0 on a clean run, 1 if the input is missing/empty, and 2 if it finished with one or more per-record errors.

Cost & budget

Veralith tracks token usage and USD per call, and guards every evaluation with a pre-flight budget estimate. Instrument your client once, then attribute cost per trace with a scope:

python
from veralith.observability.cost import (
    instrument, get_tracker, CostScope, enforce_budget, BudgetExceeded,
)
from veralith.llm import get_client

instrument(get_client())            # record every LLM call this client makes
tracker = get_tracker()

with CostScope() as scope:          # per-trace attribution
    result = veralith.evaluate(query, context, response)
print(scope.usd, scope.tokens)      # read AFTER the block exits
print(tracker.total_usd)

try:                                # pre-flight guard (raises before any model call)
    enforce_budget(query, context, response)     # uses VERALITH_PER_TRACE_BUDGET_USD
except BudgetExceeded as e:
    print(e.estimated_usd, e.budget_usd)

instrument() is idempotent and records per-model usage by diffing the client's token counters around each chat / structured / embed call. The budget guard runs automatically inside log() and the eval CLI, so an over-budget trace fails fast and cheap.

The built-in pricing table is a convenience estimate — verify it against current OpenAI pricing before relying on it for production budgeting.

Persistence

When persist=True (the default for log() / evaluate()), Veralith writes to a local SQLite database — veralith.db by default, or VERALITH_DB_PATH. The schema mirrors the result models one-to-one: traces and their context_chunks, sub_questions, claims, the three judgment tables, completeness with its mappings/extras, plus a grounding join table.

The database is self-bootstrapping: the first connection creates the schema and runs any pending migrations, so library callers never have to initialize it explicitly. An LLM-result cache (in-process LRU in front of a SQLite cache table) keeps repeated decomposition / judge calls cheap.

Read a verdict back later with the CLI (veralith inspect <id>) or query the tables directly — the traces row carries the rolled-up failure_cell, overall_verdict, and the sufficiency / faithfulness fractions for fast aggregation.


Status & roadmap

alpha · 0.1.x   The public API is stable — expect additions, not breaking changes.

In 0.1

  • Three judges (Sufficiency, Faithfulness, Completeness) with batched LLM calls.
  • Six-cell diagnostic classifier and rule-based suggester.
  • Outcome-based sufficiency-threshold calibration per knowledge base.
  • SDK: log(), @trace, the LangChain adapter, and a background eval worker.
  • SQLite persistence with self-healing migrations, an LLM-result cache, and a CLI.
  • Cost tracking with a per-trace budget guard.

On the roadmap

  • LLM-enriched, trace-specific suggestions.
  • Cross-trace pattern detection (“you keep hallucinating on time-sensitive queries”).
  • Additional judges (reasoning validity, temporal validity) and framework adapters (LlamaIndex, raw OpenAI tools).
  • A hosted dashboard with multi-tenant projects.
MIT licensed · Veralith 0.1.x