Veralith · v0.1.x (alpha)

Hallucination diagnosis for RAG

Wrap one line around your retrieval pipeline and get a structured report on what failed and how to fix it — not just a single yes/no hallucination flag. Veralith decomposes every (query, context, response) trace, runs three LLM-as-judge metrics over it, and classifies it into one of six diagnostic cells with a concrete remediation.

Overview

A monolithic “is this response hallucinated?” judge is a smoke alarm — it tells you something is wrong, but not what or where. Veralith is the diagnostic dashboard behind the alarm. For each trace it answers three independent questions:

Sufficiency — was the retrieval good enough to answer each part of the query?
Faithfulness — is each claim in the response grounded in the retrieved context?
Completeness — does the response actually answer every part of the query (and stay on topic)?

Cross-tabulating these gives a named failure mode (retrieval gap, intrinsic hallucination, padded answer, …) plus actionable fixes (lower temperature, bump retrieval‑K, tighten the generator prompt, …) for every trace.

Veralith evaluates traces you already have — it does not sit in your request path or change your responses. It only needs the query, the retrieved chunks, and the response your system produced.

Installation

Veralith targets Python 3.10+ and uses the OpenAI API for its judges.

bash

pip install veralith

Optional extras:

bash

pip install "veralith[langchain]"   # LangChain auto-tracing adapter
pip install "veralith[sample]"      # chromadb, for the sample RAG app
pip install "veralith[dev]"         # pytest, ruff, build, twine (contributors)

Set your OpenAI key (read once at import via python-dotenv, so a .env file works too):

bash

export OPENAI_API_KEY=sk-...

30-second quickstart

Run one synchronous evaluation and read the diagnosis straight off the typed result — no database, no polling:

python

import veralith

result = veralith.evaluate(
    query="What is a P/E ratio and what was Apple's P/E in 2023?",
    context=[
        "The price-to-earnings (P/E) ratio is a company's share price "
        "divided by its earnings per share."
    ],
    response=(
        "A P/E ratio divides share price by earnings per share. "
        "Apple's P/E in 2023 was 42.7."
    ),
    persist=False,                       # run entirely in memory
)

print(result.failure_cell.value)        # 'incomplete_ungrounded'
print(result.suggestion.title)          # 'Worst-case failure'
for action in result.suggestion.actions:
    print(" -", action)

Here the response invents a number (42.7) that the context never grounds and the context didn't actually cover Apple's P/E — so Veralith lands the trace in the worst-case cell and returns concrete next steps. You get back a typed EvaluationResult with per-claim verdicts, per-sub-question sufficiency, a failure-cell diagnosis, and a suggestion.

How it works

Every evaluation runs the same deterministic pipeline over one trace — the triple (query Q, context C, response R):

Decompose

Split Q into atomic sub-questions {Qᵢ} and R into atomic claims {Rᵢ}.

→

Judge ×3

Sufficiency (per Qᵢ), Faithfulness (per Rᵢ), Completeness (Rᵢ ↔ Qᵢ alignment).

→

Classify

Cross-tab Completeness × Faithfulness into one of six failure cells.

→

Suggest

Map the diagnosis to a concrete, rule-based remediation.

Decomposition is deliberately conservative: the splitter only acts on content literally present in the text, resolves pronouns so each piece is self-contained, and never invents sub-topics. A single-purpose query stays one sub-question; a refusal (“I can't answer that from the context”) yields zero claims — a case the pipeline handles explicitly.

Each of the three judges is isolated: if one fails (an API error, a malformed verdict), the others still complete and the failure is recorded in result.errors rather than aborting the whole evaluation. Per-phase wall-clock timings land in result.latency_ms.

Cost. A typical evaluation is ~5 LLM calls — 2 decomposition calls on the cheaper decomposer model plus 3 batched judges — roughly $0.005 / trace on the default models. The per-item judges batch in groups of 5, so 15 sub-questions cost 3 calls, not 15.

The three metrics

The metrics measure independent things. Two are per-item binary judges; Completeness is a single whole-trace alignment call.

Sufficiency

Per sub-question Qᵢ: do the retrieved chunks contain enough information to answer it — using only the chunks? A retrieval-quality signal.

verdict Y / N · per Qᵢ · batched

Faithfulness

Per claim Rᵢ: is it grounded in the chunks? Checks grounding, not correctness — a plausible inference beyond the context still counts as ungrounded; strict on numbers, dates, and entities.

verdict Y / N · per Rᵢ · batched

Completeness

Whole trace: does R cover every Qᵢ and stay on topic? Definitions, formulas, examples, caveats, and history of the asked topic are on-topic — only a claim about a different topic is “extra”.

complete / incomplete / extra

Sufficiency is a deliberate sidecar: it does not decide which failure cell you land in. It is collapsed to a HIGH / LOW level that only refines which remediation text you get — see calibration.

The six failure cells

Each trace lands in exactly one cell — a strict cross-tab of Completeness (3 rows: does R cover Q?) × Faithfulness (2 columns: is every claim grounded?). The cell name follows the pattern <completeness>_<faithfulness>, so you can decode any cell without a lookup chart: read it as “the response is <X> and the claims are <Y>.”

Grounded
every claim supported

Ungrounded
some claim invented

Complete

complete_grounded

Answers everything; every claim is grounded.

healthy

complete_ungrounded

Answers everything, but at least one claim is fabricated.

hallucination

Incomplete

incomplete_grounded

Misses part of the query; what is there is grounded.

gap

incomplete_ungrounded

Misses parts and fabricates within what it did answer.

worst case

Extra

extra_grounded

Adds unrequested content; everything is still grounded.

padded

extra_ungrounded

Adds unrequested content and fabricates some of it.

padded + invented

The Completeness axis follows a fixed precedence — incomplete > extra > complete: an uncovered sub-question always makes a trace incomplete, even if it also has extra claims. A response with zero claims (a refusal) is treated as grounded — there is nothing to be ungrounded about — so a refusal lands in incomplete_grounded.

Sufficiency calibration

“Was retrieval good enough?” depends on your corpus. Rather than a fixed bar, Veralith can learn the Sufficiency HIGH/LOW threshold per knowledge base from your own trace history.

The idea: if a trace reached the healthy outcome — fully grounded and complete — despite imperfect retrieval, then that retrieval was “good enough” for this corpus. Veralith takes the 10th percentile of the sufficiency fraction across those successful traces and uses it as the threshold.

A fresh project starts strict (a fallback threshold), so HIGH requires strong retrieval.
Calibration only kicks in after ≥ 20 successful traces exist in the database; until then the fallback is used.
The level never moves a trace between cells — it only selects between the HIGH and LOW phrasing of the same cell's remediation.

Calibration reads from the SQLite database, so it only applies on the persisted path (persist=True). In-memory evaluations use the default threshold.

Integration · `log()`

The minimum-friction entry point. It persists the trace immediately and, by default, runs the evaluation on a background worker thread — so it adds almost nothing to your request latency. It returns the integer trace_id right away.

python

import veralith

def answer(query: str) -> str:
    chunks = my_retriever(query)            # list[str] | list[dict] | list[ContextChunk]
    response = my_generator(query, chunks)

    veralith.log(query=query, context=chunks, response=response)   # background eval
    return response

Signature

python

log(query: str, context, response: str, *, sync: bool | None = None) -> int | EvaluationResult

Param	Type	Notes
query	`str`	The user query. Must be non-empty.
context	`list[str] \| list[dict] \| list[ContextChunk]`	Retrieved chunks. Strings and dicts are normalized to `ContextChunk`; dict keys read are `text`, `rank`, `source`, `score`.
response	`str`	The generated response.
sync	`bool \| None`	`None` → uses `VERALITH_DEFAULT_SYNC` (default `False`). `False` → background eval, returns `trace_id`. `True` → inline eval, returns `EvaluationResult`.

A pre-flight budget guard runs before anything is written, so an over-budget trace raises BudgetExceeded without persisting or calling any model. Background evals never crash your process; if the process exits, drain them first:

python

veralith.shutdown(wait=True)   # block until in-flight background evals finish (also runs at exit)

Integration · `@trace`

Wrap a RAG function and Veralith captures its (response, context) automatically — zero code reshape. The wrapped function still returns just the response to your callers.

python

import veralith

@veralith.trace
def my_rag(query: str):
    chunks = my_retriever(query)
    response = my_generator(query, chunks)
    return response, chunks            # return (response, context) — the decorator captures both

answer = my_rag("How do I reset my password?")   # -> the response string; eval runs in the background

Return either a (response, context) tuple or a TraceReturn(response=..., context=...) when a bare tuple is awkward. The decorator supports bare @trace, parameterized @trace(...), and async functions.

python

from veralith import trace, TraceReturn

@trace(query_arg="user_question", sync=False, on_error="warn")
def my_rag(user_question: str):
    ...
    return TraceReturn(response=answer, context=chunks)

Option	Default	Notes
query_arg	`None`	Name of the parameter holding the query. Defaults to the first positional arg (or `kwargs["query"]`).
sync	`None`	Forwarded to `log()`.
on_error	`"warn"`	`"warn"` emits a warning if capture/log fails; `"silent"` swallows it. A `BudgetExceeded` always propagates.

Telemetry is best-effort: if Veralith can't extract a clean (response, context) it warns and passes your return value through untouched — it never disturbs your pipeline.

Integration · `evaluate()`

The low-level orchestrator. It always runs synchronously and returns the full typed EvaluationResult inline — ideal for tests, gating, or pulling a verdict into your own control flow.

python

evaluate(query: str, context, response: str, *, persist: bool = True, trace_id: int | None = None) -> EvaluationResult

python

result = veralith.evaluate(query, context, response, persist=False)

if result.failure_cell and result.failure_cell.value.endswith("ungrounded"):
    handle_hallucination(result.faithfulness)   # per-claim verdicts + grounding chunks

Param	Default	Notes
persist	`True`	`True` writes the trace + all artifacts to SQLite (and enables per-KB calibration). `False` runs entirely in memory and skips all DB writes.
trace_id	`None`	Reuse an existing trace row (e.g. one `log()` already persisted) instead of inserting a new one.

A reserved sync keyword exists for API symmetry with log() but has no effect here — evaluate() is always synchronous.

Integration · LangChain

Zero-code auto-tracing. One install() patches LangChain's retrieval chains so every .invoke() also logs (query, source documents, answer) to Veralith.

python

import veralith.adapters.langchain as adapter

adapter.install()           # patches RetrievalQA + RetrievalQAWithSourcesChain .invoke()
# ... every existing chain.invoke() now auto-traces to Veralith ...

adapter.is_installed()      # True
adapter.uninstall()         # restore the originals

install() returns the number of chain classes patched, is idempotent, and raises ImportError only if LangChain isn't installed at all. As with @trace, extraction failures warn and pass the chain result through unchanged — the adapter never breaks your chain.

The result object

evaluate() (and log(sync=True)) return an EvaluationResult — a typed Pydantic model. Every field is structured and inspectable.

Field	Type	Meaning
trace_id	`int`	DB id of the trace (negative when `persist=False`).
query	`str`	The original query.
sub_questions	`list[SubQuestion]`	The decomposed query {Qᵢ}.
claims	`list[Claim]`	The decomposed response {Rᵢ}.
sufficiency	`list[SufficiencyJudgment]`	Per-Qᵢ verdicts; empty if the sufficiency judge failed.
faithfulness	`list[FaithfulnessJudgment]`	Per-Rᵢ verdicts + grounding chunk ranks; empty if it failed.
completeness	`CompletenessJudgment \| None`	Qᵢ↔Rᵢ alignment; `None` if it failed.
diagnosis	`Diagnosis \| None`	Failure cell + sufficiency level + counts; `None` if the cell couldn't be determined.
suggestion	`Suggestion`	Remediation (always present).
created_at	`datetime`	UTC timestamp.
errors	`dict[str, str]`	metric → error message, for any judge that failed.
latency_ms	`dict[str, float]`	phase → elapsed milliseconds.

The top-level result.failure_cell property is a shortcut for result.diagnosis.failure_cell (or None). A Diagnosis also carries the supporting signals: sufficiency_level (HIGH/LOW), sufficiency_fraction, faithfulness_fraction, and counts — n_sub_questions, n_claims, n_uncovered_sub_questions, n_extra_claims.

Nested judgment models

Model	Key fields
SubQuestion / Claim	`id`, `text`, `order_idx`
SufficiencyJudgment	`sub_question_id`, `verdict` (Y/N), `reasoning`, `supporting_chunk_ranks`
FaithfulnessJudgment	`claim_id`, `verdict` (Y/N), `reasoning`, `grounding_chunk_ranks`
CompletenessJudgment	`overall` (complete/incomplete/extra), `mappings` (Qᵢ → covering Rᵢ or `None`), `extra_claim_ids`, `reasoning`
Suggestion	`title`, `body`, `actions: list[str]`
ContextChunk	`text`, `rank` (0 = top), `source?`, `score?`

Reading a verdict

python

r = veralith.evaluate(query, context, response, persist=False)

print(r.failure_cell.value)                          # e.g. 'complete_ungrounded'
print(r.diagnosis.sufficiency_level.value)           # 'high' | 'low'

for claim, judgment in zip(r.claims, r.faithfulness):
    if judgment.verdict.value == "N":
        print("UNGROUNDED:", claim.text, "—", judgment.reasoning)

print(r.suggestion.title)
for step in r.suggestion.actions:
    print(" -", step)

Configuration

Defaults work out of the box. Everything is tunable via environment variables (read once at import) or the veralith.config.settings singleton.

Environment variable	Default	Purpose
`OPENAI_API_KEY`	—	Required.
`VERALITH_JUDGE_MODEL`	`gpt-4o`	Model for the S / F / C judges.
`VERALITH_DECOMPOSER_MODEL`	`gpt-4o-mini`	Model for query / response decomposition.
`VERALITH_EMBED_MODEL`	`text-embedding-3-small`	Embedding model (cost tracking).
`VERALITH_DB_PATH`	`veralith.db`	SQLite persistence path.
`VERALITH_BATCH_SIZE`	`5`	Per-item judge batch size.
`VERALITH_DEFAULT_SYNC`	`False`	Default `sync` for `log()` / `@trace`.
`VERALITH_PER_TRACE_BUDGET_USD`	`0.50`	Pre-flight budget ceiling per trace.
`VERALITH_WORKER_CONCURRENCY`	`4`	Background eval thread-pool size.
`VERALITH_CACHE_ENABLED`	`True`	Enable the LLM-result cache.

Booleans accept 1, true, yes, on (case-insensitive). For tests and scoped tweaks, settings.override(...) is a context manager that validates keys and restores them on exit:

python

from veralith.config import settings

settings.judge_model            # 'gpt-4o'
with settings.override(batch_size=1):
    ...                         # scoped change, auto-restored on exit

Command-line interface

Installing the package adds a veralith console script with four subcommands.

bash

# Batch-evaluate a JSONL file (each line: {"query", "context", "response"})
veralith eval traces.jsonl
veralith eval traces.jsonl --concurrency 8
veralith eval traces.jsonl --no-persist     # dry run, do not write to the DB

# Inspect one trace by id (read-only, colorized)
veralith inspect 42

# List recent traces
veralith list --limit 50

# Failure-cell distribution + totals across the DB
veralith stats

Command	Args	Does
`veralith eval`	<file.jsonl> · `--concurrency` (4) · `--no-persist`	Evaluates each record concurrently and prints per-cell counts + total cost.
`veralith inspect`	<trace_id>	Pretty-prints one trace: claims, verdicts, completeness, suggestion.
`veralith list`	`--limit` (20)	Most recent traces with their failure cell.
`veralith stats`	—	Totals + an ASCII bar chart of the failure-cell distribution.

veralith eval exits 0 on a clean run, 1 if the input is missing/empty, and 2 if it finished with one or more per-record errors.

Cost & budget

Veralith tracks token usage and USD per call, and guards every evaluation with a pre-flight budget estimate. Instrument your client once, then attribute cost per trace with a scope:

python

from veralith.observability.cost import (
    instrument, get_tracker, CostScope, enforce_budget, BudgetExceeded,
)
from veralith.llm import get_client

instrument(get_client())            # record every LLM call this client makes
tracker = get_tracker()

with CostScope() as scope:          # per-trace attribution
    result = veralith.evaluate(query, context, response)
print(scope.usd, scope.tokens)      # read AFTER the block exits
print(tracker.total_usd)

try:                                # pre-flight guard (raises before any model call)
    enforce_budget(query, context, response)     # uses VERALITH_PER_TRACE_BUDGET_USD
except BudgetExceeded as e:
    print(e.estimated_usd, e.budget_usd)

instrument() is idempotent and records per-model usage by diffing the client's token counters around each chat / structured / embed call. The budget guard runs automatically inside log() and the eval CLI, so an over-budget trace fails fast and cheap.

The built-in pricing table is a convenience estimate — verify it against current OpenAI pricing before relying on it for production budgeting.

Persistence

When persist=True (the default for log() / evaluate()), Veralith writes to a local SQLite database — veralith.db by default, or VERALITH_DB_PATH. The schema mirrors the result models one-to-one: traces and their context_chunks, sub_questions, claims, the three judgment tables, completeness with its mappings/extras, plus a grounding join table.

The database is self-bootstrapping: the first connection creates the schema and runs any pending migrations, so library callers never have to initialize it explicitly. An LLM-result cache (in-process LRU in front of a SQLite cache table) keeps repeated decomposition / judge calls cheap.

Read a verdict back later with the CLI (veralith inspect <id>) or query the tables directly — the traces row carries the rolled-up failure_cell, overall_verdict, and the sufficiency / faithfulness fractions for fast aggregation.

Status & roadmap

alpha · 0.1.x The public API is stable — expect additions, not breaking changes.

In 0.1

Three judges (Sufficiency, Faithfulness, Completeness) with batched LLM calls.
Six-cell diagnostic classifier and rule-based suggester.
Outcome-based sufficiency-threshold calibration per knowledge base.
SDK: log(), @trace, the LangChain adapter, and a background eval worker.
SQLite persistence with self-healing migrations, an LLM-result cache, and a CLI.
Cost tracking with a per-trace budget guard.

On the roadmap

LLM-enriched, trace-specific suggestions.
Cross-trace pattern detection (“you keep hallucinating on time-sensitive queries”).
Additional judges (reasoning validity, temporal validity) and framework adapters (LlamaIndex, raw OpenAI tools).
A hosted dashboard with multi-tenant projects.

GitHub → Issues Playground Back to top

MIT licensed · Veralith 0.1.x

Hallucination diagnosis for RAG

Overview #

Installation #

30-second quickstart #

How it works #

The three metrics #

The six failure cells #

Sufficiency calibration #

Integration · log() #

Signature

Integration · @trace #

Integration · evaluate() #

Integration · LangChain #

The result object #

Nested judgment models

Reading a verdict

Configuration #

Command-line interface #

Cost & budget #

Persistence #

Status & roadmap #

In 0.1

On the roadmap

Overview

Installation

30-second quickstart

How it works

The three metrics

The six failure cells

Sufficiency calibration

Integration · `log()`

Integration · `@trace`

Integration · `evaluate()`

Integration · LangChain

The result object

Configuration

Command-line interface

Cost & budget

Persistence

Status & roadmap