Detect, diagnose, and correct hallucinations in your RAG pipeline — trace by trace, in real time.
Supported by
VERALITH diagnoses every answer your RAG ships — claim by claim.
Break the answer into self-contained claims and flag any that retrieval doesn't support — supported, unsupported, or contradicted.
See why it failed: missing evidence, weak retrieval, a contradicted source, or parametric drift — down to the exact chunks.
Each flagged claim comes with a recommended fix — re-cite, regenerate, or lower confidence — to apply on your terms. Diagnosed, never blocked.
Claims, grounding, latency, failure cells, volume — resolved the moment a response ships.
Wire it in once and every answer gets the same scrutiny — before your user ever reads it.
Veralith diagnoses every RAG answer at the claim level, measures the health of your whole system, tells you what to do about it — and closes the loop inside your own codebase.
Explain compounding frequency tradeoffs for monthly vs annual, and how this changes the doubling time under the Rule of 72.
Judge: Directly defined in chunk #0 (similarity 0.82). Sufficient context to answer.
Supporting: #0, #2
Judge: No retrieved chunk discusses compounding frequency tradeoffs. All chunks cover the Rule of 72 itself, not periodic compounding.
Supporting: none
The Rule of 72 estimates doubling time by dividing 72 by the annual interest rate.R0 At a 6% return, the formula gives roughly 12 years to double.R1 Monthly compounding produces a doubling time of about 11.6 years at the same nominal rate.R2 Banks generally prefer annual compounding because it reduces operational overhead.R3 Daily compounding offers diminishing returns above 12 periods per year.R4
Every answer is split into atomic claims, then scored by three LLM judges — Sufficiency, Faithfulness, Completeness — and routed into one of six failure cells. You see which sentence broke, and why.
One composite index — the mean of your three judges — tracked over time. Slice by route, model, or document set and watch the line climb as you tune.
A prescriptive advisor over all your traces. It reads the week's failures and tells you where the next fix pays off most — ranked, with the expected lift.
Hand the diagnosis to your own Claude Code over MCP. It reads your repo, makes the edit, and opens a PR — you review and merge. Failures cluster by root cause, so one fix clears many.
Every supported claim links back to the exact chunk it leans on. Every unsupported one is flagged with the evidence it's missing.
Failing queries roll up into the topics your corpus can't answer — ranked by volume and trend. You learn which docs to write next, not just that something broke.
pip install veralith — the judges, classifier, and suggester are open source. Bring your own LLM keys; your traces never leave your boundary.
Pass the user query, your retrieved context, and the LLM response — in a single call.
Veralith splits the response into atomic claims and cross-checks each against the context.
Read the failure cell, per-claim verdicts, and suggested fix — then route, gate, or heal.
import veralith # the (query, context, response) your RAG stack already produces result = veralith.evaluate( query="What is the refund policy?", # user's question context=knowledge_base, # your retrieved chunks response=llm_output, # the answer your LLM gave persist=False, ) # a named failure cell — not a yes/no flag print(result.diagnosis.failure_cell.value) # → 'incomplete_ungrounded' # per-claim faithfulness, plus the one fix to apply next print(result.faithfulness[0].verdict.value) # → 'N' print(result.suggestion.actions[0])
{
"diagnosis": {
"failure_cell": "incomplete_ungrounded",
"sufficiency_level": "low"
},
"faithfulness": [
{ "claim_id": 1, "verdict": "N", "grounding_chunk_ranks": [] },
{ "claim_id": 2, "verdict": "Y", "grounding_chunk_ranks": [0] }
],
"suggestion": {
"title": "Worst-case failure",
"actions": [
"Fix retrieval first: bump K, audit the corpus, re-chunk, re-embed.",
"Add an abort path: if Sufficiency is low at eval time, return a refusal instead of the generated answer.",
"Tighten the generator prompt to refuse when context is thin."
]
}
}
MIT-licensed core · bring your own keys · also via the @veralith.trace decorator, the LangChain adapter, or the hosted REST API.
A question, its retrieved context, and a generated answer. Run the check and Veralith decomposes the answer into claims, grounds each one against the context, and tells you exactly what failed — live, no signup.
For prototypes and side projects finding their footing.
For teams running RAG in production and tuning it weekly.
For regulated, high-volume, or self-hosted deployments.
Example pricing for this mockup — not VERALITH's real plans.