“Hallucinated: yes” tells you nothing. We tell you everything.

Detect, diagnose, and correct hallucinations in your RAG pipeline — trace by trace, in real time.

Supported by

How it works

Three steps, every answer.

VERALITH diagnoses every answer your RAG ships — claim by claim.

Detect

Find every claim

Break the answer into self-contained claims and flag any that retrieval doesn't support — supported, unsupported, or contradicted.

Diagnose

Understand why

See why it failed: missing evidence, weak retrieval, a contradicted source, or parametric drift — down to the exact chunks.

Correct

Ground it

Each flagged claim comes with a recommended fix — re-cite, regenerate, or lower confidence — to apply on your terms. Diagnosed, never blocked.

A sky full of answers

Thousands of Q · C · R a minute.

That's the whole picture

Every answer, pulled apart and checked.

Claims, grounding, latency, failure cells, volume — resolved the moment a response ships.

That's the loop

Detect. Diagnose. Correct.

Wire it in once and every answer gets the same scrutiny — before your user ever reads it.

Built for high-stakes RAG, where a wrong claim is expensive
LegalHealthcareFinanceSupportResearch
Capabilities

From a failing answer to a merged fix.

Veralith diagnoses every RAG answer at the claim level, measures the health of your whole system, tells you what to do about it — and closes the loop inside your own codebase.

Query & Responseclaims highlighted by judge verdict · hover any claim for reasoning
grounded claimungrounded claim
Q Query 2 sub-questions · 123 chars

Explain compounding frequency tradeoffs for monthly vs annual, and how this changes the doubling time under the Rule of 72.

Sub-questions (decomposed by judge)
Q0How does the Rule of 72 estimate doubling time?PASS

Judge: Directly defined in chunk #0 (similarity 0.82). Sufficient context to answer.

Supporting: #0, #2

Q1What are the tradeoffs of monthly vs annual compounding?UNCOVERED

Judge: No retrieved chunk discusses compounding frequency tradeoffs. All chunks cover the Rule of 72 itself, not periodic compounding.

Supporting: none

R Response 5 claims · 2 grounded · 3 ungrounded

The Rule of 72 estimates doubling time by dividing 72 by the annual interest rate.R0 At a 6% return, the formula gives roughly 12 years to double.R1 Monthly compounding produces a doubling time of about 11.6 years at the same nominal rate.R2 Banks generally prefer annual compounding because it reduces operational overhead.R3 Daily compounding offers diminishing returns above 12 periods per year.R4

The diagnosis

Trace-level analysis

Every answer is split into atomic claims, then scored by three LLM judges — Sufficiency, Faithfulness, Completeness — and routed into one of six failure cells. You see which sentence broke, and why.

Sufficiency0.83
Faithfulness0.20
Completeness
failure cell complete · ungrounded
Observability

RAG health

One composite index — the mean of your three judges — tracked over time. Slice by route, model, or document set and watch the line climb as you tune.

0.80
health · 7d
Suff0.81
Faith0.88
Comp0.72
Guidance

Ver-Advise

A prescriptive advisor over all your traces. It reads the week's failures and tells you where the next fix pays off most — ranked, with the expected lift.

1Close the billing-refund gap+3.1pp
2Audit completeness on long queriesmed
Close the loop

Heal

Hand the diagnosis to your own Claude Code over MCP. It reads your repo, makes the edit, and opens a PR — you review and merge. Failures cluster by root cause, so one fix clears many.

diagnose edit open PR
Evidence

Source attribution

Every supported claim links back to the exact chunk it leans on. Every unsupported one is flagged with the evidence it's missing.

supportedunsupportedchunk #3 · p.12
Patterns

Knowledge-gap clusters

Failing queries roll up into the topics your corpus can't answer — ranked by volume and trend. You learn which docs to write next, not just that something broke.

Billing & refunds42
SSO / SAML setup34
API rate limits30
Yours to run

Open-source core

pip install veralith — the judges, classifier, and suggester are open source. Bring your own LLM keys; your traces never leave your boundary.

LangChainLlamaIndexHaystackraw API
Integrate in minutes

One call around your answer.

01

Send

Pass the user query, your retrieved context, and the LLM response — in a single call.

02

Verify

Veralith splits the response into atomic claims and cross-checks each against the context.

03

Act

Read the failure cell, per-claim verdicts, and suggested fix — then route, gate, or heal.

>_ python
import veralith

# the (query, context, response) your RAG stack already produces
result = veralith.evaluate(
    query="What is the refund policy?",     # user's question
    context=knowledge_base,                 # your retrieved chunks
    response=llm_output,                    # the answer your LLM gave
    persist=False,
)

# a named failure cell — not a yes/no flag
print(result.diagnosis.failure_cell.value)     # → 'incomplete_ungrounded'

# per-claim faithfulness, plus the one fix to apply next
print(result.faithfulness[0].verdict.value)    # → 'N'
print(result.suggestion.actions[0])
{
  "diagnosis": {
    "failure_cell": "incomplete_ungrounded",
    "sufficiency_level": "low"
  },
  "faithfulness": [
    { "claim_id": 1, "verdict": "N", "grounding_chunk_ranks": [] },
    { "claim_id": 2, "verdict": "Y", "grounding_chunk_ranks": [0] }
  ],
  "suggestion": {
    "title": "Worst-case failure",
    "actions": [
      "Fix retrieval first: bump K, audit the corpus, re-chunk, re-embed.",
      "Add an abort path: if Sufficiency is low at eval time, return a refusal instead of the generated answer.",
      "Tighten the generator prompt to refuse when context is thin."
    ]
  }
}

MIT-licensed core · bring your own keys · also via the @veralith.trace decorator, the LangChain adapter, or the hosted REST API.

Playground

Pick an answer. Watch it get checked.

A question, its retrieved context, and a generated answer. Run the check and Veralith decomposes the answer into claims, grounds each one against the context, and tells you exactly what failed — live, no signup.

Query
Context
Response
Start free Read the docs Illustrative — verdicts shown are pre-computed.
Pricing

Start free. Scale when it ships.

Hobby
$0

For prototypes and side projects finding their footing.

  • 10,000 claims / month
  • Claim-level verdicts
  • Playground access
  • Community support
Start free
Team
$79 / mo

For teams running RAG in production and tuning it weekly.

  • 1M claims / month
  • Streaming + auto-correction
  • Observability dashboards
  • Email & Slack support
Start 14-day trial
Enterprise
Custom

For regulated, high-volume, or self-hosted deployments.

  • Unlimited claims
  • In-VPC / on-prem verifier
  • SSO, audit logs, SLAs
  • Dedicated solutions engineer
Talk to us

Example pricing for this mockup — not VERALITH's real plans.

Stop shipping hallucinations.

Make ’em grounded!