LettuceDetect

Hallucination detection for RAG systems, using token-level classification over context, question, and answer.

What it does

LettuceDetect checks whether a generated answer is supported by the context it was given.

It reads the same triple a RAG system already has: retrieved context, user question, and candidate answer. Instead of asking another language model to judge the answer, LettuceDetect uses an encoder classifier to label answer tokens as supported or unsupported.

That makes the detector small enough to run as a normal production component. The English model is built on ModernBERT, the multilingual line uses multilingual encoder variants, and the public family includes smaller variants for constrained deployments.

The useful output is not a score alone. It is the location of the unsupported text, so the system can show a reviewer what failed, regenerate the answer, or block the result before it reaches a user.

score
79.22%: F1 on RAGTruth example-level hallucination detection; speed
30 to 60: examples per second on a single GPU in the published evaluation; context
8,192: token context window for ModernBERT and multilingual variants; size
150M: base English model size, about 30 times smaller than LLM-judge detectors

How it works

The detector sits after generation, where it can compare every answer token against the retrieved context before the answer is accepted.

support check

LettuceDetect turns hallucination detection into token classification. The model reads the full RAG triple and returns spans that are supported, unsupported, or ready for review.

rag triple

Context, question, answer

contextThe retention period under Article 6 is six years.

questionWhat is the retention period?

answerThe retention period is six years and applies to all customers.

classification boundary

Encoder token classifier

input: context + question + answer
decision: supported and unsupported token spans
runtime: single-GPU batch inference or local CPU checks

action surface

Verified or flagged output

The supported claim passes with its source context. The unsupported phrase is returned as a span, so the system can send it to review, regenerate the answer, or block the response.

supported_score: Aggregate confidence that the answer is grounded in the retrieved context.
unsupported_spans[]: Text spans that the model cannot support from the supplied context.
action: Pass, review, regenerate, or block according to the deployment policy.

LettuceDetect does not replace retrieval or generation. It checks the answer after the LLM has written it, using the context the RAG system retrieved. That placement makes the failure mode visible: if a phrase is not supported by the context, the detector returns the phrase rather than hiding the issue behind a single judge score.

Results

The public evaluation focuses on the production tradeoff: enough accuracy to be useful, with inference cost closer to an encoder than an LLM judge.

RAGTruth F1 79.22%
Example-level detection score for the English base model on RAGTruth.
Encoder gain +14.8%
Improvement over the previous best encoder-based hallucination detector reported in the paper.
Throughput 30 to 60 / sec
Single-GPU inference range for checking context, question, and answer triples.
Deployment size about 30x smaller
Compared with the strongest LLM-based detector class discussed in the paper.

The headline score is the RAGTruth example-level F1 reported in the paper. Throughput and size matter because hallucination detection usually sits in the request path: if the checker is as expensive as another LLM call, many teams skip it. For the full setup, category breakdown, and baseline comparison, read the arXiv paper.

Models on Hugging Face

The model family separates language coverage and deployment size, so teams can choose the detector that matches their corpus and latency budget.

English lettucedect-base-modernbert-en-v1 8,192 tokens. base English detector.
English lettucedect-large-modernbert-en-v1 8,192 tokens. larger English detector.
Hungarian lettucedect-mmbert-base-hu-v1 4,096 tokens. base Hungarian detector.
Hungarian lettucedect-mmbert-small-hu-v1 4,096 tokens. smaller Hungarian detector.
English tinylettuce-ettin-68m-en 2,048 tokens. low-resource deployment variant.

For developers

Use LettuceDetect when your pipeline already produces context, question, and answer triples and you need a support check before returning the answer.

inspect

Inspect the moving parts

The repo, paper, and model cards are public so teams can test detection quality against their own RAG logs before adoption.

GitHub Library, training scripts, examples, model usage, and evaluation code.
Hugging Face collection Published English, multilingual, Hungarian, and TinyLettuce model family.
Model: base ModernBERT EN English base model with an 8,192-token context window.
Model: large ModernBERT EN Larger English detector for accuracy-sensitive evaluation runs.
Model: base MMBERT HU Hungarian detector built on a multilingual encoder.
Paper: arXiv 2502.17125 Paper covering architecture, benchmark setup, and comparison to LLM-judge baselines.

run

Start with a support check

The detector takes the RAG triple directly. Use the single-record call for application paths and batch calls for evaluation jobs.

Install

Add the package to the service that receives generated answers and retrieved context.

bash

pip install lettucedetect

pip install lettucedetect

Detect

Return unsupported spans so the application can review, regenerate, or block the answer.

python

from lettucedetect.models.inference import HallucinationDetector detector = HallucinationDetector( method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1", ) result = detector.detect( context="The retention period under Article 6 is six years.", question="What is the retention period?", answer="The retention period is six years and applies to all customers.", ) for span in result.unsupported_spans: print("unsupported:", repr(span.text))

from lettucedetect.models.inference import HallucinationDetector

detector = HallucinationDetector(
    method="transformer",
    model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
)

result = detector.detect(
    context="The retention period under Article 6 is six years.",
    question="What is the retention period?",
    answer="The retention period is six years and applies to all customers.",
)

for span in result.unsupported_spans:
    print("unsupported:", repr(span.text))

Batch

Run the same detector over saved RAG logs to measure how often answers contain unsupported spans.

python

records = [ { "context": "The contract renews every 12 months.", "question": "How often does the contract renew?", "answer": "The contract renews every 12 months.", }, { "context": "The policy applies to EU customers.", "question": "Who does the policy apply to?", "answer": "The policy applies to EU and US customers.", }, ] results = detector.detect_batch(records) for item in results: print(item.supported_score, item.unsupported_spans)

records = [
    {
        "context": "The contract renews every 12 months.",
        "question": "How often does the contract renew?",
        "answer": "The contract renews every 12 months.",
    },
    {
        "context": "The policy applies to EU customers.",
        "question": "Who does the policy apply to?",
        "answer": "The policy applies to EU and US customers.",
    },
]

results = detector.detect_batch(records)
for item in results:
    print(item.supported_score, item.unsupported_spans)

cite

Paper reference

bibtex

@article{kovacs-recski-2025-lettucedetect, title = {LettuceDetect: a hallucination-detection framework for RAG applications}, author = {Kovacs, Adam and Recski, Gabor}, year = {2025}, journal = {arXiv preprint arXiv:2502.17125}, url = {https://arxiv.org/abs/2502.17125} }

@article{kovacs-recski-2025-lettucedetect,
  title = {LettuceDetect: a hallucination-detection framework for RAG applications},
  author = {Kovacs, Adam and Recski, Gabor},
  year = {2025},
  journal = {arXiv preprint arXiv:2502.17125},
  url = {https://arxiv.org/abs/2502.17125}
}

Compatibility and licensing

The adoption boundary is simple: code runs in Python, models run through the usual transformer stack, and the detector only needs the RAG triple.

code: github.com/KRLabsOrg/LettuceDetect, MIT licence.
models: English ModernBERT, Hungarian multilingual-encoder variants, and TinyLettuce models under the KR Labs Hugging Face organisation.
runtime: PyTorch, CPU or CUDA, with batch inference for offline evaluation and request-path checks for production systems.
integration: Works with any RAG pipeline that can provide context, question, and answer.

Combine with the rest of the stack

Each product can run on its own. Together, they turn an LLM answer into something a team can inspect, reject, or enforce.

LettuceDetect is the support-check layer: it asks whether the answer is actually grounded in the retrieved context. Pair it with an evidence extraction layer to make the answer easier to verify, and a rules layer to decide what should happen after the check. The result is not just a pass or fail, but a record of which text was supported, which text was not, and what action followed.

Check the answer before it leaves the system

Start with the evaluation details, or inspect the repository for the detector, model scripts, examples, and deployment path.

Read the evaluation Inspect the repository