[technology] [lettucedetect]
LettuceDetect
Hallucination detection for RAG systems, using token-level classification over context, question, and answer.
What it does
LettuceDetect checks whether a generated answer is supported by the context it was given.
It reads the same triple a RAG system already has: retrieved context, user question, and candidate answer. Instead of asking another language model to judge the answer, LettuceDetect uses an encoder classifier to label answer tokens as supported or unsupported.
That makes the detector small enough to run as a normal production component. The English model is built on ModernBERT, the multilingual line uses multilingual encoder variants, and the public family includes smaller variants for constrained deployments.
The useful output is not a score alone. It is the location of the unsupported text, so the system can show a reviewer what failed, regenerate the answer, or block the result before it reaches a user.
- score
- 79.22%
- F1 on RAGTruth example-level hallucination detection
- speed
- 30 to 60
- examples per second on a single GPU in the published evaluation
- context
- 8,192
- token context window for ModernBERT and multilingual variants
- size
- 150M
- base English model size, about 30 times smaller than LLM-judge detectors
How it works
The detector sits after generation, where it can compare every answer token against the retrieved context before the answer is accepted.
LettuceDetect turns hallucination detection into token classification. The model reads the full RAG triple and returns spans that are supported, unsupported, or ready for review.
Context, question, answer
contextThe retention period under Article 6 is six years.
questionWhat is the retention period?
answerThe retention period is six years and applies to all customers.
Encoder token classifier
- input
context+question+answer- decision
- supported and unsupported token spans
- runtime
- single-GPU batch inference or local CPU checks
Verified or flagged output
The supported claim passes with its source context. The unsupported phrase is returned as a span, so the system can send it to review, regenerate the answer, or block the response.
- supported_score
- Aggregate confidence that the answer is grounded in the retrieved context.
- unsupported_spans[]
- Text spans that the model cannot support from the supplied context.
- action
- Pass, review, regenerate, or block according to the deployment policy.
LettuceDetect does not replace retrieval or generation. It checks the answer after the LLM has written it, using the context the RAG system retrieved. That placement makes the failure mode visible: if a phrase is not supported by the context, the detector returns the phrase rather than hiding the issue behind a single judge score.
Results
The public evaluation focuses on the production tradeoff: enough accuracy to be useful, with inference cost closer to an encoder than an LLM judge.
- RAGTruth F1 79.22%
Example-level detection score for the English base model on RAGTruth.
- Encoder gain +14.8%
Improvement over the previous best encoder-based hallucination detector reported in the paper.
- Throughput 30 to 60 / sec
Single-GPU inference range for checking context, question, and answer triples.
- Deployment size about 30x smaller
Compared with the strongest LLM-based detector class discussed in the paper.
The headline score is the RAGTruth example-level F1 reported in the paper. Throughput and size matter because hallucination detection usually sits in the request path: if the checker is as expensive as another LLM call, many teams skip it. For the full setup, category breakdown, and baseline comparison, read the arXiv paper.
Models on Hugging Face
The model family separates language coverage and deployment size, so teams can choose the detector that matches their corpus and latency budget.
- English lettucedect-base-modernbert-en-v1 8,192 tokens. base English detector.
- English lettucedect-large-modernbert-en-v1 8,192 tokens. larger English detector.
- Hungarian lettucedect-mmbert-base-hu-v1 4,096 tokens. base Hungarian detector.
- Hungarian lettucedect-mmbert-small-hu-v1 4,096 tokens. smaller Hungarian detector.
- English tinylettuce-ettin-68m-en 2,048 tokens. low-resource deployment variant.
For developers
Use LettuceDetect when your pipeline already produces context, question, and answer triples and you need a support check before returning the answer.
Inspect the moving parts
The repo, paper, and model cards are public so teams can test detection quality against their own RAG logs before adoption.
- GitHub Library, training scripts, examples, model usage, and evaluation code.
- Hugging Face collection Published English, multilingual, Hungarian, and TinyLettuce model family.
- Model: base ModernBERT EN English base model with an 8,192-token context window.
- Model: large ModernBERT EN Larger English detector for accuracy-sensitive evaluation runs.
- Model: base MMBERT HU Hungarian detector built on a multilingual encoder.
- Paper: arXiv 2502.17125 Paper covering architecture, benchmark setup, and comparison to LLM-judge baselines.
Start with a support check
The detector takes the RAG triple directly. Use the single-record call for application paths and batch calls for evaluation jobs.
Install
Add the package to the service that receives generated answers and retrieved context.
pip install lettucedetect Detect
Return unsupported spans so the application can review, regenerate, or block the answer.
from lettucedetect.models.inference import HallucinationDetector
detector = HallucinationDetector(
method="transformer",
model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
)
result = detector.detect(
context="The retention period under Article 6 is six years.",
question="What is the retention period?",
answer="The retention period is six years and applies to all customers.",
)
for span in result.unsupported_spans:
print("unsupported:", repr(span.text)) Batch
Run the same detector over saved RAG logs to measure how often answers contain unsupported spans.
records = [
{
"context": "The contract renews every 12 months.",
"question": "How often does the contract renew?",
"answer": "The contract renews every 12 months.",
},
{
"context": "The policy applies to EU customers.",
"question": "Who does the policy apply to?",
"answer": "The policy applies to EU and US customers.",
},
]
results = detector.detect_batch(records)
for item in results:
print(item.supported_score, item.unsupported_spans) Paper reference
@article{kovacs-recski-2025-lettucedetect,
title = {LettuceDetect: a hallucination-detection framework for RAG applications},
author = {Kovacs, Adam and Recski, Gabor},
year = {2025},
journal = {arXiv preprint arXiv:2502.17125},
url = {https://arxiv.org/abs/2502.17125}
} Compatibility and licensing
The adoption boundary is simple: code runs in Python, models run through the usual transformer stack, and the detector only needs the RAG triple.
- code
github.com/KRLabsOrg/LettuceDetect, MIT licence.- models
- English ModernBERT, Hungarian multilingual-encoder variants, and TinyLettuce models under the KR Labs Hugging Face organisation.
- runtime
- PyTorch, CPU or CUDA, with batch inference for offline evaluation and request-path checks for production systems.
- integration
- Works with any RAG pipeline that can provide
context,question, andanswer.
Combine with the rest of the stack
Each product can run on its own. Together, they turn an LLM answer into something a team can inspect, reject, or enforce.
LettuceDetect is the support-check layer: it asks whether the answer is actually grounded in the retrieved context. Pair it with an evidence extraction layer to make the answer easier to verify, and a rules layer to decide what should happen after the check. The result is not just a pass or fail, but a record of which text was supported, which text was not, and what action followed.
Check the answer before it leaves the system
Start with the evaluation details, or inspect the repository for the detector, model scripts, examples, and deployment path.