[technology] [lettucedetect]

LettuceDetect

Hallucination detection for RAG systems, using token-level classification over context, question, and answer.

What it does

LettuceDetect checks whether a generated answer is supported by the context it was given.

It reads the same triple a RAG system already has: retrieved context, user question, and candidate answer. Instead of asking another language model to judge the answer, LettuceDetect uses an encoder classifier to label answer tokens as supported or unsupported.

That makes the detector small enough to run as a normal production component. The English model is built on ModernBERT, the multilingual line uses multilingual encoder variants, and the public family includes smaller variants for constrained deployments.

The useful output is not a score alone. It is the location of the unsupported text, so the system can show a reviewer what failed, regenerate the answer, or block the result before it reaches a user.

score
79.22%
F1 on RAGTruth example-level hallucination detection
speed
30 to 60
examples per second on a single GPU in the published evaluation
context
8,192
token context window for ModernBERT and multilingual variants
size
150M
base English model size, about 30 times smaller than LLM-judge detectors

How it works

The detector sits after generation, where it can compare every answer token against the retrieved context before the answer is accepted.

support check

LettuceDetect turns hallucination detection into token classification. The model reads the full RAG triple and returns spans that are supported, unsupported, or ready for review.

rag triple

Context, question, answer

contextThe retention period under Article 6 is six years.

questionWhat is the retention period?

answerThe retention period is six years and applies to all customers.

classification boundary

Encoder token classifier

input
context + question + answer
decision
supported and unsupported token spans
runtime
single-GPU batch inference or local CPU checks
action surface

Verified or flagged output

The supported claim passes with its source context. The unsupported phrase is returned as a span, so the system can send it to review, regenerate the answer, or block the response.

supported_score
Aggregate confidence that the answer is grounded in the retrieved context.
unsupported_spans[]
Text spans that the model cannot support from the supplied context.
action
Pass, review, regenerate, or block according to the deployment policy.

LettuceDetect does not replace retrieval or generation. It checks the answer after the LLM has written it, using the context the RAG system retrieved. That placement makes the failure mode visible: if a phrase is not supported by the context, the detector returns the phrase rather than hiding the issue behind a single judge score.

Results

The public evaluation focuses on the production tradeoff: enough accuracy to be useful, with inference cost closer to an encoder than an LLM judge.

  • RAGTruth F1 79.22%

    Example-level detection score for the English base model on RAGTruth.

  • Encoder gain +14.8%

    Improvement over the previous best encoder-based hallucination detector reported in the paper.

  • Throughput 30 to 60 / sec

    Single-GPU inference range for checking context, question, and answer triples.

  • Deployment size about 30x smaller

    Compared with the strongest LLM-based detector class discussed in the paper.

The headline score is the RAGTruth example-level F1 reported in the paper. Throughput and size matter because hallucination detection usually sits in the request path: if the checker is as expensive as another LLM call, many teams skip it. For the full setup, category breakdown, and baseline comparison, read the arXiv paper.

Models on Hugging Face

The model family separates language coverage and deployment size, so teams can choose the detector that matches their corpus and latency budget.

For developers

Use LettuceDetect when your pipeline already produces context, question, and answer triples and you need a support check before returning the answer.

inspect

Inspect the moving parts

The repo, paper, and model cards are public so teams can test detection quality against their own RAG logs before adoption.

run

Start with a support check

The detector takes the RAG triple directly. Use the single-record call for application paths and batch calls for evaluation jobs.

Install

Add the package to the service that receives generated answers and retrieved context.

bash
pip install lettucedetect
pip install lettucedetect

Detect

Return unsupported spans so the application can review, regenerate, or block the answer.

python
from lettucedetect.models.inference import HallucinationDetector detector = HallucinationDetector( method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1", ) result = detector.detect( context="The retention period under Article 6 is six years.", question="What is the retention period?", answer="The retention period is six years and applies to all customers.", ) for span in result.unsupported_spans: print("unsupported:", repr(span.text))
from lettucedetect.models.inference import HallucinationDetector

detector = HallucinationDetector(
    method="transformer",
    model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
)

result = detector.detect(
    context="The retention period under Article 6 is six years.",
    question="What is the retention period?",
    answer="The retention period is six years and applies to all customers.",
)

for span in result.unsupported_spans:
    print("unsupported:", repr(span.text))

Batch

Run the same detector over saved RAG logs to measure how often answers contain unsupported spans.

python
records = [ { "context": "The contract renews every 12 months.", "question": "How often does the contract renew?", "answer": "The contract renews every 12 months.", }, { "context": "The policy applies to EU customers.", "question": "Who does the policy apply to?", "answer": "The policy applies to EU and US customers.", }, ] results = detector.detect_batch(records) for item in results: print(item.supported_score, item.unsupported_spans)
records = [
    {
        "context": "The contract renews every 12 months.",
        "question": "How often does the contract renew?",
        "answer": "The contract renews every 12 months.",
    },
    {
        "context": "The policy applies to EU customers.",
        "question": "Who does the policy apply to?",
        "answer": "The policy applies to EU and US customers.",
    },
]

results = detector.detect_batch(records)
for item in results:
    print(item.supported_score, item.unsupported_spans)
cite

Paper reference

bibtex
@article{kovacs-recski-2025-lettucedetect, title = {LettuceDetect: a hallucination-detection framework for RAG applications}, author = {Kovacs, Adam and Recski, Gabor}, year = {2025}, journal = {arXiv preprint arXiv:2502.17125}, url = {https://arxiv.org/abs/2502.17125} }
@article{kovacs-recski-2025-lettucedetect,
  title = {LettuceDetect: a hallucination-detection framework for RAG applications},
  author = {Kovacs, Adam and Recski, Gabor},
  year = {2025},
  journal = {arXiv preprint arXiv:2502.17125},
  url = {https://arxiv.org/abs/2502.17125}
}

Compatibility and licensing

The adoption boundary is simple: code runs in Python, models run through the usual transformer stack, and the detector only needs the RAG triple.

code
github.com/KRLabsOrg/LettuceDetect, MIT licence.
models
English ModernBERT, Hungarian multilingual-encoder variants, and TinyLettuce models under the KR Labs Hugging Face organisation.
runtime
PyTorch, CPU or CUDA, with batch inference for offline evaluation and request-path checks for production systems.
integration
Works with any RAG pipeline that can provide context, question, and answer.

Combine with the rest of the stack

Each product can run on its own. Together, they turn an LLM answer into something a team can inspect, reject, or enforce.

LettuceDetect is the support-check layer: it asks whether the answer is actually grounded in the retrieved context. Pair it with an evidence extraction layer to make the answer easier to verify, and a rules layer to decide what should happen after the check. The result is not just a pass or fail, but a record of which text was supported, which text was not, and what action followed.

Check the answer before it leaves the system

Start with the evaluation details, or inspect the repository for the detector, model scripts, examples, and deployment path.