Because even AI needs a reality check! đĽŹ
We present LettuceDetect, a lightweight hallucination detector for Retrieval-Augmented Generation (RAG) pipelines. It is an encoder-based model built on ModernBERT, released under the MIT license with ready-to-use Python packages and pretrained models.
LettuceDetect keeps your RAG framewok fresh by spotting rotten parts of your LLMâs outputs. đ
Install the package:
pip install lettucedetect
Then, you can use the package as follows:
from lettucedetect.models.inference import HallucinationDetector
# For a transformer-based approach:
detector = HallucinationDetector(
method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
)
contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]
question = "What is the capital of France? What is the population of France?"
answer = "The capital of France is Paris. The population of France is 69 million."
# Get span-level predictions indicating which parts of the answer are considered hallucinated.
predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")
print("Predictions:", predictions)
# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]
Large Language Models (LLMs) have made considerable advancements in NLP tasks, like GPT-4 [4], the Llama-3 models [5], or Mistral [6] (and many more). Despite the success of LLMs, hallucinations remain a key obstacle deploying LLMs in high-stakes scenarios (such as in healthcare or legal) [7,8].
Retrieval-Augmented Generation (RAG) attempts to mitigate hallucinations by grounding an LLMâs responses in retrieved documents, providing external knowledge that the model can reference [9]. But even though RAG is a powerful method to reduce hallucinations, LLMs still suffer from hallucinations in these settings [1]. Hallucinations are information in the output that is nonsensical, factually incorrect, or inconsistent with the retrieved context [8]. Ji et al. [10] categorizes hallucinations into:
While RAG approaches can mitigate intrinsic hallucinations, they are not immune to extrinsic hallucinations. Sun et al. [11] showed that models tend to prioritize their intrinsic knowledge over the external context. As LLMs remain prone to hallucinations, their applications in critical domains e.g. medical or legal, can be still flawed.
Current solutions for hallucination detection can be categorized into different categories based on the approach they take:
ModernBERT [3] is a drop in replacement for BERT, and is a state of the art encoder-only transformers architecture that incorporates several modern design improvements over the original BERT model such as it uses Rotary Positional Embeddings (RoPe) to handle sequences of up to 8,192 tokens, unpadding optimization to eliminate wasted computation on padding tokens, and GeGLU activation layers for enhanced expressiveness and alternating attention for more efficient attention computation.
LettuceDetect capitalizes on ModernBERTâs extended context window to build a token-level classifier for hallucination detection. This approach sidesteps many limitations of older BERT-based models (e.g., short context bounds) and avoids the inference overhead of large LLM-based detectors. Our experiments show that LettuceDetect outperforms other encoder-based systems while remaining competitive with fine-tuned LLM detectors at a fraction of their computational cost.
RAGTruth is the first large-scale benchmark specifically designed to evaluate hallucination detection in Retrieval-Augmented Generation (RAG) settings. It contains 18,000 annotated examples spanning multiple tasks:
Each example in RAGTruth includes:
Below is an illustrative sample from RAGTruth, showing an instance of hallucination:
QUESTION
How to explain quartiles?
REFERENCE
Passage 2:
Second quartile (Q2) which is more commonly known as median splits the data in half (50%).
Median divides the data into a lower half and an upper half.
Third quartile (Q3), also known as upper quartile, splits lowest 75% (or highest 25%) of data.
(âŚ)
RESPONSE
The first quartile (Q1) splits the lowest 25% of the data, while the second quartile (Q2) splits the data into two equal halves, with the median being the middle value of the lower half. Finally, the third quartile (Q3) splits the highest 75% of the data.
(âŚ)
ANNOTATION
This example showcases a RAGTruth annotation involving specific spans in the answer that conflict with the reference passages, providing both the type of hallucination (e.g., Evident Conflict) and a human-readable rationale. Our method only uses binary annotations and does not use the type of hallucination annotations.
A high-level depiction of LettuceDetect. Here, an example Question, Context, and Answer triplet is processed. First, the text is tokenized, after which LettuceDetect performs token-level classification. Tokens from both the question and context are masked (indicated by the red line in the figure) to exclude them from the loss function. Each token in the answer receives a probability indicating whether it is hallucinated or supported. For span-level detection, we merge consecutive tokens with hallucination probabilities above 0.5 into a single predicted span.
We train ModernBERT-base and ModernBERT-large variants as token-classification models on the RAGTruth dataset. The input to the model is a concatenation of Context, Question, and Answer segments, with specialized tokens ([CLS]) (for the context) and ([SEP]) (as separators). We limit the sequence length to 4,096 tokens for computational feasibility, though ModernBERT can theoretically handle up to 8,192 tokens.
[CLS]
and [SEP]
appropriately.-100
in PyTorch) so that they do not contribute to the loss.Our models build on Hugging Faceâs AutoModelForTokenClassification, using ModernBERT as the encoder and a classification head on top. Unlike some previous encoder-based approaches (e.g., ones pre-trained on NLI tasks), our method uses only ModernBERT with no additional pretraining stage.
During training, we monitor token-level F1 scores on a validation split, saving checkpoints using the safetensors format. Once training is complete, we upload the best-performing models to Hugging Face for public access.
At inference time, the model outputs a probability of hallucination for each token in the answer. We aggregate consecutive tokens exceeding a 0.5 threshold to produce span-level predictions, indicating exactly which segments of the answer are likely to be hallucinated. The figure above illustrates this workflow.
Next, we provide a more detailed evaluation of the modelâs performance.
We evaluate our models on the RAGTruth test set across all task types (Question Answering, Data-to-Text, and Summarization). For each example, RAGTruth includes manually annotated spans indicating hallucinated content.
We first assess the example-level question: Does the generated answer contain any hallucination at all? Our large model (lettucedetect-large-v1) attains an overall F1 score of 79.22%, surpassing:
It is second only to the fine-tuned Llama-3-8B from the RAG-HAT paper [15] (83.9%), but LettuceDetect is significantly smaller and faster to run. Meanwhile, our base model (lettucedetect-base-v1) remains highly competitive while using fewer parameters.
Above is a comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna). Overall, lettucedetect-large-v1 and lettucedect-base-v1 are very performant models, while being very effective in inference settings.
Beyond detecting if an answer contains hallucinations, we also examine LettuceDetectâs ability to identify the exact spans of unsupported content. Here, LettuceDetect achieves state-of-the-art results among models that have reported span-level performance, substantially outperforming the fine-tuned Llama-2-13B model from the RAGTruth paper [1] and other baselines.
Most methods, like RAG-HAT [15], do not report span-level metrics, so we do not compare to them here.
Both lettucedetect-base-v1 and lettucedetect-large-v1 require fewer parameters than typical LLM-based detectors (e.g., GPT-4 or Llama-3-8B) and can process 30â60 examples per second on a single NVIDIA A100 GPU. This makes them practical for industrial workloads, real-time user-facing systems, and resource-constrained environments.
Overall, these results show that LettuceDetect has a good balance: it achieves near state-of-the-art accuracy at a fraction of the size and cost compared to large LLM-based judges, while offering precise, token-level hallucination detection.
We introduced LettuceDetect, a lightweight and efficient framework for hallucination detection in RAG systems. By utilizing ModernBERTâs extended context capabilities, our models achieve strong performance on the RAGTruth benchmark while retaining high inference efficiency. This work lays the groundwork for future research directions, such as expanding to additional datasets, supporting multiple languages, and exploring more advanced architectures. Even at this stage, LettuceDetect demonstrates that effective hallucination detection can be achieved using lean, purpose-built encoder-based models.
If you find this work useful, please cite it as follows:
@misc{Kovacs:2025,
title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
author={ĂdĂĄm KovĂĄcs and GĂĄbor Recski},
year={2025},
eprint={2502.17125},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.17125},
}
Also, if you use our code, please donât forget to give us a star â on our GitHub repository here.
[1] Niu et al., 2024, RAGTruth: A Dataset for Hallucination Detection in Retrieval-Augmented Generation
[3] ModernBERT: A Modern BERT Model for Long-Context Processing
[4] GPT-4 report
[5] Llama-3 report
[6] Mistral 7B
[7] Kaddour et al., 2023, Challenges and Applications of Large Language Models
[9] Gao et al., 2024, Retrieval-Augmented Generation for Large Language Models: A Survey
[10] Ji et al., 2023, Survey of Hallucination in Natural Language Generation
[13] Cohen et al., 2023, LM vs LM: Detecting Factual Errors via Cross Examination
[14] Friel et al., 2023, Chainpoll: A high efficacy method for LLM hallucination detection