[technology] [verbatim]

VerbatimRAG

Retrieval-augmented generation where the answer is assembled from source text the system can point back to, resulting in zero unsupported factual claims.

Live now Ask the ACL Anthology with cited source spans

What it does

VerbatimRAG replaces free-form factual generation with source-span extraction, so the target is zero hallucinated factual claims in the final answer.

Most RAG systems retrieve documents, then let a language model write a free-form answer that draws on them. That final step is where unsupported claims enter: the model can paraphrase too far, merge sources, or invent a fact that was never in the retrieved context.

VerbatimRAG moves the factual work into extraction. A query-conditioned span extractor identifies source passages that answer the question. The answer is then assembled from those passages, so the reader can inspect the exact text behind each cited claim.

The key property is not that the answer sounds plausible. It is that every factual claim is either replayable against source text, source id, and character offsets, or left out.

model
150M: ModernBERT extractor, small enough to run as a dedicated evidence model; context
8,192: token context window for long retrieved passages; dataset
195k: training rows in the public verbatim-spans dataset; licence
MIT: code licence for the VerbatimRAG package

How it works

The pipeline is designed around one constraint: if the source text does not contain the fact, the answer should not present it.

evidence path

VerbatimRAG moves the factual decision before generation. The extractor selects source spans; the answer can only cite what passed that boundary.

retrieved context

Source text with offsets

doc_14 · chars 1482 to 1533The extractor runs on ModernBERT and supports an 8,192-token context window for long-form context.

doc_27 · chars 904 to 957Each predicted span returns character offsets relative to the source for replay against the original document.

evidence boundary

ModernBERT span extractor

input: question + retrieved_context[]
decision: token labels for supported answer spans
failure mode: no selected span means no grounded answer

generated presentation

Cited answer

The extractor uses a ModernBERT backbone that supports an 8,192-token context window ^[1]. Each cited span carries character offsets relative to the source ^[2], so the claim can be checked against the original document.

answer.text: Readable response assembled from selected spans.
citations[]: Marker, source id, span text, character offsets.
abstain: Returned when the extractor finds no supporting span.

VerbatimRAG treats answer generation as evidence selection before language generation. Retrieval supplies candidate context, the extractor selects the source spans that can support an answer, and generation is limited to arranging those spans into readable prose. That is the basis for the zero-unsupported-claims target. The output includes citations, source ids, and character offsets, and if no supporting span is found, the system should abstain rather than ask the LLM to fill the gap.

Results

The public evidence is split across two evaluations: the BioNLP paper tests the full verbatim pipeline in clinical QA, while the Hugging Face model card shows the generic extractor leading every reported Word-F1 slice.

Clinical QA 42.01%
Full VerbatimRAG pipeline score on ArchEHR-QA 2025, with top-10 in core metrics.
ACL evidence selection 0.463 Word-F1
Best Word-F1 on ACL gold.
RAG domains 0.618 Word-F1
Best Word-F1 across multi-domain QA, including finance, medical, legal, and product-manual sources.
Generalization 0.588 / 0.513 Word-F1
Best Word-F1 on Squeez tool-output pruning and QASPER scientific QA slices.

The clinical-QA result evaluates the full VerbatimRAG pipeline. The Word-F1 scores evaluate the public span extractor: whether it selects the same evidence text as the reference annotations. The strongest claim we can make from the public model card is specific: best Word-F1 on ACL gold, RAGBench, Squeez, and QASPER, not a blanket claim over all RAG systems. For setup, baselines, and slice-level details, read the BioNLP paper and the Hugging Face model card.

Built with VerbatimRAG

VerbatimRAG is not only a package. We use it as the evidence layer for live corpus products, agent tooling, and hosted API workflows.

These surfaces show the same architecture at different levels: a user-facing research tool, a reusable API, an MCP server for agent contexts, and a Claude Code skill for working directly inside development workflows.

live corpus ACL-Verbatim Trustworthy QA over the ACL Anthology. It indexes a real research corpus, answers questions, and returns cited source text.

hosted API Verbatim API Hosted query and transform endpoints for teams that want to test source-span extraction before running a full retrieval stack.
Claude Code verbatim-acl-skill Claude Code workflow for searching ACL papers and transforming source context into cited answers.
MCP verbatim-mcp MCP server for querying academic papers, retrieving metadata, and exporting citations inside agent environments.

For developers

Use the lightweight transform when you already have context, or the full package when you want retrieval, extraction, and cited answers in one stack.

inspect

Inspect the moving parts

The package, model, dataset, and paper are public so teams can inspect the implementation before adopting the hosted workflow.

GitHub Core VerbatimRAG package, examples, docs, web interface, and tests.
PyPI: verbatim-rag Full RAG package for retrieval, span extraction, and cited answers.
PyPI: verbatim-core Lightweight transform package with a small dependency surface.
Model: verbatim-rag-modern-bert-v2 150M ModernBERT token classifier for query-conditioned evidence spans.
Dataset: verbatim-spans Multi-domain evidence-selection data with ACL, RAGBench, Squeez, and related sources.
Paper: ArchEHR-QA 2025 The BioNLP shared-task paper describing the original verbatim pipeline.

run

Start with the package boundary

Use verbatim-core when your application already has retrieved context. Use verbatim-rag when you also want indexing, retrieval, extraction, and cited answer assembly.

Install

Add the model extra when the lightweight transform should use a local ModernBERT extractor.

bash

pip install "verbatim-core[model]" pip install verbatim-rag

pip install "verbatim-core[model]"
pip install verbatim-rag

Transform provided context

Use verbatim-core when your application already has question and context pairs. Pass your own extractor when span selection should run through a local model.

python

from verbatim_core import VerbatimTransform from verbatim_core.extractors import ModelSpanExtractor extractor = ModelSpanExtractor( model_path="KRLabsOrg/verbatim-rag-modern-bert-v2", threshold=0.2, device=None, ) transform = VerbatimTransform(extractor=extractor) response = transform.transform( question="What is the main finding?", context=[ { "content": "The study found that X leads to Y.", "title": "Paper A", }, { "content": "Results show Z is significant.", "title": "Paper B", } ], ) print(response.answer) for document in response.documents: for highlight in document.highlights: print(document.title, highlight.start, highlight.end, highlight.text)

from verbatim_core import VerbatimTransform
from verbatim_core.extractors import ModelSpanExtractor

extractor = ModelSpanExtractor(
    model_path="KRLabsOrg/verbatim-rag-modern-bert-v2",
    threshold=0.2,
    device=None,
)

transform = VerbatimTransform(extractor=extractor)

response = transform.transform(
    question="What is the main finding?",
    context=[
        {
            "content": "The study found that X leads to Y.",
            "title": "Paper A",
        },
        {
            "content": "Results show Z is significant.",
            "title": "Paper B",
        }
    ],
)

print(response.answer)

for document in response.documents:
    for highlight in document.highlights:
        print(document.title, highlight.start, highlight.end, highlight.text)

Retrieve and answer

Use verbatim-rag when the system should retrieve candidate passages before extraction and answer assembly.

python

from verbatim_rag import VerbatimIndex, VerbatimRAG from verbatim_rag.ingestion import DocumentProcessor from verbatim_rag.vector_stores import LocalMilvusStore from verbatim_rag.embedding_providers import SpladeProvider processor = DocumentProcessor() document = processor.process_url( url="https://aclanthology.org/2025.bionlp-share.8.pdf", title="KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering", metadata={"authors": ["Adam Kovacs", "Paul Schmitt", "Gabor Recski"]}, ) sparse = SpladeProvider( model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", device="cpu", ) store = LocalMilvusStore( db_path="./index.db", collection_name="verbatim_rag", enable_dense=False, enable_sparse=True, ) index = VerbatimIndex( vector_store=store, sparse_provider=sparse, ) index.add_documents([document]) rag = VerbatimRAG(index) response = rag.query("What is the main contribution of the paper?") print(response.answer)

from verbatim_rag import VerbatimIndex, VerbatimRAG
from verbatim_rag.ingestion import DocumentProcessor
from verbatim_rag.vector_stores import LocalMilvusStore
from verbatim_rag.embedding_providers import SpladeProvider

processor = DocumentProcessor()
document = processor.process_url(
    url="https://aclanthology.org/2025.bionlp-share.8.pdf",
    title="KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering",
    metadata={"authors": ["Adam Kovacs", "Paul Schmitt", "Gabor Recski"]},
)

sparse = SpladeProvider(
    model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",
    device="cpu",
)

store = LocalMilvusStore(
    db_path="./index.db",
    collection_name="verbatim_rag",
    enable_dense=False,
    enable_sparse=True,
)

index = VerbatimIndex(
    vector_store=store,
    sparse_provider=sparse,
)
index.add_documents([document])

rag = VerbatimRAG(index)
response = rag.query("What is the main contribution of the paper?")
print(response.answer)

cite

Paper reference

bibtex

@inproceedings{kovacs-etal-2025-kr, title = {{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering}, author = {Kovacs, Adam and Schmitt, Paul and Recski, Gabor}, booktitle = {Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)}, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, pages = {69--74}, url = {https://aclanthology.org/2025.bionlp-share.8/} }

@inproceedings{kovacs-etal-2025-kr,
  title = {{KR} Labs at {A}rch{EHR}-{QA} 2025: A Verbatim Approach for Evidence-Based Question Answering},
  author = {Kovacs, Adam and Schmitt, Paul and Recski, Gabor},
  booktitle = {Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)},
  year = {2025},
  address = {Vienna, Austria},
  publisher = {Association for Computational Linguistics},
  pages = {69--74},
  url = {https://aclanthology.org/2025.bionlp-share.8/}
}

Compatibility and licensing

The public stack separates code, model, and dataset artefacts so technical and legal teams can review the adoption boundary.

code: verbatim-rag and verbatim-core, MIT licence.
model: KRLabsOrg/verbatim-rag-modern-bert-v2, Apache-2.0. ModernBERT token classification with an 8,192-token context window.
dataset: KRLabsOrg/verbatim-spans, Apache-2.0. Multi-domain evidence-selection data.
runtime: Python package, hosted API, self-hosted service, local experiments, MCP server, and Claude Code workflow.

Combine with the rest of the stack

Each product can run on its own. Together, they turn an LLM answer into something a team can inspect, reject, or enforce.

VerbatimRAG is the evidence extraction layer in the pipeline: retrieval supplies candidate context, then VerbatimRAG selects the source spans an answer is allowed to rely on. Pair it with a detector to check whether the answer is supported by the retrieved context, and a rules layer to apply the domain constraints that should govern the output. The result is not just an answer, but a record of what was retrieved, what was checked, and which rules fired.

Trace the answer back to the source

Start with the evaluation details, or inspect the repository for the extractor, API, examples, and deployment surface.

Read the evaluation Inspect the repository