Loading Session...

Trustworthy and Responsible Retrieval-Augmented Systems

Back to Schedule Check-inYou can join session 5 minutes before start time.

Session Information

  • Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate
  • FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG
  • SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs
  • Bribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act Compliance
  • RAC: Retrieval-Augmented Clarification for Faithful Conversational Search
Mar 31, 2026 14:30 - 16:00(Europe/Amsterdam)
Venue : Chaos
20260331T1430 20260331T1600 Europe/Amsterdam Trustworthy and Responsible Retrieval-Augmented Systems Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production RateFACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAGSUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMsBribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act ComplianceRAC: Retrieval-Augmented Clarification for Faithful Conversational Search Chaos ECIR2026 n.fontein@tudelft.nl

Sub Sessions

Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Full papersMachine Learning and Large Language ModelsFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces a methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves token hallucination detection over state-of-the-art methods. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top 10 per token), confirming its practical efficiency and suitability for API-constrained deployments. This work provides a lightweight technique to enhance the trustworthiness of LLM responses, at the token level, after a single generation pass for QA and Retrieval-Augmented Generation (RAG) systems, as well as for a private finance framework analyzing responses to queries on annual company reports.
Presenters
CM
Charles Moslonka
Senior Research Scientist, Artefact Research Center
Co-Authors
HR
Hicham Randrianarivo
AG
Arthur Garnier
Ardian
EM
Emmanuel Malherbe
Artefact Research Center

FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

Full papersMachine Learning and Large Language ModelsFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.
Presenters
MD
Maxime Dassen
PhD Student, University Of Amsterdam
Co-Authors
RK
Rebecca Kotula
Department Of Defense, Washington DC
KM
Kenton Murray
Johns Hopkins University Human Language Technology Center Of Excellence (HLTCOE)
AY
Andrew Yates
Johns Hopkins University, HLTCOE
DL
Dawn Lawrie
Senior Research Scientist, HLTCOE At Johns Hopkins University
EK
Efsun Kayi
The Johns Hopkins University Applied Physics Laboratory
JM
James Mayfield
Johns Hopkins University
KD
Kevin Duh
Johns Hopkins University Human Language Technology Center Of Excellence (HLTCOE)

SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs

Full papersApplications Search and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
With the rapid proliferation of online sports journalism, extracting meaningful pre-game and post-game insights from articles is essential for enhancing user engagement and comprehension. In this paper, we address the task of automatically extracting such insights from articles published before and after matches. We curate a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. To ensure contextual relevance, we employ a two-step validation pipeline leveraging both open-source and proprietary large language models (LLMs). We then utilize multiple state-of-the-art LLM (GPT-4o, Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1) to generate comprehensive insights. The factual accuracy of these outputs is rigorously assessed using a FactScore-based methodology, complemented by hallucination detection via the SummaC (Summary Consistency) framework with GPT-4o. Finally, we propose SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user-specific interests. Our results demonstrate the effectiveness of this approach in generating high-quality, relevant insights, while also revealing significant differences in factual consistency and interestingness across LLMs. This work contributes a robust framework for automated, reliable insight generation from sports news content.
Presenters
NK
Nitish Kumar
Doctoral Candidate, Indian Institute Of Technology Patna
Co-Authors
MG
Manish Gupta
Microsoft India
SS
Sriparna Saha
Associate Professor, Indian Institute Of Technology Patna

Bribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act Compliance

Full papersSearch and ranking Societally-motivated IR researchFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Modern ranking systems must comply with emerging AI regulations while resisting manipulation attacks. The EU AI Act's prohibition on individual user scoring creates a critical gap: reputation-based systems often violate compliance requirements, while user-agnostic approaches lack resistance to bribery, and both approaches still remain vulnerable to demographic bias. We propose a user-agnostic multipartite ranking framework addressing regulatory compliance, security, personalization and bias. Our approach clusters users based on rating patterns and applies localized statistical filtering within clusters to remove anomalous ratings, thereby eliminating individual profiling while preserving personalization and enhancing manipulation resistance. Evaluation across three datasets shows substantial bribery resistance improvements, with profitable attacks in only 7 of 18 scenarios versus 8--11 for state-of-the-art baselines. The framework also achieves demographic bias values reduced by a factor of 100 compared to a user-agnostic bipartite approach. These results are achieved while designed to avoid individual user scoring as prohibited by the EU AI Act. Robustness analysis reveals enhanced spam resistance on two datasets, with computational overhead as the primary trade-off.
Presenters
MB
Martim Baltazar
Co-Authors
LB
Ludovico Boratto
Associate Professor Of Computer Science, University Of Cagliari
MM
Mirko Marras
Tenure-Track Assistant Professor, University Of Cagliari
GR
Guilherme Ramos
SQIG - Instituto De Telecomunica??es

RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

Full papersConversational search and recommender systems Machine Learning and Large Language ModelsFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based questions. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrates significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.
Presenters
AK
Ahmed Rayane Kebir
Phd Student, Toulouse University
Co-Authors
VG
Vincent Guigue
AgroParisTech, UMR MIA-PS
LL
Lynda Said Lhadj
LS
Laure Soulier
12 visits

Session Participants

User Online
Session speakers, moderators & attendees
Senior Research Scientist
,
Artefact Research Center
PhD Student
,
University Of Amsterdam
Doctoral Candidate
,
Indian Institute Of Technology Patna
Phd Student
,
Toulouse University
Professor
,
University Of Regensburg
No attendee has checked-in to this session!
7 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.