Trustworthy And Responsible Retrieval-Augmented Systems

Loading Session...

Session Information

Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate
FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs
Bribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act Compliance
RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

Full Papers

Mar 31, 2026 14:30 - 16:00(Europe/Amsterdam)

Venue : Chaos

20260331T1430 20260331T1600 Europe/Amsterdam Trustworthy and Responsible Retrieval-Augmented Systems Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production RateFACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAGSUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMsBribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act ComplianceRAC: Retrieval-Augmented Clarification for Faithful Conversational Search Chaos ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Full papersMachine Learning and Large Language ModelsFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces a methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves token hallucination detection over state-of-the-art methods. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top 10 per token), confirming its practical efficiency and suitability for API-constrained deployments. This work provides a lightweight technique to enhance the trustworthiness of LLM responses, at the token level, after a single generation pass for QA and Retrieval-Augmented Generation (RAG) systems, as well as for a private finance framework analyzing responses to queries on annual company reports.

Presenters

Co-Authors

FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

Full papersMachine Learning and Large Language ModelsFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.

Presenters

Co-Authors

SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs

Full papersApplicationsSearch and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

With the rapid proliferation of online sports journalism, extracting meaningful pre-game and post-game insights from articles is essential for enhancing user engagement and comprehension. In this paper, we address the task of automatically extracting such insights from articles published before and after matches. We curate a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. To ensure contextual relevance, we employ a two-step validation pipeline leveraging both open-source and proprietary large language models (LLMs). We then utilize multiple state-of-the-art LLM (GPT-4o, Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1) to generate comprehensive insights. The factual accuracy of these outputs is rigorously assessed using a FactScore-based methodology, complemented by hallucination detection via the SummaC (Summary Consistency) framework with GPT-4o. Finally, we propose SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user-specific interests. Our results demonstrate the effectiveness of this approach in generating high-quality, relevant insights, while also revealing significant differences in factual consistency and interestingness across LLMs. This work contributes a robust framework for automated, reliable insight generation from sports news content.

Presenters

Co-Authors

Bribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act Compliance

Full papersSearch and rankingSocietally-motivated IR researchFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Modern ranking systems must comply with emerging AI regulations while resisting manipulation attacks. The EU AI Act's prohibition on individual user scoring creates a critical gap: reputation-based systems often violate compliance requirements, while user-agnostic approaches lack resistance to bribery, and both approaches still remain vulnerable to demographic bias. We propose a user-agnostic multipartite ranking framework addressing regulatory compliance, security, personalization and bias. Our approach clusters users based on rating patterns and applies localized statistical filtering within clusters to remove anomalous ratings, thereby eliminating individual profiling while preserving personalization and enhancing manipulation resistance. Evaluation across three datasets shows substantial bribery resistance improvements, with profitable attacks in only 7 of 18 scenarios versus 8--11 for state-of-the-art baselines. The framework also achieves demographic bias values reduced by a factor of 100 compared to a user-agnostic bipartite approach. These results are achieved while designed to avoid individual user scoring as prohibited by the EU AI Act. Robustness analysis reveals enhanced spam resistance on two datasets, with computational overhead as the primary trade-off.

Presenters

Co-Authors

RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

Full papersConversational search and recommender systemsMachine Learning and Large Language ModelsFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based questions. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrates significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.

Presenters

Co-Authors

192 visits

Session Participants

User Online

Session speakers, moderators & attendees

Charles Moslonka

Senior Research Scientist

Artefact Research Center

Maxime Dassen

PhD Student

University Of Amsterdam

Pervez Shaik Mohammed

Research Intern

Sony Research India

Martim Baltazar

Ahmed Rayane Kebir

Phd Student

Toulouse University

Udo Kruschwitz

Professor

University Of Regensburg

No attendee has checked-in to this session!

28 attendees saved this session

Session Chat

Live Chat

Chat with participants attending this session

Questions & Answers

Answered

Submit questions for the presenters

Session Polls

Active

Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.

Trustworthy and Responsible Retrieval-Augmented Systems

Session Information

Sub Sessions

Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs

Bribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act Compliance

RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

Session Participants

Session Chat

Questions & Answers

Session Polls

Need Help?

Please enter the four digit secret code The secret code should have been announced or displayed at the session location.

AI-generated Summary

Please enter the four digit secret code
The secret code should have been announced or displayed at the session location.