20260331T143020260331T1600Europe/AmsterdamTrustworthy and Responsible Retrieval-Augmented SystemsLearned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production RateFACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAGSUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMsBribery-Resistant Ranking Systems: A Multipartite User-Agnostic Framework for AI Act ComplianceRAC: Retrieval-Augmented Clarification for Faithful Conversational SearchChaosECIR2026n.fontein@tudelft.nl
Learned Hallucination Detection in Black-Box LLMs using
Token-level Entropy Production Rate
Full papersMachine Learning and Large Language ModelsFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces a methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves token hallucination detection over state-of-the-art methods. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top 10 per token), confirming its practical efficiency and suitability for API-constrained deployments. This work provides a lightweight technique to enhance the trustworthiness of LLM responses, at the token level, after a single generation pass for QA and Retrieval-Augmented Generation (RAG) systems, as well as for a private finance framework analyzing responses to queries on annual company reports.
FACTUM: Mechanistic Detection of Citation Hallucination in
Long-Form RAG
Full papersMachine Learning and Large Language ModelsFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.
Kevin Duh Johns Hopkins University Human Language Technology Center Of Excellence (HLTCOE)
SUMMIR: A Hallucination-Aware Framework for Ranking Sports
Insights from LLMs
Full papersApplications
Search and rankingFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
With the rapid proliferation of online sports journalism, extracting meaningful pre-game and post-game insights from articles is essential for enhancing user engagement and comprehension. In this paper, we address the task of automatically extracting such insights from articles published before and after matches. We curate a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. To ensure contextual relevance, we employ a two-step validation pipeline leveraging both open-source and proprietary large language models (LLMs). We then utilize multiple state-of-the-art LLM (GPT-4o, Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1) to generate comprehensive insights. The factual accuracy of these outputs is rigorously assessed using a FactScore-based methodology, complemented by hallucination detection via the SummaC (Summary Consistency) framework with GPT-4o. Finally, we propose SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user-specific interests. Our results demonstrate the effectiveness of this approach in generating high-quality, relevant insights, while also revealing significant differences in factual consistency and interestingness across LLMs. This work contributes a robust framework for automated, reliable insight generation from sports news content.
Sriparna Saha Associate Professor, Indian Institute Of Technology Patna
Bribery-Resistant Ranking Systems: A Multipartite
User-Agnostic Framework for AI Act Compliance
Full papersSearch and ranking
Societally-motivated IR researchFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Modern ranking systems must comply with emerging AI regulations while resisting manipulation attacks. The EU AI Act's prohibition on individual user scoring creates a critical gap: reputation-based systems often violate compliance requirements, while user-agnostic approaches lack resistance to bribery, and both approaches still remain vulnerable to demographic bias. We propose a user-agnostic multipartite ranking framework addressing regulatory compliance, security, personalization and bias. Our approach clusters users based on rating patterns and applies localized statistical filtering within clusters to remove anomalous ratings, thereby eliminating individual profiling while preserving personalization and enhancing manipulation resistance. Evaluation across three datasets shows substantial bribery resistance improvements, with profitable attacks in only 7 of 18 scenarios versus 8--11 for state-of-the-art baselines. The framework also achieves demographic bias values reduced by a factor of 100 compared to a user-agnostic bipartite approach. These results are achieved while designed to avoid individual user scoring as prohibited by the EU AI Act. Robustness analysis reveals enhanced spam resistance on two datasets, with computational overhead as the primary trade-off.
RAC: Retrieval-Augmented Clarification for Faithful
Conversational Search
Full papersConversational search and recommender systems
Machine Learning and Large Language ModelsFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based questions. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrates significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.