20260330T103020260330T1230Europe/AmsterdamApplied Generation, Evaluation & Analysis with LLMsContradictions in Context: Challenges for Retrieval-Augmented Generation in HealthcareSmall Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware SummarizationFrom Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent DebatePrompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM InferenceTowards Quantitative Summarization Evaluation: An Integrated Atomic-Based Evaluation Framework and Dataset for Text SummarizationExpertMix: Aspect and Severity Detection in Conversational ComplaintsMemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-Turn ConversationsChaosECIR2026n.fontein@tudelft.nl
Contradictions in Context: Challenges for
Retrieval-Augmented Generation in Healthcare
Full papersApplications
Machine Learning and Large Language ModelsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.
Presenters Saeedeh Javadi PhD Candidate, RMIT University Co-Authors
Small Models, Big Picture! A Language Model Augmentation
for Enhanced Reader-Aware Summarization
Full papersApplications
Machine Learning and Large Language ModelsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Integrating heterogeneous modalities for effective information access remains a central challenge in Information Retrieval (IR), particularly in reader-aware summarization, where user perspectives must be incorporated alongside textual and multimedia content. In this work, we present a novel augmentation framework that combines the strengths of Language Models (LMs) and multimodal models to generate holistic news summaries. Our approach seamlessly integrates textual articles, visual evidence from images, user-generated comments, and distilled insights from video streams. Through extensive experiments, we show that this LM-ensembled multimodal framework consistently surpasses specialized Video Language Models (Video LMs) in terms of coherence, informativeness, and user-sensitivity across multiple benchmarks. To further advance multimodal IR research, we extend the Reader-Aware Multi-Document Summarization (RAMDS) dataset with video components, introducing VARAMDS (Video-Augmented-RAMDS), the first resource to explicitly couple news text, imagery, reader comments, and video content. Our findings demonstrate that LM-driven augmentation not only improves multimodal summarization quality but also sets a new standard for reader-aware, comment-sensitive synthesis, bridging gaps between heterogeneous information sources and supporting richer retrieval-oriented applications in resource-constrained environments.
Presenters Raghvendra Kumar PhD Student 4th Year, Indian Institute Of Technology Patna Co-Authors
Sriparna Saha Associate Professor, Indian Institute Of Technology Patna
From Comments to Conclusions: Adaptive Reader-Aware Summary
Generation in Low-Resource Languages via Agent Debate
Full papersApplications
Machine Learning and Large Language ModelsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Reader-aware summarization distills articles while embedding user opinions and contextual grounding, shaping results that resonate with diverse readers and ease the challenge of extracting meaning from abundant news sources. However, research so far has centered on English and Chinese, with the complex multilingual and multimodal ecosystem of Indian news, shaped by articles, images, and user comments, still largely overlooked. Traditional single large language models (LLMs) often fail to integrate such heterogeneous evidence, yielding shallow or biased outputs. We introduce a Multi-Agent Debate (MAD) framework for reader-aware oriented summarization, built on the COSMMIC dataset, a multilingual, multimodal, and comment-sensitive resource for Indian news. MAD employs role-specialized agents (article analyst, comment integrator, image contextualizer, summary planner, and judge) that deliberate to produce a final summary, accompanied by a justification that attributes information to its source modality. This design not only enhances informativeness and factual consistency but also provides interpretability crucial for trustworthy Information Retrieval (IR) systems. Extensive automatic and human evaluations demonstrate that MAD significantly outperforms strong baselines in generating summaries that are more grounded, diverse, and aligned with reader context, especially in low-resource Indian languages.
Presenters Raghvendra Kumar PhD Student 4th Year, Indian Institute Of Technology Patna Co-Authors
Sriparna Saha Associate Professor, Indian Institute Of Technology Patna
Prompt Compression in the Wild: Measuring Latency, Rate
Adherence, and Quality for Faster LLM Inference
Full papersMachine Learning and Large Language ModelsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
With the wide adaptation of Language Models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages leads large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30.000 queries across several open-source LLMs and four GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model--hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.
ExpertMix: Aspect and Severity Detection in Conversational
Complaints
Full papersApplications
Machine Learning and Large Language ModelsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC