Loading Session...

Applied Generation, Evaluation & Analysis with LLMs

Session Information

  • Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare
  • Small Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware Summarization
  • From Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent Debate
  • Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
  • Towards Quantitative Summarization Evaluation: An Integrated Atomic-Based Evaluation Framework and Dataset for Text Summarization
  • ExpertMix: Aspect and Severity Detection in Conversational Complaints
  • MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-Turn Conversations
Mar 30, 2026 10:30 - 12:30(Europe/Amsterdam)
Venue : Chaos
20260330T1030 20260330T1230 Europe/Amsterdam Applied Generation, Evaluation & Analysis with LLMs Contradictions in Context: Challenges for Retrieval-Augmented Generation in HealthcareSmall Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware SummarizationFrom Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent DebatePrompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM InferenceTowards Quantitative Summarization Evaluation: An Integrated Atomic-Based Evaluation Framework and Dataset for Text SummarizationExpertMix: Aspect and Severity Detection in Conversational ComplaintsMemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-Turn Conversations Chaos ECIR2026 conference-secretariat@blueboxevents.nl

Sub Sessions

Contradictions in Context: Challenges forRetrieval-Augmented Generation in Healthcare

Full papersApplicationsMachine Learning and Large Language Models 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.
Presenters Saeedeh Javadi
PhD Candidate, RMIT University
Co-Authors
SM
Sara Mirabi
PhD Candidate, Deakin University
BO
Bahadorreza Ofoghi
Deakin University

Small Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware Summarization

Full papersApplicationsMachine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Integrating heterogeneous modalities for effective information access remains a central challenge in Information Retrieval (IR), particularly in reader-aware summarization, where user perspectives must be incorporated alongside textual and multimedia content. In this work, we present a novel augmentation framework that combines the strengths of Language Models (LMs) and multimodal models to generate holistic news summaries. Our approach seamlessly integrates textual articles, visual evidence from images, user-generated comments, and distilled insights from video streams. Through extensive experiments, we show that this LM-ensembled multimodal framework consistently surpasses specialized Video Language Models (Video LMs) in terms of coherence, informativeness, and user-sensitivity across multiple benchmarks. To further advance multimodal IR research, we extend the Reader-Aware Multi-Document Summarization (RAMDS) dataset with video components, introducing VARAMDS (Video-Augmented-RAMDS), the first resource to explicitly couple news text, imagery, reader comments, and video content. Our findings demonstrate that LM-driven augmentation not only improves multimodal summarization quality but also sets a new standard for reader-aware, comment-sensitive synthesis, bridging gaps between heterogeneous information sources and supporting richer retrieval-oriented applications in resource-constrained environments.
Presenters Raghvendra Kumar
PhD Student 4th Year, Indian Institute Of Technology Patna
Co-Authors
AP
A S Poornash
Indian Institute Of Technology Patna
SS
Sriparna Saha
Associate Professor, Indian Institute Of Technology Patna

From Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent Debate

Full papersApplicationsMachine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Reader-aware summarization distills articles while embedding user opinions and contextual grounding, shaping results that resonate with diverse readers and ease the challenge of extracting meaning from abundant news sources. However, research so far has centered on English and Chinese, with the complex multilingual and multimodal ecosystem of Indian news, shaped by articles, images, and user comments, still largely overlooked. Traditional single large language models (LLMs) often fail to integrate such heterogeneous evidence, yielding shallow or biased outputs. We introduce a Multi-Agent Debate (MAD) framework for reader-aware oriented summarization, built on the COSMMIC dataset, a multilingual, multimodal, and comment-sensitive resource for Indian news. MAD employs role-specialized agents (article analyst, comment integrator, image contextualizer, summary planner, and judge) that deliberate to produce a final summary, accompanied by a justification that attributes information to its source modality. This design not only enhances informativeness and factual consistency but also provides interpretability crucial for trustworthy Information Retrieval (IR) systems. Extensive automatic and human evaluations demonstrate that MAD significantly outperforms strong baselines in generating summaries that are more grounded, diverse, and aligned with reader context, especially in low-resource Indian languages.
Presenters Raghvendra Kumar
PhD Student 4th Year, Indian Institute Of Technology Patna
Co-Authors
MA
Mohammed Salman S A
National Institute Of Technology Tiruchirappalli
JV
Jaya Verma
Indian Institute Of Technology Patna
SS
Sriparna Saha
Associate Professor, Indian Institute Of Technology Patna

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Full papersMachine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
With the wide adaptation of Language Models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages leads large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30.000 queries across several open-source LLMs and four GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model--hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.
Presenters
CK
Cornelius Kummer
TU Dresden
Co-Authors
LJ
Lena Jurkschat
Research Associate, TU Dresden, ScaDS.AI
MF
Michael F?rber
ScaDS.AI & TU Dresden
SV
Sahar Vahdati
TU Dresden

Towards Quantitative Summarization Evaluation: AnIntegrated Atomic-Based Evaluation Framework and Datasetfor Text Summarization

Full papersApplications Evaluation research 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Despite the dramatic advances in Large Language Models (LLMs), traditional summarization benchmarks are critically saturated, failing to differentiate state-of-the-art models or reflect real-world user needs. Existing evaluation datasets suffer from fundamental constraints in instructional diversity, narrow domain coverage, and homogeneous text lengths, missing the practical demands of modern summarization. To address this limitation, we introduce SumBench, a challenging benchmark derived from an in-depth analysis of user requirements and designed to stress-test advanced LLM capabilities. SumBench incorporates a diverse range of goal-oriented instructions, cross-domain long documents (up to 32K tokens), and complex domain knowledge. Crucially, we integrate the Atomic Summarization Evaluation Framework (ATMSumE), which leverages atomic decompositions of instructions and references to enable fine-grained, multi-dimensional assessment across instruction adherence, key point coverage, and factual accuracy. Our analysis on SumBench reveals systemic LLM limitations: performance substantially degrades when source length exceeds 16K tokens, showing pronounced weaknesses in completeness and factuality within specialized domains. Critically, failure modes are highly task-dependent: Completeness gaps emerge in Timeline and Global Summarization, while reasoning-intensive tasks incur higher factual error rates. These results empirically establish that LLM performance is strongly modulated by document length, domain complexity, and instruction type, providing an evidence-based roadmap for robust model development. The benchmark and tools will be publicly released.
Presenters
YL
Yan Lei
Phd Student, Key Laboratory Of AI Safety, Institute Of Computing Technolog, University Of Chinese Academy Of Science
Co-Authors
SZ
Suncong Zheng
RW
Roberts Wang
LP
Liang Pang
Institute Of Computing Technology, Chinese Academy Of Sciences
LH
Lei He
SC
Shuang Chen
WY
Wang Yu
HS
Huawei Shen
XC
Xueqi Cheng
YW
Yuanzhuo Wang
Institute Of Computing Technology , Chinese Academy Of Sciences

ExpertMix: Aspect and Severity Detection in ConversationalComplaints

Full papersApplicationsMachine Learning and Large Language Models 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Prior research on fine-grained complaint analysis haslargely focused on short, context-limited inputs such astweets or product reviews. In this work, we advance thefield by introducing a paradigm that leverages multi-turncustomer-support dialogues as a richer medium for capturingnuanced dissatisfaction. Such conversations provide dynamicsignals°™emotional shifts, iterative follow-ups, anddetailed issue descriptions°™that enable more accuratedetection of aspect categories (e.g., service quality,software issues) and severity levels (e.g., disapproval,accusation). To this end, we extend a publicly availablecustomer-support corpus with fine-grained aspect andseverity annotations in the cellular services domain.Building on this dataset, we propose CompSense, amulti-task Mixture-of-Experts (MoE) framework underpinnedby Large Language Models (LLMs) and enriched withcommonsense-aware contextualization for robust complaintunderstanding. Extensive evaluations show that CompSenseconsistently outperforms task-specific conversational LLMsand decoder-only causal LLM baselines, highlighting thevalue of bidirectional and commonsense-aware modeling. Thiswork marks a step toward practical, real-world systemscapable of sophisticated conversational complaint analysis.
Presenters
SD
Sarmistha Das
Indian Institute Of Technology Patna
Co-Authors
AS
Apoorva Singh
Fondazione Bruno Kessler, Italy
RS
Rishu Kumar Singh
IIT Patna
NS
Navneet Shreya
National Institute Of Technology Patna
SS
Sriparna Saha
Associate Professor, Indian Institute Of Technology Patna

MemTool: Optimizing Short-Term Memory Management forDynamic Tool Retrieval and Invocation in LLM AgentMulti-Turn Conversations

Full papersSearch and rankingSystem aspects 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically retrieving and utilizing relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically retrieve and manage tools or MCP server contexts across multi-turn conversations, outperforming previous state-of-the-art tool retrieval approaches that lack multi-turn support and memory management of available tools or MCPs. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. We evaluate all modes across 13+ LLMs on the ScaleMCP benchmark, conducting experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency), task completion accuracy, and comprehensive cost analysis across modes. Our results significantly outperform existing state-of-the-art tool retrieval methods which cannot handle multi-turn tool retrieval and management. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90¨C94\% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0¨C60\%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs, cost analysis, and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.
Presenters
EL
Elias Lumer
Lead AI Researcher, PricewaterhouseCoopers U.S.
Co-Authors
AG
Anmol Gulati
Associate, PricewaterhouseCoopers U.S.
VS
Vamse Kumar Subbiah
PB
Pradeep Honaganahalli Basavaraju
JB
James A. Burke
378 visits

Session Participants

User Online
Session speakers, moderators & attendees
PhD Student
,
University Of Amsterdam
Leibniz University Hannover, L3S Research Center
Research Scientist
,
Sony Research India
Professor in Applied Data Science
,
Modul University Vienna
+ 1 more speakers. View All
University Of Brest
 Sérgio Nunes
University of Porto | INESC TEC
37 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.