Applied Generation, Evaluation & Analysis With LLMs

Loading Session...

Session Information

Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare
Small Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware Summarization
From Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent Debate
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
Towards Quantitative Summarization Evaluation: An Integrated Atomic-Based Evaluation Framework and Dataset for Text Summarization
ExpertMix: Aspect and Severity Detection in Conversational Complaints
MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-Turn Conversations

Full Papers

Mar 30, 2026 10:30 - 12:30(Europe/Amsterdam)

Venue : Chaos

20260330T1030 20260330T1230 Europe/Amsterdam Applied Generation, Evaluation & Analysis with LLMs Contradictions in Context: Challenges for Retrieval-Augmented Generation in HealthcareSmall Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware SummarizationFrom Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent DebatePrompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM InferenceTowards Quantitative Summarization Evaluation: An Integrated Atomic-Based Evaluation Framework and Dataset for Text SummarizationExpertMix: Aspect and Severity Detection in Conversational ComplaintsMemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-Turn Conversations Chaos ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

Contradictions in Context: Challenges forRetrieval-Augmented Generation in Healthcare

Full papersApplicationsMachine Learning and Large Language Models 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.

Presenters

Co-Authors

Small Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware Summarization

Full papersApplicationsMachine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Integrating heterogeneous modalities for effective information access remains a central challenge in Information Retrieval (IR), particularly in reader-aware summarization, where user perspectives must be incorporated alongside textual and multimedia content. In this work, we present a novel augmentation framework that combines the strengths of Language Models (LMs) and multimodal models to generate holistic news summaries. Our approach seamlessly integrates textual articles, visual evidence from images, user-generated comments, and distilled insights from video streams. Through extensive experiments, we show that this LM-ensembled multimodal framework consistently surpasses specialized Video Language Models (Video LMs) in terms of coherence, informativeness, and user-sensitivity across multiple benchmarks. To further advance multimodal IR research, we extend the Reader-Aware Multi-Document Summarization (RAMDS) dataset with video components, introducing VARAMDS (Video-Augmented-RAMDS), the first resource to explicitly couple news text, imagery, reader comments, and video content. Our findings demonstrate that LM-driven augmentation not only improves multimodal summarization quality but also sets a new standard for reader-aware, comment-sensitive synthesis, bridging gaps between heterogeneous information sources and supporting richer retrieval-oriented applications in resource-constrained environments.

Presenters

Co-Authors

From Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent Debate

Full papersApplicationsMachine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Reader-aware summarization distills articles while embedding user opinions and contextual grounding, shaping results that resonate with diverse readers and ease the challenge of extracting meaning from abundant news sources. However, research so far has centered on English and Chinese, with the complex multilingual and multimodal ecosystem of Indian news, shaped by articles, images, and user comments, still largely overlooked. Traditional single large language models (LLMs) often fail to integrate such heterogeneous evidence, yielding shallow or biased outputs. We introduce a Multi-Agent Debate (MAD) framework for reader-aware oriented summarization, built on the COSMMIC dataset, a multilingual, multimodal, and comment-sensitive resource for Indian news. MAD employs role-specialized agents (article analyst, comment integrator, image contextualizer, summary planner, and judge) that deliberate to produce a final summary, accompanied by a justification that attributes information to its source modality. This design not only enhances informativeness and factual consistency but also provides interpretability crucial for trustworthy Information Retrieval (IR) systems. Extensive automatic and human evaluations demonstrate that MAD significantly outperforms strong baselines in generating summaries that are more grounded, diverse, and aligned with reader context, especially in low-resource Indian languages.

Presenters

Co-Authors

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Full papersMachine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

With the wide adaptation of Language Models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages leads large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30.000 queries across several open-source LLMs and four GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model--hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

Presenters

Co-Authors

Towards Quantitative Summarization Evaluation: AnIntegrated Atomic-Based Evaluation Framework and Datasetfor Text Summarization

Full papersApplications Evaluation research 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Despite the dramatic advances in Large Language Models (LLMs), traditional summarization benchmarks are critically saturated, failing to differentiate state-of-the-art models or reflect real-world user needs. Existing evaluation datasets suffer from fundamental constraints in instructional diversity, narrow domain coverage, and homogeneous text lengths, missing the practical demands of modern summarization. To address this limitation, we introduce SumBench, a challenging benchmark derived from an in-depth analysis of user requirements and designed to stress-test advanced LLM capabilities. SumBench incorporates a diverse range of goal-oriented instructions, cross-domain long documents (up to 32K tokens), and complex domain knowledge. Crucially, we integrate the Atomic Summarization Evaluation Framework (ATMSumE), which leverages atomic decompositions of instructions and references to enable fine-grained, multi-dimensional assessment across instruction adherence, key point coverage, and factual accuracy. Our analysis on SumBench reveals systemic LLM limitations: performance substantially degrades when source length exceeds 16K tokens, showing pronounced weaknesses in completeness and factuality within specialized domains. Critically, failure modes are highly task-dependent: Completeness gaps emerge in Timeline and Global Summarization, while reasoning-intensive tasks incur higher factual error rates. These results empirically establish that LLM performance is strongly modulated by document length, domain complexity, and instruction type, providing an evidence-based roadmap for robust model development. The benchmark and tools will be publicly released.

Presenters

Co-Authors

ExpertMix: Aspect and Severity Detection in ConversationalComplaints

Full papersApplicationsMachine Learning and Large Language Models 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Prior research on fine-grained complaint analysis haslargely focused on short, context-limited inputs such astweets or product reviews. In this work, we advance thefield by introducing a paradigm that leverages multi-turncustomer-support dialogues as a richer medium for capturingnuanced dissatisfaction. Such conversations provide dynamicsignals°™emotional shifts, iterative follow-ups, anddetailed issue descriptions°™that enable more accuratedetection of aspect categories (e.g., service quality,software issues) and severity levels (e.g., disapproval,accusation). To this end, we extend a publicly availablecustomer-support corpus with fine-grained aspect andseverity annotations in the cellular services domain.Building on this dataset, we propose CompSense, amulti-task Mixture-of-Experts (MoE) framework underpinnedby Large Language Models (LLMs) and enriched withcommonsense-aware contextualization for robust complaintunderstanding. Extensive evaluations show that CompSenseconsistently outperforms task-specific conversational LLMsand decoder-only causal LLM baselines, highlighting thevalue of bidirectional and commonsense-aware modeling. Thiswork marks a step toward practical, real-world systemscapable of sophisticated conversational complaint analysis.

Presenters

Co-Authors

MemTool: Optimizing Short-Term Memory Management forDynamic Tool Retrieval and Invocation in LLM AgentMulti-Turn Conversations

Full papersSearch and rankingSystem aspects 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically retrieving and utilizing relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically retrieve and manage tools or MCP server contexts across multi-turn conversations, outperforming previous state-of-the-art tool retrieval approaches that lack multi-turn support and memory management of available tools or MCPs. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. We evaluate all modes across 13+ LLMs on the ScaleMCP benchmark, conducting experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency), task completion accuracy, and comprehensive cost analysis across modes. Our results significantly outperform existing state-of-the-art tool retrieval methods which cannot handle multi-turn tool retrieval and management. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90¨C94\% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0¨C60\%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs, cost analysis, and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.

Presenters

Co-Authors

378 visits

Session Participants

User Online

Session speakers, moderators & attendees

Jingfen Qiao

PhD Student

University Of Amsterdam

Mandeep Rathee

Leibniz University Hannover, L3S Research Center

Prosenjit Biswas

Research Scientist

Sony Research India

Ingo Frommholz

Professor in Applied Data Science

Modul University Vienna

Cornelius Kummer

TU Dresden

+ 1 more speakers. View All

Dr. Liana Ermakova

University Of Brest

Sérgio Nunes

University of Porto | INESC TEC

37 attendees saved this session

Session Chat

Live Chat

Chat with participants attending this session

Questions & Answers

Answered

Submit questions for the presenters

Session Polls

Active

Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.

Applied Generation, Evaluation & Analysis with LLMs

Session Information

Sub Sessions

Contradictions in Context: Challenges forRetrieval-Augmented Generation in Healthcare

Small Models, Big Picture! A Language Model Augmentation for Enhanced Reader-Aware Summarization

From Comments to Conclusions: Adaptive Reader-Aware Summary Generation in Low-Resource Languages via Agent Debate

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Towards Quantitative Summarization Evaluation: AnIntegrated Atomic-Based Evaluation Framework and Datasetfor Text Summarization

ExpertMix: Aspect and Severity Detection in ConversationalComplaints

MemTool: Optimizing Short-Term Memory Management forDynamic Tool Retrieval and Invocation in LLM AgentMulti-Turn Conversations

Session Participants

Session Chat

Questions & Answers

Session Polls

Need Help?

Please enter the four digit secret code The secret code should have been announced or displayed at the session location.

AI-generated Summary

Please enter the four digit secret code
The secret code should have been announced or displayed at the session location.