20260331T103020260331T1230Europe/AmsterdamLLMs as Rankers, Rerankers & JudgesTraining-Induced Bias Toward LLM-Generated Content in Dense RetrievalOrLog: Resolving Complex Queries with LLMs and Probabilistic ReasoningLLM-based Listwise Reranking under the Effect of Positional BiasRerAnchor: Anchoring Important Context in Multi-Modal Document RerankingHow role-play shapes relevance judgment in zero-shot LLM rankersInfluential Training Data Retrieval for Explaining Verbalized Confidence of LLMsLANCER: LLM Reranking for Nugget CoverageCentrale (Plenary Room) ECIR2026n.fontein@tudelft.nl
Training-Induced Bias Toward LLM-Generated Content in Dense
Retrieval
Full papersSearch and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
OrLog: Resolving Complex Queries with LLMs and
Probabilistic Reasoning
Full papersMachine Learning and Large Language Models
Search and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
LLM-based Listwise Reranking under the Effect of Positional
Bias
Full papersSearch and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
LLM-based listwise passage reranking has attracted attention for its effectiveness in ranking candidate passages. However, these models suffer from positional bias, where passages positioned towards the end of the input are less likely to be moved to top positions in the ranking. We hypothesize that there are two primary sources of positional bias: (1) architectural bias inherent in LLMs and (2) the imbalanced positioning of relevant documents. To address this, we propose DebiasFirst, a method that integrates positional calibration and position-aware data augmentation during fine-tuning. Positional calibration uses inverse propensity scoring to adjust for positional bias by re-weighting the contributions of different positions in the loss function when training. Position-aware augmentation augments training data to ensure that each passage appears equally across varied positions in the input list. This approach markedly enhances both effectiveness and robustness to the original ranking across diverse first-stage retrievers, reducing the dependence of NDCG@10 performance on the position of relevant documents. DebiasFirst also complements inference-stage debiasing methods, offering a practical solution for mitigating positional bias in reranking.
RerAnchor: Anchoring Important Context in Multi-Modal
Document Reranking
Full papersApplications
Search and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Conventional vision-based document retrievers operate at page-level granularity, compelling subsequent reranking models to process documents containing substantial irrelevant information. We introduce RerAnchor, a post-retrieval, OCR-free reranking module designed to address this limitation. At the core of RerAnchor is Context Anchoring: a token-level classifier built upon a vision-language model assigns query-conditioned relevance scores to image patches. A subsequent masking step then suppresses low-scoring patches, effectively denoising the document before a late-interaction retriever performs the final scoring. To enable robust evaluation, we constructed new visual reranking testbeds derived from the Paper-VISA and PDF-MVQA datasets. Experimentally, RerAnchor demonstrates significant ranking improvements. On PDF-MVQA, it increases Recall@1 from 0.66 to 0.738 and MRR@3 from 0.735 to 0.789. On Paper-VISA, it improves Recall@1 from 0.64 to 0.677 and MRR@3 from 0.709 to 0.740, while also achieving competitive fine-grained grounding performance (F1=0.526) without relying on generative decoding. Further analysis identifies a stable masking regime that balances evidence retention with noise reduction. A case study also reveals our model's ability to identify multiple non-contiguous relevant regions, despite being trained exclusively with single-bounding-box supervision. RerAnchor effectively transforms coarse, page-level documents into precise, token-budget-friendly contexts, enhancing vision-based Retrieval-Augmented Generation (RAG). The code, data, and model checkpoints will be made publicly available.
How role-play shapes relevance judgment in zero-shot LLM
rankers
Full papersExplainability methodsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts --- where the model is assigned a functional role or identity --- often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.
Influential Training Data Retrieval for Explaining
Verbalized Confidence of LLMs
Full papersExplainability methods
Machine Learning and Large Language ModelsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Full papersSearch and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Unlike short-form retrieval-augmented generation (RAG), such as factoid question answering, long-form RAG applications require retrieval to provide documents covering a wide range of relevant information. Automated report generation exemplifies this setting: it needs not only relevant information but also a more elaborate response with more information included. Yet, existing retrieval methods are primarily optimized for relevance rather than information coverage. To address this limitation, we propose LANCER, an \textbf{L}LM-based rer\textbf{A}nking method for \textbf{N}ugget \textbf{C}ov\textbf{ER}age. LANCER predicts what sub-questions should be answered to satisfy an information need, predicts which documents answer these sub-questions, and reranks documents in order to provide a ranked list covering as many sub-questions as possible at the top of the ranking. Our empirical results show that LANCER enhances the quality of retrieval as measured by nugget coverage metrics and can achieve better $\alpha$-nDCG and information coverage than other LLM-based reranking methods. Further analysis demonstrates that sub-question generation is one of the key components for optimizing coverage.