20260331T103020260331T1230Europe/AmsterdamLLMs as Rankers, Rerankers & JudgesOrLog: Resolving Complex Queries with LLMs and Probabilistic ReasoningTraining-Induced Bias Toward LLM-Generated Content in Dense RetrievalLLM-based Listwise Reranking under the Effect of Positional BiasRerAnchor: Anchoring Important Context in Multi-Modal Document RerankingHow role-play shapes relevance judgment in zero-shot LLM rankersInfluential Training Data Retrieval for Explaining Verbalized Confidence of LLMsLANCER: LLM Reranking for Nugget CoverageCentrale (Plenary Room) ECIR2026conference-secretariat@blueboxevents.nl
OrLog: Resolving Complex Queries with LLMs andProbabilistic Reasoning
Full papersMachine Learning and Large Language ModelsSearch and ranking10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Resolving complex information needs that come with multiple constraints should consider enforcing the logical operators encoded in the query (i.e., conjunction, disjunction, negation) on the candidate answer set. Current retrieval systems either ignore these constraints in neural embeddings or approximate them in a generative reasoning process that can be inconsistent and unreliable. Although well-suited to structured reasoning, existing neuro?symbolic approaches remain confined to formal logic or mathematics problems as they often assume unambiguous queries and access to complete evidence, conditions rarely met in information retrieval. To bridge this gap, we introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning: a large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction. We evaluate OrLog across multiple backbone LLMs, varying levels of access to external knowledge, and a range of logical constraints, and compare it against base retrievers and LLM-as-reasoner methods. Provided with entity descriptions, OrLog can significantly boost top?rank precision compared to LLM reasoning with larger gains on disjunctive queries. OrLog is also more efficient, cutting mean tokens by ~90% per query¨Centity pair. These results demonstrate that generation?free predicate plausibility estimation combined with probabilistic reasoning enables constraint?aware retrieval that outperforms monolithic reasoning while using far fewer tokens.
Training-Induced Bias Toward LLM-Generated Content in Dense
Retrieval
Full papersSearch and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
LLM-based Listwise Reranking under the Effect of Positional
Bias
Full papersSearch and rankingFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
LLM-based listwise passage reranking has attracted attention for its effectiveness in ranking candidate passages. However, these models suffer from positional bias, where passages positioned towards the end of the input are less likely to be moved to top positions in the ranking. We hypothesize that there are two primary sources of positional bias: (1) architectural bias inherent in LLMs and (2) the imbalanced positioning of relevant documents. To address this, we propose DebiasFirst, a method that integrates positional calibration and position-aware data augmentation during fine-tuning. Positional calibration uses inverse propensity scoring to adjust for positional bias by re-weighting the contributions of different positions in the loss function when training. Position-aware augmentation augments training data to ensure that each passage appears equally across varied positions in the input list. This approach markedly enhances both effectiveness and robustness to the original ranking across diverse first-stage retrievers, reducing the dependence of NDCG@10 performance on the position of relevant documents. DebiasFirst also complements inference-stage debiasing methods, offering a practical solution for mitigating positional bias in reranking.
RerAnchor: Anchoring Important Context in Multi-ModalDocument Reranking
Full papersApplicationsSearch and ranking10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Conventional vision-based document retrievers operate at page-level granularity, compelling subsequent reranking models to process documents containing substantial irrelevant information. We introduce RerAnchor, a post-retrieval, OCR-free reranking module designed to address this limitation. At the core of RerAnchor is Context Anchoring: a token-level classifier built upon a vision-language model assigns query-conditioned relevance scores to image patches. A subsequent masking step then suppresses low-scoring patches, effectively denoising the document before a late-interaction retriever performs the final scoring. To enable robust evaluation, we constructed new visual reranking testbeds derived from the Paper-VISA and PDF-MVQA datasets. Experimentally, RerAnchor demonstrates significant ranking improvements. On PDF-MVQA, it increases Recall@1 from 0.66 to 0.738 and MRR@3 from 0.735 to 0.789. On Paper-VISA, it improves Recall@1 from 0.64 to 0.677 and MRR@3 from 0.709 to 0.740, while also achieving competitive fine-grained grounding performance (F1=0.526) without relying on generative decoding. Further analysis identifies a stable masking regime that balances evidence retention with noise reduction. A case study also reveals our model's ability to identify multiple non-contiguous relevant regions, despite being trained exclusively with single-bounding-box supervision. RerAnchor effectively transforms coarse, page-level documents into precise, token-budget-friendly contexts, enhancing vision-based Retrieval-Augmented Generation (RAG). The code, data, and model checkpoints will be made publicly available.
How role-play shapes relevance judgment in zero-shot LLM
rankers
Full papersExplainability methodsFull papers10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts --- where the model is assigned a functional role or identity --- often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.
Influential Training Data Retrieval for ExplainingVerbalized Confidence of LLMs
Full papersExplainability methodsMachine Learning and Large Language Models10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Large language models (LLMs) can increase users¡¯ perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is semantically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
Presenters Yuxi Xia PhD Student, University Of Vienna
LANCER: LLM Reranking for Nugget Coverage
Full papersSearch and ranking10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Unlike short-form retrieval-augmented generation (RAG), such as factoid question answering, long-form RAG applications require retrieval to provide documents covering a wide range of relevant information. Automated report generation exemplifies this setting: it needs not only relevant information but also a more elaborate response with more information included. Yet, existing retrieval methods are primarily optimized for relevance rather than information coverage. To address this limitation, we propose LANCER, an \textbf{L}LM-based rer\textbf{A}nking method for \textbf{N}ugget \textbf{C}ov\textbf{ER}age. LANCER predicts what sub-questions should be answered to satisfy an information need, predicts which documents answer these sub-questions, and reranks documents in order to provide a ranked list covering as many sub-questions as possible at the top of the ranking. Our empirical results show that LANCER enhances the quality of retrieval as measured by nugget coverage metrics and can achieve better $\alpha$-nDCG and information coverage than other LLM-based reranking methods. Further analysis demonstrates that sub-question generation is one of the key components for optimizing coverage.