LLMs As Rankers, Rerankers & Judges

Loading Session...

Session Information

OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning
Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval
LLM-based Listwise Reranking under the Effect of Positional Bias
RerAnchor: Anchoring Important Context in Multi-Modal Document Reranking
How role-play shapes relevance judgment in zero-shot LLM rankers
Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
LANCER: LLM Reranking for Nugget Coverage

Full Papers

Mar 31, 2026 10:30 - 12:30(Europe/Amsterdam)

Venue : Centrale (Plenary Room)

20260331T1030 20260331T1230 Europe/Amsterdam LLMs as Rankers, Rerankers & Judges OrLog: Resolving Complex Queries with LLMs and Probabilistic ReasoningTraining-Induced Bias Toward LLM-Generated Content in Dense RetrievalLLM-based Listwise Reranking under the Effect of Positional BiasRerAnchor: Anchoring Important Context in Multi-Modal Document RerankingHow role-play shapes relevance judgment in zero-shot LLM rankersInfluential Training Data Retrieval for Explaining Verbalized Confidence of LLMsLANCER: LLM Reranking for Nugget Coverage Centrale (Plenary Room) ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

OrLog: Resolving Complex Queries with LLMs andProbabilistic Reasoning

Full papersMachine Learning and Large Language ModelsSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

Resolving complex information needs that come with multiple constraints should consider enforcing the logical operators encoded in the query (i.e., conjunction, disjunction, negation) on the candidate answer set. Current retrieval systems either ignore these constraints in neural embeddings or approximate them in a generative reasoning process that can be inconsistent and unreliable. Although well-suited to structured reasoning, existing neuro?symbolic approaches remain confined to formal logic or mathematics problems as they often assume unambiguous queries and access to complete evidence, conditions rarely met in information retrieval. To bridge this gap, we introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning: a large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction. We evaluate OrLog across multiple backbone LLMs, varying levels of access to external knowledge, and a range of logical constraints, and compare it against base retrievers and LLM-as-reasoner methods. Provided with entity descriptions, OrLog can significantly boost top?rank precision compared to LLM reasoning with larger gains on disjunctive queries. OrLog is also more efficient, cutting mean tokens by ~90% per query¨Centity pair. These results demonstrate that generation?free predicate plausibility estimation combined with probabilistic reasoning enables constraint?aware retrieval that outperforms monolithic reasoning while using far fewer tokens.

Presenters

Co-Authors

Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.

Presenters

Co-Authors

LLM-based Listwise Reranking under the Effect of Positional Bias

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

LLM-based listwise passage reranking has attracted attention for its effectiveness in ranking candidate passages. However, these models suffer from positional bias, where passages positioned towards the end of the input are less likely to be moved to top positions in the ranking. We hypothesize that there are two primary sources of positional bias: (1) architectural bias inherent in LLMs and (2) the imbalanced positioning of relevant documents. To address this, we propose DebiasFirst, a method that integrates positional calibration and position-aware data augmentation during fine-tuning. Positional calibration uses inverse propensity scoring to adjust for positional bias by re-weighting the contributions of different positions in the loss function when training. Position-aware augmentation augments training data to ensure that each passage appears equally across varied positions in the input list. This approach markedly enhances both effectiveness and robustness to the original ranking across diverse first-stage retrievers, reducing the dependence of NDCG@10 performance on the position of relevant documents. DebiasFirst also complements inference-stage debiasing methods, offering a practical solution for mitigating positional bias in reranking.

Presenters

Co-Authors

RerAnchor: Anchoring Important Context in Multi-ModalDocument Reranking

Full papersApplicationsSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

Conventional vision-based document retrievers operate at page-level granularity, compelling subsequent reranking models to process documents containing substantial irrelevant information. We introduce RerAnchor, a post-retrieval, OCR-free reranking module designed to address this limitation. At the core of RerAnchor is Context Anchoring: a token-level classifier built upon a vision-language model assigns query-conditioned relevance scores to image patches. A subsequent masking step then suppresses low-scoring patches, effectively denoising the document before a late-interaction retriever performs the final scoring. To enable robust evaluation, we constructed new visual reranking testbeds derived from the Paper-VISA and PDF-MVQA datasets. Experimentally, RerAnchor demonstrates significant ranking improvements. On PDF-MVQA, it increases Recall@1 from 0.66 to 0.738 and MRR@3 from 0.735 to 0.789. On Paper-VISA, it improves Recall@1 from 0.64 to 0.677 and MRR@3 from 0.709 to 0.740, while also achieving competitive fine-grained grounding performance (F1=0.526) without relying on generative decoding. Further analysis identifies a stable masking regime that balances evidence retention with noise reduction. A case study also reveals our model's ability to identify multiple non-contiguous relevant regions, despite being trained exclusively with single-bounding-box supervision. RerAnchor effectively transforms coarse, page-level documents into precise, token-budget-friendly contexts, enhancing vision-based Retrieval-Augmented Generation (RAG). The code, data, and model checkpoints will be made publicly available.

Presenters

Co-Authors

How role-play shapes relevance judgment in zero-shot LLM rankers

Full papersExplainability methodsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts --- where the model is assigned a functional role or identity --- often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.

Presenters

Co-Authors

Influential Training Data Retrieval for ExplainingVerbalized Confidence of LLMs

Full papersExplainability methodsMachine Learning and Large Language Models 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

Large language models (LLMs) can increase users¡¯ perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is semantically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.

Presenters

LANCER: LLM Reranking for Nugget Coverage

Full papersSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC

Unlike short-form retrieval-augmented generation (RAG), such as factoid question answering, long-form RAG applications require retrieval to provide documents covering a wide range of relevant information. Automated report generation exemplifies this setting: it needs not only relevant information but also a more elaborate response with more information included. Yet, existing retrieval methods are primarily optimized for relevance rather than information coverage. To address this limitation, we propose LANCER, an \textbf{L}LM-based rer\textbf{A}nking method for \textbf{N}ugget \textbf{C}ov\textbf{ER}age. LANCER predicts what sub-questions should be answered to satisfy an information need, predicts which documents answer these sub-questions, and reranks documents in order to provide a ranked list covering as many sub-questions as possible at the top of the ranking. Our empirical results show that LANCER enhances the quality of retrieval as measured by nugget coverage metrics and can achieve better $\alpha$-nDCG and information coverage than other LLM-based reranking methods. Further analysis demonstrates that sub-question generation is one of the key components for optimizing coverage.

Presenters

Co-Authors

296 visits

Session Participants

User Online

Session speakers, moderators & attendees

Mohanna Hoveyda

PhD Student

Radboud University

William Xion

Phd Student

Leibniz University Hannover, L3S Research Center

Jingfen Qiao

PhD Student

University Of Amsterdam

Tz-Huan Hsu

Data Scientist

CyCraft Technology Corporation Taiwan Branch

Yumeng Wang

PhD Student

Leiden University

+ 2 more speakers. View All

Gianmaria Silvello

Professor

University Of Padova

No attendee has checked-in to this session!

41 attendees saved this session

Session Chat

Live Chat

Chat with participants attending this session

Questions & Answers

Answered

Submit questions for the presenters

Session Polls

Active

Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.

LLMs as Rankers, Rerankers & Judges

Session Information

Sub Sessions

OrLog: Resolving Complex Queries with LLMs andProbabilistic Reasoning

Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

LLM-based Listwise Reranking under the Effect of Positional Bias

RerAnchor: Anchoring Important Context in Multi-ModalDocument Reranking

How role-play shapes relevance judgment in zero-shot LLM rankers

Influential Training Data Retrieval for ExplainingVerbalized Confidence of LLMs

LANCER: LLM Reranking for Nugget Coverage

Session Participants

Session Chat

Questions & Answers

Session Polls

Need Help?

Please enter the four digit secret code The secret code should have been announced or displayed at the session location.

AI-generated Summary

Please enter the four digit secret code
The secret code should have been announced or displayed at the session location.