Loading Session...

LLMs as Rankers, Rerankers & Judges

Session Information

  • OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning
  • Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval
  • LLM-based Listwise Reranking under the Effect of Positional Bias
  • RerAnchor: Anchoring Important Context in Multi-Modal Document Reranking
  • How role-play shapes relevance judgment in zero-shot LLM rankers
  • Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
  • LANCER: LLM Reranking for Nugget Coverage
Mar 31, 2026 10:30 - 12:30(Europe/Amsterdam)
Venue : Centrale (Plenary Room)
20260331T1030 20260331T1230 Europe/Amsterdam LLMs as Rankers, Rerankers & Judges OrLog: Resolving Complex Queries with LLMs and Probabilistic ReasoningTraining-Induced Bias Toward LLM-Generated Content in Dense RetrievalLLM-based Listwise Reranking under the Effect of Positional BiasRerAnchor: Anchoring Important Context in Multi-Modal Document RerankingHow role-play shapes relevance judgment in zero-shot LLM rankersInfluential Training Data Retrieval for Explaining Verbalized Confidence of LLMsLANCER: LLM Reranking for Nugget Coverage Centrale (Plenary Room) ECIR2026 conference-secretariat@blueboxevents.nl

Sub Sessions

OrLog: Resolving Complex Queries with LLMs andProbabilistic Reasoning

Full papersMachine Learning and Large Language ModelsSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Resolving complex information needs that come with multiple constraints should consider enforcing the logical operators encoded in the query (i.e., conjunction, disjunction, negation) on the candidate answer set. Current retrieval systems either ignore these constraints in neural embeddings or approximate them in a generative reasoning process that can be inconsistent and unreliable. Although well-suited to structured reasoning, existing neuro?symbolic approaches remain confined to formal logic or mathematics problems as they often assume unambiguous queries and access to complete evidence, conditions rarely met in information retrieval. To bridge this gap, we introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning: a large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction. We evaluate OrLog across multiple backbone LLMs, varying levels of access to external knowledge, and a range of logical constraints, and compare it against base retrievers and LLM-as-reasoner methods. Provided with entity descriptions, OrLog can significantly boost top?rank precision compared to LLM reasoning with larger gains on disjunctive queries. OrLog is also more efficient, cutting mean tokens by ~90% per query¨Centity pair. These results demonstrate that generation?free predicate plausibility estimation combined with probabilistic reasoning enables constraint?aware retrieval that outperforms monolithic reasoning while using far fewer tokens.
Presenters
MH
Mohanna Hoveyda
PhD Student, Radboud University
Co-Authors
JP
Jelle Piepenbrock
Eindhoven University Of Technology
AV
Arjen De Vries
Radboud University
MD
Maarten De Rijke
Distinguished University Professor, University Of Amsterdam
FH
Faegheh Hasibi
Radboud University

Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
Presenters
WX
William Xion
Phd Student, Leibniz University Hannover, L3S Research Center
Co-Authors
WN
Wolfgang Nejdl
L3S Research Center

LLM-based Listwise Reranking under the Effect of Positional Bias

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
LLM-based listwise passage reranking has attracted attention for its effectiveness in ranking candidate passages. However, these models suffer from positional bias, where passages positioned towards the end of the input are less likely to be moved to top positions in the ranking. We hypothesize that there are two primary sources of positional bias: (1) architectural bias inherent in LLMs and (2) the imbalanced positioning of relevant documents. To address this, we propose DebiasFirst, a method that integrates positional calibration and position-aware data augmentation during fine-tuning. Positional calibration uses inverse propensity scoring to adjust for positional bias by re-weighting the contributions of different positions in the loss function when training. Position-aware augmentation augments training data to ensure that each passage appears equally across varied positions in the input list. This approach markedly enhances both effectiveness and robustness to the original ranking across diverse first-stage retrievers, reducing the dependence of NDCG@10 performance on the position of relevant documents. DebiasFirst also complements inference-stage debiasing methods, offering a practical solution for mitigating positional bias in reranking.
Presenters
JQ
Jingfen Qiao
PhD Student, University Of Amsterdam
Co-Authors
EK
Evangelos Kanoulas
University Of Amsterdam
AY
Andrew Yates
Johns Hopkins University, HLTCOE

RerAnchor: Anchoring Important Context in Multi-ModalDocument Reranking

Full papersApplicationsSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Conventional vision-based document retrievers operate at page-level granularity, compelling subsequent reranking models to process documents containing substantial irrelevant information. We introduce RerAnchor, a post-retrieval, OCR-free reranking module designed to address this limitation. At the core of RerAnchor is Context Anchoring: a token-level classifier built upon a vision-language model assigns query-conditioned relevance scores to image patches. A subsequent masking step then suppresses low-scoring patches, effectively denoising the document before a late-interaction retriever performs the final scoring. To enable robust evaluation, we constructed new visual reranking testbeds derived from the Paper-VISA and PDF-MVQA datasets. Experimentally, RerAnchor demonstrates significant ranking improvements. On PDF-MVQA, it increases Recall@1 from 0.66 to 0.738 and MRR@3 from 0.735 to 0.789. On Paper-VISA, it improves Recall@1 from 0.64 to 0.677 and MRR@3 from 0.709 to 0.740, while also achieving competitive fine-grained grounding performance (F1=0.526) without relying on generative decoding. Further analysis identifies a stable masking regime that balances evidence retention with noise reduction. A case study also reveals our model's ability to identify multiple non-contiguous relevant regions, despite being trained exclusively with single-bounding-box supervision. RerAnchor effectively transforms coarse, page-level documents into precise, token-budget-friendly contexts, enhancing vision-based Retrieval-Augmented Generation (RAG). The code, data, and model checkpoints will be made publicly available.
Presenters Tz-Huan Hsu
Data Scientist, CyCraft Technology Corporation Taiwan Branch
Co-Authors
SH
Sian-Yao Huang
Data Scientist Technical Lead, CyCraft Technology Corporation Taiwan Branch
KL
KuanLun Liao
Data Scientist, CyCraft Technology
CL
Che-Yu Lin
CY
Cheng-Lin Yang
CyCraft AI Lab

How role-play shapes relevance judgment in zero-shot LLM rankers

Full papersExplainability methodsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts --- where the model is assigned a functional role or identity --- often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.
Presenters
YW
Yumeng Wang
PhD Student, Leiden University
Co-Authors
JQ
Jirui Qi
Center For Language And Cognition, University Of Groningen
CC
Catherine Chen
PhD Candidate, Brown University
PE
Panagiotis Eustratiadis
University Of Amsterdam
Suzan Verberne
Leiden University

Influential Training Data Retrieval for ExplainingVerbalized Confidence of LLMs

Full papersExplainability methodsMachine Learning and Large Language Models 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Large language models (LLMs) can increase users¡¯ perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is semantically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
Presenters Yuxi Xia
PhD Student, University Of Vienna

LANCER: LLM Reranking for Nugget Coverage

Full papersSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Unlike short-form retrieval-augmented generation (RAG), such as factoid question answering, long-form RAG applications require retrieval to provide documents covering a wide range of relevant information. Automated report generation exemplifies this setting: it needs not only relevant information but also a more elaborate response with more information included. Yet, existing retrieval methods are primarily optimized for relevance rather than information coverage. To address this limitation, we propose LANCER, an \textbf{L}LM-based rer\textbf{A}nking method for \textbf{N}ugget \textbf{C}ov\textbf{ER}age. LANCER predicts what sub-questions should be answered to satisfy an information need, predicts which documents answer these sub-questions, and reranks documents in order to provide a ranked list covering as many sub-questions as possible at the top of the ranking. Our empirical results show that LANCER enhances the quality of retrieval as measured by nugget coverage metrics and can achieve better $\alpha$-nDCG and information coverage than other LLM-based reranking methods. Further analysis demonstrates that sub-question generation is one of the key components for optimizing coverage.
Presenters
JJ
Jia-Huei Ju
PhD Student, University Of Amsterdam
Co-Authors
FL
François G. Landry
Eugene Yang
Research Scientist, Human Language Technology Center Of Excellence, Johns Hopkins University
Suzan Verberne
Leiden University
AY
Andrew Yates
Johns Hopkins University, HLTCOE
296 visits

Session Participants

User Online
Session speakers, moderators & attendees
PhD Student
,
Radboud University
Phd Student
,
Leibniz University Hannover, L3S Research Center
PhD Student
,
University Of Amsterdam
Data Scientist
,
CyCraft Technology Corporation Taiwan Branch
PhD Student
,
Leiden University
+ 2 more speakers. View All
Professor
,
University Of Padova
No attendee has checked-in to this session!
41 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.