Loading Session...

LLMs as Rankers, Rerankers & Judges

Back to Schedule Check-inYou can join session 5 minutes before start time.

Session Information

  • Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval
  • OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning
  • LLM-based Listwise Reranking under the Effect of Positional Bias
  • RerAnchor: Anchoring Important Context in Multi-Modal Document Reranking
  • How role-play shapes relevance judgment in zero-shot LLM rankers
  • Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
  • LANCER: LLM Reranking for Nugget Coverage
Mar 31, 2026 10:30 - 12:30(Europe/Amsterdam)
Venue : Centrale (Plenary Room)
20260331T1030 20260331T1230 Europe/Amsterdam LLMs as Rankers, Rerankers & Judges Training-Induced Bias Toward LLM-Generated Content in Dense RetrievalOrLog: Resolving Complex Queries with LLMs and Probabilistic ReasoningLLM-based Listwise Reranking under the Effect of Positional BiasRerAnchor: Anchoring Important Context in Multi-Modal Document RerankingHow role-play shapes relevance judgment in zero-shot LLM rankersInfluential Training Data Retrieval for Explaining Verbalized Confidence of LLMsLANCER: LLM Reranking for Nugget Coverage Centrale (Plenary Room) ECIR2026 n.fontein@tudelft.nl

Sub Sessions

Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.
Presenters
WX
William Xion
Phd Student, Leibniz University Hannover, L3S Research Center
Co-Authors
WN
Wolfgang Nejdl
L3S Research Center

OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning

Full papersMachine Learning and Large Language Models Search and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Presenters
MH
Mohanna Hoveyda
PhD Student, Radboud University
Co-Authors
JP
Jelle Piepenbrock
Eindhoven University Of Technology
AV
Arjen De Vries
Radboud University
MD
Maarten De Rijke
Distinguished University Professor, University Of Amsterdam
FH
Faegheh Hasibi
Radboud University

LLM-based Listwise Reranking under the Effect of Positional Bias

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
LLM-based listwise passage reranking has attracted attention for its effectiveness in ranking candidate passages. However, these models suffer from positional bias, where passages positioned towards the end of the input are less likely to be moved to top positions in the ranking. We hypothesize that there are two primary sources of positional bias: (1) architectural bias inherent in LLMs and (2) the imbalanced positioning of relevant documents. To address this, we propose DebiasFirst, a method that integrates positional calibration and position-aware data augmentation during fine-tuning. Positional calibration uses inverse propensity scoring to adjust for positional bias by re-weighting the contributions of different positions in the loss function when training. Position-aware augmentation augments training data to ensure that each passage appears equally across varied positions in the input list. This approach markedly enhances both effectiveness and robustness to the original ranking across diverse first-stage retrievers, reducing the dependence of NDCG@10 performance on the position of relevant documents. DebiasFirst also complements inference-stage debiasing methods, offering a practical solution for mitigating positional bias in reranking.
Presenters
JQ
Jingfen Qiao
PhD Student, University Of Amsterdam
Co-Authors
EK
Evangelos Kanoulas
University Of Amsterdam
AY
Andrew Yates
Johns Hopkins University, HLTCOE

RerAnchor: Anchoring Important Context in Multi-Modal Document Reranking

Full papersApplications Search and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Conventional vision-based document retrievers operate at page-level granularity, compelling subsequent reranking models to process documents containing substantial irrelevant information. We introduce RerAnchor, a post-retrieval, OCR-free reranking module designed to address this limitation. At the core of RerAnchor is Context Anchoring: a token-level classifier built upon a vision-language model assigns query-conditioned relevance scores to image patches. A subsequent masking step then suppresses low-scoring patches, effectively denoising the document before a late-interaction retriever performs the final scoring. To enable robust evaluation, we constructed new visual reranking testbeds derived from the Paper-VISA and PDF-MVQA datasets. Experimentally, RerAnchor demonstrates significant ranking improvements. On PDF-MVQA, it increases Recall@1 from 0.66 to 0.738 and MRR@3 from 0.735 to 0.789. On Paper-VISA, it improves Recall@1 from 0.64 to 0.677 and MRR@3 from 0.709 to 0.740, while also achieving competitive fine-grained grounding performance (F1=0.526) without relying on generative decoding. Further analysis identifies a stable masking regime that balances evidence retention with noise reduction. A case study also reveals our model's ability to identify multiple non-contiguous relevant regions, despite being trained exclusively with single-bounding-box supervision. RerAnchor effectively transforms coarse, page-level documents into precise, token-budget-friendly contexts, enhancing vision-based Retrieval-Augmented Generation (RAG). The code, data, and model checkpoints will be made publicly available.
Presenters Tz-Huan Hsu
Data Scientist, CyCraft Technology Corporation Taiwan Branch
Co-Authors
CY
Cheng-Lin Yang
CyCraft AI Lab

How role-play shapes relevance judgment in zero-shot LLM rankers

Full papersExplainability methodsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts --- where the model is assigned a functional role or identity --- often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.
Presenters
YW
Yumeng Wang
Leiden Institute Of Advanced Computer Science, Leiden University
Co-Authors
JQ
Jirui Qi
Center For Language And Cognition, University Of Groningen
CC
Catherine Chen
PhD Candidate, Brown University
PE
Panagiotis Eustratiadis
University Of Amsterdam
Suzan Verberne
Leiden University

Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs

Full papersExplainability methods Machine Learning and Large Language ModelsFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Presenters
YX
Yuxi Xia
PhD Student, University Of Vienna

LANCER: LLM Reranking for Nugget Coverage

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/31 08:30:00 UTC - 2026/03/31 10:30:00 UTC
Unlike short-form retrieval-augmented generation (RAG), such as factoid question answering, long-form RAG applications require retrieval to provide documents covering a wide range of relevant information. Automated report generation exemplifies this setting: it needs not only relevant information but also a more elaborate response with more information included. Yet, existing retrieval methods are primarily optimized for relevance rather than information coverage. To address this limitation, we propose LANCER, an \textbf{L}LM-based rer\textbf{A}nking method for \textbf{N}ugget \textbf{C}ov\textbf{ER}age. LANCER predicts what sub-questions should be answered to satisfy an information need, predicts which documents answer these sub-questions, and reranks documents in order to provide a ranked list covering as many sub-questions as possible at the top of the ranking. Our empirical results show that LANCER enhances the quality of retrieval as measured by nugget coverage metrics and can achieve better $\alpha$-nDCG and information coverage than other LLM-based reranking methods. Further analysis demonstrates that sub-question generation is one of the key components for optimizing coverage.
Presenters
DJ
Dylan Jia-Huei Ju
PhD Student, University Of Amsterdam
Co-Authors
EY
Eugene Yang
Research Scientist, Human Language Technology Center Of Excellence, Johns Hopkins University
Suzan Verberne
Leiden University
AY
Andrew Yates
Johns Hopkins University, HLTCOE
13 visits

Session Participants

User Online
Session speakers, moderators & attendees
Phd Student
,
Leibniz University Hannover, L3S Research Center
PhD Student
,
Radboud University
PhD Student
,
University Of Amsterdam
Data Scientist
,
CyCraft Technology Corporation Taiwan Branch
Leiden Institute of Advanced Computer Science, Leiden University
+ 2 more speakers. View All
Professor
,
University Of Padova
No attendee has checked-in to this session!
14 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.