Loading Session...

Core Retrieval Models, Representations & Evaluation

Back to Schedule Check-inYou can join session 5 minutes before start time.

Session Information

  • Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy Ranking
  • Validating Search Query Simulations: A Taxonomy of Measures
  • Reducing Human Effort to Validate LLM Relevance Judgements via Stratified Sampling
  • Revealing MonoT5's Learning Mechanisms via Prompt-Token Adaptation
  • When Reducing Representations Improves Performance
  • An Empirical Study of Model Casing in Learned Sparse Retrieval
  • Improving Instruction-Aware Retrieval with Query-Preserving Regularization
Mar 30, 2026 10:30 - 12:30(Europe/Amsterdam)
Venue : Centrale (Plenary Room)
20260330T1030 20260330T1230 Europe/Amsterdam Core Retrieval Models, Representations & Evaluation Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy RankingValidating Search Query Simulations: A Taxonomy of MeasuresReducing Human Effort to Validate LLM Relevance Judgements via Stratified SamplingRevealing MonoT5's Learning Mechanisms via Prompt-Token AdaptationWhen Reducing Representations Improves PerformanceAn Empirical Study of Model Casing in Learned Sparse RetrievalImproving Instruction-Aware Retrieval with Query-Preserving Regularization Centrale (Plenary Room) ECIR2026 n.fontein@tudelft.nl

Sub Sessions

Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy Ranking

Full papersMachine Learning and Large Language Models Search and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Off-policy evaluation (OPE) and optimization for learning to rank (LTR) leverage document placement probabilities to correct for the effects of various statistical biases, e.g., position bias. However, computing these propensities poses a challenge, as for most ranking models this requires iterating over all possible rankings. A common solution is to approximate them by sampling multiple rankings and using the ob- served document frequencies per position. Nevertheless, even when using extremely large numbers of sampled rankings, these estimates often still contain significant estimation errors. In this work, we propose the novel marginalized Plackett-Luce (MPL) method to efficiently and accurately calculate document-rank placement probabilities under the widely used Plackett-Luce (PL) ranking model. In particular, we establish MPL by first showing that this probability is the expected value of a Poisson binomial distribution over the document scores; subsequently, we leverage a known connection between the Poisson binomial distribution, convolutional operations and numerical integration, to achieve efficient and accurate propensity estimation. Furthermore, we argue that MPL provides near-exact estimation when computing the function over a practical number of evaluation points. Our experiments confirm that the propensity estimation of MPL is highly accurate, efficient, and leads to substantial improvements over the sampling-based method in downstream applications, thus opening the door to a wider use of PL policies in off-policy learning to rank.
Presenters
NK
Norman Knyazev
PhD Candidate, Radboud University
Co-Authors
HO
Harrie Oosterhuis
Assistant Professor, Radboud University

Validating Search Query Simulations: A Taxonomy of Measures

Full papersEvaluation researchFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Assessing the validity of user simulators when used for the evaluation of information retrieval systems remains an open question, constraining their effective use and the reliability of simulation-based results. To address this issue, we conduct a comprehensive literature review with a particular focus on methods for the validation of simulated user queries with regard to real queries. Based on the review, we develop a taxonomy that structures the current landscape of available measures. We empirically corroborate the taxonomy by analyzing the relationships between the different measures applied to four different datasets representing diverse search scenarios. Finally, we provide concrete recommendations on which measures or combinations of measures should be considered when validating user simulation in different contexts. Furthermore, we release a dedicated library with the most commonly used measures to facilitate future research.
Presenters
AK
Andreas Konstantin Kruff
PhD Student, TH Köln
Co-Authors
NB
Nolwenn Bernard
TH Köln
PS
Philipp Schaer
Professor, TH Köln

Reducing Human Effort to Validate LLM Relevance Judgements via Stratified Sampling

Full papersEvaluation researchFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Information Retrieval (IR) evaluation deeply relies on human-made relevance judgments. To overcome the high costs of the judgment collection process, a potential solution is to utilize LLMs as judges to replace human annotators. However, the validation of LLM-generated judgments is fundamental for informed use. Standard validation approaches typically rely on simple sampling techniques to collect a sample of the LLM-generated judgments and estimate the LLM agreement with the human. In this work, we propose using stratified sampling, a more sophisticated sampling strategy that, by leveraging appropriate stratification features, reduces human involvement in the validation process while still providing statistical guarantees on the human-LLM agreement estimate. Through the analysis of various candidate features, we identify the LLM-generated judgments themselves as the most promising one. Our approach achieves up to an 85% reduction in the required human involvement in the validation process.
Presenters
SM
Simone Merlo
Ph.D. Student, University Of Padova
Co-Authors
SM
Stefano Marchesin
University Of Padua
GF
Guglielmo Faggioli
University Of Padova
NF
Nicola Ferro
Full Professor, University Of Padova

Revealing MonoT5's Learning Mechanisms via Prompt-Token Adaptation

Full papersMachine Learning and Large Language Models Recommender systems 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Transformer-based cross-encoders, such as MonoT5, achieve state-of-the-art performance in several retrieval tasks. In particular, MonoT5 is based on a sequence-to-sequence architecture and trained on MS MARCO for passage re-ranking, where it predicts the relevance of a passage given an input query. In this paper, to understand what MonoT5 learned, we analyse the parameter updates during its training process. We observe that the largest shifts occur in a small set of parameters, i.e. less than 1% of the model, while the rest of the model remains relatively unchanged. Motivated by this finding, we propose Light-MonoT5, a parameter-efficient variant of MonoT5 that updates only this small set of parameters during training, and leaves the rest of the network unchanged. Extensive evaluation on both in- and out-domain benchmarks shows that Light-MonoT5 achieves statistically equivalent effectiveness compared to MonoT5. Since relevance can be captured by updating only a subset of T5 parameters, we hypothesise that MonoT5, which updates all the original model°Øs parameters, primarily learns to evaluate passage quality rather than explicitly assessing the relevance of a passage to the query. To test our hypothesis, we employ QT5, a T5-based quality estimation model, to prune low-quality passages before indexing. On the pruned collection, Light-MonoT5 achieves performance on par with MonoT5, indicating that MonoT5°Øs strong performance is largely attributable to quality assessment, with minimal adaptation required once low-quality content is removed.
Presenters
MB
Marco Braga
PhD Student, Politecnico Di Torino

When Reducing Representations Improves Performance

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Neural models have transformed Information Retrieval (IR) by enabling semantic search, representing queries and documents as dense embeddings in latent spaces. However, recent works indicate the contribution of single dimensions in these representations to ranking quality is uneven: some dimensions are essential, while others may even degrade performance. Dimension IMportance Estimators (DIMEs) are heuristics to guide the search for the subsets of dimensions that induce an optimal subaspace where retrieval is more effective. To explore these subspaces, DIMEs rely on two simplifying assumptions: the linearity of subspaces and the independence of dimensions. In this paper, we move a step forward by relaxing the independence assumption and employing genetic algorithms to select the optimal set of dimensions. We show that selecting optimal dimensions for individual queries can achieve up to 0.981 nDCG@10 and 0.831 AP using state-of-the-art dense retrieval models on the considered datasets. Additionally, we identify subsets of dimensions that improve ranking quality across multiple queries simultaneously. Finally, we show that a dataset-specific subset of dimensions enables dense retrieval models to generalize across other datasets without loss of performance.
Presenters
AP
Andrea Pasin
University Of Padua, Italy
Co-Authors
GF
Guglielmo Faggioli
University Of Padova
NF
Nicola Ferro
Full Professor, University Of Padova
RP
Raffaele Perego
ISTI-CNR
NT
Nicola Tonellotto
University Of Pisa

An Empirical Study of Model Casing in Learned Sparse Retrieval

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC
Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have almost exclusively relied on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; performing lowercasing during preprocessing eliminates this gap. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: \url{https://anonymous.4open.science/r/Uncased-vs-cased-models-in-LSR-F75D}
Presenters
EL
Emmanouil Georgios Lionis
PhD Student, University Of Glasgow
Co-Authors
DJ
Dylan Jia-Huei Ju
PhD Student, University Of Amsterdam
AN
Angelos Nalmpantis
TKH AI Technology
CT
Casper Thuis
TKH AI Technology
SM
Sean MacAvaney
Senior Lecturer, University Of Glasgow
AY
Andrew Yates
Johns Hopkins University, HLTCOE
57 visits

Session Participants

User Online
Session speakers, moderators & attendees
PhD Candidate
,
Radboud University
PhD Student
,
TH Köln
Ph.D. Student
,
University Of Padova
University of Padua, Italy
PhD Student
,
University Of Glasgow
+ 1 more speakers. View All
Dr. Jaap Kamps
University Of Amsterdam
No attendee has checked-in to this session!
8 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.