Core Retrieval Models, Representations & Evaluation

Loading Session...

Session Information

Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy Ranking
Validating Search Query Simulations: A Taxonomy of Measures
Reducing Human Effort to Validate LLM Relevance Judgements via Stratified Sampling
Revealing MonoT5's Learning Mechanisms via Prompt-Token Adaptation
When Reducing Representations Improves Performance
An Empirical Study of Model Casing in Learned Sparse Retrieval
Improving Instruction-Aware Retrieval with Query-Preserving Regularization

Full Papers

Mar 30, 2026 10:30 - 12:30(Europe/Amsterdam)

Venue : Centrale (Plenary Room)

20260330T1030 20260330T1230 Europe/Amsterdam Core Retrieval Models, Representations & Evaluation Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy RankingValidating Search Query Simulations: A Taxonomy of MeasuresReducing Human Effort to Validate LLM Relevance Judgements via Stratified SamplingRevealing MonoT5's Learning Mechanisms via Prompt-Token AdaptationWhen Reducing Representations Improves PerformanceAn Empirical Study of Model Casing in Learned Sparse RetrievalImproving Instruction-Aware Retrieval with Query-Preserving Regularization Centrale (Plenary Room) ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy Ranking

Full papersMachine Learning and Large Language ModelsSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Off-policy evaluation (OPE) and optimization for learning to rank (LTR) leverage document placement probabilities to correct for the effects of various statistical biases, e.g., position bias. However, computing these propensities poses a challenge, as for most ranking models this requires iterating over all possible rankings. A common solution is to approximate them by sampling multiple rankings and using the ob- served document frequencies per position. Nevertheless, even when using extremely large numbers of sampled rankings, these estimates often still contain significant estimation errors. In this work, we propose the novel marginalized Plackett-Luce (MPL) method to efficiently and accurately calculate document-rank placement probabilities under the widely used Plackett-Luce (PL) ranking model. In particular, we establish MPL by first showing that this probability is the expected value of a Poisson binomial distribution over the document scores; subsequently, we leverage a known connection between the Poisson binomial distribution, convolutional operations and numerical integration, to achieve efficient and accurate propensity estimation. Furthermore, we argue that MPL provides near-exact estimation when computing the function over a practical number of evaluation points. Our experiments confirm that the propensity estimation of MPL is highly accurate, efficient, and leads to substantial improvements over the sampling-based method in downstream applications, thus opening the door to a wider use of PL policies in off-policy learning to rank.

Presenters

Co-Authors

Validating Search Query Simulations: A Taxonomy of Measures

Full papersEvaluation researchFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Assessing the validity of user simulators when used for the evaluation of information retrieval systems remains an open question, constraining their effective use and the reliability of simulation-based results. To address this issue, we conduct a comprehensive literature review with a particular focus on methods for the validation of simulated user queries with regard to real queries. Based on the review, we develop a taxonomy that structures the current landscape of available measures. We empirically corroborate the taxonomy by analyzing the relationships between the different measures applied to four different datasets representing diverse search scenarios. Finally, we provide concrete recommendations on which measures or combinations of measures should be considered when validating user simulation in different contexts. Furthermore, we release a dedicated library with the most commonly used measures to facilitate future research.

Presenters

Co-Authors

Reducing Human Effort to Validate LLM Relevance Judgements via Stratified Sampling

Full papersEvaluation researchFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Information Retrieval (IR) evaluation deeply relies on human-made relevance judgments. To overcome the high costs of the judgment collection process, a potential solution is to utilize LLMs as judges to replace human annotators. However, the validation of LLM-generated judgments is fundamental for informed use. Standard validation approaches typically rely on simple sampling techniques to collect a sample of the LLM-generated judgments and estimate the LLM agreement with the human. In this work, we propose using stratified sampling, a more sophisticated sampling strategy that, by leveraging appropriate stratification features, reduces human involvement in the validation process while still providing statistical guarantees on the human-LLM agreement estimate. Through the analysis of various candidate features, we identify the LLM-generated judgments themselves as the most promising one. Our approach achieves up to an 85% reduction in the required human involvement in the validation process.

Presenters

Co-Authors

Revealing MonoT5's Learning Mechanisms via Prompt-Token Adaptation

Full papersMachine Learning and Large Language ModelsRecommender systems 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Transformer-based cross-encoders, such as MonoT5, achieve state-of-the-art performance in several retrieval tasks. In particular, MonoT5 is based on a sequence-to-sequence architecture and trained on MS MARCO for passage re-ranking, where it predicts the relevance of a passage given an input query. In this paper, to understand what MonoT5 learned, we analyse the parameter updates during its training process. We observe that the largest shifts occur in a small set of parameters, i.e. less than 1% of the model, while the rest of the model remains relatively unchanged. Motivated by this finding, we propose Light-MonoT5, a parameter-efficient variant of MonoT5 that updates only this small set of parameters during training, and leaves the rest of the network unchanged. Extensive evaluation on both in- and out-domain benchmarks shows that Light-MonoT5 achieves statistically equivalent effectiveness compared to MonoT5. Since relevance can be captured by updating only a subset of T5 parameters, we hypothesise that MonoT5, which updates all the original model°Øs parameters, primarily learns to evaluate passage quality rather than explicitly assessing the relevance of a passage to the query. To test our hypothesis, we employ QT5, a T5-based quality estimation model, to prune low-quality passages before indexing. On the pruned collection, Light-MonoT5 achieves performance on par with MonoT5, indicating that MonoT5°Øs strong performance is largely attributable to quality assessment, with minimal adaptation required once low-quality content is removed.

Presenters

Co-Authors

When Reducing Representations Improves Performance

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Neural models have transformed Information Retrieval (IR) by enabling semantic search, representing queries and documents as dense embeddings in latent spaces. However, recent works indicate the contribution of single dimensions in these representations to ranking quality is uneven: some dimensions are essential, while others may even degrade performance. Dimension IMportance Estimators (DIMEs) are heuristics to guide the search for the subsets of dimensions that induce an optimal subaspace where retrieval is more effective. To explore these subspaces, DIMEs rely on two simplifying assumptions: the linearity of subspaces and the independence of dimensions. In this paper, we move a step forward by relaxing the independence assumption and employing genetic algorithms to select the optimal set of dimensions. We show that selecting optimal dimensions for individual queries can achieve up to 0.981 nDCG@10 and 0.831 AP using state-of-the-art dense retrieval models on the considered datasets. Additionally, we identify subsets of dimensions that improve ranking quality across multiple queries simultaneously. Finally, we show that a dataset-specific subset of dimensions enables dense retrieval models to generalize across other datasets without loss of performance.

Presenters

Co-Authors

An Empirical Study of Model Casing in Learned Sparse Retrieval

Full papersSearch and rankingFull papers 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have almost exclusively relied on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; performing lowercasing during preprocessing eliminates this gap. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: \url{https://anonymous.4open.science/r/Uncased-vs-cased-models-in-LSR-F75D}

Presenters

Co-Authors

Improving Instruction-Aware Retrieval with Query-PreservingRegularization

Full papersSearch and ranking 10:30 AM - 12:30 PM (Europe/Amsterdam) 2026/03/30 08:30:00 UTC - 2026/03/30 10:30:00 UTC

Instruction-aware retrievers are retrieval models that usenatural language instructions to specify fine-grainedretrieval conditions beyond the original query. Theseretrievers, built on large language models, are trainedusing contrastive learning to consider both the relevancebetween a query and a document and the relevance between aninstruction-augmented query and a document. However, duringtraining, instruction-augmented queries are learned solelyfrom relevance information associated with relateddocuments, without explicitly considering the originalquery. As a result, retrievers often struggle todistinguish between the query and the instruction, leadingto results that either do not follow the instruction or areirrelevant to the original query. To address this issue, wepropose a query-preserving regularization method integratedinto contrastive learning. The proposed method aligns thedocument relevance distributions induced by the originalquery and the internal query representation within theinstruction-augmented query, ensuring that the modelpreserves the original query's semantics while using theinstruction to guide relevance learning from relateddocuments. Experiments on two instruction followingretrieval benchmarks demonstrate that our method improvesthe existing state-of-the-art instruction-aware retriever.Furthermore, our model achieves strong performance onstandard retrieval tasks without instructions, in both indomain and out of domain scenarios.

Presenters

Co-Authors

437 visits

People

Session Participants

User Online

Session speakers, moderators & attendees

Norman Knyazev

PhD Candidate

Radboud University

Andreas Konstantin Kruff

PhD Student

TH Köln

Simone Merlo

Ph.D. Student

University Of Padova

Andrea Pasin

University of Padua, Italy

Emmanouil Georgios Lionis

PhD Student

University Of Glasgow

+ 2 more speakers. View All

Dr. Jaap Kamps

University Of Amsterdam

No attendee has checked-in to this session!

41 attendees saved this session

Session Chat

Live Chat

Chat with participants attending this session

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.

Core Retrieval Models, Representations & Evaluation

Session Information

Sub Sessions

Sample-Free Almost-Exact Estimation of Plackett-Luce Propensities for Off-Policy Ranking

Validating Search Query Simulations: A Taxonomy of Measures

Reducing Human Effort to Validate LLM Relevance Judgements via Stratified Sampling

Revealing MonoT5's Learning Mechanisms via Prompt-Token Adaptation

When Reducing Representations Improves Performance

An Empirical Study of Model Casing in Learned Sparse Retrieval

Improving Instruction-Aware Retrieval with Query-PreservingRegularization

Session Participants

Session Chat

Need Help?

Please enter the four digit secret code The secret code should have been announced or displayed at the session location.

AI-generated Summary

Please enter the four digit secret code
The secret code should have been announced or displayed at the session location.