20260330T143020260330T1530Europe/AmsterdamReproducibility I: Recommender SystemsAre Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual ModalitiesRecRankerEval: A Reproducible Framework for Deploying and Evaluating LLM-based Top-$k$ RecommendersEfficient Optimization of Hierarchical Identifiers for Generative RecommendationA Reproducible and Fair Evaluation of Partition-aware Collaborative FilteringA Systematic Reproducibility Study of BSARec for Sequential RecommendationChaosECIR2026conference-secretariat@blueboxevents.nl
Are Multimodal Embeddings Truly Beneficial for
Recommendation? A Deep Dive into Whole vs. Individual
Modalities
ReproducibilityReproducibility02:30 PM - 03:30 PM (Europe/Amsterdam) 2026/03/30 12:30:00 UTC - 2026/03/30 13:30:00 UTC
ultimodal recommendation has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation?
RecRankerEval: A Reproducible Framework for Deploying andEvaluating LLM-based Top-$k$ Recommenders
Reproducibility02:30 PM - 03:30 PM (Europe/Amsterdam) 2026/03/30 12:30:00 UTC - 2026/03/30 13:30:00 UTC
Large Language Models (LLMs) have been shown to have promising effectiveness in recommender systems. RecRanker, a recent LLM-based recommendation model, has demonstrated strong results on the top-$k$ recommendation task. However, the contribution of each of its core components, namely user sampling, initial ranking list generation, prompt construction, and an instruction tuning strategy, remains underexplored. In this work, we inspect the reproducibility of RecRanker, and study the impact and role of its various components in recommendation performance. We begin by reproducing the RecRanker's pipeline through the implementation of all its key components. Our reproduction shows that the pairwise and listwise instruction tuning methods achieve a performance comparable to that reported in the original paper. For the pointwise method, while we are also able to reproduce the original paper¡¯s results, further analysis shows that the abnormal high performance due to data leakage from the inclusion of ground-truth information in the prompts. To enable a fair and comprehensive evaluation of LLM-based top-$k$ recommendations, we propose RecRankerEval, an extensible framework that covers five key dimensions: user sampling strategy, initial recommendation model, LLM backbone, dataset selection, and instruction tuning method. Using the RecRankerEval framework, we show that the original results of RecRanker can be reproduced on the ML-100K and ML-1M datasets, as well as an additional Amazon-Music dataset, but not on BookCrossing due to the lack of timestamp information in the original RecRanker paper. Furthermore, we demonstrate that RecRanker's performance can be improved by employing alternative user sampling methods (e.g., DBSCAN), stronger initial recommenders (e.g., XSimGCL), and more capable LLMs (e.g., Llama3).
Efficient Optimization of Hierarchical Identifiers forGenerative Recommendation
Reproducibility02:30 PM - 03:30 PM (Europe/Amsterdam) 2026/03/30 12:30:00 UTC - 2026/03/30 13:30:00 UTC
SEATER is a generative retrieval model that improves recommendation inference efficiency and retrieval quality by utilizing balanced tree-structured item identifiers and contrastive training objectives. We reproduce and validate SEATER¡¯s reported improvements in retrieval quality over strong baselines across all datasets from the original work, and extend the evaluation to Yambda, a large-scale music recommendation dataset. Our experiments verify SEATER¡¯s strong performance, but show that its tree construction step during training becomes a major bottleneck as the number of items grows. To address this, we implement and evaluate two alternative construction algorithms: a greedy method optimized for minimal build time, and a hybrid method that combines greedy clustering at high levels with more precise grouping at lower levels. The greedy method reduces tree construction time to less than 2% of the original with only a minor drop in quality on the dataset with the largest item collection. The hybrid method achieves retrieval quality on par with the original, and even improves on the largest dataset, while cutting construction time to just 5¨C8%. All data and code are publicly available for full reproducibility at https://anonymous.4open.science/r/re-seater-8003/.
A Reproducible and Fair Evaluation of Partition-awareCollaborative Filtering
Reproducibility02:30 PM - 03:30 PM (Europe/Amsterdam) 2026/03/30 12:30:00 UTC - 2026/03/30 13:30:00 UTC
Similarity-based collaborative filtering (CF) models have long demonstrated strong offline performance and conceptual simplicity. However, their scalability is limited by the quadratic cost of maintaining dense item¨Citem similarity matrices. Partitioning-based paradigms have recently emerged as an effective strategy to balance effectiveness and efficiency, allowing models to learn local similarities within coherent subgraphs while maintaining limited global context. In this work, we focus on the Fine-tuning Partition-aware Similarity Refinement (FPSR) framework, a prominent representative of this family, and its extension FPSR+. Reproducible evaluation of partition?aware collaborative filtering remains challenging, as prior FPSR/FPSR+ reports often rely on splits of unclear provenance and omit some similarity?based baselines, complicating fair comparison. We present a transparent, fully reproducible benchmark of FPSR and FPSR+. Based on our results, the family of FPSR models does not consistently perform at the highest level. Overall, it remains competitive, validates its design choices, and shows significant advantages in long-tail scenarios. This highlights the accuracy¨Ccoverage trade-offs resulting from partitioning, global components, and hub design. Our investigation clarifies when partition?aware similarity modeling is most beneficial and offers actionable guidance for scalable recommender system design under reproducible protocols. Source code at https://split.to/rep_ecir.
A Systematic Reproducibility Study of BSARec for SequentialRecommendation
Reproducibility02:30 PM - 03:30 PM (Europe/Amsterdam) 2026/03/30 12:30:00 UTC - 2026/03/30 13:30:00 UTC
In sequential recommendation (SR), the self-attention mechanism of Transformer-based models acts as a low-pass filter, limiting their ability to capture high-frequency signals that reflect short-term user interests. To overcome this, BSARec augments the Transformer encoder with a frequency layer that rescales high-frequency components using the Fourier transform. However, the overall effectiveness of BSARec and the roles of its individual components have yet to be systematically validated. We reproduce BSARec and show that it outperforms other SR methods on some datasets. To empirically assess whether BSARec improves performance on high-frequency signals, we propose a metric to quantify user history frequency and evaluate SR methods across different user groups. We compare digital signal processing (DSP) techniques and find that the discrete wavelet transform (DWT) offer only slight improvements over Fourier transforms, and DSP methods provide no clear advantage over simple residual connections. Finally, we explore padding strategies and find that non-constant padding significantly improves recommendation performance, whereas constant padding hinders the frequency rescaler¡¯s ability to capture high-frequency signals.