20260331T143020260331T1600Europe/AmsterdamMultimodal Retrieval & EmbeddingsEvent-aware Video Corpus Moment RetrievalScalable Music Cover Retrieval Using Lyrics-Aligned Audio EmbeddingsImage Complexity-Aware Adaptive Retrieval for Efficient Vision-Language ModelsCross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to AudioLearning Audio–Visual Embeddings with Inferred Latent Interaction GraphsCentrale (Plenary Room) ECIR2026conference-secretariat@blueboxevents.nl
Full papersApplicationsSearch and rankingFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Video corpus moment retrieval is a challenging task that requires locating a specific moment from a large corpus of untrimmed videos using a natural language query. Existing methods typically rely on frame-level retrieval, which ranks videos by the maximum similarity between the query and individual frames. However, such approaches often overlook the semantic structure underlying consecutive frames, specifically, the concept of "events" which is fundamental to human video comprehension. To address this limitation, we propose EventFormer, a novel model that explicitly treats events as fundamental units for video retrieval. Our approach constructs event representations by first grouping consecutive, visually similar frames into coherent events via an event reasoning module, and then hierarchically encoding information at both the frame and event levels. Additionally, we introduce an anchor multi-head self-attention mechanism to enhance the modeling of local dependencies within the Transformer. Extensive experiments on three benchmark datasets (TVR, ANetCaps, and DiDeMo) demonstrate that EventFormer achieves state-of-the-art performance both in effectiveness and efficiency. The code for this work will be available on GitHub.
Liang Pang Institute Of Computing Technology, Chinese Academy Of Sciences
Scalable Music Cover Retrieval Using Lyrics-Aligned AudioEmbeddings
Full papersApplicationsSearch and ranking02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Music Cover Retrieval, also known as Version Identification, aims to recognize distinct renditions of the same underlying musical work, a task central to catalog management, copyright enforcement, and music retrieval. State-of-the-art approaches have largely focused on harmonic and melodic features, employing increasingly complex audio pipelines designed to be invariant to musical attributes that often vary widely across covers. While effective, these methods demand substantial training time and computational resources. By contrast, lyrics constitute a strong invariant across covers, though their use has been limited by the difficulty of extracting them accurately and efficiently from polyphonic audio. Early methods relied on simple frameworks that limited downstream performance, while more recent systems deliver stronger results but require large models integrated within complex multimodal architectures. We introduce LIVI (Lyrics-Informed Version Identification), an approach that seeks to balance retrieval accuracy with computational efficiency. First, LIVI leverages supervision from state-of-the-art transcription and text embedding models during training to achieve retrieval accuracy on par with¡ªor superior to¡ªharmonic-based systems. Second, LIVI remains lightweight and efficient by removing the transcription step at inference, challenging the dominance of complexity-heavy pipelines.
Image Complexity-Aware Adaptive Retrieval for EfficientVision-Language Models
Full papersMachine Learning and Large Language ModelsSearch and ranking02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4¡Á speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
Presenters Mikel Williams Lekuona Research Associate, Loughborough University Co-Authors
Cross-Sensory Brain Passage Retrieval: Scaling Beyond
Visual to Audio
Full papersSocietally-motivated IR researchUser aspects in IRFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Query formulation from internal information needs (IN) is required across all IR paradigms (ad hoc retrieval, chatbots, conversational search) but remains fundamentally challenging due to IN realisation complexity and physical/mental impairments. Brain Passage Retrieval (BPR) was proposed to bypass explicit query formulation by directly mapping EEG queries to passage representations without intermediate text translation. However, existing BPR research focuses exclusively on visual stimuli, creating notable limitations: no evidence exists for if auditory EEG can serve as effective query representations, despite auditory processing being significant to conversational search and accessibility for visually impaired users; and critically, whether training on combined EEG datasets from different modalities improves retrieval performance remains entirely unexplored. To address these gaps, we investigate whether auditory EEG enables effective BPR and the potential benefits of cross-sensory training. Using a dual encoder architecture, we compare four pooling strategies across modalities. Controlled experiments with auditory and visual datasets compare three training regimes: auditory only, visual only, and combined cross-sensory. Results show auditory EEG consistently outperforms visual EEG across architectures, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). These findings establish auditory neural interfaces as viable for IR and demonstrate that cross sensory training outperforms individual sensory training, whilst enabling inclusive brain-machine interfaces.
Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
Full papersApplicationsSearch and ranking02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Learning robust audio®Cvisual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences°™background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled °∞train°± might also contain motorcycle audio and visual, because °∞motorcycle°± is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio®CVisual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., °∞Train (visual)°± °˙ °∞Motorcycle (audio)°±) that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (MAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.
Presenters Donghuo Zeng Researcher, KDDI Research, Inc. Japan Co-Authors