20260331T143020260331T1600Europe/AmsterdamMultimodal Retrieval & EmbeddingsEvent-aware Video Corpus Moment RetrievalScalable Music Cover Retrieval Using Lyrics-Aligned Audio EmbeddingsImage Complexity-Aware Adaptive Retrieval for Efficient Vision-Language ModelsCross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to AudioLearning Audio–Visual Embeddings with Inferred Latent Interaction GraphsCentrale (Plenary Room) ECIR2026n.fontein@tudelft.nl
Full papersApplications
Search and rankingFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Video corpus moment retrieval is a challenging task that requires locating a specific moment from a large corpus of untrimmed videos using a natural language query. Existing methods typically rely on frame-level retrieval, which ranks videos by the maximum similarity between the query and individual frames. However, such approaches often overlook the semantic structure underlying consecutive frames, specifically, the concept of "events" which is fundamental to human video comprehension. To address this limitation, we propose EventFormer, a novel model that explicitly treats events as fundamental units for video retrieval. Our approach constructs event representations by first grouping consecutive, visually similar frames into coherent events via an event reasoning module, and then hierarchically encoding information at both the frame and event levels. Additionally, we introduce an anchor multi-head self-attention mechanism to enhance the modeling of local dependencies within the Transformer. Extensive experiments on three benchmark datasets (TVR, ANetCaps, and DiDeMo) demonstrate that EventFormer achieves state-of-the-art performance both in effectiveness and efficiency. The code for this work will be available on GitHub.
Image Complexity-Aware Adaptive Retrieval for Efficient
Vision-Language Models
Full papersMachine Learning and Large Language Models
Search and rankingFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Cross-Sensory Brain Passage Retrieval: Scaling Beyond
Visual to Audio
Full papersSocietally-motivated IR research
User aspects in IRFull papers02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Query formulation from internal information needs (IN) is required across all IR paradigms (ad hoc retrieval, chatbots, conversational search) but remains fundamentally challenging due to IN realisation complexity and physical/mental impairments. Brain Passage Retrieval (BPR) was proposed to bypass explicit query formulation by directly mapping EEG queries to passage representations without intermediate text translation. However, existing BPR research focuses exclusively on visual stimuli, creating notable limitations: no evidence exists for if auditory EEG can serve as effective query representations, despite auditory processing being significant to conversational search and accessibility for visually impaired users; and critically, whether training on combined EEG datasets from different modalities improves retrieval performance remains entirely unexplored. To address these gaps, we investigate whether auditory EEG enables effective BPR and the potential benefits of cross-sensory training. Using a dual encoder architecture, we compare four pooling strategies across modalities. Controlled experiments with auditory and visual datasets compare three training regimes: auditory only, visual only, and combined cross-sensory. Results show auditory EEG consistently outperforms visual EEG across architectures, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). These findings establish auditory neural interfaces as viable for IR and demonstrate that cross sensory training outperforms individual sensory training, whilst enabling inclusive brain-machine interfaces.
Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
Full papersApplications
Search and ranking02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Learning robust audio®Cvisual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences°™background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled °∞train°± might also contain motorcycle audio and visual, because °∞motorcycle°± is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio®CVisual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., °∞Train (visual)°± °˙ °∞Motorcycle (audio)°±) that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (MAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.