Multimodal Retrieval & Embeddings

Loading Session...

Session Information

Event-aware Video Corpus Moment Retrieval
Scalable Music Cover Retrieval Using Lyrics-Aligned Audio Embeddings
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio
Learning Audio–Visual Embeddings with Inferred Latent Interaction Graphs

Full Papers

Mar 31, 2026 14:30 - 16:00(Europe/Amsterdam)

Venue : Centrale (Plenary Room)

20260331T1430 20260331T1600 Europe/Amsterdam Multimodal Retrieval & Embeddings Event-aware Video Corpus Moment RetrievalScalable Music Cover Retrieval Using Lyrics-Aligned Audio EmbeddingsImage Complexity-Aware Adaptive Retrieval for Efficient Vision-Language ModelsCross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to AudioLearning Audio–Visual Embeddings with Inferred Latent Interaction Graphs Centrale (Plenary Room) ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

Event-aware Video Corpus Moment Retrieval

Full papersApplicationsSearch and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Video corpus moment retrieval is a challenging task that requires locating a specific moment from a large corpus of untrimmed videos using a natural language query. Existing methods typically rely on frame-level retrieval, which ranks videos by the maximum similarity between the query and individual frames. However, such approaches often overlook the semantic structure underlying consecutive frames, specifically, the concept of "events" which is fundamental to human video comprehension. To address this limitation, we propose EventFormer, a novel model that explicitly treats events as fundamental units for video retrieval. Our approach constructs event representations by first grouping consecutive, visually similar frames into coherent events via an event reasoning module, and then hierarchically encoding information at both the frame and event levels. Additionally, we introduce an anchor multi-head self-attention mechanism to enhance the modeling of local dependencies within the Transformer. Extensive experiments on three benchmark datasets (TVR, ANetCaps, and DiDeMo) demonstrate that EventFormer achieves state-of-the-art performance both in effectiveness and efficiency. The code for this work will be available on GitHub.

Presenters

Co-Authors

Scalable Music Cover Retrieval Using Lyrics-Aligned AudioEmbeddings

Full papersApplicationsSearch and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Music Cover Retrieval, also known as Version Identification, aims to recognize distinct renditions of the same underlying musical work, a task central to catalog management, copyright enforcement, and music retrieval. State-of-the-art approaches have largely focused on harmonic and melodic features, employing increasingly complex audio pipelines designed to be invariant to musical attributes that often vary widely across covers. While effective, these methods demand substantial training time and computational resources. By contrast, lyrics constitute a strong invariant across covers, though their use has been limited by the difficulty of extracting them accurately and efficiently from polyphonic audio. Early methods relied on simple frameworks that limited downstream performance, while more recent systems deliver stronger results but require large models integrated within complex multimodal architectures. We introduce LIVI (Lyrics-Informed Version Identification), an approach that seeks to balance retrieval accuracy with computational efficiency. First, LIVI leverages supervision from state-of-the-art transcription and text embedding models during training to achieve retrieval accuracy on par with¡ªor superior to¡ªharmonic-based systems. Second, LIVI remains lightweight and efficient by removing the transcription step at inference, challenging the dominance of complexity-heavy pipelines.

Presenters

Co-Authors

Image Complexity-Aware Adaptive Retrieval for EfficientVision-Language Models

Full papersMachine Learning and Large Language ModelsSearch and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4¡Á speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

Presenters

Co-Authors

Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio

Full papersSocietally-motivated IR researchUser aspects in IRFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Query formulation from internal information needs (IN) is required across all IR paradigms (ad hoc retrieval, chatbots, conversational search) but remains fundamentally challenging due to IN realisation complexity and physical/mental impairments. Brain Passage Retrieval (BPR) was proposed to bypass explicit query formulation by directly mapping EEG queries to passage representations without intermediate text translation. However, existing BPR research focuses exclusively on visual stimuli, creating notable limitations: no evidence exists for if auditory EEG can serve as effective query representations, despite auditory processing being significant to conversational search and accessibility for visually impaired users; and critically, whether training on combined EEG datasets from different modalities improves retrieval performance remains entirely unexplored. To address these gaps, we investigate whether auditory EEG enables effective BPR and the potential benefits of cross-sensory training. Using a dual encoder architecture, we compare four pooling strategies across modalities. Controlled experiments with auditory and visual datasets compare three training regimes: auditory only, visual only, and combined cross-sensory. Results show auditory EEG consistently outperforms visual EEG across architectures, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). These findings establish auditory neural interfaces as viable for IR and demonstrate that cross sensory training outperforms individual sensory training, whilst enabling inclusive brain-machine interfaces.

Presenters

Co-Authors

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Full papersApplicationsSearch and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Learning robust audio®Cvisual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences°™background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled °∞train°± might also contain motorcycle audio and visual, because °∞motorcycle°± is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio®CVisual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., °∞Train (visual)°± °˙ °∞Motorcycle (audio)°±) that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (MAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.

Presenters

Co-Authors

175 visits

Session Participants

User Online

Session speakers, moderators & attendees

Joanne Affolter

Machine Learning Engineer

Deezer

Dr. Mikel Williams Lekuona

Research Associate

Loughborough University

Niall McGuire

PhD Student

Strathclyde University

Danyang Hou

Beijing, China

Institute Of Computing Technology, Chinese Academy Of Sciences

Donghuo Zeng

Researcher

KDDI Research, Inc. Japan

Adam Jatowt

University of Innsbruck

No attendee has checked-in to this session!

21 attendees saved this session

Session Chat

Live Chat

Chat with participants attending this session

Questions & Answers

Answered

Submit questions for the presenters

Session Polls

Active

Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.

Multimodal Retrieval & Embeddings

Session Information

Sub Sessions

Event-aware Video Corpus Moment Retrieval

Scalable Music Cover Retrieval Using Lyrics-Aligned AudioEmbeddings

Image Complexity-Aware Adaptive Retrieval for EfficientVision-Language Models

Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Session Participants

Session Chat

Questions & Answers

Session Polls

Need Help?

Please enter the four digit secret code The secret code should have been announced or displayed at the session location.

AI-generated Summary

Please enter the four digit secret code
The secret code should have been announced or displayed at the session location.