Loading Session...

Multimodal Retrieval & Embeddings

Session Information

  • Event-aware Video Corpus Moment Retrieval
  • Scalable Music Cover Retrieval Using Lyrics-Aligned Audio Embeddings
  • Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
  • Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio
  • Learning Audio–Visual Embeddings with Inferred Latent Interaction Graphs
Mar 31, 2026 14:30 - 16:00(Europe/Amsterdam)
Venue : Centrale (Plenary Room)
20260331T1430 20260331T1600 Europe/Amsterdam Multimodal Retrieval & Embeddings Event-aware Video Corpus Moment RetrievalScalable Music Cover Retrieval Using Lyrics-Aligned Audio EmbeddingsImage Complexity-Aware Adaptive Retrieval for Efficient Vision-Language ModelsCross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to AudioLearning Audio–Visual Embeddings with Inferred Latent Interaction Graphs Centrale (Plenary Room) ECIR2026 conference-secretariat@blueboxevents.nl

Sub Sessions

Event-aware Video Corpus Moment Retrieval

Full papersApplicationsSearch and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Video corpus moment retrieval is a challenging task that requires locating a specific moment from a large corpus of untrimmed videos using a natural language query. Existing methods typically rely on frame-level retrieval, which ranks videos by the maximum similarity between the query and individual frames. However, such approaches often overlook the semantic structure underlying consecutive frames, specifically, the concept of "events" which is fundamental to human video comprehension. To address this limitation, we propose EventFormer, a novel model that explicitly treats events as fundamental units for video retrieval. Our approach constructs event representations by first grouping consecutive, visually similar frames into coherent events via an event reasoning module, and then hierarchically encoding information at both the frame and event levels. Additionally, we introduce an anchor multi-head self-attention mechanism to enhance the modeling of local dependencies within the Transformer. Extensive experiments on three benchmark datasets (TVR, ANetCaps, and DiDeMo) demonstrate that EventFormer achieves state-of-the-art performance both in effectiveness and efficiency. The code for this work will be available on GitHub.
Presenters
DH
Danyang Hou
Beijing, China, Institute Of Computing Technology, Chinese Academy Of Sciences
Co-Authors
LP
Liang Pang
Institute Of Computing Technology, Chinese Academy Of Sciences

Scalable Music Cover Retrieval Using Lyrics-Aligned AudioEmbeddings

Full papersApplicationsSearch and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Music Cover Retrieval, also known as Version Identification, aims to recognize distinct renditions of the same underlying musical work, a task central to catalog management, copyright enforcement, and music retrieval. State-of-the-art approaches have largely focused on harmonic and melodic features, employing increasingly complex audio pipelines designed to be invariant to musical attributes that often vary widely across covers. While effective, these methods demand substantial training time and computational resources. By contrast, lyrics constitute a strong invariant across covers, though their use has been limited by the difficulty of extracting them accurately and efficiently from polyphonic audio. Early methods relied on simple frameworks that limited downstream performance, while more recent systems deliver stronger results but require large models integrated within complex multimodal architectures. We introduce LIVI (Lyrics-Informed Version Identification), an approach that seeks to balance retrieval accuracy with computational efficiency. First, LIVI leverages supervision from state-of-the-art transcription and text embedding models during training to achieve retrieval accuracy on par with¡ªor superior to¡ªharmonic-based systems. Second, LIVI remains lightweight and efficient by removing the transcription step at inference, challenging the dominance of complexity-heavy pipelines.
Presenters Joanne Affolter
Machine Learning Engineer, Deezer
Co-Authors
BM
Benjamin Martin
Deezer
EE
Elena V. Epure
Deezer
GM
Gabriel Meseguer-Brocal
Deezer
FK
Frédéric Kaplan

Image Complexity-Aware Adaptive Retrieval for EfficientVision-Language Models

Full papersMachine Learning and Large Language ModelsSearch and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4¡Á speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
Presenters Mikel Williams Lekuona
Research Associate, Loughborough University
Co-Authors
GC
Georgina Cosma
Loughborough University

Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio

Full papersSocietally-motivated IR researchUser aspects in IRFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Query formulation from internal information needs (IN) is required across all IR paradigms (ad hoc retrieval, chatbots, conversational search) but remains fundamentally challenging due to IN realisation complexity and physical/mental impairments. Brain Passage Retrieval (BPR) was proposed to bypass explicit query formulation by directly mapping EEG queries to passage representations without intermediate text translation. However, existing BPR research focuses exclusively on visual stimuli, creating notable limitations: no evidence exists for if auditory EEG can serve as effective query representations, despite auditory processing being significant to conversational search and accessibility for visually impaired users; and critically, whether training on combined EEG datasets from different modalities improves retrieval performance remains entirely unexplored. To address these gaps, we investigate whether auditory EEG enables effective BPR and the potential benefits of cross-sensory training. Using a dual encoder architecture, we compare four pooling strategies across modalities. Controlled experiments with auditory and visual datasets compare three training regimes: auditory only, visual only, and combined cross-sensory. Results show auditory EEG consistently outperforms visual EEG across architectures, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). These findings establish auditory neural interfaces as viable for IR and demonstrate that cross sensory training outperforms individual sensory training, whilst enabling inclusive brain-machine interfaces.
Presenters
NM
Niall McGuire
PhD Student, Strathclyde University
Co-Authors
YM
Yashar Moshfeghi
Associate Professor, University Of Strathclyde

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Full papersApplicationsSearch and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Learning robust audio®Cvisual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences°™background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled °∞train°± might also contain motorcycle audio and visual, because °∞motorcycle°± is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio®CVisual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., °∞Train (visual)°± °˙ °∞Motorcycle (audio)°±) that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (MAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.
Presenters Donghuo Zeng
Researcher, KDDI Research, Inc. Japan
Co-Authors
HN
Hao Niu
YW
Yanan Wang
MT
Masato Taya
175 visits

Session Participants

User Online
Session speakers, moderators & attendees
Machine Learning Engineer
,
Deezer
Research Associate
,
Loughborough University
PhD Student
,
Strathclyde University
Beijing, China
,
Institute Of Computing Technology, Chinese Academy Of Sciences
Researcher
,
KDDI Research, Inc. Japan
University of Innsbruck
No attendee has checked-in to this session!
21 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.