Loading Session...

Multimodal Retrieval & Embeddings

Back to Schedule Check-inYou can join session 5 minutes before start time.

Session Information

  • Event-aware Video Corpus Moment Retrieval
  • Scalable Music Cover Retrieval Using Lyrics-Aligned Audio Embeddings
  • Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
  • Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio
  • Learning Audio–Visual Embeddings with Inferred Latent Interaction Graphs
Mar 31, 2026 14:30 - 16:00(Europe/Amsterdam)
Venue : Centrale (Plenary Room)
20260331T1430 20260331T1600 Europe/Amsterdam Multimodal Retrieval & Embeddings Event-aware Video Corpus Moment RetrievalScalable Music Cover Retrieval Using Lyrics-Aligned Audio EmbeddingsImage Complexity-Aware Adaptive Retrieval for Efficient Vision-Language ModelsCross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to AudioLearning Audio–Visual Embeddings with Inferred Latent Interaction Graphs Centrale (Plenary Room) ECIR2026 n.fontein@tudelft.nl

Sub Sessions

Event-aware Video Corpus Moment Retrieval

Full papersApplications Search and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Video corpus moment retrieval is a challenging task that requires locating a specific moment from a large corpus of untrimmed videos using a natural language query. Existing methods typically rely on frame-level retrieval, which ranks videos by the maximum similarity between the query and individual frames. However, such approaches often overlook the semantic structure underlying consecutive frames, specifically, the concept of "events" which is fundamental to human video comprehension. To address this limitation, we propose EventFormer, a novel model that explicitly treats events as fundamental units for video retrieval. Our approach constructs event representations by first grouping consecutive, visually similar frames into coherent events via an event reasoning module, and then hierarchically encoding information at both the frame and event levels. Additionally, we introduce an anchor multi-head self-attention mechanism to enhance the modeling of local dependencies within the Transformer. Extensive experiments on three benchmark datasets (TVR, ANetCaps, and DiDeMo) demonstrate that EventFormer achieves state-of-the-art performance both in effectiveness and efficiency. The code for this work will be available on GitHub.
Presenters
DH
Danyang Hou
Beijing, China, Institute Of Computing Technology, Chinese Academy Of Sciences
Co-Authors
LP
Liang Pang
Institute Of Computing Technology, Chinese Academy Of Sciences

Scalable Music Cover Retrieval Using Lyrics-Aligned Audio Embeddings

Full papersApplications Search and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC

Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

Full papersMachine Learning and Large Language Models Search and rankingFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Presenters
MW
Mikel Williams-Lekuona
Loughborough University
Co-Authors
GC
Georgina Cosma
Loughborough University

Cross-Sensory Brain Passage Retrieval: Scaling Beyond Visual to Audio

Full papersSocietally-motivated IR research User aspects in IRFull papers 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Query formulation from internal information needs (IN) is required across all IR paradigms (ad hoc retrieval, chatbots, conversational search) but remains fundamentally challenging due to IN realisation complexity and physical/mental impairments. Brain Passage Retrieval (BPR) was proposed to bypass explicit query formulation by directly mapping EEG queries to passage representations without intermediate text translation. However, existing BPR research focuses exclusively on visual stimuli, creating notable limitations: no evidence exists for if auditory EEG can serve as effective query representations, despite auditory processing being significant to conversational search and accessibility for visually impaired users; and critically, whether training on combined EEG datasets from different modalities improves retrieval performance remains entirely unexplored. To address these gaps, we investigate whether auditory EEG enables effective BPR and the potential benefits of cross-sensory training. Using a dual encoder architecture, we compare four pooling strategies across modalities. Controlled experiments with auditory and visual datasets compare three training regimes: auditory only, visual only, and combined cross-sensory. Results show auditory EEG consistently outperforms visual EEG across architectures, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). These findings establish auditory neural interfaces as viable for IR and demonstrate that cross sensory training outperforms individual sensory training, whilst enabling inclusive brain-machine interfaces.
Presenters
NM
Niall McGuire
PhD Student, Strathclyde University
Co-Authors
YM
Yashar Moshfeghi
Associate Professor, University Of Strathclyde

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Full papersApplications Search and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/03/31 12:30:00 UTC - 2026/03/31 14:00:00 UTC
Learning robust audio®Cvisual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences°™background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled °∞train°± might also contain motorcycle audio and visual, because °∞motorcycle°± is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio®CVisual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., °∞Train (visual)°± °˙ °∞Motorcycle (audio)°±) that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (MAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.
Presenters
DZ
Donghuo Zeng
KDDI Research, Inc.
16 visits

Session Participants

User Online
Session speakers, moderators & attendees
Loughborough University
PhD Student
,
Strathclyde University
Beijing, China
,
Institute Of Computing Technology, Chinese Academy Of Sciences
KDDI Research, Inc.
University of Innsbruck
No attendee has checked-in to this session!
8 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.