Loading Session...

Resource II: Domain- and Language-specific Datasets

Back to Schedule Check-inYou can join session 5 minutes before start time.

Session Information

  • FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert Finding
  • SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation
  • FoodNexus: Massive Food Knowledge for Recommender Systems
  • pt-image-ir-dataset: An Image Retrieval Dataset in European Portuguese
  • CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
  • ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles
  • BioGraphletQA: Knowledge-Anchored Generation of Complex Question Answering Datasets
Apr 01, 2026 10:30 - 12:30(Europe/Amsterdam)
Venue : Chemie
20260401T1030 20260401T1230 Europe/Amsterdam Resource II: Domain- and Language-specific Datasets FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert FindingSciNUP: Natural Language User Interest Profiles for Scientific Literature RecommendationFoodNexus: Massive Food Knowledge for Recommender Systemspt-image-ir-dataset: An Image Retrieval Dataset in European PortugueseCitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting MinutesClaimPT: A Portuguese Dataset of Annotated Claims in News ArticlesBioGraphletQA: Knowledge-Anchored Generation of Complex Question Answering Datasets Chemie ECIR2026 n.fontein@tudelft.nl

Sub Sessions

FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert Finding

ResourceResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
Expert-finding systems aim to identify knowledgeable individuals in specific domains based on evidence such as publications, activities, and social network data, with academic uses including allowing identification of potential supervisors, collaborators, or peer reviewers. However, most existing benchmark datasets for academic expert finding contain only publication information and lack authentic query logs. We introduce the Find an Expert (FaE) dataset from the University of Melbourne, comprising three interconnected components: structured profiles for 8,984 academic staff providing text biographies and research interests, and lists of (and links to) publications and current projects; 712,937 interaction records captured over 239 days in 2025, that record queries, clicks, and temporal patterns; and system-generated rankings for 530 queries where users clicked on profiles. Together, these resources provide the first publicly available expert-finding dataset combining profile data, log interactions, and system outputs.
Presenters
MA
Marjan Azimi
PhD Candidate, University Of Melbourne
Co-Authors
AM
Alistair Moffat
The University Of Melbourne
JZ
Justin Zobel

SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation

ResourceResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
The use of natural language (NL) user profiles in recommender systems offers greater transparency and user control compared to traditional representations. However, there is scarcity of large-scale, publicly available test collections for evaluating NL profile-based recommendation. To address this gap, we introduce SciNUP, a novel synthetic dataset for scholarly recommendation that leverages authors' publication histories to generate NL profiles and corresponding ground truth items. We use this dataset to conduct a comparison of baseline methods, ranging from sparse and dense retrieval approaches to state-of-the-art LLM-based rerankers. Our results show that while baseline methods achieve comparable performance, they often retrieve different items, indicating complementary behaviors. At the same time, considerable headroom for improvement remains, highlighting the need for effective NL-based recommendation approaches. The SciNUP dataset thus serves as a valuable resource for fostering future research and development in this area.
Presenters Mariam Arustashvili
PhD Fellow, University Of Stavanger
Co-Authors
KB
Krisztian Balog
Professor, University Of Stavanger

FoodNexus: Massive Food Knowledge for Recommender Systems

ResourceEvaluation research Recommender systemsResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
Presenters
LB
Ludovico Boratto
Associate Professor Of Computer Science, University Of Cagliari
Co-Authors
MM
Mirko Marras
Tenure-Track Assistant Professor, University Of Cagliari
GM
Giacomo Medda
University Of Cagliari

pt-image-ir-dataset: An Image Retrieval Dataset in European Portuguese

ResourceResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
With the surge of multimodal models and the demand for effective image Information Retrieval (IR) systems, high-quality text-to-image datasets have become paramount. However, most existing datasets are primarily in English, limiting their applicability to multilingual settings. To address this, we introduce the pt-image-ir-dataset, a manually annotated resource for text-based Image IR in European Portuguese. The dataset comprises 80 diverse queries and a curated pool of 5,201 images, each annotated for relevance by multiple human judges. The proposed dataset is a step forward in supporting the development and evaluation of image IR systems for European Portuguese, addressing a clear gap in multilingual multimodal research. To this end, we have made our dataset publicly available, alongside baseline experimental results, demonstrating its suitability on the Image IR task across different retrieval paradigms, including traditional text-based lexical IR methods, semantic dense retrieval models based on language embeddings, cutting-edge vision-language models and end-to-end image retrieval systems. Results demonstrate that vision-language models, particularly OpenCLIP/xlm-roberta-base-ViT-B-32, significantly outperform other approaches (MRR = 0.610).
Presenters
RD
Rodrigo Duarte
University Of Beira Interior; INESC TEC
Co-Authors
RC
Ricardo Campos
Professor, University Of Beira Interior / INESC TEC

CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes

ResourceMachine Learning and Large Language Models Societally-motivated IR researchResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
Presenters
RC
Ricardo Campos
Professor, University Of Beira Interior / INESC TEC
Co-Authors
AP
Ana Pacheco
University Of Porto; INESC TEC;
AF
Ana Fernandes
University Of Porto; INESC TEC;
RR
Rute Rebou?as
University Of Porto; INESC TEC;
MM
Miguel Marques
University Of Beira Interior; INESC TEC

ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

ResourceMachine Learning and Large Language Models Societally-motivated IR researchResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the original misinformation does; therefore, accelerating corrections through automation can help combat misinformation more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claim sentences is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. European Portuguese, like other low-resource languages, still lacks accessible and licensed datasets, limiting both research and NLP tool developments. In this paper, we introduce ClaimPT, a new dataset of annotated claims from European Portuguese news articles, comprising 1308 articles and 6875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure high-quality annotations, each article was manually annotated by two trained annotators and validated by a curator, following a newly proposed annotation scheme. We also provide baseline models for claim detection, establishing initial performance benchmarks and enabling future applications of Natural Language Processing (NLP) and Information retrieval (IR) techniques. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.
Presenters
RC
Ricardo Campos
Professor, University Of Beira Interior / INESC TEC

BioGraphletQA: Knowledge-Anchored Generation of Complex Question Answering Datasets

ResourceResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC
This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.
Presenters
RJ
Richard Jonker
PhD Student, IEETA - University Of Aveiro, Portugal
10 visits

Session Participants

User Online
Session speakers, moderators & attendees
PhD candidate
,
University Of Melbourne
PhD fellow
,
University Of Stavanger
Associate Professor of Computer Science
,
University Of Cagliari
University of Beira Interior; INESC TEC
Professor
,
University Of Beira Interior / INESC TEC
+ 1 more speakers. View All
No moderator for this session!
No attendee has checked-in to this session!
8 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.