Resource II: Domain- And Language-specific Datasets

Loading Session...

Session Information

FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert Finding
SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation
FoodNexus: Massive Food Knowledge for Recommender Systems
pt-image-ir-dataset: An Image Retrieval Dataset in European Portuguese
CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles
BioGraphletQA: Knowledge-Anchored Generation of Complex Question Answering Datasets

Resource

Apr 01, 2026 10:30 - 12:30(Europe/Amsterdam)

Venue : Chemie

20260401T1030 20260401T1230 Europe/Amsterdam Resource II: Domain- and Language-specific Datasets FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert FindingSciNUP: Natural Language User Interest Profiles for Scientific Literature RecommendationFoodNexus: Massive Food Knowledge for Recommender Systemspt-image-ir-dataset: An Image Retrieval Dataset in European PortugueseCitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting MinutesClaimPT: A Portuguese Dataset of Annotated Claims in News ArticlesBioGraphletQA: Knowledge-Anchored Generation of Complex Question Answering Datasets Chemie ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert Finding

ResourceResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

Expert-finding systems aim to identify knowledgeable individuals in specific domains based on evidence such as publications, activities, and social network data, with academic uses including allowing identification of potential supervisors, collaborators, or peer reviewers. However, most existing benchmark datasets for academic expert finding contain only publication information and lack authentic query logs. We introduce the Find an Expert (FaE) dataset from the University of Melbourne, comprising three interconnected components: structured profiles for 8,984 academic staff providing text biographies and research interests, and lists of (and links to) publications and current projects; 712,937 interaction records captured over 239 days in 2025, that record queries, clicks, and temporal patterns; and system-generated rankings for 530 queries where users clicked on profiles. Together, these resources provide the first publicly available expert-finding dataset combining profile data, log interactions, and system outputs.

Presenters

Co-Authors

SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation

ResourceResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

The use of natural language (NL) user profiles in recommender systems offers greater transparency and user control compared to traditional representations. However, there is scarcity of large-scale, publicly available test collections for evaluating NL profile-based recommendation. To address this gap, we introduce SciNUP, a novel synthetic dataset for scholarly recommendation that leverages authors' publication histories to generate NL profiles and corresponding ground truth items. We use this dataset to conduct a comparison of baseline methods, ranging from sparse and dense retrieval approaches to state-of-the-art LLM-based rerankers. Our results show that while baseline methods achieve comparable performance, they often retrieve different items, indicating complementary behaviors. At the same time, considerable headroom for improvement remains, highlighting the need for effective NL-based recommendation approaches. The SciNUP dataset thus serves as a valuable resource for fostering future research and development in this area.

Presenters

Co-Authors

FoodNexus: Massive Food Knowledge for Recommender Systems

ResourceEvaluation research Recommender systems 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

Personalized food recommendation can promote healthier, sustainable eating, but current systems often rely on sparse and unstructured data, limiting semantic expressiveness and diverse personalization. In this paper, we propose FoodNexus, a large-scale knowledge graph with nearly one billion triples designed to enrich food recommendation with structured, nutrition-aware, and user-contextual information. We built it via a multi-stage pipeline that combines and augments the largest public dataset of user¨Crecipe interactions, HUMMUS, with extensive metadata from Open Food Facts by linking recipes to concrete food products, extracting user traits from their biographies and reviews, and mapping both data sources onto the same ontology. Experiments show that FoodNexus enables richer, nutrition-sensitive evaluation of recommendations. Code & Resource: https://github.com/tail-unica/food-nexus.

Presenters

Co-Authors

pt-image-ir-dataset: An Image Retrieval Dataset in EuropeanPortuguese

Resource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

With the surge of multimodal models and the demand for effective image Information Retrieval (IR) systems, high-quality text-to-image datasets have become paramount. However, most existing datasets are primarily in English, limiting their applicability to multilingual settings. To address this, we introduce the pt-image-ir-dataset, a manually annotated resource for text-based Image IR in European Portuguese. The dataset comprises 80 diverse queries and a curated pool of 5,201 images, each annotated for relevance by multiple human judges. The proposed dataset is a step forward in supporting the development and evaluation of image IR systems for European Portuguese, addressing a clear gap in multilingual multimodal research. To this end, we have made our dataset publicly available, alongside baseline experimental results, demonstrating its suitability on the Image IR task across different retrieval paradigms, including traditional text-based lexical IR methods, semantic dense retrieval models based on language embeddings, cutting-edge vision-language models and end-to-end image retrieval systems. Results demonstrate that vision-language models, particularly OpenCLIP/xlm-roberta-base-ViT-B-32, significantly outperform other approaches (MRR = 0.610).

Presenters

Co-Authors

CitiLink-Minutes: A Multilayer Annotated Dataset ofMunicipal Meeting Minutes

ResourceMachine Learning and Large Language Models Societally-motivated IR research 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

City councils play a crucial role in local governance, directly influencing citizens¡¯ daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.

Presenters

Co-Authors

ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

ResourceMachine Learning and Large Language Models Societally-motivated IR researchResource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the original misinformation does; therefore, accelerating corrections through automation can help combat misinformation more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claim sentences is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. European Portuguese, like other low-resource languages, still lacks accessible and licensed datasets, limiting both research and NLP tool developments. In this paper, we introduce ClaimPT, a new dataset of annotated claims from European Portuguese news articles, comprising 1308 articles and 6875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure high-quality annotations, each article was manually annotated by two trained annotators and validated by a curator, following a newly proposed annotation scheme. We also provide baseline models for claim detection, establishing initial performance benchmarks and enabling future applications of Natural Language Processing (NLP) and Information retrieval (IR) techniques. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.

Presenters

Co-Authors

BioGraphletQA: Knowledge-Anchored Generation of ComplexQuestion Answering Datasets

Resource 10:30 AM - 11:00 AM (Europe/Amsterdam) 2026/04/01 08:30:00 UTC - 2026/04/01 09:00:00 UTC

This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.

Presenters

Co-Authors

194 visits

Session Participants

User Online

Session speakers, moderators & attendees

Marjan Azimi

PhD candidate

University Of Melbourne

Mariam Arustashvili

PhD fellow

University Of Stavanger

Ludovico Boratto

Associate Professor of Computer Science

University Of Cagliari

Rodrigo Duarte

University of Beira Interior; INESC TEC

Ricardo Campos

Professor

University Of Beira Interior / INESC TEC

+ 1 more speakers. View All

Joel Mackenzie

Senior Lecturer

The University Of Queensland

No attendee has checked-in to this session!

18 attendees saved this session

Session Chat

Live Chat

Chat with participants attending this session

Questions & Answers

Answered

Submit questions for the presenters

Session Polls

Active

Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.

Resource II: Domain- and Language-specific Datasets

Session Information

Sub Sessions

FaE: A Resource of Logs, Profiles, and Rankings for Academic Expert Finding

SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation

FoodNexus: Massive Food Knowledge for Recommender Systems

pt-image-ir-dataset: An Image Retrieval Dataset in EuropeanPortuguese

CitiLink-Minutes: A Multilayer Annotated Dataset ofMunicipal Meeting Minutes

ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

BioGraphletQA: Knowledge-Anchored Generation of ComplexQuestion Answering Datasets

Session Participants

Session Chat

Questions & Answers

Session Polls

Need Help?

Please enter the four digit secret code The secret code should have been announced or displayed at the session location.

AI-generated Summary

Please enter the four digit secret code
The secret code should have been announced or displayed at the session location.