2025
Authors
Lopes, F; Soares, C; Cortez, P;
Publication
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II
Abstract
This research addresses the challenge of generating synthetic data that resembles real-world data while preserving privacy. With privacy laws protecting sensitive information such as healthcare data, accessing sufficient training data becomes difficult, resulting in an increased difficulty in training Machine Learning models and in overall worst models. Recently, there has been an increased interest in the usage of Generative Adversarial Networks (GAN) to generate synthetic data since they enable researchers to generate more data to train their models. GANs, however, may not be suitable for privacy-sensitive data since they have no concern for the privacy of the generated data. We propose modifying the known Conditional Tabular GAN (CTGAN) model by incorporating a privacy-aware loss function, thus resulting in the Private CTGAN (PCTGAN) method. Several experiments were carried out using 10 public domain classification datasets and comparing PCTGAN with CTGAN and the state-of-the-art privacy-preserving model, the Differential Privacy CTGAN (DP-CTGAN). The results demonstrated that PCTGAN enables users to fine-tune the privacy fidelity trade-off by leveraging parameters, as well as that if desired, a higher level of privacy.
2025
Authors
Silva, Aline Santos; Plácido da Silva, Hugo; Correia, Miguel; Gonçalves da Costa, Andreia Cristina; Laranjo, Sérgio;
Publication
Abstract
Our team previously introduced an innovative concept for an "invisible"
Electrocardiography (ECG) system, incorporating electrodes and sensors into a
toilet seat design to enable signal acquisition from the thighs. Building upon
that work, we now present a novel dataset featuring real-world, single-lead
ECG signals captured at the thighs, offering a valuable resource for advancing
research on thigh-based ECG for cardiovascular disease assessment. To our
knowledge, this is the first dataset of its kind.
The tOLIet dataset comprises 149 ECG recordings collected from 86 individuals
(50 females, 36 males) with an average age of 31.73 ± 13.11 years, a mean
weight of 66.89 ± 10.70 kg, and an average height of 166.82 ± 6.07 cm.
Participants were recruited through direct contact with the Principal
Investigator at Centro Hospitalar Universitario de Lisboa Central (CHULC) and
via clinical consultations conducted at the same institution. Each recording
includes four differential signals acquired from electrode pairs embedded in
the toilet seat, with reference signals obtained from a standard 12-lead
hospital ECG system.
2025
Authors
Neto, R; Alencar, B; Gomes, HM; Bifet, A; Gama, J; Cassales, G; Rios, R;
Publication
DATA MINING AND KNOWLEDGE DISCOVERY
Abstract
Traditional machine learning techniques assume that data is drawn from a stationary source. This assumption is challenged in contexts with data streams for presenting constant and potentially infinite sequences whose distribution is prone to change over time. Based on these settings, detecting changes (a.k.a. concept drifts) is necessary to keep learning models up-to-date. Although state-of-the-art detection methods were designed to monitor the loss of predictive models, such monitoring falls short in many real-world scenarios where the true labels are not readily available. Therefore, there is increasing attention to unsupervised concept drift detection methods as approached in this paper. In this work, we present an unsupervised and interpretable method based on Radial Basis Function Networks (RBFN) and Markov Chains (MC), referred to as RMIDDM (Radial Markov Interpretable Drift Detection Method). In our method, RBF performs, in the intermediate layer, an activation process that implicitly produces groups of observations collected over time. Simultaneously, MC models the transitions between groups to support the detection of concept drifts, which happens when the active group changes and its probability exceeds a given threshold. A set of experiments with synthetic datasets and comparisons with state-of-the-art algorithms demonstrated that the proposed method can detect drifts at runtime in an efficient, interpretable, and independent way of labels, presenting competitive results and behavior. Additionally, to show its applicability in a real-world scenario, we analyzed new COVID-19 cases, deaths, and vaccinations to identify new waves as concept drifts and generate Markov models that allow understanding of their interaction.
2025
Authors
Rodrigues, P; Teixeira, C; Guimaraes, L; Ferreira, NGC;
Publication
MOLECULAR BIOLOGY REPORTS
Abstract
Bees play a critical role as pollinators in ecosystem services, contributing significantly to the sexual reproduction and diversity of plants. The Caatinga biome in Brazil, home to around 200 bee species, provides an ideal habitat for these species due to its unique climate conditions. However, this biome faces threats from anthropogenic processes, making it urgent to characterise the local bee populations efficiently. Traditional taxonomic surveys for bee identification are complex due to the lack of suitable keys and expertise required. As a result, molecular barcoding has emerged as a valuable tool, using genome regions to compare and identify bee species. However, little is known about Caatinga bees to develop these molecular tools further. This study addresses this gap, providing an updated list of 262 Caatinga bee species across 86 genera and identifying similar to 40 primer sets to aid in barcoding these species. The findings highlight the ongoing work needed to fully characterise the Caatinga biome's bee distribution and species or subspecies to support more effective monitoring and conservation efforts.
2025
Authors
Roque, L; Cerqueira, V; Soares, C; Torgo, L;
Publication
THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 19
Abstract
The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets - what most studies report - 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.
2025
Authors
Giagnolini, L; Koch, I; Tomasi, F; Teixeira Lopes, C;
Publication
Journal of Documentation
Abstract
Purpose – This study aims to comparatively evaluate two semantic models, ArchOnto (CIDOC CRM based) and Records in Contexts Ontology (RiC-O), for archival representation within the Linked Open Data framework. The research seeks to critically analyse their ability to represent archival documents, events, activities, and provenance through the application on a case study of historical baptism records. Design/methodology/approach – The study adopted a comparative approach, utilising the two models to represent a dataset of baptism records from a Portuguese parish spanning several centuries. This involved information extraction and conversion processes, transforming XML EAD finding aids into RDF to facilitate more explicit semantic representation and analysis. Findings – The analysis revealed distinctive strengths and limitations of each semantic model, providing nuanced insights into their respective capacities for archival description. The findings guide cultural heritage institutions in selecting and implementing the most suitable semantic model for their needs and pave the way for semantic alignment between the two models. Research limitations/implications – Although the case study explored the representation of a wide range of features, potential limitations include the specific contextual constraints of parish records and the need for broader comparative studies across diverse archival contexts. Originality/value – This paper offers original insights into semantic modelling for archival representations by providing a detailed comparative analysis of two ontological approaches. It offers valuable perspectives for archivists, digital humanities researchers, and cultural heritage professionals seeking to enhance the semantic richness of archival descriptions. © 2025 Emerald Publishing Limited
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.