2025
Autores
Cerqueira, V; Santos, M; Roque, L; Baghoussi, Y; Soares, C;
Publicação
Progress in Artificial Intelligence - 24th EPIA Conference on Artificial Intelligence, EPIA 2025, Faro, Portugal, October 1-3, 2025, Proceedings, Part I
Abstract
2025
Autores
Lopes, F; Soares, C; Cortez, P;
Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II
Abstract
This research addresses the challenge of generating synthetic data that resembles real-world data while preserving privacy. With privacy laws protecting sensitive information such as healthcare data, accessing sufficient training data becomes difficult, resulting in an increased difficulty in training Machine Learning models and in overall worst models. Recently, there has been an increased interest in the usage of Generative Adversarial Networks (GAN) to generate synthetic data since they enable researchers to generate more data to train their models. GANs, however, may not be suitable for privacy-sensitive data since they have no concern for the privacy of the generated data. We propose modifying the known Conditional Tabular GAN (CTGAN) model by incorporating a privacy-aware loss function, thus resulting in the Private CTGAN (PCTGAN) method. Several experiments were carried out using 10 public domain classification datasets and comparing PCTGAN with CTGAN and the state-of-the-art privacy-preserving model, the Differential Privacy CTGAN (DP-CTGAN). The results demonstrated that PCTGAN enables users to fine-tune the privacy fidelity trade-off by leveraging parameters, as well as that if desired, a higher level of privacy.
2025
Autores
Roque, L; Cerqueira, V; Soares, C; Torgo, L;
Publicação
THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 19
Abstract
The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets - what most studies report - 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.
2025
Autores
Strecht, P; Mendes-Moreira, J; Soares, C;
Publicação
MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2024, PT I
Abstract
In many organizations with a distributed operation, not only is data collection distributed, but models are also developed and deployed separately. Understanding the combined knowledge of all the local models may be important and challenging, especially in the case of a large number of models. The automated development of consensus models, which aggregate multiple models into a single one, involves several challenges, including fidelity (ensuring that aggregation does not penalize the predictive performance severely) and completeness (ensuring that the consensus model covers the same space as the local models). In this paper, we address the latter, proposing two measures for geometrical and distributional completeness. The first quantifies the proportion of the decision space that is covered by a model, while the second takes into account the concentration of the data that is covered by the model. The use of these measures is illustrated in a real-world example of academic management, as well as four publicly available datasets. The results indicate that distributional completeness in the deployed models is consistently higher than geometrical completeness. Although consensus models tend to be geometrically incomplete, distributional completeness reveals that they cover the regions of the decision space with a higher concentration of data.
2025
Autores
Freitas, F; Brazdil, P; Soares, C;
Publicação
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS
Abstract
Many current AutoML platforms include a very large space of alternatives (the configuration space). This increases the probability of including the best one for any dataset but makes the task of identifying it for a new dataset more difficult. In this paper, we explore a method that can reduce a large configuration space to a significantly smaller one and so help to reduce the search time for the potentially best algorithm configuration, with limited risk of significant loss of predictive performance. We empirically validate the method with a large set of alternatives based on five ML algorithms with different sets of hyperparameters and one preprocessing method (feature selection). Our results show that it is possible to reduce the given search space by more than one order of magnitude, from a few thousands to a few hundred items. After reduction, the search for the best algorithm configuration is about one order of magnitude faster than on the original space without significant loss in predictive performance.
2025
Autores
Roberto, GF; Pereira, DC; Martins, AS; Tosta, TAA; Soares, C; Lumini, A; Rozendo, GB; Neves, LA; Nascimento, MZ;
Publicação
PATTERN RECOGNITION LETTERS
Abstract
Covid-19 is a severe illness caused by the Sars-CoV-2 virus, initially identified in China in late 2019 and swiftly spreading globally. Since the virus primarily impacts the lungs, analyzing chest X-rays stands as a reliable and widely accessible means of diagnosing the infection. In computer vision, deep learning models such as CNNs have been the main adopted approach for detection of Covid-19 in chest X-ray images. However, we believe that handcrafted features can also provide relevant results, as shown previously in similar image classification challenges. In this study, we propose a method for identifying Covid-19 in chest X-ray images by extracting and classifying local and global percolation-based features. This technique was tested on three datasets: one comprising 2,002 segmented samples categorized into two groups (Covid-19 and Healthy); another with 1,125 non-segmented samples categorized into three groups (Covid-19, Healthy, and Pneumonia); and a third one composed of 4,809 non-segmented images representing three classes (Covid-19, Healthy, and Pneumonia). Then, 48 percolation features were extracted and give as input into six distinct classifiers. Subsequently, the AUC and accuracy metrics were assessed. We used the 10-fold cross-validation approach and evaluated lesion sub-types via binary and multiclass classification using the Hermite polynomial classifier, a novel approach in this domain. The Hermite polynomial classifier exhibited the most promising outcomes compared to five other machine learning algorithms, wherein the best obtained values for accuracy and AUC were 98.72% and 0.9917, respectively. We also evaluated the influence of noise in the features and in the classification accuracy. These results, based in the integration of percolation features with the Hermite polynomial, hold the potential for enhancing lesion detection and supporting clinicians in their diagnostic endeavors.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.