Publicacoes - INESC TEC

Publicações

Publicações por Pedro Henriques Abreu

2018

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

Autores
Santos, MS; Soares, JP; Abreu, PH; Araujo, H; Santos, J;

Publicação
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

Abstract
Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates. We emphasize and distinguish overoptimism from overfitting, showing that the former is associated with the cross-validation procedure, while the latter is influenced by the chosen oversampling algorithm. Furthermore, we perform a thorough empirical comparison of well-established oversampling algorithms, supported by a data complexity analysis. The best oversampling techniques seem to possess three key characteristics: use of cleaning procedures, cluster-based example synthetization and adaptive weighting of minority examples, where Synthetic Minority Oversampling Technique coupled with Tomek Links and Majority Weighted Minority Oversampling Technique stand out, being capable of increasing the discriminative power of data.

FecharLer Abstract

2018

BI-RADS CLASSIFICATION OF BREAST CANCER: A NEW PRE-PROCESSING PIPELINE FOR DEEP MODELS TRAINING

Autores
Domingues, I; Abreu, PH; Santos, J;

Publicação
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)

Abstract
One of the main difficulties in the use of deep learning strategies in medical contexts is the training set size. While these methods need large annotated training sets, these datasets are costly to obtain in medical contexts and suffer from intra and inter-subject variability. In the present work, two new pre-processing techniques are introduced to improve a deep classifier performance. First, data augmentation based on co-registration is suggested. Then, multi-scale enhancement based on Difference of Gaussians is proposed. Results are accessed in a public mammogram database, the InBreast, in the context of an ordinal problem, the BI-RADS classification. Moreover, a pre-trained Convolutional Neural Network with the AlexNet architecture was used as a base classifier. The multi-class classification experiments show that the proposed pipeline with the Difference of Gaussians and the data augmentation technique outperforms using the original dataset only and using the original dataset augmented by mirroring the images.

FecharLer Abstract

2020

Interpretability vs. Complexity: The Friction in Deep Neural Networks

Autores
Amorim, JP; Abreu, PH; Reyes, M; Santos, J;

Publicação
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Abstract
Saliency maps have been used as one possibility to interpret deep neural networks. This method estimates the relevance of each pixel in the image classification, with higher values representing pixels which contribute positively to classification. The goal of this study is to understand how the complexity of the network affects the interpretabilty of the saliency maps in classification tasks. To achieve that, we investigate how changes in the regularization affects the saliency maps produced, and their fidelity to the overall classification process of the network. The experimental setup consists in the calculation of the fidelity of five saliency map methods that were compare, applying them to models trained on the CIFAR-10 dataset, using different levels of weight decay on some or all the layers. Achieved results show that models with lower regularization are statistically (significance of 5%) more interpretable than the other models. Also, regularization applied only to the higher convolutional layers or fully-connected layers produce saliency maps with more fidelity.

FecharLer Abstract

2018

Exploring the effects of data distribution in missing data imputation

Autores
Pompeu Soares, J; Seoane Santos, M; Henriques Abreu, P; Araújo, H; Santos, J;

Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset. © Springer Nature Switzerland AG 2018.

FecharLer Abstract

2022

The impact of heterogeneous distance functions on missing data imputation and classification performance

Autores
Santos, MS; Abreu, PH; Fernandez, A; Luengo, J; Santos, J;

Publicação
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

Abstract
This work performs an in-depth study of the impact of distance functions on K-Nearest Neighbours imputation of heterogeneous datasets. Missing data is generated at several percentages, on a large benchmark of 150 datasets (50 continuous, 50 categorical and 50 heterogeneous datasets) and data imputation is performed using different distance functions (HEOM, HEOM-R, HVDM, HVDM-R, HVDM-S, MDE and SIMDIST) and k values (1, 3, 5 and 7). The impact of distance functions on kNN imputation is then evaluated in terms of classification performance, through the analysis of a classifier learned from the imputed data, and in terms of imputation quality, where the quality of the reconstruction of the original values is assessed. By analysing the properties of heterogeneous distance functions over continuous and categorical datasets individually, we then study their behaviour over heterogeneous data. We discuss whether datasets with different natures may benefit from different distance functions and to what extent the component of a distance function that deals with missing values influences such choice. Our experiments show that missing data has a significant impact on distance computation and the obtained results provide guidelines on how to choose appropriate distance functions depending on data characteristics (continuous, categorical or heterogeneous datasets) and the objective of the study (classification or imputation tasks).

FecharLer Abstract

2017

Image descriptors in radiology images: a systematic review

Autores
Nogueira, MA; Abreu, PH; Martins, P; Machado, P; Duarte, H; Santos, J;

Publicação
ARTIFICIAL INTELLIGENCE REVIEW

Abstract
Clinical decisions are sometimes based on a variety of patient's information such as: age, weight or information extracted from image exams, among others. Depending on the nature of the disease or anatomy, clinicians can base their decisions on different image exams like mammographies, positron emission tomography scans or magnetic resonance images. However, the analysis of those exams is far from a trivial task. Over the years, the use of image descriptors-computational algorithms that present a summarized description of image regions-became an important tool to assist the clinician in such tasks. This paper presents an overview of the use of image descriptors in healthcare contexts, attending to different image exams. In the making of this review, we analyzed over 70 studies related to the application of image descriptors of different natures-e.g., intensity, texture, shape-in medical image analysis. Four imaging modalities are featured: mammography, PET, CT and MRI. Pathologies typically covered by these modalities are addressed: breast masses and microcalcifications in mammograms, head and neck cancer and Alzheimer's disease in the case of PET images, lung nodules regarding CTs and multiple sclerosis and brain tumors in the MRI section.

FecharLer Abstract