Publications

Publications by LIAAD

2018

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

Authors
Santos, MS; Soares, JP; Abreu, PH; Araujo, H; Santos, J;

Publication
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

Abstract
Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates. We emphasize and distinguish overoptimism from overfitting, showing that the former is associated with the cross-validation procedure, while the latter is influenced by the chosen oversampling algorithm. Furthermore, we perform a thorough empirical comparison of well-established oversampling algorithms, supported by a data complexity analysis. The best oversampling techniques seem to possess three key characteristics: use of cleaning procedures, cluster-based example synthetization and adaptive weighting of minority examples, where Synthetic Minority Oversampling Technique coupled with Tomek Links and Majority Weighted Minority Oversampling Technique stand out, being capable of increasing the discriminative power of data.

CloseRead Abstract

2018

BI-RADS CLASSIFICATION OF BREAST CANCER: A NEW PRE-PROCESSING PIPELINE FOR DEEP MODELS TRAINING

Authors
Domingues, I; Abreu, PH; Santos, J;

Publication
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)

Abstract
One of the main difficulties in the use of deep learning strategies in medical contexts is the training set size. While these methods need large annotated training sets, these datasets are costly to obtain in medical contexts and suffer from intra and inter-subject variability. In the present work, two new pre-processing techniques are introduced to improve a deep classifier performance. First, data augmentation based on co-registration is suggested. Then, multi-scale enhancement based on Difference of Gaussians is proposed. Results are accessed in a public mammogram database, the InBreast, in the context of an ordinal problem, the BI-RADS classification. Moreover, a pre-trained Convolutional Neural Network with the AlexNet architecture was used as a base classifier. The multi-class classification experiments show that the proposed pipeline with the Difference of Gaussians and the data augmentation technique outperforms using the original dataset only and using the original dataset augmented by mirroring the images.

CloseRead Abstract

2018

Exploring the effects of data distribution in missing data imputation

Authors
Pompeu Soares, J; Seoane Santos, M; Henriques Abreu, P; Araújo, H; Santos, J;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset. © Springer Nature Switzerland AG 2018.

CloseRead Abstract

2018

Evaluation of Oversampling Data Balancing Techniques in the Context of Ordinal Classification

Authors
Domingues, I; Amorim, JP; Abreu, PH; Duarte, H; Santos, JAM;

Publication
2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018

Abstract
Data imbalance is characterized by a discrepancy in the number of examples per class of a dataset. This phenomenon is known to deteriorate the performance of classifiers, since they are less able to learn the characteristics of the less represented classes. For most imbalanced datasets, the application of sampling techniques improves the classifier's performance. For small datasets, oversampling has been shown to be the most appropriate strategy since it augments the original set of samples. Although several oversampling strategies have been proposed and tested over the years, the work has mostly focused on binary or multi-class tasks. Motivated by medical applications, where there is often an order associated with the classes (increasing likelihood of malignancy, for instance), the present work tests some existing oversampling techniques in ordinal contexts. Moreover, four new oversampling techniques are proposed. Experiments were made both on private and public datasets. Private datasets concern the assessment of response to treatment on oncologic diseases. The 15 public datasets were chosen since they are widely used in the literature. Results show that data balance techniques improve classification results on ordinal imbalanced datasets, even when these techniques are not specifically designed for ordinal problems. With our pipeline, better or equal to published results were obtained for 10 out of the 15 public datasets with improvements upon a decrease of 0.43 on MMAE.

CloseRead Abstract

2018

Improving the Classifier Performance in Motor Imagery Task Classification: What are the steps in the classification process that we should worry about?

Authors
Santos, MS; Abreu, PH; Rodriguez Bermudez, G; Garcia Laencina, PJ;

Publication
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS

Abstract
Brain-Computer Interface systems based on motor imagery are able to identify an individual's intent to initiate control through the classification of encephalography patterns. Correctly classifying such patterns is instrumental and strongly depends in a robust machine learning block that is able to properly process the features extracted from a subject's encephalograms. The main objective of this work is to provide an overall view on machine learning stages, aiming to answer the following question: "What are the steps in the classification process that we should worry about?". The obtained results suggest that future research in the field should focus on two main aspects: exploring techniques for dimensionality reduction, in particular, supervised linear approaches, and evaluating adequate validation schemes to allow a more precise interpretation of results.

CloseRead Abstract

2018

Missing data imputation via denoising autoencoders: The untold story

Authors
Costa, AF; Santos, MS; Soares, JP; Abreu, PH;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
Missing data consists in the lack of information in a dataset and since it directly influences classification performance, neglecting it is not a valid option. Over the years, several studies presented alternative imputation strategies to deal with the three missing data mechanisms, Missing Completely At Random, Missing At Random and Missing Not At Random. However, there are no studies regarding the influence of all these three mechanisms on the latest high-performance Artificial Intelligence techniques, such as Deep Learning. The goal of this work is to perform a comparison study between state-of-the-art imputation techniques and a Stacked Denoising Autoencoders approach. To that end, the missing data mechanisms were synthetically generated in 6 different ways; 8 different imputation techniques were implemented; and finally, 33 complete datasets from different open source repositories were selected. The obtained results showed that Support Vector Machines imputation ensures the best classification performance while Multiple Imputation by Chained Equations performs better in terms of imputation quality. © Springer Nature Switzerland AG 2018.

CloseRead Abstract