Publications

Publications by LIAAD

2018

Resampling with neighbourhood bias on imbalanced domains

Authors
Branco, P; Torgo, L; Ribeiro, RP;

Publication
EXPERT SYSTEMS

Abstract
Imbalanced domains are an important problem that arises in predictive tasks causing a loss in the performance on the most relevant cases for the user. This problem has been extensively studied for classification problems, where the target variable is nominal. Recently, it was recognized that imbalanced domains occur in several other contexts and for multiple tasks, such as regression tasks, where the target variable is continuous. This paper focuses on imbalanced domains in both classification and regression tasks. Resampling strategies are among the most successful approaches to address imbalanced domains. In this work, we propose variants of existing resampling strategies that are able to take into account the information regarding the neighbourhood of the examples. Instead of performing sampling uniformly, our proposals bias the strategies to reinforce some regions of the data sets. With an extensive set of experiments, we provide evidence of the advantage of introducing a neighbourhood bias in the resampling strategies for both classification and regression tasks with imbalanced data sets.

CloseRead Abstract

2018

MetaUtil: Meta Learning for Utility Maximization in Regression

Authors
Branco, P; Torgo, L; Ribeiro, RP;

Publication
Discovery Science - 21st International Conference, DS 2018, Limassol, Cyprus, October 29-31, 2018, Proceedings

Abstract
Several important real world problems of predictive analytics involve handling different costs of the predictions of the learned models. The research community has developed multiple techniques to deal with these tasks. The utility-based learning framework is a generalization of cost-sensitive tasks that takes into account both costs of errors and benefits of accurate predictions. This framework has important advantages such as allowing to represent more complex settings reflecting the domain knowledge in a more complete and precise way. Most existing work addresses classification tasks with only a few proposals tackling regression problems. In this paper we propose a new method, MetaUtil, for solving utility-based regression problems. The MetaUtil algorithm is versatile allowing the conversion of any out-of-the-box regression algorithm into a utility-based method. We show the advantage of our proposal in a large set of experiments on a diverse set of domains. © 2018, Springer Nature Switzerland AG.

CloseRead Abstract

2018

SMOTEBoost for Regression: Improving the Prediction of Extreme Values

Authors
Moniz, N; Ribeiro, RP; Cerqueira, V; Chawla, N;

Publication
2018 IEEE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA)

Abstract
Supervised learning with imbalanced domains is one of the biggest challenges in machine learning. Such tasks differ from standard learning tasks by assuming a skewed distribution of target variables, and user domain preference towards under-represented cases. Most research has focused on imbalanced classification tasks, where a wide range of solutions has been tested. Still, little work has been done concerning imbalanced regression tasks. In this paper, we propose an adaptation of the SMOTEBoost approach for the problem of imbalanced regression. Originally designed for classification tasks, it combines boosting methods and the SMOTE resampling strategy. We present four variants of SMOTEBoost and provide an experimental evaluation using 30 datasets with an extensive analysis of results in order to assess the ability of SMOTEBoost methods in predicting extreme target values, and their predictive trade-off concerning baseline boosting methods. SMOTEBoost is publicly available in a software package.

CloseRead Abstract

2018

REBAGG: REsampled BAGGing for Imbalanced Regression

Authors
Branco, P; Torgo, L; Ribeiro, RP;

Publication
Second International Workshop on Learning with Imbalanced Domains: Theory and Applications, LIDTA@ECML/PKDD 2018, Dublin, Ireland, September 10, 2018

Abstract

2018

Comparing Reverse Complementary Genomic Words Based on Their Distance Distributions and Frequencies

Authors
Tavares, AH; Raymaekers, J; Rousseeuw, PJ; Silva, RM; Bastos, CAC; Pinho, A; Brito, P; Afreixo, V;

Publication
INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES

Abstract
In this work, we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is also explored, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.

CloseRead Abstract

2018

Outlier detection in interval data

Authors
Duarte Silva, APD; Filzmoser, P; Brito, P;

Publication
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Abstract
A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.

CloseRead Abstract