Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by LIAAD

2018

Outliers and the Simpson's Paradox

Authors
Portela, E; Ribeiro, RP; Gama, J;

Publication
ADVANCES IN SOFT COMPUTING, MICAI 2017, PT I

Abstract
There is no standard definition of outliers, but most authors agree that outliers are points far from other data points. Several outlier detection techniques have been developed mainly with two different purposes. On one hand, outliers are the interesting observations, like in fraud detection, on the other side, outliers are considered measurement observations that should be removed from the analysis, e.g. robust statistics. In this work, we start from the observation that outliers are effected by the so called Simpson paradox: a trend that appears in different groups of data but disappears or reverses when these groups are combined. Given a dataset, we learn a regression tree. The tree grows by partitioning the data into groups more and more homogeneous of the target variable. At each partition defined by the tree, we apply a box plot on the target variable to detect outliers. We would expected that deeper nodes of the tree contain less and less outliers. We observe that some points previously signaled as outliers are no more signaled as such, but new outliers appear. The identification of outliers depends on the context considered. Based on this observation, we propose a new method to quantify the level of outlierness of data points. © Springer Nature Switzerland AG 2018.

2018

SMOTEBoost for Regression: Improving the Prediction of Extreme Values

Authors
Moniz, N; Ribeiro, RP; Cerqueira, V; Chawla, N;

Publication
2018 IEEE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA)

Abstract
Supervised learning with imbalanced domains is one of the biggest challenges in machine learning. Such tasks differ from standard learning tasks by assuming a skewed distribution of target variables, and user domain preference towards under-represented cases. Most research has focused on imbalanced classification tasks, where a wide range of solutions has been tested. Still, little work has been done concerning imbalanced regression tasks. In this paper, we propose an adaptation of the SMOTEBoost approach for the problem of imbalanced regression. Originally designed for classification tasks, it combines boosting methods and the SMOTE resampling strategy. We present four variants of SMOTEBoost and provide an experimental evaluation using 30 datasets with an extensive analysis of results in order to assess the ability of SMOTEBoost methods in predicting extreme target values, and their predictive trade-off concerning baseline boosting methods. SMOTEBoost is publicly available in a software package.

2018

REBAGG: REsampled BAGGing for Imbalanced Regression

Authors
Branco, P; Torgo, L; Ribeiro, RP;

Publication
Second International Workshop on Learning with Imbalanced Domains: Theory and Applications, LIDTA@ECML/PKDD 2018, Dublin, Ireland, September 10, 2018

Abstract

2018

Comparing Reverse Complementary Genomic Words Based on Their Distance Distributions and Frequencies

Authors
Tavares, AH; Raymaekers, J; Rousseeuw, PJ; Silva, RM; Bastos, CAC; Pinho, A; Brito, P; Afreixo, V;

Publication
INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES

Abstract
In this work, we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is also explored, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.

2018

Outlier detection in interval data

Authors
Silva, APD; Filzmoser, P; Brito, P;

Publication
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Abstract
A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.

2018

Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering

Authors
Cunha, T; Soares, C; de Carvalho, ACPLF;

Publication
INFORMATION SCIENCES

Abstract
The problem of information overload motivated the appearance of Recommender Systems. From the several open problems in this area, the decision of which is the best recommendation algorithm for a specific problem is one of the most important and less studied. The current trend to solve this problem is the experimental evaluation of several recommendation algorithms in a handful of datasets. However, these studies require an extensive amount of computational resources, particularly processing time. To avoid these drawbacks, researchers have investigated the use of Metalearning to select the best recommendation algorithms in different scopes. Such studies allow to understand the relationships between data characteristics and the relative performance of recommendation algorithms, which can be used to select the best algorithm(s) for a new problem. The contributions of this study are two-fold: 1) to identify and discuss the key concepts of algorithm selection for recommendation algorithms via a systematic literature review and 2) to perform an experimental study on the Metalearning approaches reviewed in order to identify the most promising concepts for automatic selection of recommendation algorithms.

  • 218
  • 506