Publications

Publications by Carlos Manuel Soares

2016

CHADE: Metalearning with Classifier Chains for Dynamic Combination of Classifiers

Authors
Pinto, F; Soares, C; Moreira, JM;

Publication
Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I

Abstract
Dynamic selection or combination (DSC) methods allow to select one or more classifiers from an ensemble according to the characteristics of a given test instance x. Most methods proposed for this purpose are based on the nearest neighbours algorithm: it is assumed that if a classifier performed well on a set of instances similar to x, it will also perform well on x. We address the problem of dynamically combining a pool of classifiers by combining two approaches: metalearning and multi-label classification. Taking into account that diversity is a fundamental concept in ensemble learning and the interdependencies between the classifiers cannot be ignored, we solve the multi-label classification problem by using a widely known technique: Classifier Chains (CC). Additionally, we extend a typical metalearning approach by combining metafeatures characterizing the interdependencies between the classifiers with the base-level features.We executed experiments on 42 classification datasets and compared our method with several state-of-the-art DSC techniques, including another metalearning approach. Results show that our method allows an improvement over the other metalearning approach and is very competitive with the other four DSC methods. © Springer International Publishing AG 2016.

CloseRead Abstract

2017

autoBagging: Learning to Rank Bagging Workflows with Metalearning

Authors
Pinto, F; Cerqueira, V; Soares, C; Moreira, JM;

Publication
Proceedings of the International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms co-located with the European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases, AutoML@PKDD/ECML 2017, Skopje, Macedonia, September 22, 2017.

Abstract
Machine Learning (ML) has been successfully applied to a wide range of domains and applications. One of the techniques behind most of these successful applications is Ensemble Learning (EL), the field of ML that gave birth to methods such as Random Forests or Boosting. The complexity of applying these techniques together with the market scarcity on ML experts, has created the need for systems that enable a fast and easy drop-in replacement for ML libraries. Automated machine learning (autoML) is the field of ML that attempts to answers these needs. We propose autoBagging, an autoML system that automatically ranks 63 bagging workflows by exploiting past performance and metalearning. Results on 140 classification datasets from the OpenML platform show that autoBagging can yield better performance than the Average Rank method and achieve results that are not statistically different from an ideal model that systematically selects the best workflow for each dataset. For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package on CRAN.

CloseRead Abstract

2015

Customer segmentation in a large database of an online customized fashion business

Authors
Brito, PQ; Soares, C; Almeida, S; Monte, A; Byvoet, M;

Publication
ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING

Abstract
Data mining (DM) techniques have been used to solve marketing and manufacturing problems in the fashion industry. These approaches are expected to be particularly important for highly customized industries because the diversity of products sold makes it harder to find clear patterns of customer preferences. The goal of this project was to investigate two different data mining approaches for customer segmentation: clustering and subgroup discovery. The models obtained produced six market segments and 49 rules that allowed a better understanding of customer preferences in a highly customized fashion manufacturer/e-tailor. The scope and limitations of these clustering DM techniques will lead to further methodological refinements.

CloseRead Abstract

2016

RetweetPatterns: detection of spatio-temporal patterns of retweets

Authors
Rodrigues, T; Cunha, T; Ienco, D; Poncelet, P; Soares, C;

Publication
NEW ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1

Abstract
Social media is strongly present in people's everyday life and Twitter is one example that stands out. The data within these types of services can be analyzed in order to discover useful knowledge. One interesting approach is to use data mining techniques to perceive hidden behaviours and patterns. The primary focus of this paper is the identification of patterns of retweets and to understand how information spreads over time in Twitter. The aim of this work lies in the adaptation of the GetMove tool, that is capable of extracting spatio-temporal pattern trajectories, and TweeProfiles, that identifies tweet profiles regarding several dimensions: spatial, temporal, social and content. We hope that the more flexible clustering strategy from TweeProfiles will enhance the results extracted by GetMove. We study the application of said mechanism to one case study and developed a visualization tool to interpret the results.

CloseRead Abstract

2014

Simulation of the ensemble generation process: The divergence between data and model similarity

Authors
Pinto, F; Mendes Moreira, J; Soares, C; Rossetti, RJF;

Publication
Modelling and Simulation 2014 - European Simulation and Modelling Conference, ESM 2014

Abstract
In this paper we present a Netlogo simulation model for a Data Mining methodological process: ensemble classifier generation. The model allows to study the trade-off between data characteristics and diversity, a key concept in Ensemble Learning. We studied the re™ search hypothesis that data characteristics should also be taken into account while generating ensemble classifier models. The results of our experiments indicate that diversity is in fact a key concept in Ensemble Learning but regarding our research hypothesis, the findings axe inconclusive.

CloseRead Abstract

2014

TweeProfiles: Detection of Spatio-temporal Patterns on Twitter

Authors
Cunha, T; Soares, C; Rodrigues, EM;

Publication
ADVANCED DATA MINING AND APPLICATIONS, ADMA 2014

Abstract
Online social networks present themselves as valuable information sources about their users and their respective behaviours and interests. Many researchers in data mining have analysed these types of data, aiming to find interesting patterns. This paper addresses the problem of identifying and displaying tweet profiles by analysing multiple types of data: spatial, temporal, social and content. The data mining process that extracts the patterns is composed by the manipulation of the dissimilarity matrices for each type of data, which are fed to a clustering algorithm to obtain the desired patterns. This paper studies appropriate distance functions for the different types of data, the normalization and combination methods available for different dimensions and the existing clustering algorithms. The visualization platform is designed for a dynamic and intuitive usage, aimed at revealing the extracted profiles in an understandable and interactive manner. In order to accomplish this, various visualization patterns were studied and widgets were chosen to better represent the information. The use of the project is illustrated with data from the Portuguese twittosphere.

CloseRead Abstract