Publications

Publications by Carlos Manuel Soares

2018

Machine Learning for Drugs Prescription

Authors
Silva, P; Rivolli, A; Rocha, P; Correia, F; Soares, C;

Publication
Intelligent Data Engineering and Automated Learning - IDEAL 2018 - 19th International Conference, Madrid, Spain, November 21-23, 2018, Proceedings, Part I

Abstract
In a medical appointment, patient information, including past exams, is analyzed in order to define a diagnosis. This process is prone to errors, since there may be many possible diagnoses. This analysis is very dependent on the experience of the doctor. Even with the correct diagnosis, prescribing medicines can be a problem, because there are multiple drugs for each disease and some may not be used due to allergies or high cost. Therefore, it would be helpful, if the doctors were able to use a system that, for each diagnosis, provided a list of the most suitable medicines. Our approach is to support the physician in this process. Rather than trying to predict the medicine, we aim to, given the available information, predict the set of the most likely drugs. The prescription problem may be solved as a Multi-Label classification problem since, for each diagnosis, multiple drugs may be prescribed at the same time. Due to its complexity, some simplifications were performed for the problem to be treatable. So, multiple approaches were done with different assumptions. The data supplied was also complex, with important problems in its quality, that led to a strong investment in data preparation, in particular, feature engineering. Overall, the results in each scenario are good with performances almost twice the baseline, especially using Binary Relevance as transformation approach. © 2018, Springer Nature Switzerland AG.

CloseRead Abstract

2018

Bandit-Based Automated Machine Learning

Authors
Das Dores, SCN; Soares, C; Ruiz, D;

Publication
2018 7TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS)

Abstract
Machine Learning (ML) has been successfully applied to a wide range of domains and applications. Since the number of ML applications is growing, there is a need for tools that boost the data scientist's productivity. Automated Machine Learning (AutoML) is the field of ML that aims to address these needs through the development of solutions which enable data science practitioners, experts and non-experts, to efficiently create fine-tuned predictive models with minimum intervention. In this paper, we present the application of the multi-armed bandit optimization algorithm Hyperband to address the AutoML problem of generating customized classification workflows, a combination of preprocessing methods and ML algorithms including hyperparameter optimization. Experimental results comparing the bandit-based approach against Auto ML Bayesian Optimization methods show that this new approach is superior to the state-of-the-art methods in the test evaluation and equivalent to them in a statistical analysis.

CloseRead Abstract

2018

Analysing the Footprint of Classifiers in Overlapped and Imbalanced Contexts

Authors
Mercier, M; Santos, MS; Abreu, PH; Soares, C; Soares, JP; Santos, J;

Publication
Advances in Intelligent Data Analysis XVII - 17th International Symposium, IDA 2018, 's-Hertogenbosch, The Netherlands, October 24-26, 2018, Proceedings

Abstract
It is recognised that the imbalanced data problem is aggravated by other difficulty factors, such as class overlap. Over the years, several research works have focused on this problematic, although presenting two major hitches: the limitation of test domains and the lack of a formulation of the overlap degree, which makes results hard to generalise. This work studies the performance degradation of classifiers with distinct learning biases in overlap and imbalanced contexts, focusing on the characteristics of the test domains (shape, dimensionality and imbalance ratio) and on to what extent our proposed overlapping measure (degOver) is aligned with the performance results observed. Our results show that MLP and CART classifiers are the most robust to high levels of class overlap, even for complex domains, and that KNN and linear SVM are the most aligned with degOver. Furthermore, we found that the dimensionality of data also plays an important role in explaining performance results. © Springer Nature Switzerland AG 2018.

CloseRead Abstract

2018

Label Expansion for Multi-Label Classification

Authors
Rivolli, A; Soares, C; de Carvalho, ACPLF;

Publication
2018 7TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS)

Abstract
In multi-label classification tasks, instances are simultaneously associated with multiple labels, representing different and, possibly, related concepts from a domain. One characteristic of these tasks is a high class-label imbalance. In order to obtain improved predictive models, several algorithms either have explored the label dependencies or have dealt with the problem of imbalanced labels. This work proposes a label expansion approach which combines both alternatives. For such, some labels are expanded with data from a related class label, making the labels more balanced and representative. Preliminary experiments show the effectiveness of this approach to improve the Binary Relevance strategy. Particularly, it reduced the number of labels that were never predicted in the test instances. Although the results are preliminary, they are potentially attractive, considering the scale and consistency of the improvement obtained, as well as the broad scope of the proposed approach.

CloseRead Abstract

2019

Constructive Aggregation and Its Application to Forecasting with Dynamic Ensembles

Authors
Cerqueira, V; Pinto, F; Torgo, L; Soares, C; Moniz, N;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2018, PT I

Abstract
While the predictive advantage of ensemble methods is nowadays widely accepted, the most appropriate way of estimating the weights of each individual model remains an open research question. Meanwhile, several studies report that combining different ensemble approaches leads to improvements in performance, due to a better trade-off between the diversity and the error of the individual models in the ensemble. We contribute to this research line by proposing an aggregation framework for a set of independently created forecasting models, i.e. heterogeneous ensembles. The general idea is to, instead of directly aggregating these models, first rearrange them into different subsets, creating a new set of combined models which is then aggregated into a final decision. We present this idea as constructive aggregation, and apply it to time series forecasting problems. Results from empirical experiments show that applying constructive aggregation to state of the art dynamic aggregation methods provides a consistent advantage. Constructive aggregation is publicly available in a software package. Data related to this paper are available at: https://github.com/vcerqueira/timeseriesdata. Code related to this paper is available at: https://github. com/vcerqueira/tsensembler.

CloseRead Abstract

2019

Data mining based framework to assess solution quality for the rectangular 2D strip-packing problem

Authors
Neuenfeldt Junior, A; Silva, E; Gomes, M; Soares, C; Oliveira, JF;

Publication
EXPERT SYSTEMS WITH APPLICATIONS

Abstract
In this paper, we explore the use of reference values (predictors) for the optimal objective function value of hard combinatorial optimization problems, instead of bounds, obtained by data mining techniques, and that may be used to assess the quality of heuristic solutions for the problem. With this purpose, we resort to the rectangular two-dimensional strip-packing problem (2D-SPP), which can be found in many industrial contexts. Mostly this problem is solved by heuristic methods, which provide good solutions. However, heuristic approaches do not guarantee optimality, and lower bounds are generally used to give information on the solution quality, in particular, the area lower bound. But this bound has a severe accuracy problem. Therefore, we propose a data mining-based framework capable of assessing the quality of heuristic solutions for the 2D-SPP. A regression model was fitted by comparing the strip height solutions obtained with the bottom-left-fill heuristic and 19 predictors provided by problem characteristics. Random forest was selected as the data mining technique with the best level of generalisation for the problem, and 30,000 problem instances were generated to represent different 2D-SPP variations found in real-world applications. Height predictions for new problem instances can be found in the regression model fitted. In the computational experimentation, we demonstrate that the data mining-based framework proposed is consistent, opening the doors for its application to finding predictions for other combinatorial optimisation problems, in particular, other cutting and packing problems. However, how to use a reference value instead of a bound, has still a large room for discussion and innovative ideas. Some directions for the use of reference values as a stopping criterion in search algorithms are also provided.

CloseRead Abstract