Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by Carlos Manuel Soares

2022

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Authors
Strecht, P; Mendes Moreira, J; Soares, C;

Publication
ADVANCED DATA MINING AND APPLICATIONS, ADMA 2022, PT II

Abstract
Density estimation is an important tool for data analysis. Non-parametric approaches have a reputation for offering state-of-the-art density estimates limited to few dimensions. Despite providing less accurate density estimates, histogram-based approaches remain the only alternative for datasets in high-dimensional spaces. In this paper, we present a multivariate histogram approach to estimate the density of a dataset without restrictions on the number of dimensions, containing both numerical and categorical variables (without numerical encoding) and allowing missing data (without the need to preprocess them). Results from the empirical evaluation show that it is possible to estimate the density of datasets without restrictions on dimensionality, and the method is robust to missing values and categorical variables.

2021

Inmplode: A framework to interpret multiple related rule-based models

Authors
Strecht, P; Mendes Moreira, J; Soares, C;

Publication
EXPERT SYSTEMS

Abstract
There is a growing trend to split problems into separate subproblems and develop separate models for each (e.g., different churn models for separate customer segments; different failure prediction models for separate university courses, etc.). While it may lead to better predictive models, the use of multiple models makes interpretability more challenging. In this paper, we address the problem of synthesizing the knowledge contained in a set of models without a significant loss of prediction performance. We focus on decision tree models because their interpretability makes them suitable for problems involving knowledge extraction. We detail the process, identifying alternative methods to address the different phases involved. An extensive set of experiments is carried out on the problem of predicting the failure of students in courses at the University of Porto. We assess the effect of using different methods for the operations of the methodology, both in terms of the knowledge extracted as well as the accuracy of the combined models.

2022

A case study comparing machine learning with statistical methods for time series forecasting: size matters

Authors
Cerqueira, V; Torgo, L; Soares, C;

Publication
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS

Abstract
Time series forecasting is one of the most active research topics. Machine learning methods have been increasingly adopted to solve these predictive tasks. However, in a recent work, evidence was shown that these approaches systematically present a lower predictive performance relative to simple statistical methods. In this work, we counter these results. We show that these are only valid under an extremely low sample size. Using a learning curve method, our results suggest that machine learning methods improve their relative predictive performance as the sample size grows. The R code to reproduce all of our experiments is available at https://github.com/vcerqueira/MLforForecasting.

2024

VEST: automatic feature engineering for forecasting

Authors
Cerqueira, V; Moniz, N; Soares, C;

Publication
MACHINE LEARNING

Abstract
Time series forecasting is a challenging task with applications in a wide range of domains. Auto-regression is one of the most common approaches to address these problems. Accordingly, observations are modelled by multiple regression using their past lags as predictor variables. We investigate the extension of auto-regressive processes using statistics which summarise the recent past dynamics of time series. The result of our research is a novel framework called VEST, designed to perform feature engineering using univariate and numeric time series automatically. The proposed approach works in three main steps. First, recent observations are mapped onto different representations. Second, each representation is summarised by statistical functions. Finally, a filter is applied for feature selection. We discovered that combining the features generated by VEST with auto-regression significantly improves forecasting performance in a database composed by 90 time series with high sampling frequency. However, we also found that there are no improvements when the framework is applied for multi-step forecasting or in time series with low sample size. VEST is publicly available online.

2022

Metalearning

Authors
Brazdil, P; van Rijn, JN; Soares, C; Vanschoren, J;

Publication
Cognitive Technologies

Abstract

2022

On Usefulness of Outlier Elimination in Classification Tasks

Authors
Hetlerovic, D; Popelinsky, L; Brazdil, P; Soares, C; Freitas, F;

Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS XX, IDA 2022

Abstract
Although outlier detection/elimination has been studied before, few comprehensive studies exist on when exactly this technique would be useful as preprocessing in classification tasks. The objective of our study is to fill in this gap. We have performed experiments with 12 various outlier elimination methods and 10 classification algorithms on 50 different datasets. The results were then processed by the proposed reduction method, whose aim is identify the most useful workflows for a given set of tasks (datasets). The reduction method has identified that just three OEMs that are generally useful for the given set of tasks. We have shown that the inclusion of these OEMs is indeed useful, as it leads to lower loss in accuracy and the difference is quite significant (0.5%) on average.

  • 20
  • 38