Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2024

Anomaly detection-based undersampling for imbalanced classification problems

Autores
Park, YJ; Brito, P; Ma, YC;

Publicação
ENGINEERING OPTIMIZATION

Abstract
In various machine learning applications, classification plays an important role in categorizing and predicting data. To improve the classification performance, it is crucial to identify and remove the anomalies. Also, class imbalance in many machine learning applications is a very common problem since most classifiers tend to be biased toward the majority class by ignoring the minority class instances. Thus, in this research, we propose a new under-sampling technique based on anomaly detection and removal to enhance the performance of imbalanced classification problems. To demonstrate the effectiveness of the proposed method, comprehensive experiments are conducted on forty imbalanced data sets and two non-parametric hypothesis tests are employed to show the statistical difference in classification performances between the proposed method and other traditional resampling methods. From the experiment, it is shown that the proposed method improves the classification performance by effectively detecting and eliminating the anomalies among true-majority or pseudo-majority class instances.

FecharLer Abstract

2024

Immigrant groups in the Luxembourgish labour market: A Symbolic Data Analysis approach

Autores
Silva, CC; Brito, P; Campos, P;

Publicação
Statistical Journal of the IAOS

Abstract
Luxembourg, known for its immigration history, attracts immigrants to work. This study analyses different immigrant groups in the labour market from 2014 to 2022 by using Labor Force Survey (LFS) data, Symbolic Data Analysis (SDA), and the Monitoring the Evolution of Clusters (MEC) framework. Based on the birthplace and length of residence in Luxembourg, in each year, microdata were aggregated into 21 symbolic objects. They were primarily described by 16 modal variables which are multi-valued variables with a frequency attached to each category. Moreover, clustering using complete linkage and the Chernoff’s distance was applied. The Heuristic Identification of Noisy Variables (HINoV) suggested that with just six variables, objects may be grouped homogeneously. The MEC framework traced temporal relations and transitions between the clusters, revealing some movements across the different years. Results indicate that people from the European Union (EU) and Neighbouring countries have similar profiles while the Portuguese have opposite characteristics. The Luxembourgers are somewhere in between. Profiling people from non-EU countries was challenging. The data and methodology used make it easy to replicate the work in other nations, enabling comparison of results and monitoring to continue in the future.

FecharLer Abstract

2024

New skills in symbolic data analysis for official statistics

Autores
Verde R.; Batagelj V.; Brito P.; Silva A.P.D.; Korenjak-Cerne S.; Dobša J.; Diday E.;

Publicação
Statistical Journal of the IAOS

Abstract
The paper draws attention to the use of Symbolic Data Analysis (SDA) in the field of Official Statistics. It is composed of three sections presenting three pilot techniques in the field of SDA. The three contributions range from a technique based on the notion of exactly unified summaries for the creation of symbolic objects, a model-based approach for interval data as an innovative parametric strategy in this context, and measures of similarity defined between a class and a collection of classes based on the frequency of the categories which characterize them. The paper shows the effectiveness of the proposed approaches as prototypes of numerous techniques developed within the SDA framework and opens to possible further developments.

FecharLer Abstract

2024

Special issue on New methodologies in clustering and classification for complex and/or big data

Autores
Brito, P; Cerioli, A; Garcia-Escudero, LA; Saporta, G;

Publicação
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Abstract
[No abstract available]

FecharLer Abstract

2024

Multidimensional subgroup discovery on event logs

Autores
Ribeiro, J; Fontes, T; Soares, C; Borges, JL;

Publicação
EXPERT SYSTEMS WITH APPLICATIONS

Abstract
Subgroup discovery (SD) aims at finding significant subgroups of a given population of individuals characterized by statistically unusual properties of interest. SD on event logs provides insight into particular behaviors of processes, which may be a valuable complement to the traditional process analysis techniques, especially for low -structured processes. This paper proposes a scalable and efficient method to search significant SD rules on frequent sequences of events, exploiting their multidimensional nature. With this method, it is intended to identify significant subsequences of events where the distribution of values of some target aspect is significantly different than the same distribution for the entire event log. A publicly available real -life event log of a Dutch hospital is used as a running example to demonstrate the applicability of our method. The proposed approach was applied on a real -life case study based on the public transport of a medium size European city (Porto, Portugal), for which the event data consists of 133 million smartcard travel validations from buses, trams and trains. The results include a characterization of mobility flows over multiple aspects, as well as the identification of unexpected behaviors in the flow of commuters (public transport). The generated knowledge provided a useful insight into the behavior of travelers, which can be applied at operational, tactical and strategic business levels, enhancing the current view of the transport services to transport authorities and operators.

FecharLer Abstract

2024

VEST: automatic feature engineering for forecasting

Autores
Cerqueira, V; Moniz, N; Soares, C;

Publicação
MACHINE LEARNING

Abstract
Time series forecasting is a challenging task with applications in a wide range of domains. Auto-regression is one of the most common approaches to address these problems. Accordingly, observations are modelled by multiple regression using their past lags as predictor variables. We investigate the extension of auto-regressive processes using statistics which summarise the recent past dynamics of time series. The result of our research is a novel framework called VEST, designed to perform feature engineering using univariate and numeric time series automatically. The proposed approach works in three main steps. First, recent observations are mapped onto different representations. Second, each representation is summarised by statistical functions. Finally, a filter is applied for feature selection. We discovered that combining the features generated by VEST with auto-regression significantly improves forecasting performance in a database composed by 90 time series with high sampling frequency. However, we also found that there are no improvements when the framework is applied for multi-step forecasting or in time series with low sample size. VEST is publicly available online.

FecharLer Abstract