Publicacoes - INESC TEC

Publicações

Publicações por Paula Brito

2015

Clustering of symbolic data

Autores
Brito, P;

Publicação
Handbook of Cluster Analysis

Abstract
In this chapter, we present clustering methods for symbolic data. We start by recalling that symbolic data is data presenting inherent variability, and the motivations for the introduction of this new paradigm.We then proceed by defining the different types of variables that allow for the representation of symbolic data, and recall some distance measures appropriate for the new data types. Then we present clustering methods for different types of symbolic data, both hierarchical and nonhierarchical. An application illustrates two well-known methods for clustering symbolic data. © 2016 by Taylor & Francis Group, LLC.

FecharLer Abstract

2024

Anomaly detection-based undersampling for imbalanced classification problems

Autores
Park, YJ; Brito, P; Ma, YC;

Publicação
ENGINEERING OPTIMIZATION

Abstract
In various machine learning applications, classification plays an important role in categorizing and predicting data. To improve the classification performance, it is crucial to identify and remove the anomalies. Also, class imbalance in many machine learning applications is a very common problem since most classifiers tend to be biased toward the majority class by ignoring the minority class instances. Thus, in this research, we propose a new under-sampling technique based on anomaly detection and removal to enhance the performance of imbalanced classification problems. To demonstrate the effectiveness of the proposed method, comprehensive experiments are conducted on forty imbalanced data sets and two non-parametric hypothesis tests are employed to show the statistical difference in classification performances between the proposed method and other traditional resampling methods. From the experiment, it is shown that the proposed method improves the classification performance by effectively detecting and eliminating the anomalies among true-majority or pseudo-majority class instances.

FecharLer Abstract

2024

Immigrant groups in Luxembourg's labour market: A symbolic data analysis approach

Autores
Silva, CC; Brito, P; Campos, P;

Publicação
STATISTICAL JOURNAL OF THE IAOS

Abstract
Luxembourg, known for its immigration history, attracts immigrants to work. This study analyses different immigrant groups in the labour market from 2014 to 2022 by using Labor Force Survey (LFS) data, Symbolic Data Analysis (SDA), and the Monitoring the Evolution of Clusters (MEC) framework.Based on the birthplace and length of residence in Luxembourg, in each year, microdata were aggregated into 21 symbolic objects. They were primarily described by 16 modal variables which are multi-valued variables with a frequency attached to each category. Moreover, clustering using complete linkage and the Chernoff's distance was applied. The Heuristic Identification of Noisy Variables (HINoV) suggested that with just six variables, objects may be grouped homogeneously. The MEC framework traced temporal relations and transitions between the clusters, revealing some movements across the different years.Results indicate that people from the European Union (EU) and Neighbouring countries have similar profiles while the Portuguese have opposite characteristics. The Luxembourgers are somewhere in between. Profiling people from non-EU countries was challenging.The data and methodology used make it easy to replicate the work in other nations, enabling comparison of results and monitoring to continue in the future.

FecharLer Abstract

2024

New skills in symbolic data analysis for official statistics

Autores
Verde R.; Batagelj V.; Brito P.; Silva A.P.D.; Korenjak-Cerne S.; Dobša J.; Diday E.;

Publicação
Statistical Journal of the IAOS

Abstract
The paper draws attention to the use of Symbolic Data Analysis (SDA) in the field of Official Statistics. It is composed of three sections presenting three pilot techniques in the field of SDA. The three contributions range from a technique based on the notion of exactly unified summaries for the creation of symbolic objects, a model-based approach for interval data as an innovative parametric strategy in this context, and measures of similarity defined between a class and a collection of classes based on the frequency of the categories which characterize them. The paper shows the effectiveness of the proposed approaches as prototypes of numerous techniques developed within the SDA framework and opens to possible further developments.

FecharLer Abstract

2024

Special issue on New methodologies in clustering and classification for complex and/or big data

Autores
Brito, P; Cerioli, A; Garcia-Escudero, LA; Saporta, G;

Publicação
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Abstract
[No abstract available]

FecharLer Abstract

2025

Parametric models for distributional data

Autores
Brito, P; Silva, APD;

Publicação
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Abstract
We present parametric probabilistic models for numerical distributional variables. The proposed models are based on the representation of each distribution by a location measure and inter-quantile ranges, for given quantiles, thereby characterizing the underlying empirical distributions in a flexible way. Multivariate Normal distributions are assumed for the whole set of indicators, considering alternative structures of the variance-covariance matrix. For all cases, maximum likelihood estimators of the corresponding parameters are derived. This modelling allows for hypothesis testing and multivariate parametric analysis. The proposed framework is applied to Analysis of Variance and parametric Discriminant Analysis of distributional data. A simulation study examines the performance of the proposed models in classification problems under different data conditions. Applications to Internet traffic data and Portuguese official data illustrate the relevance of the proposed approach.

FecharLer Abstract