Publications

Publications by LIAAD

2025

Anomaly Detection in Pet Behavioural Data

Authors
Silva, I; Ribeiro, RP; Gama, J;

Publication
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
Pet owners are increasingly becoming conscious of their pet's necessities and are paying more attention to their overall wellness. The well-being of their pets is intricately linked to their own emotional and physical well-being. Some veterinary system solutions are emerging to provide proactive healthcare options for pets. One such solution offers the continuous monitoring of a pet's activity through accelerometer tracking devices. Based on data collected by this application, in this paper, we study different time aggregation and three unsupervised machine learning techniques to identify anomalies in pet behaviour data. Specifically, three algorithms, Isolation Forest, Local Outlier Factor, and K-Nearest Neighbour, with various thresholds to differentiate between normal and abnormal events. Results conducted on ten pets (five cats and five dogs) show that the most effective approach is to use daily data divided into periods. Moreover, the Local Outlier Factor is the best algorithm for detecting anomalies when prioritizing the identification of true positives. However, it also produces a high false positive ratio.

CloseRead Abstract

2025

Data Science for Fighting Environmental Crime

Authors
Barbosa, M; Ribeiro, C; Gomes, F; Ribeiro, RP; Gama, J;

Publication
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
The rise of environmental crimes has become a major concern globally as they cause significant damage to ecosystems, public health and result in economic losses. The availability of vast sensor data provides an opportunity to analyze environmental data proactively. This helps to detect irregularities and uncover potential criminal activities. This paper highlights the critical role played by machine learning (ML) and remote sensing technologies in the continuously evolving scenarios of environmental crime. By examining some case studies on detecting illegal fishing, illegal oil spills, illegal landfills, and illegal logging, we delve into the practical implementation of data-driven approaches for environmental crime detection. Our goal with this study is to provide an overview of the existing research in this area and foster the use of ML and data science techniques to enhance environmental crime detection.

CloseRead Abstract

2025

Evaluating Short Text Stream Clustering on Large E-commerce Datasets

Authors
Andrade, C; Ribeiro, RP; Gama, J;

Publication
INTELLIGENT SYSTEMS, BRACIS 2024, PT III

Abstract
Latent Dirichlet Allocation (LDA) is a fundamental method for clustering short text streams. However, when applied to large datasets, it often faces significant challenges, and its performance is typically evaluated in domain-specific datasets such as news and tweets. This study aims to fill this gap by evaluating the effectiveness of short text clustering methods in a large and diverse e-commerce dataset. We specifically investigate how well these clustering algorithms adapt to the complex dynamics and larger scale of e-commerce text streams, which differ from their usual application domains. Our analysis focuses on the impact of high homogeneity scores on the reported Normalized Mutual Information (NMI) values. We particularly examine whether these scores are inflated due to the prevalence of single-element clusters. To address potential biases in clustering evaluation, we propose using the Akaike Information Criterion (AIC) as an alternative metric to reduce the formation of single-element clusters and provide a more balanced measure of clustering performance. We present new insights for applying short text clustering methodologies in real-world situations, especially in sectors like e-commerce, where text data volumes and dynamics present unique challenges.

CloseRead Abstract

2025

Network-Based Anomaly Detection in Waste Transportation Data

Authors
Shaji, N; Tabassum, S; Ribeiro, RP; Gama, J; Santana, P; Garcia, A;

Publication
COMPLEX NETWORKS & THEIR APPLICATIONS XIII, COMPLEX NETWORKS 2024, VOL 1

Abstract
Waste transport management is a critical sector where maintaining accurate records and preventing fraudulent or illegal activities is essential for regulatory compliance, environmental protection, and public safety. However, monitoring and analyzing large-scale waste transport records to identify suspicious patterns or anomalies is a complex task. These records often involve multiple entities and exhibit variability in waste flows between them. Traditional anomaly detection methods relying solely on individual transaction data, may struggle to capture the deeper, network-level anomalies that emerge from the interactions between entities. To address this complexity, we propose a hybrid approach that integrates network-based measures with machine learning techniques for anomaly detection in waste transport data. Our method leverages advanced graph analysis techniques, such as sub-graph detection, community structure analysis, and centrality measures, to extract meaningful features that describe the network's topology. We also introduce novel metrics for edge weight disparities. Further, advanced machine learning techniques, including clustering, neural network, density-based, and ensemble methods are applied to these structural features to enhance and refine the identification of anomalous behaviors.

CloseRead Abstract

2025

Screening Urban Soil Contamination in Rome: Insights from XRF and Multivariate Analysis

Authors
Chandramohan, MS; da Silva, IM; Ribeiro, RP; Jorge, A; da Silva, JE;

Publication
ENVIRONMENTS

Abstract
This study investigates spatial distribution and chemical elemental composition screening in soils in Rome (Italy) using X-ray fluorescence analysis. Fifty-nine soil samples were collected from various locations within the urban areas of the Rome municipality and were analyzed for 19 elements. Multivariate statistical techniques, including nonlinear mapping, principal component analysis, and hierarchical cluster analysis, were employed to identify clusters of similar soil samples and their spatial distribution and to try to obtain environmental quality information. The soil sample clusters result from natural geological processes and anthropogenic activities on soil contamination patterns. Spatial clustering using the k-means algorithm further identified six distinct clusters, each with specific geographical distributions and elemental characteristics. Hence, the findings underscore the importance of targeted soil assessments to ensure the sustainable use of land resources in urban areas.

CloseRead Abstract

2025

Efficient Instance Selection in Tree-Based Models for Data Streams Classification

Authors
Paim, AM; Gama, J; Veloso, B; Enembreck, F; Ribeiro, RP;

Publication
40TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING

Abstract
The learning from continuous data streams is a relevant area within machine learning, focusing on the creation and updating of predictive models in real time as new data becomes available for training and prediction. Among the most widely used methods for this type of task, Hoeffding Trees are highly valued for their simplicity and robustness across a variety of applications and are considered the primary choice for generating decision trees in data stream contexts. However, Hoeffding Trees tend to continuously expand as new data is incorporated, resulting in increased processing time and memory consumption, often without providing significant gains in accuracy. In this study, we propose an instance selection scheme that combines different strategies to regularize Hoeffding Trees and their variants, mitigating excessive growth without compromising model accuracy. The method selects misclassified instances and a fraction of correctly classified instances during the training phase. After extensive experimental evaluation, the instance selection scheme demonstrates superior predictive performance compared to the original models (without selection), for both real and synthetic datasets for data streams, using a reduced subset of examples. Additionally, the method achieves relevant improvements in processing time, model complexity, and memory consumption, highlighting the effectiveness of the proposed instance selection scheme.

CloseRead Abstract