Publications

Publications by AI

2022

How are you Riding? Transportation Mode Identification from Raw GPS Data

Authors
Andrade, T; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022

Abstract
Analyzing the way individuals move is fundamental to understand the dynamics of humanity. Transportation mode plays a significant role in human behavior as it changes how individuals travel, how far, and how often they can move. The identification of transportation modes can be used in many applications and it is a key component of the internet of things (IoT) and the Smart Cities concept as it helps to organize traffic control and transport management. In this paper, we propose the use of ensemble methods to infer the transportation modes using raw GPS data. From latitude, longitude, and timestamp we perform feature engineering in order to obtain more discriminative fields for the classification. We test our features in several machine learning algorithms and among those with the best results we perform feature selection using the Boruta method in order to boost our accuracy results and decrease the amount of data, processing time, and noise in the model. We assess the validity of our approach on a real-world dataset with several different transportation modes and the results show the efficacy of our approach.

CloseRead Abstract

2022

Methods and tools for causal discovery and causal inference

Authors
Nogueira, AR; Pugnana, A; Ruggieri, S; Pedreschi, D; Gama, J;

Publication
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation-based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples. This article is categorized under: Algorithmic Development > Causality Discovery Fundamental Concepts of Data and Knowledge > Explainable AI Technologies > Machine Learning

CloseRead Abstract

2022

Improving the Prediction of Age of Onset of TTR-FAP Patients Using Graph-Embedding Features

Authors
Pedroto, M; Jorge, A; Mendes Moreira, J; Coelho, T;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022

Abstract
Transthyretin Familial Amyloid Polyneuropathy (TTR-FAP) is a neurological genetic illness that inflicts severe symptoms after the onset occurs. Age of onset represents the moment a patient starts to experience the symptoms of a disease. An accurate prediction of this event can improve clinical and operational guidelines that define the work of doctors, nurses, and operational staff. In this work, we transform family trees into compact vectors, that is, embeddings, and handle these as input features to predict the age of onset of patients with TTR-FAP. Our purpose is to evaluate how information present in genealogical trees can be transformed and used to improve a regression-based setting for TTR-FAP age of onset prediction. Our results show that by combining manual and graph-embeddings features there is a decrease in the mean prediction error when there is less information regarding a patient's family. With this work, we open the way for future work in representation learning for genealogical data, enabling a more effective exploitation of machine learning approaches.

CloseRead Abstract

2022

Novel features for time series analysis: a complex networks approach

Authors
Silva, VF; Silva, ME; Ribeiro, P; Silva, F;

Publication
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Being able to capture the characteristics of a time series with a feature vector is a very important task with a multitude of applications, such as classification, clustering or forecasting. Usually, the features are obtained from linear and nonlinear time series measures, that may present several data related drawbacks. In this work we introduce NetF as an alternative set of features, incorporating several representative topological measures of different complex networks mappings of the time series. Our approach does not require data preprocessing and is applicable regardless of any data characteristics. Exploring our novel feature vector, we are able to connect mapped network features to properties inherent in diversified time series models, showing that NetF can be useful to characterize time data. Furthermore, we also demonstrate the applicability of our methodology in clustering synthetic and benchmark time series sets, comparing its performance with more conventional features, showcasing how NetF can achieve high-accuracy clusters. Our results are very promising, with network features from different mapping methods capturing different properties of the time series, adding a different and rich feature set to the literature.

CloseRead Abstract

2022

Which distance dimensions matter in international research collaboration? A cross-country analysis by scientific domain

Authors
Vieira, ES; Cerdeira, J; Teixeira, AAC;

Publication
JOURNAL OF INFORMETRICS

Abstract
The relevance of international research collaboration (IRC) in bolstering intellectual capital, in-creasing embeddedness in networks, and promoting innovation has been acknowledged by sci-entists and policymakers. However, large-scale studies involving different scientific domains and periods aimed at exploring the factors that influence IRC are missing, which could deepen our understanding of the factors affecting IRC. Based on a novel dataset of 193 countries over three periods, 1990-1999, 2000-2009 and 2010-2018, we have examined the impact of geographical, socioeconomic, political, cultural, intellectual, and excellence distances on the propensity to engage in IRC at the global level, by scientific domain and over time. In general, all the distances considered obstruct IRC, with geographical and cultural distance emerging as the barriers with the highest impact. Two exceptions are worthwhile noting: excel-lence distance fosters IRC in the Medical & Health Sciences (MHS) and intellectual distance fosters IRC in the Agricultural Sciences (AS). At the global level, the negative impact of socioeconomic, political, and intellectual distances on IRC has increased over time, whereas the negative impact of geographical and cultural dis-tances has decreased.

CloseRead Abstract

2022

Semi-causal decision trees

Authors
Nogueira, AR; Ferreira, CA; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
Typically, classification algorithms use correlation analysis to make decisions. However, these decisions and the models they learn are not easily understandable for the typical user. Causal discovery is the field that studies the means to find causal relationships in observational data. Although highly interpretable, causal discovery algorithms tend to not perform so well in classification problems. This paper aims to propose a hybrid decision tree approach (SC tree) that mixes causal discovery with correlation analysis through the implementation of a custom metric to split the data in the tree's construction (Semi-causal gain ratio). In the results, the proposed methodology obtained a significant performance improvement (11.26% mean error rate) when compared to several causal baselines CDT-PS (23.67% ) and CDT-SPS (25.14%), matching closely the performance of J48 (10.20%), used as a correlation baseline, in ten binary data sets. Besides, when compared with PC in discrete data sets, the proposed approach obtained substantial improvement (16.17% against 28.07% in terms of mean error rate).

CloseRead Abstract