Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by João Gama

2022

Temporal Nodes Causal Discovery for in Intensive Care Unit Survival Analysis

Authors
Nogueira, AR; Ferreira, CA; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022

Abstract
In hospital and after ICU discharge deaths are usual, given the severity of the condition under which many of them are admitted to these wings. Because of this, there is an urge to identify and follow these cases closely. Furthermore, as ICU data is usually composed of variables measured in varying time intervals, there is a need for a method that can capture causal relationships in this type of data. To solve this problem, we propose ItsPC, a causal Bayesian network that can model irregular multivariate time-series data. The preliminary results show that ItsPC creates smaller and more concise networks while maintaining the temporal properties. Moreover, its irregular approach to time-series can capture more relationships with the target than the Dynamic Bayesian Networks.

2020

Interconnect bypass fraud detection: a case study

Authors
Veloso, B; Tabassum, S; Martins, C; Espanha, R; Azevedo, R; Gama, J;

Publication
ANNALS OF TELECOMMUNICATIONS

Abstract
The high asymmetry of international termination rates is fertile ground for the appearance of fraud in telecom companies. International calls have higher values when compared with national ones, which raises the attention of fraudsters. In this paper, we present a solution for a real problem called interconnect bypass fraud, more specifically, a newly identified distributed pattern that crosses different countries and keeps fraudsters from being tracked by almost all fraud detection techniques. This problem is one of the most expressive in the telecommunication domain, and it has some abnormal behaviours like the occurrence of a burst of calls from specific numbers. Based on this assumption, we propose the adoption of a new fast forgetting technique that works together with the Lossy Counting algorithm. We apply frequent set mining to capture distributed patterns from different countries. Our goal is to detect as soon as possible items with abnormal behaviours, e.g., bursts of calls, repetitions, mirrors, distributed behaviours and a small number of calls spread by a vast set of destination numbers. The results show that the application of different techniques improves the detection ratio and not only complements the techniques used by the telecom company but also improves the performance of the Lossy Counting algorithm in terms of run-time, memory used and sensibility to detect the abnormal behaviours. Additionally, the application of frequent set mining allows us to capture distributed fraud patterns.

2021

A new self-organizing map based algorithm for multi-label stream classification

Authors
Cerri, R; Costa Jínior, JD; Faria, ER; Gama, J;

Publication
SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021

Abstract
Several algorithms have been proposed for offline multi-label classification. However, applications in areas such as traffic monitoring, social networks, and sensors produce data continuously, the so called data streams, posing challenges to batch multi-label learning. With the lack of stationarity in the distribution of data streams, new algorithms are needed to online adapt to such changes (concept drift). Also, in realistic applications, changes occur in scenarios with infinitely delayed labels, where the true classes of the arrival instances are never available. We propose an online unsupervised incremental method based on self-organizing maps for multi-label stream classification in scenarios with infinitely delayed labels. We consider the existence of an initial set of labeled instances to train a self-organizing map for each label. The learned models are then used and adapted in an evolving stream to classify new instances, considering that their classes will never be available. We adapt to incremental concept drifts by online updating the weight vectors of winner neurons and the dataset label cardinality. Predictions are obtained using the Bayes rule and the outputs of each neuron, adapting the prior probabilities and conditional probabilities of the classes in the stream. Experiments using synthetic and real datasets show that our method is highly competitive with several ones from the literature, in both stationary and concept drift scenarios. © 2021 ACM.

2022

Methods and tools for causal discovery and causal inference

Authors
Nogueira, AR; Pugnana, A; Ruggieri, S; Pedreschi, D; Gama, J;

Publication
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation-based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples. This article is categorized under: Algorithmic Development > Causality Discovery Fundamental Concepts of Data and Knowledge > Explainable AI Technologies > Machine Learning

2020

Profiling high leverage points for detecting anomalous users in telecom data networks

Authors
Tabassum, S; Azad, MA; Gama, J;

Publication
ANNALS OF TELECOMMUNICATIONS

Abstract
Fraud in telephony incurs huge revenue losses and causes a menace to both the service providers and legitimate users. This problem is growing alongside augmenting technologies. Yet, the works in this area are hindered by the availability of data and confidentiality of approaches. In this work, we deal with the problem of detecting different types of unsolicited users from spammers to fraudsters in a massive phone call network. Most of the malicious users in telecommunications have some of the characteristics in common. These characteristics can be defined by a set of features whose values are uncommon for normal users. We made use of graph-based metrics to detect profiles that are significantly far from the common user profiles in a real data log with millions of users. To achieve this, we looked for the high leverage points in the 99.99th percentile, which identified a substantial number of users as extreme anomalous points. Furthermore, clustering these points helped distinguish malicious users efficiently and minimized the problem space significantly. Convincingly, the learned profiles of these detected users coincided with fraudulent behaviors.

2022

Semi-causal decision trees

Authors
Nogueira, AR; Ferreira, CA; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
Typically, classification algorithms use correlation analysis to make decisions. However, these decisions and the models they learn are not easily understandable for the typical user. Causal discovery is the field that studies the means to find causal relationships in observational data. Although highly interpretable, causal discovery algorithms tend to not perform so well in classification problems. This paper aims to propose a hybrid decision tree approach (SC tree) that mixes causal discovery with correlation analysis through the implementation of a custom metric to split the data in the tree's construction (Semi-causal gain ratio). In the results, the proposed methodology obtained a significant performance improvement (11.26% mean error rate) when compared to several causal baselines CDT-PS (23.67% ) and CDT-SPS (25.14%), matching closely the performance of J48 (10.20%), used as a correlation baseline, in ten binary data sets. Besides, when compared with PC in discrete data sets, the proposed approach obtained substantial improvement (16.17% against 28.07% in terms of mean error rate).

  • 45
  • 89