Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2020

Optimizing Waste Collection: A Data Mining Approach

Autores
Londres, G; Filipe, N; Gama, J;

Publicação
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT I

Abstract
The smart cities concept - use of connected services and intelligent systems to support decision making in cities governance - aims to build better sustainability and living conditions for urban spaces, which are more complex every day. This work expects to optimize the waste collection circuits for non-residential customers in a city in Portugal. It is developed through the implementation of a simple, low-cost methodology when compared to commercial-available sensor systems. The main goal is to build a classifier for each client, being able to forecast the presence or absence of containers and, in a second step, predict how many containers of glass, paper or plastic would be available to be collected. Data were acquired during the period of one year, from January to December 2017, from more than 100 customers, resulting in a 26.000+ records dataset. Due to its degree of interpretability, we use Decision trees, implemented with a sliding window, which ran through the months of the year, stacking it one-by-one and/or merging few groups aiming the best correct predictions score. This project results in more efficient waste-collection routes, increasing the operation profits and reducing both costs and fuel-consumption, therefore diminishing it environmental footprint.

FecharLer Abstract

2020

Fraud Detection using Heavy Hitters: a Case Study

Autores
Veloso, B; Martins, C; Espanha, R; Azevedo, R; Gama, J;

Publicação
PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20)

Abstract
The high asymmetry of international termination rates, where calls are charged with higher values, are fertile ground for the appearance of frauds in Telecom Companies. In this paper, we present three different and complementary solutions for a real problem called Interconnect Bypass Fraud. This problem is one of the most common in the telecommunication domain and can be detected by the occurrence of abnormal behaviours from specific numbers. Our goal is to detect as soon as possible numbers with abnormal behaviours, e.g. bursts of calls, repetitions and mirror behaviours. Based on this assumption, we propose: (i) the adoption of a new fast forgetting technique that works together with the Lossy Counting algorithm; (ii) the proposal of a single pass hierarchical heavy hitter algorithm that also contains a forgetting technique; and (iii) the application of the HyperLogLog sketches for each phone number. We used the heavy hitters to detect abnormal behaviours, e.g. burst of calls, repetition and mirror. The hierarchical heavy hitters algorithm is used to detect the numbers that make calls for a huge set of destinations and destination numbers that receives a huge set of calls to provoke a denial of service. Additionally, to detect the cardinality of destination numbers of each origin number we use the HyperLogLog algorithm. The results shows that these three approaches combined complements the techniques used by the telecom company and make the fraud task more difficult.

FecharLer Abstract

2020

Improving Prediction with Causal Probabilistic Variables

Autores
Nogueira, AR; Gama, J; Ferreira, CA;

Publicação
ADVANCES IN INTELLIGENT DATA ANALYSIS XVIII, IDA 2020

Abstract
The application of feature engineering in classification problems has been commonly used as a means to increase the classification algorithms performance. There are already many methods for constructing features, based on the combination of attributes but, to the best of our knowledge, none of these methods takes into account a particular characteristic found in many problems: causality. In many observational data sets, causal relationships can be found between the variables, meaning that it is possible to extract those relations from the data and use them to create new features. The main goal of this paper is to propose a framework for the creation of new supposed causal probabilistic features, that encode the inferred causal relationships between the target and the other variables. In this case, an improvement in the performance was achieved when applied to the Random Forest algorithm.

FecharLer Abstract

2019

Special Issue of DASFAA 2019

Autores
Li, G; Gama, J; Yang, J;

Publicação
Data Science and Engineering

Abstract

2020

A scalable saliency-based feature selection method with instance-level information

Autores
Cancela, B; Bolon Canedo, V; Alonso Betanzos, A; Gama, J;

Publicação
KNOWLEDGE-BASED SYSTEMS

Abstract
Classic feature selection techniques remove irrelevant or redundant features to achieve a subset of relevant features in compact models that are easier to interpret and so improve knowledge extraction. Most such techniques operate on the whole dataset, but are unable to provide the user with useful information when only instance-level information is required; in other words, classic feature selection algorithms do not identify the most relevant information in a sample. We have developed a novel feature selection method, called saliency-based feature selection (SFS), based on deep-learning saliency techniques. Our algorithm works under any architecture that is trained by using gradient descent techniques (Neural Networks, SVMs, ...), and can be used for classification or regression problems. Experimental results show our algorithm is robust, as it allows to transfer the feature ranking result between different architectures, achieving remarkable results. The versatility of our algorithm has been also demonstrated, as it can work either in big data environments as well as with small datasets.

FecharLer Abstract

2020

BRIGHT-Drift-Aware Demand Predictions for Taxi Networks

Autores
Saadallah, A; Moreira Matias, L; Sousa, R; Khiari, J; Jenelius, E; Gama, J;

Publicação
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Abstract
Massive data broadcast by GPS-equipped vehicles provide unprecedented opportunities. One of the main tasks in order to optimize our transportation networks is to build data-driven real-time decision support systems. However, the dynamic environments where the networks operate disallow the traditional assumptions required to put in practice many off-the-shelf supervised learning algorithms, such as finite training sets or stationary distributions. In this paper, we propose BRIGHT: a drift-aware supervised learning framework to predict demand quantities. BRIGHT aims to provide accurate predictions for short-term horizons through a creative ensemble of time series analysis methods that handles distinct types of concept drift. By selecting neighborhoods dynamically, BRIGHT reduces the likelihood of overfitting. By ensuring diversity among the base learners, BRIGHT ensures a high reduction of variance while keeping bias stable. Experiments were conducted using three large-scale heterogeneous real-world transportation networks in Porto (Portugal), Shanghai (China), and Stockholm (Sweden), as well as with controlled experiments using synthetic data where multiple distinct drifts were artificially induced. The obtained results illustrate the advantages of BRIGHT in relation to state-of-the-art methods for this task.

FecharLer Abstract