Publications

Publications by João Gama

2013

Avoiding Anomalies in Data Stream Learning

Authors
Gama, J; Kosina, P; Almeida, E;

Publication
DISCOVERY SCIENCE

Abstract
The presence of anomalies in data compromises data quality and can reduce the effectiveness of learning algorithms. Standard data mining methodologies refer to data cleaning as a pre-processing before the learning task. The problem of data cleaning is exacerbated when learning in the computational model of data streams. In this paper we present a streaming algorithm for learning classification rules able to detect contextual anomalies in the data. Contextual anomalies are surprising attribute values in the context defined by the conditional part of the rule. For each example we compute the degree of anomaliness based on the probability of the attribute-values given the conditional part of the rule covering the example. The examples with high degree of anomaliness are signaled to the user and not used to train the classifier. The experimental evaluation in real-world data sets shows the ability to discover anomalous examples in the data. The main advantage of the proposed method is the ability to inform the context and explain why the anomaly occurs.

CloseRead Abstract

2017

Comparison Between Co-training and Self-training for Single-target Regression in Data Streams using AMRules

Authors
Sousa, R; Gama, J;

Publication
Proceedings of the Workshop on IoT Large Scale Learning from Data Streams co-located with the 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2017), Skopje, Macedonia, September 18-22, 2017.

Abstract
A comparison between co-training and self-training method for single-target regression based on multiples learners is performed. Data streaming systems can create a significant amount of unlabeled data which is caused by label assignment impossibility, high cost of labeling or labeling long duration tasks. In supervised learning, this data is wasted. In order to take advantaged from unlabeled data, semi-supervised approaches such as Co-training and Self-training have been created to benefit from input information that is contained in unlabeled data. However, these approaches have been applied to classification and batch training scenarios. Due to these facts, this paper presents a comparison between Co-training and Self-learning methods for single-target regression in data streams. Rules learning is used in this context since this methodology enables to explore the input information. The experimental evaluation consisted of a comparison between the real standard scenario where all unlabeled data is rejected and scenarios where unlabeled data is used to improve the regression model. Results show evidences of better performance in terms of error reduction and in high level of unlabeled examples in the stream. Despite this fact, the improvements are not expressive.

CloseRead Abstract

2017

Efficient Incremental Laplace Centrality Algorithm for Dynamic Networks

Authors
Sarmento, RP; Cordeiro, M; Brazdil, P; Gama, J;

Publication
Complex Networks & Their Applications VI - Proceedings of Complex Networks 2017 (The Sixth International Conference on Complex Networks and Their Applications), COMPLEX NETWORKS 2017, Lyon, France, November 29 - December 1, 2017.

Abstract
Social Network Analysis (SNA) is an important research area. It originated in sociology but has spread to other areas of research, including anthropology, biology, information science, organizational studies, political science, and computer science. This has stimulated research on how to support SNA with the development of new algorithms. One of the critical areas involves calculation of different centrality measures. The challenge is how to do this fast, as many increasingly larger datasets are available. Our contribution is an incremental version of the Laplacian Centrality measure that can be applied not only to large graphs but also to dynamically changing networks. We have conducted several tests with different types of evolving networks. We show that our incremental version can process a given large network, faster than the corresponding batch version in both incremental and full dynamic network setups. © Springer International Publishing AG 2018.

CloseRead Abstract

2016

MINAS: multiclass learning algorithm for novelty detection in data streams

Authors
de Faria, ER; de Leon Ferreira Carvalho, ACPDF; Gama, J;

Publication
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Data stream mining is an emergent research area that aims at extracting knowledge from large amounts of continuously generated data. Novelty detection (ND) is a classification task that assesses if one or a set of examples differ significantly from the previously seen examples. This is an important task for data stream, as new concepts may appear, disappear or evolve over time. Most of the works found in the ND literature presents it as a binary classification task. In several data stream real life problems, ND must be treated as a multiclass task, in which, the known concept is composed by one or more classes and different new classes may appear. This work proposes MINAS, an algorithm for ND in data streams. MINAS deals with ND as a multiclass task. In the initial training phase, MINAS builds a decision model based on a labeled data set. In the online phase, new examples are classified using this model, or marked as unknown. Groups of unknown examples can be used later to create valid novelty patterns (NP), which are added to the current model. The decision model is updated as new data come over the stream in order to reflect changes in the known classes and allow the addition of NP. This work also presents a set of experiments carried out comparing MINAS and the main novelty detection algorithms found in the literature, using artificial and real data sets. The experimental results show the potential of the proposed algorithm.

CloseRead Abstract

2017

Mobility Mining Using Nonnegative Tensor Factorization

Authors
Nosratabadi, HE; Fanaee T, H; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017)

Abstract
Mobility mining has lots of applications in urban planning and transportation systems. In particular, extracting mobility patterns enables service providers to have a global insight about the mobility behaviors which consequently leads to providing better services to the citizens. In the recent years several data mining techniques have been presented to tackle this problem. These methods usually are either spatial extension of temporal methods or temporal extension of spatial methods. However, still a framework that can keep the natural structure of mobility data has not been considered. Non-negative tensor factorizations (NNTF) have shown great applications in topic modelling and pattern recognition. However, unfortunately their usefulness in mobility mining is less explored. In this paper we propose a new mobility pattern mining framework based on a recent non-negative tensor model called BetaNTF. We also present a new approach based on interpretability concept for determination of number of components in the tensor rank selection process. We later demonstrate some meaningful mobility patterns extracted with the proposed method from bike sharing network mobility data in Boston, USA.

CloseRead Abstract

2013

On recommending urban hotspots to find our next passenger

Authors
Moreira Matias, L; Fernandes, R; Gama, J; Ferreira, M; Mendes Moreira, J; Damas, L;

Publication
CEUR Workshop Proceedings

Abstract
The rising fuel costs is disallowing random cruising strategies for passenger finding. Hereby, a recommendation model to suggest the most passengerprofitable urban area/stand is presented. This framework is able to combine the 1) underlying historical patterns on passenger demand and the 2) current network status to decide which is the best zone to head to in each moment. The major contribution of this work is on how to combine well-known methods for learning from data streams (such as the historical GPS traces) as an approach to solve this particular problem. The results were promising: 395.361/506.873 of the services dispatched were correctly predicted. The experiments also highlighted that a fleet equipped with such framework surpassed a fleet that is not: they experienced an average waiting time to pick-up a passenger 5% lower than its competitor. © 2013 IJCAI.

CloseRead Abstract