Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2015

Guest Editors introduction: special issue of the ECMLPKDD 2015 journal track

Autores
Bielza, C; Gama, J; Jorge, A; Zliobaite, I;

Publicação
MACHINE LEARNING

Abstract

2015

Classification of Evolving Data Streams with Infinitely Delayed Labels

Autores
Souza, VMA; Silva, DF; Batista, GEAPA; Gama, J;

Publicação
2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA)

Abstract
The majority of evolving data streams classification algorithms assume that the actual labels of the predicted examples are readily available without any time delay just after a prediction is made. However, given the high label costs, dependence of an expert, limitations in data transmission or even restrictions imposed by the problem's nature, there is a large number of real-world applications in which the availability of actual labels is infinitely delayed (never available), In these cases, it is necessary the use of algorithms that does not follow the traditional process of monitoring the error rate to detect changes in data distribution and uses the most recent labeled data to update the classification model. In this paper, we propose the method Maasstfication to classify evolving data streams with infinitely delayed labels. Our method is inspired on the use of Micro-Cluster representation from online clustering algorithms. Considering the presence of incremental drifts, our approach uses a distance-based strategy to maintain the Micro-Clusters' positions updated. An evaluation in several synthetic and real data shows that Maassification achieves competitive accuracy results to state-of-the-art methods and adequate computational cost. The main advantage of the proposed method is the absence of critical parameters that require user's prior knowledge, as occurs with rival methods.

FecharLer Abstract

2013

Data Stream Clustering: A Survey

Autores
Silva, JA; Faria, ER; Barros, RC; Hruschka, ER; de Carvalho, ACPLF; Gama, J;

Publicação
ACM COMPUTING SURVEYS

Abstract
Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that describe applications of data stream clustering in different domains, such as network intrusion detection, sensor networks, and stock market analysis. Information regarding software packages and data repositories are also available for helping researchers and practitioners. Finally, some important issues and open questions that can be subject of future research are discussed.

FecharLer Abstract

2014

Distributed Adaptive Model Rules for Mining Big Data Streams

Autores
Vu, AT; De Francisci Morales, GD; Gama, J; Bifet, A;

Publicação
2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)

Abstract
Decision rules are among the most expressive data mining models. We propose the first distributed streaming algorithm to learn decision rules for regression tasks. The algorithm is available in SAMOA (SCALABLE ADVANCED MASSIVE ONLINE ANALYSIS), an open-source platform for mining big data streams. It uses a hybrid of vertical and horizontal parallelism to distribute Adaptive Model Rules (AMRules) on a cluster. The decision rules built by AMRules are comprehensible models, where the antecedent of a rule is a conjunction of conditions on the attribute values, and the consequent is a linear combination of the attributes. Our evaluation shows that this implementation is scalable in relation to CPU and memory consumption. On a small commodity Samza cluster of 9 nodes, it can handle a rate of more than 3 0 0 0 0 instances per second, and achieve a speedup of up to 4.7 x over the sequential version.

FecharLer Abstract

2017

The Initialization and Parameter Setting Problem in Tensor Decomposition-Based Link Prediction

Autores
Silva Fernandes, Sd; Tork, HF; da Gama, JMP;

Publicação
2017 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2017, Tokyo, Japan, October 19-21, 2017

Abstract
Link prediction is the task of social network analysis whose goal is to predict the links that will appear in the network in future instants. Among the link predictors exploiting the time evolution of the networks, we can find the tensor decomposition-based methods. A major limitation of these methods is the lack of appropriate approaches for estimating their parameters and initialization. In this paper, we address this problem by proposing a parameter setting method. Our proposed approach resorts to optimization techniques to drive the search for an adequate parameter and initialization choice. © 2017 IEEE.

FecharLer Abstract

2017

Proceedings of the Workshop on IoT Large Scale Learning from Data Streams co-located with the 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2017), Skopje, Macedonia, September 18-22, 2017

Autores
Mouchaweh, MS; Bifet, A; Bouchachia, H; Gama, J; Ribeiro, RP;

Publicação
IOTSTREAMING@PKDD/ECML

Abstract