Publicacoes - INESC TEC

Publicações

Publicações por Pedro Pereira Rodrigues

2009

Issues in Evaluation of Stream Learning Algorithms

Autores
Gama, J; Sebastiao, R; Rodrigues, PP;

Publicação
KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING

Abstract
Learning from data streams is a research area of increasing importance. Nowadays, several stream learning algorithms have been developed. Most of them learn decision models that continuously evolve over time, run in resource-aware environments, detect and react to changes in the environment generating data. One important issue, not yet conveniently addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. There are no golden standards for assessing performance in non-stationary environments. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of Predictive Sequential methods for error estimate - the prequential error. The prequential error allows us to monitor the evolution of the performance of models that evolve over time. Nevertheless, it is known to be a pessimistic estimator in comparison to holdout estimates. To obtain more reliable estimators we need some forgetting mechanism. Two viable alternatives are: sliding windows and fading factors. We observe that the prequential error converges to an holdout estimator when estimated over a sliding window or using fading factors. We present illustrative examples of the use of prequential error estimators, using fading factors, for the tasks of: i) assessing performance of a learning algorithm; ii) comparing learning algorithms; iii) hypothesis testing using McNemar test; and iv) change detection using Page-Hinkley test. In these tasks, the prequential error estimated using fading factors provide reliable estimators. In comparison to sliding windows, fading factors are faster and memory-less, a requirement for streaming applications. This paper is a contribution to a discussion in the good-practices on performance assessment when learning dynamic models that evolve over time.

FecharLer Abstract

2008

Online reliability estimates for individual predictions in data streams

Autores
Rodrigues, PP; Gama, J; Bosnic, Z;

Publicação
Proceedings - IEEE International Conference on Data Mining Workshops, ICDM Workshops 2008

Abstract
Several predictive systems are nowadays vital for operations and decision support. The quality of these systems is most of the time defined by their average accuracy which has low or no information at all about the estimated error of each individual prediction. In many sensitive applications, users should be allowed to associate a measure of reliability to each prediction. In the case of batch systems, reliability measures have already been defined, mostly empirical measures as the estimation using the local sensitivity analysis. However, with the advent of data streams, these reliability estimates should also be computed online, based only on available data and current model's state. In this paper we define empirical measures to perform online estimation of reliability of individual predictions when made in the context of online learning systems. We present preliminary results and evaluate the estimators in two different problems. © 2008 IEEE.

FecharLer Abstract

2007

An overview on learning from data streams - Preface

Autores
Gama, J; Rodrigues, P; Aguilar Ruiz, J;

Publicação
NEW GENERATION COMPUTING

Abstract

2010

A Simple Dense Pixel Visualization for Mobile Sensor Data Mining

Autores
Rodrigues, PP; Gama, J;

Publicação
KNOWLEDGE DISCOVERY FROM SENSOR DATA

Abstract
Sensor data is usually represented by streaming time series. Current state-of-the-art systems for visualization include line plots and three-dimensional representations, which most of the time require screen resolutions that are not available in small transient mobile devices. Moreover, when data presents cyclic behaviors, such as in the electricity domain, predictive models may tend to give higher errors in certain recurrent points of time, but the human-eye is not trained to notice this cycles in a long stream. In these contexts, information is usually hard to extract from visualization. New visualization techniques may help to detect recurrent faulty predictions. En this paper we inspect visualization techniques in the scope of a real-world sensor network, quickly dwelling into future trends in visualization in transient mobile devices. We propose a simple dense pixel display visualization system, exploiting the benefits that it may represent on detecting and correcting recurrent faulty predictions. A case study is also presented, where a simple corrective strategy is studied in the context of global electrical load demand, exemplifying the utility of the new visualization method when compared with automatic detection of recurrent errors.

FecharLer Abstract

2009

An overview on mining data streams

Autores
Gama, J; Rodrigues, PP;

Publicação
Studies in Computational Intelligence

Abstract
The most challenging applications of knowledge discovery involve dynamic environments where data continuous flow at high-speed and exhibit non-stationary properties. In this chapter we discuss the main challenges and issues when learning from data streams. In this work, we discuss the most relevant issues in knowledge discovery from data streams: incremental learning, cost-performance management, change detection, and novelty detection. We present illustrative algorithms for these learning tasks, and a real-world application illustrating the advantages of stream processing. The chapter ends with some open issues that emerge from this new research area. © 2009 Springer-Verlag Berlin Heidelberg.

FecharLer Abstract

2007

Semi-fuzzy splitting in Online Divisive-Agglomerative Clustering

Autores
Rodrigues, PP; Gama, J;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
The Online Divisive-Agglomerative Clustering (ODAC) is an incremental approach for clustering streaming time series using a hierarchical procedure over time. It constructs a tree-like hierarchy of clusters of streams, using a top-down strategy based on the correlation between streams. The system also possesses an agglomerative phase to enhance a dynamic behavior capable of structural change detection. However, the split decision used in the algorithm focus on the crisp boundary between two groups, which implies a high risk since it has to decide based on only a small subset of the entire data. In this work we propose a semi-fuzzy approach to the assignment of variables to newly created clusters, for a better trade-off between validity and performance. Experimental work supports the benefits of our approach.

FecharLer Abstract