Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2015

Online tree-based ensembles and option trees for regression on evolving data streams

Autores
Ikonomovska, E; Gama, J; Dzeroski, S;

Publicação
NEUROCOMPUTING

Abstract
The emergence of ubiquitous sources of streaming data has given rise to the popularity of algorithms for online machine learning. In that context, Hoeffding trees represent the state-of-the-art algorithms for online classification. Their popularity stems in large part from their ability to process large quantities of data with a speed that goes beyond the processing power of any other streaming or batch learning algorithm. As a consequence, Hoeffding trees have often been used as base models of many ensemble learning algorithms for online classification. However, despite the existence of many algorithms for online classification, ensemble learning algorithms for online regression do not exist. In particular, the field of online any-time regression analysis seems to have experienced a serious lack of attention. In this paper, we address this issue through a study and an empirical evaluation of a set of online algorithms for regression, which includes the baseline Hoeffding-based regression trees, online option trees, and an online least mean squares filter. We also design, implement and evaluate two novel ensemble learning methods for online regression: online bagging with Hoeffding-based model trees, and an online RandomForest method in which we have used a randomized version of the online model tree learning algorithm as a basic building block. Within the study presented in this paper, we evaluate the proposed algorithms along several dimensions: predictive accuracy and quality of models, time and memory requirements, bias-variance and bias-variance-covariance decomposition of the error, and responsiveness to concept drift.

FecharLer Abstract

2015

Data Mining Frequent Temporal Events In Agrieconomic Time Series

Autores
Correa, FE; Gama, J; Correa, PLP; Alves, LRA;

Publicação
IEEE LATIN AMERICA TRANSACTIONS

Abstract
The agricultural commodities are important to economies of several countries, especially in Brazil. Despite the amount of money involved, as knows that in agribusiness activities do not have accurate information in all the process. Therefore some research centers in Brazil, such as Center for Advanced Studies on Applied Economics - CEPEA, collect and provide daily price indices of these commodities, on several agricultural products, and spread information to these researchers markets, producers and formulators public policy. The idea is to understand the evolution and pattern for the time series of Grains price indices for seven years. The aim of this paper is find common patterns on time series, i.e. highlight events that happens frequently over seven year of daily grain prices quotation in several products. The results give an understanding of the dynamic of these grains time series, such as, some important aspects were detect was these products competes in fields for crops.

FecharLer Abstract

2015

Prediction Intervals for Electric Load Forecast: Evaluation for Different Profiles

Autores
Almeida, V; Gama, J;

Publicação
2015 18TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM APPLICATION TO POWER SYSTEMS (ISAP)

Abstract
Electricity industries throughout the world have been using load profiles for many years. Electrical load data contain valuable information that can be useful for both electricity producers and consumers. Load forecasting is a fundamental and important task to operate power systems efficiently and economically. Currently, prediction intervals (PIs) are assuming increasing importance comparatively to point forecast that cannot properly handle forecast uncertainties, since they are capable to compromise informativeness and correctness. This paper aims to demonstrate that different demand profiles clearly influence PIs reliability and width. The evaluation is performed using data from different customers on the basis of their electricity behavior using hierarchical clustering, and taking the Kullback-Leibler divergence as the distance metric. PIs are obtained using two different strategies: (1) dual perturb and combine algorithm and (2) conformal prediction. It was possible to demonstrate that different demand profiles clearly influence PI reliability and width for both models. The knowledge retrieved from the analysis of the load patterns is useful and can be used to support the selection of the best method to interval forecast, considering a specific location. And also, it can support the selection of an optimum confidence level, considering that a too wide PI conveys little information and is of no use for decision making.

FecharLer Abstract

2015

Distributed Reasoning

Autores
Rodrigues, P; Gama, J;

Publicação
MATHEMATICS OF ENERGY AND CLIMATE CHANGE

Abstract
This paper discusses the problem of learning a global model from local information. We consider ubiquitous streaming data sources, such as sensor networks, and discuss efficient learning distributed algorithms. We present the generic framework of distributed sources of data, an illustrative algorithm to monitor the global state of the network using limited communication between peers, and an efficient distributed clustering algorithm.

FecharLer Abstract

2015

Evaluation of Multiclass Novelty Detection Algorithms for Data Streams

Autores
de Faria, ER; Goncalves, IR; Gama, J; de Leon Ferreira Carvalho, ACPDF;

Publicação
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Abstract
Data stream mining is an emergent research area that investigates knowledge extraction from large amounts of continuously generated data, produced by non-stationary distribution. Novelty detection, the ability to identify new or previously unknown situations, is a useful ability for learning systems, especially when dealing with data streams, where concepts may appear, disappear, or evolve over time. There are several studies currently investigating the application of novelty detection techniques in data streams. However, there is no consensus regarding how to evaluate the performance of these techniques. In this study, we propose a new evaluation methodology for multiclass novelty detection in data streams able to deal with: i) unsupervised learning, which generates novelty patterns without an association with the true classes, where one class may be composed of a novelty set, ii) confusion matrix that increases over time, iii) confusion matrix with a column representing unknown examples, i.e., those not explained by the model, and iv) representation of the evaluation measures over time. We propose a new methodology to associate the novelty patterns detected by the algorithm, in an unsupervised fashion, with the true classes. Finally, we evaluate the performance of the proposed methodology through the use of known novelty detection algorithms with artificial and real data sets.

FecharLer Abstract

2015

Multi-Target Regression from High-Speed Data Streams with Adaptive Model Rules

Autores
Duarte, J; Gama, J;

Publicação
PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (IEEE DSAA 2015)

Abstract
Many real life prediction problems involve predicting a structured output. Multi-target regression is an instance of structured output prediction whose task is to predict for multiple target variables. Structured output algorithms are usually computationally and memory demanding, hence are not suited for dealing with massive amounts of data. Most of these algorithms can be categorized as local or global methods. Local methods produce individual models for each output component and combine them to produce the structured prediction. Global methods adapt traditional learning algorithms to predict the output structure as a whole. We propose the first rule-based algorithm for solving multi-target regression problems from data streams. The algorithm builds on the adaptive model rules framework. In contrast to the majority of the structured output predictors, this particular algorithm does not fall into the local and global categories. Instead, each rule specializes on related subsets of the output attributes. To evaluate the performance of the proposed algorithm, two other rule-based algorithms were developed, one using the local strategy and the other using the global strategy. These methods were compared considering their prediction error, memory usage, computational time, and model complexity. Experimental results on synthetic and real data show that the local-strategy algorithm usually obtains the lowest error. However, the proposed and the global-strategy algorithms use much less memory and run significantly much faster at the cost of a slightly increase in the error, which make them very attractive when computation resources are an important factor. Also, the models produced by the latter approaches are much easier to understand since considerably less rules are produced.

FecharLer Abstract