2009
Authors
Gama, J; Ganguly, A; Omitaomu, O; Vatsavai, R; Gaber, M;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
2011
Authors
Carmona Cejudo, JM; Baena Garcia, M; del Campo Avila, J; Bifet, A; Gama, J; Morales Bueno, R;
Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS X: IDA 2011
Abstract
Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail's architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.
2012
Authors
Oliveira, M; Gama, J;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
The study of evolution has become an important research issue, especially in the last decade, due to our ability to collect and store high detailed and time-stamped data. The need for describing and understanding the behavior of a given phenomena over time led to the emergence of new frameworks and methods focused on the temporal evolution of data and models. In this paper we address the problem of monitoring the evolution of clusters over time and propose the MEC framework. MEC traces evolution through the detection and categorization of clusters transitions, such as births, deaths and merges, and enables their visualization through bipartite graphs. It includes a taxonomy of transitions, a tracking method based in the computation of conditional probabilities, and a transition detection algorithm. We use MEC with two main goals: to determine the general evolution trends and to detect abnormal behavior or rare events. To demonstrate the applicability of our framework we present real world economic and financial case studies, using datasets extracted from Banco de Portugal Central Balance-Sheet Database and the The Data Page of New York University -Leonard N. Stern School of Business. The results allow us to draw interesting conclusions about the evolution of activity sectors and European companies.
2009
Authors
Spinosa, EJ; de Carvalhoa, APDF; Gama, J;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
This paper presents and evaluates an approach to novelty detection that addresses it as the problem of identifying novel concepts in a continuous learning scenario, as an extension to a single-class classification problem. OLINDDA, an OnLIne Novelty and Drift Detection Algorithm that implements this approach, uses efficient standard clustering algorithms to continuously generate candidate clusters among examples that were not explained by the current known concepts. Clusters complying with a validation criterion that takes cohesiveness and representativeness into account are initially identified as concepts. By merging similar concepts, OLINDDA may enhance the representation of some concepts as it advances toward its final goal of describing novel emerging concepts in an unsupervised way. The proposed approach is experimentally evaluated by the use of several measures taken throughout the learning process. Results show that it is capable of identifying novel concepts that are pure and correspond to real classes, disregarding unrepresentative clusters and outliers.
2011
Authors
Gama, J; Rodrigues, PP; Lopes, L;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real data from physiological sensors exposes the aforementioned advantages of the system.
2009
Authors
Rodrigues, PP; Gama, J;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
Sensors distributed all around electrical-power distribution networks produce streams of data at high-speed. From a data mining perspective, this sensor network problem is characterized by a large number of variables ( sensors), producing a continuous flow of data, in a dynamic non-stationary environment. Companies make decisions to buy or sell energy based on load profiles and forecast. In this work we analyze the most relevant data mining problems and issues: continuously learning clusters and predictive models, model adaptation in large domains, and change detection and adaptation. The goal is to continuously maintain a clustering model, defining profiles, and a predictive model able to incorporate new information at the speed data arrives, detecting changes and adapting the decision models to the most recent information. We present experimental results in a large real-world scenario, illustrating the advantages of the continuous learning and its competitiveness against Wavelets based prediction. We also propose a light electrical load visualization system which enhances the ability to inspect forecast results in mobile devices.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.