2008
Autores
Rodrigues, PP; Gama, J;
Publicação
ECAI 2008, PROCEEDINGS
Abstract
Online learning algorithms which address fast data streams should process examples at the rate they arrive, using a single scan of data and fixed memory, maintaining a decision model at any time and being able to adapt the model to the most recent data. These features yield the necessity of using approximate models. One problem that usually arises with approximate models is the definition of a minimum number of observations necessary to assure convergence, which implies a high risk since the system may have to decide based only on a small subset of the entire data. One approach is to apply techniques based on the Hoeffding bound to enforce decisions with a confidence level. In divisive clustering of time series, the goal is to find clusters of similar time series over time. In online approaches there are two decisions to make: when to split and how to assign variables to new clusters. We can define a confidence level to both the decision of splitting and the assignment of data variables to new clusters. Previous works have already addressed confident decisions on the moment of split. Our proposal is to include a confidence level to the assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are confidently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the confidence level. In this paper we propose to assign the unsure variables to a third cluster. Experimental evaluation is presented in the context of a recently proposed hierarchical algorithm, assessing the advantages of the proposal, revealing also advantages on memory usage reduction and processing speed. Although this proposal is evaluated under the scope of an existent method, it can be generalized to any divisive procedure.
2011
Autores
Correa, FE; Oliveira, MDB; Alves, LRA; Gama, J; Correa, PLP;
Publicação
EFITA/WCCA '11
Abstract
Agribusiness, as many other activities, produces huge amounts of spatio-temporal data. We need a system in order to store, analyze, and mine this data. In a previous work, we developed data warehouse tools to store, organize and query Brazilian agribusiness data from several regions along 10 years. In this paper, we go a step ahead, and propose specific data mining techniques to discover marks and evolution patterns from Agribusiness data. We propose the use of Tucker decomposition to automatically detect short time windows that exhibit large changes in the correlation structure between the time-series of prices from the Brazil Grain market.
2001
Autores
Amado, N; Gama, J; Silva, FMA;
Publicação
Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving, 10th Portuguese Conference on Artificial Intelligence, EPIA 2001, Porto, Portugal, December 17-20, 2001, Proceedings
Abstract
In the fields of data mining and machine learning the amount of data available for building classifiers is growing very fast. Therefore, there is a great need for algorithms that are capable of building classifiers from very-large datasets and, simultaneously, being computationally efficient and scalable. One possible solution is to employ parallelism to reduce the amount of time spent in building classifiers from very-large datasets and keeping the classification accuracy. This work first overviews some strategies for implementing decision tree construction algorithms in parallel based on techniques such as task parallelism, data parallelism and hybrid parallelism. We then describe a new parallel implementation of the C4.5 decision tree construction algorithm. Even though the implementation of the algorithm is still in final development phase, we present some experimental results that can be used to predict the expected behavior of the algorithm. © Springer-Verlag Berlin Heidelberg 2001.
2009
Autores
Sebastiao, R; Rodrigues, PP; Gama, J;
Publicação
2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009)
Abstract
This paper addresses the space-time change detection problem in climate data over the Iberian Peninsula using a 50 years dataset. The data were analyzed concerning the temporal and geographical information, using the following methodology: information about space-time drifts in climate data was obtained by applying a change detection algorithm on all the temporal data available for each physical location considered in this study; the performance and the robustness of this algorithm were then assessed by the McNemar nonparametric statistical test on cluster structures; geographical correlations were inferred using visualization tools and graphical representations of data. Most of the space-temporal drifts detected by the algorithm were confirmed by the results of the McNemar test and are in accordance with visual and graphical representations, supporting the advantage of using inter-disciplinary methods. This analysis also shows that there are locations which do not reveal any change along all the observed years.
2012
Autores
Oliveira, M; Gama, J;
Publicação
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY
Abstract
Data mining is being increasingly applied to social networks. Two relevant reasons are the growing availability of large volumes of relational data, boosted by the proliferation of social media web sites, and the intuition that an individual's connections can yield richer information than his/her isolate attributes. This synergistic combination can show to be germane to a variety of applications such as churn prediction, fraud detection and marketing campaigns. This paper attempts to provide a general and succinct overview of the essentials of social network analysis for those interested in taking a first look at this area and oriented to use data mining in social networks. C (C) 2012 Wiley Periodicals, Inc.
2011
Autores
Rodrigues, PP; Gama, J; Araújo, J; Lopes, LMB;
Publicação
Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), TaiChung, Taiwan, March 21 - 24, 2011
Abstract
In ubiquitous streaming data sources, such as sensor networks, clustering nodes by the data they produce is an important problem that gives insights on the phenomenon being monitored by such networks. However, if these techniques require data to be gathered centrally, communication and storage requirements are often unbounded. The goal of this paper is to assess the feasibility of computing local clustering at each node, using only neighbors' centroids, as an approximation of the global clustering computed by a centralized process. A local algorithm is proposed to perform clustering of sensors based on the moving average of each node's data over time: the moving average of each node is approximated using memory-less fading average; clustering is based on the furthest point algorithm applied to the centroids computed by the node's direct neighbors. The algorithm was evaluated on a state-of-the-art sensor network simulator, measuring the agreement between local and global clustering. Experimental work on synthetic data with spherical Gaussian clusters is consistently analyzed for different network size, number of clusters and cluster overlapping. Results show a high level of agreement between each node's clustering definitions and the global clustering definition, with special emphasis on separability agreement. Overall, local approaches are able to keep a good approximation of the global clustering, improving privacy among nodes, and decreasing communication and computation load in the network. Hence, the basic requirements for distributed clustering of streaming data sensors recommend that clustering on these settings should be performed locally. © 2011 ACM.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.