Publications

Publications by João Gama

2018

Self Hyper-Parameter Tuning for Data Streams

Authors
Veloso, B; Gama, J; Malheiro, B;

Publication
Discovery Science - 21st International Conference, DS 2018, Limassol, Cyprus, October 29-31, 2018, Proceedings

Abstract
The widespread usage of smart devices and sensors together with the ubiquity of the Internet access is behind the exponential growth of data streams. Nowadays, there are hundreds of machine learning algorithms able to process high-speed data streams. However, these algorithms rely on human expertise to perform complex processing tasks like hyper-parameter tuning. This paper addresses the problem of data variability modelling in data streams. Specifically, we propose and evaluate a new parameter tuning algorithm called Self Parameter Tuning (SPT). SPT consists of an online adaptation of the Nelder & Mead optimisation algorithm for hyper-parameter tuning. The method explores a dynamic size sample method to evaluate the current solution, and uses the Nelder & Mead operators to update the current set of parameters. The main contribution is the adaptation of the Nelder-Mead algorithm to automatically tune regression hyper-parameters for data streams. Additionally, whenever concept drifts occur in the data stream, it re-initiates the search for new hyper-parameters. The proposed method has been evaluated on regression scenario. Experiments with well known time-evolving data streams show that the proposed SPT hyper-parameter optimisation outperforms the results of previous expert hyper-parameter tuning efforts. © 2018, Springer Nature Switzerland AG.

CloseRead Abstract

2018

Weightless neural modeling for mining data streams

Authors
Cardoso, DO; Gama, J; França, F;

Publication
Data Mining in Time Series and Streaming Databases

Abstract
Learning from data streams can only be realized by systems which are not only effective but also efficient. That is, knowledge discovery in this context is impossible without being aware of the computational resources available. Weightless artificial neural networks (WANNs) are based on an alternative principle to iterative optimization of weights employed by most mainstream artificial neural network models and related tools. WANNs explicitly manage knowledge pieces, which are stored by RAM nodes. Such foundational difference reflects on the adaptability of these models to streaming inputs: in such scenario, the application of weightless models can be considered more natural than the same for their weighted counterparts, with an ample control over learning capability as well as resources consumption. This chapter details a WANN-based approach for mining data streams, which allows the maintenance of an up-to-date data summary which can be used for several purposes. The insights and original ideas which power such model are explained as well, enabling novel applications and further development of them.

CloseRead Abstract

2018

A comparison of hierarchical multi-output recognition approaches for anuran classification

Authors
Colonna, JG; Gama, J; Nakamura, EF;

Publication
MACHINE LEARNING

Abstract
In bioacoustic recognition approaches, a flat classifier is usually trained to recognize several species of anurans, where the number of classes is equal to the number of species. Consequently, the complexity of the classification function increases proportionally with the number of species. To avoid this issue, we propose a hierarchical approach that decomposes the problem into three taxonomic levels: the family, the genus, and the species. To accomplish this, we transform the original single-labelled problem into a multi-output problem (multi-label and multi-class) considering the biological taxonomy of the species. We then develop a top-down method using a set of classifiers organized as a hierarchical tree. We test and compare two hierarchical methods, using (1) one classifier per parent node and (2) one classifier per level, against a flat approach. Thus, we conclude that it is possible to predict the same set of species as a flat classifier, and additionally obtain new information about the samples and their taxonomic relationship. This helps us to better understand the problem and achieve additional conclusions by the inspection of the confusion matrices at the three classification levels. In addition, we propose a soft decision rule based on the joint probabilities of hierarchy pathways. With this we are able to identify and reject confusing cases. We carry out our experiments using cross-validation performed by individuals. This form of CV avoids mixing syllables that belong to the same specimens in the testing and training sets, preventing an overestimate of the accuracy and generalizing the predictive capabilities of the system. We tested our methods in a dataset with sixty individual frogs, from ten different species, eight genera, and four families, achieving a final Macro-Fscore of 80 and 70% with and without applying the rejection rule, respectively.

CloseRead Abstract

2018

A local algorithm to approximate the global clustering of streams generated in ubiquitous sensor networks

Authors
Rodrigues, PP; Araujo, J; Gama, J; Lopes, L;

Publication
INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS

Abstract
In ubiquitous streaming data sources, such as sensor networks, clustering nodes by the data they produce gives insights on the phenomenon being monitored. However, centralized algorithms force communication and storage requirements to grow unbounded. This article presents L2GClust, an algorithm to compute local clusterings at each node as an approximation of the global clustering. L2GClust performs local clustering of the sources based on the moving average of each node's data over time: the moving average is approximated using memory-less statistics; clustering is based on the furthest-point algorithm applied to the centroids computed by the node's direct neighbors. Evaluation is performed both on synthetic and real sensor data, using a state-of-the-art sensor network simulator and measuring sensitivity to network size, number of clusters, cluster overlapping, and communication incompleteness. A high level of agreement was found between local and global clusterings, with special emphasis on separability agreement, while an overall robustness to incomplete communications emerged. Communication reduction was also theoretically shown, with communication ratios empirically evaluated for large networks. L2GClust is able to keep a good approximation of the global clustering, using less communication than a centralized alternative, supporting the recommendation to use local algorithms for distributed clustering of streaming data sources.

CloseRead Abstract

2018

On analyzing user preference dynamics with temporal social networks

Authors
Pereira, FSF; Gama, J; de Amo, S; Oliveira, GMB;

Publication
MACHINE LEARNING

Abstract
The preferences adopted by individuals are constantly modified as these are driven by new experiences, natural life evolution and, mainly, influence from friends. Studying these temporal dynamics of user preferences has become increasingly important for personalization tasks in information retrieval and recommendation systems domains. However, existing models are too constrained for capturing the complexity of the underlying phenomenon. Online social networks contain rich information about social interactions and relations. Thus, these become an essential source of knowledge for the understanding of user preferences evolution. In this work, we investigate the interplay between user preferences and social networks over time. First, we propose a temporal preference model able to detect preference change events of a given user. Following this, we use temporal networks concepts to analyze the evolution of social relationships and propose strategies to detect changes in the network structure based on node centrality. Finally, we look for a correlation between preference change events and node centrality change events over Twitter and Jam social music datasets. Our findings show that there is a strong correlation between both change events, specially when modeling social interactions by means of a temporal network.

CloseRead Abstract

2018

Incremental TextRank - Automatic Keyword Extraction for Text Streams

Authors
Sarmento, RP; Cordeiro, M; Brazdil, P; Gama, J;

Publication
Proceedings of the 20th International Conference on Enterprise Information Systems, ICEIS 2018, Funchal, Madeira, Portugal, March 21-24, 2018, Volume 1.

Abstract
Text Mining and NLP techniques are a hot topic nowadays. Researchers thrive to develop new and faster algorithms to cope with larger amounts of data. Particularly, text data analysis has been increasing in interest due to the growth of social networks media. Given this, the development of new algorithms and/or the upgrade of existing ones is now a crucial task to deal with text mining problems under this new scenario. In this paper, we present an update to TextRank, a well-known implementation used to do automatic keyword extraction from text, adapted to deal with streams of text. In addition, we present results for this implementation and compare them with the batch version. Major improvements are lowest computation times for the processing of the same text data, in a streaming environment, both in sliding window and incremental setups. The speedups obtained in the experimental results are significant. Therefore the approach was considered valid and useful to the research community. Copyright

CloseRead Abstract