Publications

Publications by Carlos Manuel Soares

2016

TweeProfiles3: visualization of spatio-temporal patterns on Twitter

Authors
Maia, A; Cunha, T; Soares, C; Abreu, PH;

Publication
NEW ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1

Abstract
With the advent of social networking, a lot of user-specific, voluntarily provided data has been generated. Researchers and companies noticed the value that lied within those enormous amounts of data and developed algorithms and tools to extract patterns in order to act on them. TweeProfiles is an offline clustering tool that analyses tweets over multiple dimensions: spatial, temporal, content and social. This project was extended in TweeProfiles2 by enabling the processing of real-time data. In this work, we developed a visualization tool suitable for data streaming, using multiple widgets to better represent all the information. The usefulness of the developed tool for journalism was evaluated based on a usability test, which despite its reduced number of participants yielded good results.

CloseRead Abstract

2017

A guidance of data stream characterization for meta-learning

Authors
Debiaso Rossi, ALD; de Souza, BF; Soares, C; de Leon Ferreira de Carvalho, ACPDF;

Publication
INTELLIGENT DATA ANALYSIS

Abstract
The problem of selecting learning algorithms has been studied by the meta-learning community for more than two decades. One of the most important task for the success of a meta-learning system is gathering data about the learning process. This data is used to induce a (meta) model able to map characteristics extracted from different data sets to the performance of learning algorithms on these data sets. These systems are built under the assumption that the data are generated by a stationary distribution, i.e., a learning algorithm will perform similarly for new data from the same problem. However, many applications generate data whose characteristics can change over time. Therefore, a suitable bias at a given time may become inappropriate at another time. Although meta-learning has been used to continuously select a learning algorithm in data streams, data characterization has received less attention in this context. In this study, we provide a set of guidelines to support the proposal of characteristics able to describe non-stationary data over time. This guidance considers both the order of arrival of the examples and the type of variables involved in the base-level learning. In addition, we analyze the influence of characteristics regarding their dependence on data morphology. Experimental results using real data streams showed the effectiveness of the proposed data characterization general scheme to support algorithm selection by meta-learning systems. Moreover, the dependent metafeatures provided crucial information for the success of some meta-models.

CloseRead Abstract

2013

POPSTAR at RepLab 2013: Name ambiguity resolution on Twitter

Authors
Saleiro, P; Rei, L; Pasquali, A; Soares, C; Teixeira, J; Pinto, F; Nozari, M; Felix, C; Strecht, P;

Publication
CEUR Workshop Proceedings

Abstract
Filtering tweets relevant to a given entity is an important task for online reputation management systems. This contributes to a reliable analysis of opinions and trends regarding a given entity. In this paper we describe our participation at the Filtering Task of RepLab 2013. The goal of the competition is to classify a tweet as relevant or not relevant to a given entity. To address this task we studied a large set of features that can be generated to describe the relationship between an entity and a tweet. We explored different learning algorithms as well as, different types of features: text, keyword similarity scores between enti-ties metadata and tweets, Freebase entity graph and Wikipedia. The test set of the competition comprises more than 90000 tweets of 61 entities of four distinct categories: automotive, banking, universities and music. Results show that our approach is able to achieve a Reliability of 0.72 and a Sensitivity of 0.45 on the test set, corresponding to an F-measure of 0.48 and an Accuracy of 0.908.

CloseRead Abstract

2015

Pruning Bagging Ensembles with Metalearning

Authors
Pinto, F; Soares, C; Mendes Moreira, J;

Publication
MULTIPLE CLASSIFIER SYSTEMS (MCS 2015)

Abstract
Ensemble learning algorithms often benefit from pruning strategies that allow to reduce the number of individuals models and improve performance. In this paper, we propose a Metalearning method for pruning bagging ensembles. Our proposal differs from other pruning strategies in the sense that allows to prune the ensemble before actually generating the individual models. The method consists in generating a set characteristics from the bootstrap samples and relate them with the impact of the predictive models in multiple tested combinations. We executed experiments with bagged ensembles of 20 and 100 decision trees for 53 UCI classification datasets. Results show that our method is competitive with a state-of-the-art pruning technique and bagging, while using only 25% of the models.

CloseRead Abstract

2015

TwitterJam: Identification of Mobility Patterns in Urban Centers Based on Tweets

Authors
Rebelo, F; Soares, C; Rossetti, RJF;

Publication
2015 IEEE FIRST INTERNATIONAL SMART CITIES CONFERENCE (ISC2)

Abstract
In the early twenty-first century, social networks served only to let the world know our tastes, share our photos and share some thoughts. A decade later, these services are filled with an enormous amount of information. Now, the industry and the academia are exploring this information, in order to extract implicit patterns. TwitterJam is a tool that analyses the contents of the social network Twitter to extract events related to road traffic. To reach this goal, we started by analysing tweets to know those which really contains road traffic information. The second step was to gather official information to confirm the extracted information. With these two types of information (official and general), we correlated them in order to verify the credibility of public tweets. The correlation between the two types of information was done separately in two ways: the first one concerns the amount of tweets in a certain time of day and the second on the localization of these tweets. Two hypothesis were also devised concerning these correlations. The results were not perfect but where reasonable enough. We also analysed tools suitable for the visualization of data to decide what is the best strategy to follow. At the end we developed a web application that shows the results, to help the analysis of results.

CloseRead Abstract

2015

Using Metalearning for Prediction of Taxi Trip Duration Using Different Granularity Levels

Authors
Zarmehri, MN; Soares, C;

Publication
Advances in Intelligent Data Analysis XIV

Abstract
Trip duration is an important metric for the management of taxi companies, as it affects operational efficiency, driver satisfaction and, above all, customer satisfaction. In particular, the ability to predict trip duration in advance can be very useful for allocating taxis to stands and finding the best route for trips. A data mining approach can be used to generate models for trip time prediction. In fact, given the amount of data available, different models can be generated for different taxis. Given the difference between the data collected by different taxis, the best model for each one can be obtained with different algorithms and/or parameter settings. However, finding the configuration that generates the best model for each taxi is computationally very expensive. In this paper, we propose the use of metalearning to address the problem of selecting the algorithm that generates the model with the most accurate predictions for each taxi. The approach is tested on data collected in the Drive-In project. Our results show that metalearning can help to select the algorithm with the best accuracy.

CloseRead Abstract