Publications

Publications by Carlos Manuel Soares

2016

Learning from the News: Predicting Entity Popularity on Twitter

Authors
Saleiro, P; Soares, C;

Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS XV

Abstract
In this work, we tackle the problem of predicting entity popularity on Twitter based on the news cycle. We apply a supervised learning approach and extract four types of features: (i) signal, (ii) textual, (iii) sentiment and (iv) semantic, which we use to predict whether the popularity of a given entity will be high or low in the following hours. We run several experiments on six different entities in a dataset of over 150M tweets and 5M news and obtained F1 scores over 0.70. Error analysis indicates that news perform better on predicting entity popularity on Twitter when they are the primary information source of the event, in opposition to events such as live TV broadcasts, political debates or football matches.

CloseRead Abstract

2016

Active learning and data manipulation techniques for generating training examples in meta-learning

Authors
Sousa, AFM; Prudencio, RBC; Ludermir, TB; Soares, C;

Publication
NEUROCOMPUTING

Abstract
Algorithm selection is an important task in different domains of knowledge. Meta-learning treats this task by adopting a supervised learning strategy. Training examples in meta-learning (called meta examples) are generated from experiments performed with a pool of candidate algorithms in a number of problems, usually collected from data repositories or synthetically generated. A meta-learner is then applied to acquire knowledge relating features of the problems and the best algorithms in terms of performance. In this paper, we address an important aspect in meta-learning which is to produce a significant number of relevant meta-examples. Generating a high quality set of meta-examples can be difficult due to the low availability of real datasets in some domains and the high computational cost of labelling the meta-examples. In the current work, we focus on the generation of meta-examples for meta-learning by combining: (1) a promising approach to generate new datasets (called datasetoids) by manipulating existing ones; and (2) active learning methods to select the most relevant datasets previously generated. The datasetoids approach is adopted to augment the number of useful problem instances for meta-example construction. However not all generated problems are equally relevant. Active meta-learning then arises to select only the most informative instances to be labelled. Experiments were performed in different scenarios, algorithms for meta-learning and strategies to select datasets. Our experiments revealed that it is possible to reduce the computational cost of generating meta-examples, while maintaining a good meta-learning performance.

CloseRead Abstract

2017

Early Fusion Strategy for Entity-Relationship Retrieval

Authors
Saleiro, P; Frayling, NM; Rodrigues, EM; Soares, C;

Publication
Proceedings of the First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR 2017) co-located with the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), Shinjuku, Tokyo, Japan, August 11, 2017.

Abstract
We address the task of entity-relationship (E-R) retrieval, i.e, given a query characterizing types of two or more entities and relationships between them, retrieve the relevant tuples of related entities. Answering E-R queries requires gathering and joining evidence from multiple unstructured documents. In this work, we consider entity and relationships of any type, i.e, characterized by context terms instead of pre-defined types or relationships. We propose a novel IR-centric approach for E-R retrieval, that builds on the basic early fusion design pattern for object retrieval, to provide extensible entity-relationship representations, suitable for complex, multi-relationships queries. We performed experiments with Wikipedia articles as entity representations combined with relationships extracted from ClueWeb-09-B with FACC1 entity linking. We obtained promising results using 3 different query collections comprising 469 E-R queries. © Copyright by the paper's authors.

CloseRead Abstract

2017

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of Some Practical Aspects

Authors
Saleiro, P; Sarmento, L; Rodrigues, EM; Soares, C; Oliveira, E;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017)

Abstract
This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress.

CloseRead Abstract

2017

Effect of Metalearning on Feature Selection Employment

Authors
das Dôres, SN; Soares, C; Ruiz, DDA;

Publication
Proceedings of the International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms co-located with the European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases, AutoML@PKDD/ECML 2017, Skopje, Macedonia, September 22, 2017.

Abstract
Feature Selection is important to improve learning performance, reduce computational complexity and decrease required storage. There are multiple methods for feature selection, with varying impact and computational cost. Therefore, choosing the right method for a given data set is important. In this paper, we analyze the advantages of metalearning for feature selection employment. This issue is relevant because a wrong decision may imply additional processing, when FS is unnecessarily applied, or in a loss of performance, when not used in a problem for which it is appropriate. Our results showed that, although there is an advantage in using metalearning, these gains are not yet sufficiently relevant, which opens the way for new research to be carried out in the area.

CloseRead Abstract

2015

POPmine: Tracking Political Opinion on the Web

Authors
Saleiro, P; Amir, S; Silva, M; Soares, C;

Publication
CIT/IUCC/DASC/PICOM 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY - UBIQUITOUS COMPUTING AND COMMUNICATIONS - DEPENDABLE, AUTONOMIC AND SECURE COMPUTING - PERVASIVE INTELLIGENCE AND COMPUTING

Abstract
The automatic content analysis of mass media in the social sciences has become necessary and possible with the raise of social media and computational power. One particularly promising avenue of research concerns the use of opinion mining. We design and implement the POPmine system which is able to collect texts from web-based conventional media (news items in mainstream media sites) and social media (blogs and Twitter) and to process those texts, recognizing topics and political actors, analyzing relevant linguistic units, and generating indicators of both frequency of mention and polarity (positivity/negativity) of mentions to political actors across sources, types of sources, and across time.

CloseRead Abstract