Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by LIAAD

2017

TexRep: A Text Mining Framework for Online Reputation Monitoring

Authors
Saleiro, P; Rodrigues, EM; Soares, C; Oliveira, E;

Publication
NEW GENERATION COMPUTING

Abstract
This work aims to understand, formalize and explore the scientific challenges of using unstructured text data from different Web sources for Online Reputation Monitoring. We here present TexRep, an adaptable text mining framework specifically tailored for Online Reputation Monitoring that can be reused in multiple application scenarios, from politics to finance. This framework is able to collect texts from online media, such as Twitter, and identify entities of interest and classify sentiment polarity and intensity. The framework supports multiple data aggregation methods, as well as visualization and modeling techniques that can be used for both descriptive analytics, such as analyze how political polls evolve over time, and predictive analytics, such as predict elections. We here present case studies that illustrate and validate TexRep for Online Reputation Monitoring. In particular, we provide an evaluation of TexRep Entity Filtering and Sentiment Analysis modules using well known external benchmarks. We also present an illustrative example of TexRep application in the political domain.

2017

Early Fusion Strategy for Entity-Relationship Retrieval

Authors
Saleiro, P; Frayling, NM; Rodrigues, EM; Soares, C;

Publication
Proceedings of the First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR 2017) co-located with the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), Shinjuku, Tokyo, Japan, August 11, 2017.

Abstract
We address the task of entity-relationship (E-R) retrieval, i.e, given a query characterizing types of two or more entities and relationships between them, retrieve the relevant tuples of related entities. Answering E-R queries requires gathering and joining evidence from multiple unstructured documents. In this work, we consider entity and relationships of any type, i.e, characterized by context terms instead of pre-defined types or relationships. We propose a novel IR-centric approach for E-R retrieval, that builds on the basic early fusion design pattern for object retrieval, to provide extensible entity-relationship representations, suitable for complex, multi-relationships queries. We performed experiments with Wikipedia articles as entity representations combined with relationships extracted from ClueWeb-09-B with FACC1 entity linking. We obtained promising results using 3 different query collections comprising 469 E-R queries. © Copyright by the paper's authors.

2017

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of Some Practical Aspects

Authors
Saleiro, P; Sarmento, L; Rodrigues, EM; Soares, C; Oliveira, E;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2017)

Abstract
This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress.

2017

Effect of Metalearning on Feature Selection Employment

Authors
das Dôres, SN; Soares, C; Ruiz, DDA;

Publication
Proceedings of the International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms co-located with the European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases, AutoML@PKDD/ECML 2017, Skopje, Macedonia, September 22, 2017.

Abstract
Feature Selection is important to improve learning performance, reduce computational complexity and decrease required storage. There are multiple methods for feature selection, with varying impact and computational cost. Therefore, choosing the right method for a given data set is important. In this paper, we analyze the advantages of metalearning for feature selection employment. This issue is relevant because a wrong decision may imply additional processing, when FS is unnecessarily applied, or in a loss of performance, when not used in a problem for which it is appropriate. Our results showed that, although there is an advantage in using metalearning, these gains are not yet sufficiently relevant, which opens the way for new research to be carried out in the area.

2017

Metalearning

Authors
Brazdil, P; Vilalta, R; Giraud Carrier, CG; Soares, C;

Publication
Encyclopedia of Machine Learning and Data Mining

Abstract

2017

RELink: A Research Framework and Test Collection for Entity-Relationship Retrieval

Authors
Saleiro, P; Frayling, NM; Rodrigues, EM; Soares, C;

Publication
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017

Abstract
Improvements of entity-relationship (E-R) search techniques have been hampered by a lack of test collections, particularly for complex queries involving multiple entities and relationships. In this paper we describe a method for generating E-R test queries to support comprehensive E-R search experiments. Queries and relevance judgments are created from content that exists in a tabular form where columns represent entity types and the table structure implies one or more relationships among the entities. Editorial work involves creating natural language queries based on relationships represented by the entries in the table. We have publicly released the RELink test collection comprising 600 queries and relevance judgments obtained from a sample of Wikipedia List-of-lists-oflists tables. The latter comprise tuples of entities that are extracted from columns and labelled by corresponding entity types and relationships they represent. In order to facilitate research in complex E-R retrieval, we have created and released as open source the RELink Framework that includes Apache Lucene indexing and search specifically tailored to E-R retrieval. RELink includes entity and relationship indexing based on the ClueWeb-09-BWeb collection with FACC1 text span annotations linked to Wikipedia entities. With ready to use search resources and a comprehensive test collection, we support community in pursuing E-R research at scale. © 2017 ACM.

  • 188
  • 430