Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by LIAAD

2013

Comparing relational and non-relational algorithms for clustering propositional data

Authors
Motta, R; Nogueira, BM; Jorge, AM; De Andrade Lopes, A; Rezende, SO; De Oliveira, MCF;

Publication
Proceedings of the ACM Symposium on Applied Computing

Abstract
Cluster detection methods are widely studied in Propositional Data Mining. In this context, data is individually represented as a feature vector. This data has a natural nonrelational structure, but can be represented in a relational form through similarity-based network models. In these models, examples are represented by vertices and an edge connects two examples with high similarity. This relational representation allows employing network-based algorithms in Relational Data Mining. Specifically in clustering tasks, these models allow to use community detection algorithms in networks in order to detect data clusters. In this work, we compared traditional non-relational data-based clustering algorithms with clustering detection algorithms based on relational data using measures for community detection in networks. We carried out an exploratory analysis over 23 numerical datasets and 10 textual datasets. Results show that network models can efficiently represent the data topology, allowing their application in cluster detection with higher precision when compared to non-relational methods. Copyright 2013 ACM.

2013

SMOTE for Regression

Authors
Torgo, L; Ribeiro, RP; Pfahringer, B; Branco, P;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2013

Abstract
Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by sampling approaches. These approaches change the distribution of the given training data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present a modification of the well-known Smote algorithm that allows its use on these regression tasks. In an extensive set of experiments we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed SmoteR method can be used with any existing regression algorithm turning it into a general tool for addressing problems of forecasting rare extreme values of a continuous target variable. © 2013 Springer-Verlag.

2013

POPSTAR at RepLab 2013: Name ambiguity resolution on Twitter

Authors
Saleiro, P; Rei, L; Pasquali, A; Soares, C; Teixeira, J; Pinto, F; Nozari, M; Felix, C; Strecht, P;

Publication
CEUR Workshop Proceedings

Abstract
Filtering tweets relevant to a given entity is an important task for online reputation management systems. This contributes to a reliable analysis of opinions and trends regarding a given entity. In this paper we describe our participation at the Filtering Task of RepLab 2013. The goal of the competition is to classify a tweet as relevant or not relevant to a given entity. To address this task we studied a large set of features that can be generated to describe the relationship between an entity and a tweet. We explored different learning algorithms as well as, different types of features: text, keyword similarity scores between enti-ties metadata and tweets, Freebase entity graph and Wikipedia. The test set of the competition comprises more than 90000 tweets of 61 entities of four distinct categories: automotive, banking, universities and music. Results show that our approach is able to achieve a Reliability of 0.72 and a Sensitivity of 0.45 on the test set, corresponding to an F-measure of 0.48 and an Accuracy of 0.908.

2013

Clustering for decision support in the fashion industry: A case study

Authors
Monte, A; Soares, C; Brito, P; Byvoet, M;

Publication
Lecture Notes in Mechanical Engineering

Abstract
The scope of this work is the segmentation of the orders of Bivolino, a Belgian company that sells custom tailored shirts. The segmentation is done based on clustering, following a Data Mining approach. We use the K-Medoids clustering method because it is less sensitive to outliers than other methods and it can handle nominal variables, which are the most common in the data used in this work. We interpret the results from both the design and marketing perspectives. The results of this analysis contain useful knowledge for the company regarding its business. This knowledge, as well as the continued usage of clustering to support both the design and marketing processes, is expected to allow Bivolino to make important business decisions and, thus, obtain competitive advantage over its competitors. © Springer International Publishing Switzerland 2013.

2013

CN2-SD for subgroup discovery in a highly customized textile industry: A case study

Authors
Almeida, S; Soares, C;

Publication
Lecture Notes in Mechanical Engineering

Abstract
The success of the textile industry largely depends on the products offered and on the speed of response to variations in demand that are induced by changes in consumer lifestyles. The study of behavioral habits and buying trends can provide models to be integrated into the decision support systems of companies. Data mining techniques can be used to develop models based on data. This approach has been used in the past to develop models to improve sales in the textile industry. However, the discovery of scientific models based on subgroup discovery algorithms, that characterize subgroups of observations with rare distributions, has not been made in this area. The goal of this work is to investigate whether these algorithms can extract knowledge that is useful for a particular kind of textile industry, which produces highly customized garments. We apply the CN2-SD subgroup discovery method to find rare and interesting subgroups products on a database provided by a manufacturer of custom-made shirts. The results show that it is possible to obtain knowledge that is useful to understand customer preferences in highly customized textile industries using subgroup discovery techniques. © Springer International Publishing Switzerland 2013.

2013

Active Selection of Training Instances for a Random forest Meta-Learner

Authors
Sousa, AFM; Prudencio, RBC; Soares, C; Ludermir, TB;

Publication
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Abstract
Several approaches have been applied to the task of algorithm selection. In this context, Meta-Learning provides an efficient solution by adopting a supervised strategy. Despite its promising results, Meta-Learning requires an adequate number of instances to produce a rich set of meta-examples. Recent approaches to generate synthetic or manipulated datasets have been adopted with success in the context of Meta-Learning. These proposals include the datasetoids approach, a simple data manipulation technique that generates new datasets from existing ones. Although such proposals can actually produce relevant datasets, they can eventually produce redundant, or even irrelevant, problem instances. Active Meta-Learning arises in this context to select only the most informative instances for meta-example generation. In this work, we investigate the Active Meta-Learning combined with datasetoids, focusing on using the Random forest algorithm in meta-learning. Our experiments revealed that it is possible to reduce the computational cost of generating meta-examples and obtain a significant gain in Meta-Learning performance.

  • 277
  • 430