Cookies
O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais
Aceitar Rejeitar
  • Menu
Publicações

Publicações por LIAAD

2013

SMOTE for Regression

Autores
Torgo, L; Ribeiro, RP; Pfahringer, B; Branco, P;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2013

Abstract
Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by sampling approaches. These approaches change the distribution of the given training data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present a modification of the well-known Smote algorithm that allows its use on these regression tasks. In an extensive set of experiments we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed SmoteR method can be used with any existing regression algorithm turning it into a general tool for addressing problems of forecasting rare extreme values of a continuous target variable. © 2013 Springer-Verlag.

2013

POPSTAR at RepLab 2013: Name ambiguity resolution on Twitter

Autores
Saleiro, P; Rei, L; Pasquali, A; Soares, C; Teixeira, J; Pinto, F; Nozari, M; Felix, C; Strecht, P;

Publicação
CEUR Workshop Proceedings

Abstract
Filtering tweets relevant to a given entity is an important task for online reputation management systems. This contributes to a reliable analysis of opinions and trends regarding a given entity. In this paper we describe our participation at the Filtering Task of RepLab 2013. The goal of the competition is to classify a tweet as relevant or not relevant to a given entity. To address this task we studied a large set of features that can be generated to describe the relationship between an entity and a tweet. We explored different learning algorithms as well as, different types of features: text, keyword similarity scores between enti-ties metadata and tweets, Freebase entity graph and Wikipedia. The test set of the competition comprises more than 90000 tweets of 61 entities of four distinct categories: automotive, banking, universities and music. Results show that our approach is able to achieve a Reliability of 0.72 and a Sensitivity of 0.45 on the test set, corresponding to an F-measure of 0.48 and an Accuracy of 0.908.

2013

Clustering for decision support in the fashion industry: A case study

Autores
Monte, A; Soares, C; Brito, P; Byvoet, M;

Publicação
Lecture Notes in Mechanical Engineering

Abstract
The scope of this work is the segmentation of the orders of Bivolino, a Belgian company that sells custom tailored shirts. The segmentation is done based on clustering, following a Data Mining approach. We use the K-Medoids clustering method because it is less sensitive to outliers than other methods and it can handle nominal variables, which are the most common in the data used in this work. We interpret the results from both the design and marketing perspectives. The results of this analysis contain useful knowledge for the company regarding its business. This knowledge, as well as the continued usage of clustering to support both the design and marketing processes, is expected to allow Bivolino to make important business decisions and, thus, obtain competitive advantage over its competitors. © Springer International Publishing Switzerland 2013.

2013

CN2-SD for subgroup discovery in a highly customized textile industry: A case study

Autores
Almeida, S; Soares, C;

Publicação
Lecture Notes in Mechanical Engineering

Abstract
The success of the textile industry largely depends on the products offered and on the speed of response to variations in demand that are induced by changes in consumer lifestyles. The study of behavioral habits and buying trends can provide models to be integrated into the decision support systems of companies. Data mining techniques can be used to develop models based on data. This approach has been used in the past to develop models to improve sales in the textile industry. However, the discovery of scientific models based on subgroup discovery algorithms, that characterize subgroups of observations with rare distributions, has not been made in this area. The goal of this work is to investigate whether these algorithms can extract knowledge that is useful for a particular kind of textile industry, which produces highly customized garments. We apply the CN2-SD subgroup discovery method to find rare and interesting subgroups products on a database provided by a manufacturer of custom-made shirts. The results show that it is possible to obtain knowledge that is useful to understand customer preferences in highly customized textile industries using subgroup discovery techniques. © Springer International Publishing Switzerland 2013.

2013

Active Selection of Training Instances for a Random forest Meta-Learner

Autores
Sousa, AFM; Prudencio, RBC; Soares, C; Ludermir, TB;

Publicação
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Abstract
Several approaches have been applied to the task of algorithm selection. In this context, Meta-Learning provides an efficient solution by adopting a supervised strategy. Despite its promising results, Meta-Learning requires an adequate number of instances to produce a rich set of meta-examples. Recent approaches to generate synthetic or manipulated datasets have been adopted with success in the context of Meta-Learning. These proposals include the datasetoids approach, a simple data manipulation technique that generates new datasets from existing ones. Although such proposals can actually produce relevant datasets, they can eventually produce redundant, or even irrelevant, problem instances. Active Meta-Learning arises in this context to select only the most informative instances for meta-example generation. In this work, we investigate the Active Meta-Learning combined with datasetoids, focusing on using the Random forest algorithm in meta-learning. Our experiments revealed that it is possible to reduce the computational cost of generating meta-examples and obtain a significant gain in Meta-Learning performance.

2013

Space allocation in the retail industry: A decision support system integrating evolutionary algorithms and regression models

Autores
Pinto, F; Soares, C;

Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
One of the hardest resources to manage in retail is space. Retailers need to assign limited store space to a growing number of product categories such that sales and other performance metrics are maximized. Although this seems to be an ideal task for a data mining approach, there is one important barrier: the representativeness of the available data. In fact, changes to the layout of retail stores are infrequent. This means that very few values of the space variable are represented in the data, which makes it hard to generalize. In this paper, we describe a Decision Support System to assist retailers in this task. The system uses an Evolutionary Algorithm to optimize space allocation based on the estimated impact on sales caused by changes in the space assigned to product categories. We assess the quality of the system on a real case study, using different regression algorithms to generate the estimates. The system obtained very good results when compared with the recommendations made by the business experts. We also investigated the effect of the representativeness of the sample on the accuracy of the regression models. We selected a few product categories based on a heuristic assessment of their representativeness. The results indicate that the best regression models were obtained on products for which the sample was not the best. The reason for this unexpected results remains to be explained. © 2013 Springer-Verlag.

  • 277
  • 430