2015
Authors
Pinto, F; Soares, C; Brazdil, P;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
Data Mining (DM) researchers often focus on the development and testing of models for a single decision (e.g., direct mailing, churn detection, etc.). In practice, however, multiple decisions have often to be made simultaneously which are not independent and the best global solution is often not the combination of the best individual solutions. This problem can be addressed by searching for the overall best solution by using optimization methods based on the predictions made by the DM models. We describe one case study were this approach was used to optimize the layout of a retail store in order to maximize predicted sales. A metaheuristic is used to search different hypothesis of space allocations for multiple product categories, guided by the predictions made by regression models that estimate the sales for each category based on the assigned space. We test three metaheuristics and three regression algorithms on this task. Results show that the Particle Swam Optimization method guided by the models obtained with Random Forests and Support Vector Machines models obtain good results. We also provide insights about the relationship between the correctness of the regression models and the metaheuristics performance.
2015
Authors
de Sa, CR; Rebelo, C; Soares, C; Knobbe, A;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE
Abstract
The problem of Label Ranking is receiving increasing attention from several research communities. The algorithms that have developed/adapted to treat rankings as the target object follow two different approaches: distribution-based (e.g., using Mallows model) or correlation-based (e.g., using Spearman's rank correlation coefficient). Decision trees have been adapted for label ranking following both approaches. In this paper we evaluate an existing correlation-based approach and propose a new one, Entropy-based Ranking trees. We then compare and discuss the results with a distribution-based approach. The results clearly indicate that both approaches are competitive.
2015
Authors
Zarmehri, MN; Soares, C;
Publication
2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)
Abstract
Traditionally, a single model is developed for a data mining task. As more data is being collected at a more detailed level, organizations are becoming more interested in having specific models for distinct parts of data (e. g. customer segments). From the business perspective, data can be divided naturally into different dimensions. Each of these dimensions is usually hierarchically organized (e. g. country, city, zip code), which means that, when developing a model for a given part of the problem (e. g. a zip code) the training data may be collected at different levels of this nested hierarchy (e. g. the same zip code, the city and the country it is located in). Selecting different levels of granularity may change the performance of the whole process, so the question is which level to use for a given part. We propose a metalearning model which recommends a level of granularity for the training data to learn the model that is expected to obtain the best performance. We apply decision tree and random forest algorithms for metalearning. At the base level, our experiment uses results obtained by outlier detection methods on the problem of detecting errors in foreign trade transactions. The results show that using metalearning help finding the best level of granularity.
2015
Authors
Brito, PQ; Soares, C; Almeida, S; Monte, A; Byvoet, M;
Publication
ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING
Abstract
Data mining (DM) techniques have been used to solve marketing and manufacturing problems in the fashion industry. These approaches are expected to be particularly important for highly customized industries because the diversity of products sold makes it harder to find clear patterns of customer preferences. The goal of this project was to investigate two different data mining approaches for customer segmentation: clustering and subgroup discovery. The models obtained produced six market segments and 49 rules that allowed a better understanding of customer preferences in a highly customized fashion manufacturer/e-tailor. The scope and limitations of these clustering DM techniques will lead to further methodological refinements.
2015
Authors
Pinto, F; Soares, C; Mendes Moreira, J;
Publication
MULTIPLE CLASSIFIER SYSTEMS (MCS 2015)
Abstract
Ensemble learning algorithms often benefit from pruning strategies that allow to reduce the number of individuals models and improve performance. In this paper, we propose a Metalearning method for pruning bagging ensembles. Our proposal differs from other pruning strategies in the sense that allows to prune the ensemble before actually generating the individual models. The method consists in generating a set characteristics from the bootstrap samples and relate them with the impact of the predictive models in multiple tested combinations. We executed experiments with bagged ensembles of 20 and 100 decision trees for 53 UCI classification datasets. Results show that our method is competitive with a state-of-the-art pruning technique and bagging, while using only 25% of the models.
2015
Authors
Rebelo, F; Soares, C; Rossetti, RJF;
Publication
2015 IEEE FIRST INTERNATIONAL SMART CITIES CONFERENCE (ISC2)
Abstract
In the early twenty-first century, social networks served only to let the world know our tastes, share our photos and share some thoughts. A decade later, these services are filled with an enormous amount of information. Now, the industry and the academia are exploring this information, in order to extract implicit patterns. TwitterJam is a tool that analyses the contents of the social network Twitter to extract events related to road traffic. To reach this goal, we started by analysing tweets to know those which really contains road traffic information. The second step was to gather official information to confirm the extracted information. With these two types of information (official and general), we correlated them in order to verify the credibility of public tweets. The correlation between the two types of information was done separately in two ways: the first one concerns the amount of tweets in a certain time of day and the second on the localization of these tweets. Two hypothesis were also devised concerning these correlations. The results were not perfect but where reasonable enough. We also analysed tools suitable for the visualization of data to decide what is the best strategy to follow. At the end we developed a web application that shows the results, to help the analysis of results.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.