Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2014

Failure Prediction - An Application in the Railway Industry

Autores
Pereira, P; Ribeiro, RP; Gama, J;

Publicação
DISCOVERY SCIENCE, DS 2014

Abstract
Machine or system failures have high impact both at technical and economic levels. Most modern equipment has logging systems that allow us to collect a diversity of data regarding their operation and health. Using data mining models for novelty detection enables us to explore those datasets, building classification systems that can detect and issue an alert when a failure starts evolving, avoiding the unknown development up to breakdown. In the present case we use a failure detection system to predict train doors breakdowns before they happen using data from their logging system. We study three methods for failure detection: outlier detection, novelty detection and a supervised SVM. Given the problem's features, namely the possibility of a passenger interrupting the movement of a door, the three predictors are prone to false alarms. The main contribution of this work is the use of a low-pass filter to process the output of the predictors leading to a strong reduction in the false alarm rate.

FecharLer Abstract

2014

Symbolic Data Analysis: another look at the interaction of Data Mining and Statistics

Autores
Brito, P;

Publicação
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Symbolic Data Analysis (SDA) provides a framework for the representation and analysis of data that comprehends inherent variability. While in Data Mining and classical Statistics the data to be analyzed usually presents one single value for each variable, that is no longer the case when the entities under analysis are not single elements, but groups gathered on the basis of some given criteria. Then, for each variable, variability inherent to each group should be taken into account. Also, when analysing concepts, such as botanic species, disease descriptions, car models, and so on, data entail intrinsic variability, which should be explicitly considered. To this purpose, new variable types have been introduced, whose realizations are not single real values or categories, but sets, intervals, or, more generally, distributions over a given domain. SDA provides methods for the (multivariate) analysis of such data, where the variability expressed in the data representation is taken into account, using various approaches. (C) 2014 John Wiley & Sons, Ltd.

FecharLer Abstract

2014

Social Networks as Symbolic Data

Autores
Giordano, G; Brito, P;

Publicação
ANALYSIS AND MODELING OF COMPLEX DATA IN BEHAVIORAL AND SOCIAL SCIENCES

Abstract
Starting from the main idea of Symbolic Data Analysis to extend Statistics and Data Mining methods from first-order to second-order objects, we focus on network data-as defined in the framework of Social Network Analysis-to define a graph structure and the underlying network in the context of complex data objects. A Network Symbolic description is defined according to the statistical characterization of the network topological properties. We use suitable network measures, which are represented by means of symbolic variables. Their study through multidimensional data analysis, allows for the synthetic representation of a network as a point onto a metric space. The proposed approach is discussed on the basis of a simulation study considering three classical network growth processes.

FecharLer Abstract

2014

Merging Decision Trees: A Case Study in Predicting Student Performance

Autores
Strecht, P; Mendes Moreira, J; Soares, C;

Publicação
ADVANCED DATA MINING AND APPLICATIONS, ADMA 2014

Abstract
Predicting the failure of students in university courses can provide useful information for course and programme managers as well as to explain the drop out phenomenon. While it is important to have models at course level, their number makes it hard to extract knowledge that can be useful at the university level. Therefore, to support decision making at this level, it is important to generalize the knowledge contained in those models. We propose an approach to group and merge interpretable models in order to replace them with more general ones without compromising the quality of predictive performance. We evaluate our approach using data from the U. Porto. The results obtained are promising, although they suggest alternative approaches to the problem.

FecharLer Abstract

2014

An Empirical Methodology to Analyze the Behavior of Bagging

Autores
Pinto, F; Soares, C; Mendes Moreira, J;

Publicação
ADVANCED DATA MINING AND APPLICATIONS, ADMA 2014

Abstract
In this paper we propose and apply a methodology to study the relationship between the performance of bagging and the characteristics of the bootstrap samples. The methodology consists of 1) an extensive set of experiments to estimate the empirical distribution of performance of the population of all possible ensembles that can be created with those bootstraps and 2) a metalearning approach to analyze that distribution based on characteristics of the bootstrap samples and their relationship with the complete training set. Given the large size of the population of all ensembles, we empirically show that it is possible to apply the methodology to a sample. We applied the methodology to 53 classification datasets for ensembles of 20 and 100 models. Our results show that diversity is crucial for an important bootstrap and we show evidence of a metric that can measure diversity without any learning process involved. We also found evidence that the best bootstraps have a predictive power very similar to the one presented by the training set using naive models.

FecharLer Abstract

2014

Simulation of the ensemble generation process: The divergence between data and model similarity

Autores
Pinto, F; Mendes Moreira, J; Soares, C; Rossetti, RJF;

Publicação
Modelling and Simulation 2014 - European Simulation and Modelling Conference, ESM 2014

Abstract
In this paper we present a Netlogo simulation model for a Data Mining methodological process: ensemble classifier generation. The model allows to study the trade-off between data characteristics and diversity, a key concept in Ensemble Learning. We studied the re™ search hypothesis that data characteristics should also be taken into account while generating ensemble classifier models. The results of our experiments indicate that diversity is in fact a key concept in Ensemble Learning but regarding our research hypothesis, the findings axe inconclusive.

FecharLer Abstract