Publications

Publications by Pedro Gabriel Ferreira

2006

Mining approximate motifs in time series

Authors
Ferreira, PG; Azevedo, PJ; Silva, CG; Brito, RMM;

Publication
DISCOVERY SCIENCE, PROCEEDINGS

Abstract
The problem of discovering previously unknown frequent patterns in time series, also called motifs, has been recently introduced. A motif is a subseries pattern that appears a significant number of times. Results demonstrate that motifs may provide valuable insights about the data and have a wide range of applications in data mining tasks. The main motivation for this study was the need to mine time series data from protein folding/unfolding simulations. We propose an algorithm that extracts approximate motifs, i.e. motifs that capture portions of time series with a similar and eventually symmetric behavior. Preliminary results on the analysis of protein unfolding data support this proposal as a valuable tool. A.dditional experiments demonstrate that the application of utility of our algorithm is not limited to this particular problem. Rather it can be an interesting tool to be applied in many real world problems.

CloseRead Abstract

2009

Using data mining techniques to probe the role of hydrophobic residues in protein folding and unfolding simulations

Authors
Silva, CG; Ferreira, PG; Azevedo, PJ; Brito, RMM;

Publication
Evolving Application Domains of Data Warehousing and Mining: Trends and Solutions

Abstract
The protein folding problem, i.e. the identification of the rules that determine the acquisition of the native, functional, three-dimensional structure of a protein from its linear sequence of amino-acids, still is a major challenge in structural molecular biology. Moreover, the identification of a series of neurodegenerative diseases as protein unfolding/misfolding disorders highlights the importance of a detailed characterisation of the molecular events driving the unfolding and misfolding processes in proteins. One way of exploring these processes is through the use of molecular dynamics simulations. The analysis and comparison of the enormous amount of data generated by multiple protein folding or unfolding simulations is not a trivial task, presenting many interesting challenges to the data mining community. Considering the central role of the hydrophobic effect in protein folding, we show here the application of two data mining methods - hierarchical clustering and association rules - for the analysis and comparison of the solvent accessible surface area (SASA) variation profiles of each one of the 127 amino-acid residues in the amyloidogenic protein Transthyretin, across multiple molecular dynamics protein unfolding simulations. © 2010, IGI Global.

CloseRead Abstract

2007

Evaluating protein motif significance measures: A case study on prosite patterns

Authors
Ferreira, PG; Azevedo, PJ;

Publication
2007 IEEE Symposium on Computational Intelligence and Data Mining, Vols 1 and 2

Abstract
The existence of preserved subsequences in a set of related protein sequences suggests that they might play a structural and functional role in protein's mechanisms. Due to its exploratory approach, the mining process tends to deliver a large number of motifs. Therefore it is critical to release methods that identify relevant significant motifs. Many measures of interest and significance have been proposed. However, since motifs have a wide range or applications, how to choose the appropriate significance measures is application dependent. Some measures show consistent results being highly correlated, while others show disagreements. In this paper we review existent measures and study their behavior in order to assist the selection of the most appropriate set of measures. An experimental evaluation of the measures for high quality patterns from the Prosite database is presented.

CloseRead Abstract

2005

Protein sequence classification through relevant sequence mining and Bayes Classifiers

Authors
Ferreira, PG; Azevedo, PJ;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.

CloseRead Abstract

2005

Protein sequence pattern mining with constraints

Authors
Ferreira, PG; Azevedo, PJ;

Publication
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005

Abstract
Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm (gIL). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns.

CloseRead Abstract

2007

A closer look on protein unfolding Simulations through hierarchical clustering

Authors
Ferreira, PG; Silva, CG; Brito, RMM; Azevedo, PJ;

Publication
2007 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology

Abstract
Understanding protein folding and unfolding mechanisms are a central problem in molecular biology. Data obtained from molecular dynamics unfolding simulations may provide valuable insights for a better understanding of these mechanisms. Here, we propose the application of an augmented version of hierarchical clustering analysis to detect clusters of amino-acid residues with similar behavior in protein unfolding simulations. These clusters hold similar global pattern behavior of solvent accessible surface area (SASA) variation in unfolding simulations of the protein Transthyretin (TTR). Classical hierarchical clustering was applied to build a dendrogram based on the SASA variation of each amino-acid residue. The dendrogram was enriched with background information on the amino-acid residues, enabling the extraction of sub-clusters with well differentiated characteristics.

CloseRead Abstract