2005
Autores
Azevedo, PJ; Silva, CG; Rodrigues, JR; Loureiro Ferreira, N; Brito, RMM;
Publicação
BIOLOGICAL AND MEDICAL DATA ANALYSIS, PROCEEDINGS
Abstract
One way of exploring protein unfolding events associated with the development of Amyloid diseases is through the use of multiple Molecular Dynamics Protein Unfolding Simulations. The analysis of the huge amount of data generated in these simulations is not a trivial task. In the present report, we demonstrate the use of Association Rules applied to the analysis of the variation profiles of the Solvent Accessible Surface Area of the 127 amino-acid residues of the protein Transthyretin, along multiple simulations. This allowed us to identify a set of 28 hydrophobic residues forming a hydrophobic cluster that might be essential in the unfolding and folding processes of Transthyretin.
2009
Autores
Silva, CG; Ferreira, PG; Azevedo, PJ; Brito, RMM;
Publicação
Evolving Application Domains of Data Warehousing and Mining: Trends and Solutions
Abstract
The protein folding problem, i.e. the identification of the rules that determine the acquisition of the native, functional, three-dimensional structure of a protein from its linear sequence of amino-acids, still is a major challenge in structural molecular biology. Moreover, the identification of a series of neurodegenerative diseases as protein unfolding/misfolding disorders highlights the importance of a detailed characterisation of the molecular events driving the unfolding and misfolding processes in proteins. One way of exploring these processes is through the use of molecular dynamics simulations. The analysis and comparison of the enormous amount of data generated by multiple protein folding or unfolding simulations is not a trivial task, presenting many interesting challenges to the data mining community. Considering the central role of the hydrophobic effect in protein folding, we show here the application of two data mining methods - hierarchical clustering and association rules - for the analysis and comparison of the solvent accessible surface area (SASA) variation profiles of each one of the 127 amino-acid residues in the amyloidogenic protein Transthyretin, across multiple molecular dynamics protein unfolding simulations. © 2010, IGI Global.
2007
Autores
Ferreira, PG; Azevedo, PJ;
Publicação
2007 IEEE Symposium on Computational Intelligence and Data Mining, Vols 1 and 2
Abstract
The existence of preserved subsequences in a set of related protein sequences suggests that they might play a structural and functional role in protein's mechanisms. Due to its exploratory approach, the mining process tends to deliver a large number of motifs. Therefore it is critical to release methods that identify relevant significant motifs. Many measures of interest and significance have been proposed. However, since motifs have a wide range or applications, how to choose the appropriate significance measures is application dependent. Some measures show consistent results being highly correlated, while others show disagreements. In this paper we review existent measures and study their behavior in order to assist the selection of the most appropriate set of measures. An experimental evaluation of the measures for high quality patterns from the Prosite database is presented.
2005
Autores
Ferreira, PG; Azevedo, PJ;
Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS
Abstract
We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.
2005
Autores
Ferreira, PG; Azevedo, PJ;
Publicação
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005
Abstract
Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm (gIL). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns.
2012
Autores
Castro, NC; Azevedo, PJ;
Publicação
Statistical Analysis and Data Mining
Abstract
Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in the data mining communities to evaluate extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique-statistical tests-to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif. © 2012 Wiley Periodicals, Inc.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.