Publicacoes - INESC TEC

Publicações

Publicações por Paulo Jorge Azevedo

2007

Deterministic motif mining in protein databases

Autores
Ferreira, PG; Azevedo, PJ;

Publicação
Successes and New Directions in Data Mining

Abstract
Protein sequence motifs describe, through means of enhanced regular expression syntax, regions of amino acids that have been conserved across several functionally related proteins. These regions may have an implication at the structural and functional level of the proteins. Sequence motif analysis can bring significant improvements towards a better understanding of the protein sequence-structure-function relation. In this chapter, we review the subject of mining deterministic motifs from protein sequence databases. We start by giving a formal definition of the different types of motifs and the respective specificities. Then, we explore the methods available to evaluate the quality and interest of such patterns. Examples of applications and motif repositories are described. We discuss the algorithmic aspects and different methodologies for motif extraction. A brief description on how sequence motifs can be used to extract structural level information patterns is also provided. © 2008, IGI Global.

FecharLer Abstract

2009

Deterministic pattern mining on genetic sequences

Autores
Ferreira, PG; Azevedo, PJ;

Publicação
Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques

Abstract
The recent increase in the number of complete genetic sequences freely available through specialized Internet databases presents big challenges for the research community. One such challenge is the efficient and effective search of sequence patterns, also known as motifs, among a set of related genetic sequences. Such patterns describe regions that may provide important insights about the structural and functional role of DNA and proteins. Two main classes can be considered: probabilistic patterns represent a model that simulates the sequences or part of the sequences under consideration and deterministic patterns that either match or not the input sequences. In this chapter a general overview of deterministic sequence mining over sets of genetic sequences is proposed. The authors formulate an architecture that divides the mining process workflow into a set of blocks. Each of these blocks is discussed individually. © 2010, IGI Global.

FecharLer Abstract

2011

Time series motifs statistical significance

Autores
Castro, N; Azevedo, PJ;

Publicação
Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011

Abstract
Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in bioinformatics and association rules mining communities to evaluate the extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioin-formatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique - statistical tests - to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif. Copyright © SIAM.

FecharLer Abstract

2010

Multiresolution motif discovery in time series

Autores
Castro, N; Azevedo, P;

Publicação
Proceedings of the 10th SIAM International Conference on Data Mining, SDM 2010

Abstract
Time series motif discovery is an important problem with applications in a variety of areas that range from telecommunications to medicine. Several algorithms have been proposed to solve the problem. However, these algorithms heavily use expensive random disk accesses or assume the data can fit into main memory. They only consider motifs at a single resolution and are not suited to interactivity. In this work, we tackle the motif discovery problem as an approximate Top-K frequent subsequence discovery problem. We fully exploit state of the art iSAX representation multiresolution capability to obtain motifs at different resolutions. This property yields interactivity, allowing the user to navigate along the Top-K motifs structure. This permits a deeper understanding of the time series database. Further, we apply the Top-K space saving algorithm to our frequent subsequences approach. A scalable algorithm is obtained that is suitable for data stream like applications where small memory devices such as sensors are used. Our approach is scalable and disk-efficient since it only needs one single pass over the time series database. We provide empirical evidence of the validity of the algorithm in datasets from different areas that aim to represent practical applications. Copyright © by SIAM.

FecharLer Abstract

1997

Magic sets with full sharing

Autores
Azevedo, PJ;

Publicação
JOURNAL OF LOGIC PROGRAMMING

Abstract
In this paper, we study the relationship between tabulation and goal-oriented bottom-up evaluation of logic programs, Differences emerge when one tries to identify features of one evaluation method in the other. We show that to obtain the same effect as tabulation in top-down evaluation, one has to perform a careful adomment in programs to be evaluated bottom-up. Furthermore, we propose an efficient algorithm to perform forward subsumption checking over adorned magic facts. (C) Elsevier Science Inc., 1997.

FecharLer Abstract

2006

Mining approximate motifs in time series

Autores
Ferreira, PG; Azevedo, PJ; Silva, CG; Brito, RMM;

Publicação
DISCOVERY SCIENCE, PROCEEDINGS

Abstract
The problem of discovering previously unknown frequent patterns in time series, also called motifs, has been recently introduced. A motif is a subseries pattern that appears a significant number of times. Results demonstrate that motifs may provide valuable insights about the data and have a wide range of applications in data mining tasks. The main motivation for this study was the need to mine time series data from protein folding/unfolding simulations. We propose an algorithm that extracts approximate motifs, i.e. motifs that capture portions of time series with a similar and eventually symmetric behavior. Preliminary results on the analysis of protein unfolding data support this proposal as a valuable tool. A.dditional experiments demonstrate that the application of utility of our algorithm is not limited to this particular problem. Rather it can be an interesting tool to be applied in many real world problems.

FecharLer Abstract