Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2011

Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

Autores
Cejudo, JMC; García, MB; Bueno, RM; Gama, J; Bifet, A;

Publicação
Proceedings of the Second Workshop on Applications of Pattern Analysis, WAPA 2011, Castro Urdiales, Spain, October 19-21, 2011

Abstract

2011

Learning from medical data streams: An introduction

Autores
Rodrigues, PP; Pechenizkiy, M; Gaber, MM; Gama, J;

Publicação
CEUR Workshop Proceedings

Abstract
Clinical practice and research are facing a new challenge created by the rapid growth of health information science and technology, and the complexity and volume of biomedical data. Machine learning from medical data streams is a recent area of research that aims to provide better knowledge extraction and evidence-based clinical decision support in scenarios where data are produced as a continuous flow. This year's edition of AIME, the Conference on Artificial Intelligence in Medicine, enabled the sound discussion of this area of research, mainly by the inclusion of a dedicated workshop. This paper is an introduction to LEMEDS, the Learning from Medical Data Streams workshop, which highlights the contributed papers, the invited talk and expert panel discussion, as well as related papers accepted to the main conference.

FecharLer Abstract

2011

Learning decision rules from data streams

Autores
Gama, J; Kosina, P;

Publicação
IJCAI International Joint Conference on Artificial Intelligence

Abstract
Decision rules, which can provide good interpretability and flexibility for data mining tasks, have received very little attention in the stream mining community so far. In this work we introduce a new algorithm to learn rule sets, designed for open-ended data streams. The proposed algorithm is able to continuously learn compact ordered and unordered rule sets. The experimental evaluation shows competitive results in comparison with VFDT and C4.5rules.

FecharLer Abstract

2011

Speeding up hoeffding-based regression trees with options

Autores
Ikonomovska, E; Gama, J; Zenko, B; Dzeroski, S;

Publicação
Proceedings of the 28th International Conference on Machine Learning, ICML 2011

Abstract
Data streams are ubiquitous and have in the last two decades become an important research topic. For their predictive non-parametric analysis, Hoeffding-based trees are often a method of choice, offering a possibility of any-time predictions. However, one of their main problems is the delay in learning progress due to the existence of equally discriminative attributes. Options are a natural way to deal with this problem. Option trees build upon regular trees by adding splitting options in the internal nodes. As such they are known to improve accuracy, stability and reduce ambiguity. In this paper, we present on-line option trees for faster learning on numerical data streams. Our results show that options improve the any-time performance of ordinary on-line regression trees, while preserving the interpretable structure of trees and without significantly increasing the computational complexity of the algorithm. Copyright 2011 by the author(s)/owner(s).

FecharLer Abstract

2011

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Autores
Earl, D; Bradnam, K; St John, J; Darling, A; Lin, DW; Fass, J; Hung, OKY; Buffalo, V; Zerbino, DR; Diekhans, M; Nguyen, N; Ariyaratne, PN; Sung, WK; Ning, ZM; Haimel, M; Simpson, JT; Fonseca, NA; Birol, I; Docking, TR; Ho, IY; Rokhsar, DS; Chikhi, R; Lavenier, D; Chapuis, G; Naquin, D; Maillet, N; Schatz, MC; Kelley, DR; Phillippy, AM; Koren, S; Yang, SP; Wu, W; Chou, WC; Srivastava, A; Shaw, TI; Ruby, JG; Skewes Cox, P; Betegon, M; Dimon, MT; Solovyev, V; Seledtsov, I; Kosarev, P; Vorobyev, D; Ramirez Gonzalez, R; Leggett, R; MacLean, D; Xia, FF; Luo, RB; Li, ZY; Xie, YL; Liu, BH; Gnerre, S; MacCallum, I; Przybylski, D; Ribeiro, FJ; Yin, SY; Sharpe, T; Hall, G; Kersey, PJ; Durbin, R; Jackman, SD; Chapman, JA; Huang, XQ; DeRisi, JL; Caccamo, M; Li, YR; Jaffe, DB; Green, RE; Haussler, D; Korf, I; Paten, B;

Publicação
GENOME RESEARCH

Abstract
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: ( 1) It is possible to assemble the genome to a high level of coverage and accuracy, and that ( 2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.

FecharLer Abstract

2011

Amino acid pair- and triplet-wise groupings in the interior of alpha-helical segments in proteins

Autores
de Sousa, MM; Munteanu, CR; Pazos, A; Fonseca, NA; Camacho, R; Magalhaes, AL;

Publicação
JOURNAL OF THEORETICAL BIOLOGY

Abstract
A statistical approach has been applied to analyse primary structure patterns at inner positions of alpha-helices in proteins. A systematic survey was carried out in a recent sample of non-redundant proteins selected from the Protein Data Bank, which were used to analyse alpha-helix structures for amino acid pairing patterns. Only residues more than three positions apart from both termini of the alpha-helix were considered as inner. Amino acid pairings i, i+k(k = 1, 2, 3,4, 5), were analysed and the corresponding 20 x 20 matrices of relative global propensities were constructed. An analysis of (i, i+4, i+8) and (i, i+3, i+4) triplet patterns was also performed. These analysis yielded information on a series of amino acid patterns (pairings and triplets) showing either high or low preference for alpha-helical motifs and suggested a novel approach to protein alphabet reduction. In addition, it has been shown that the individual amino acid propensities are not enough to define the statistical distribution of these patterns. Global pair propensities also depend on the type of pattern, its composition and orientation in the protein sequence. The data presented should prove useful to obtain and refine useful predictive rules which can further the development and fine-tuning of protein structure prediction algorithms and tools.

FecharLer Abstract