2015
Authors
Rodrigues, P; Gama, J;
Publication
MATHEMATICS OF ENERGY AND CLIMATE CHANGE
Abstract
This paper discusses the problem of learning a global model from local information. We consider ubiquitous streaming data sources, such as sensor networks, and discuss efficient learning distributed algorithms. We present the generic framework of distributed sources of data, an illustrative algorithm to monitor the global state of the network using limited communication between peers, and an efficient distributed clustering algorithm.
2015
Authors
de Faria, ER; Goncalves, IR; Gama, J; de Leon Ferreira Carvalho, ACPDF;
Publication
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract
Data stream mining is an emergent research area that investigates knowledge extraction from large amounts of continuously generated data, produced by non-stationary distribution. Novelty detection, the ability to identify new or previously unknown situations, is a useful ability for learning systems, especially when dealing with data streams, where concepts may appear, disappear, or evolve over time. There are several studies currently investigating the application of novelty detection techniques in data streams. However, there is no consensus regarding how to evaluate the performance of these techniques. In this study, we propose a new evaluation methodology for multiclass novelty detection in data streams able to deal with: i) unsupervised learning, which generates novelty patterns without an association with the true classes, where one class may be composed of a novelty set, ii) confusion matrix that increases over time, iii) confusion matrix with a column representing unknown examples, i.e., those not explained by the model, and iv) representation of the evaluation measures over time. We propose a new methodology to associate the novelty patterns detected by the algorithm, in an unsupervised fashion, with the true classes. Finally, we evaluate the performance of the proposed methodology through the use of known novelty detection algorithms with artificial and real data sets.
2015
Authors
Duarte, J; Gama, J;
Publication
PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (IEEE DSAA 2015)
Abstract
Many real life prediction problems involve predicting a structured output. Multi-target regression is an instance of structured output prediction whose task is to predict for multiple target variables. Structured output algorithms are usually computationally and memory demanding, hence are not suited for dealing with massive amounts of data. Most of these algorithms can be categorized as local or global methods. Local methods produce individual models for each output component and combine them to produce the structured prediction. Global methods adapt traditional learning algorithms to predict the output structure as a whole. We propose the first rule-based algorithm for solving multi-target regression problems from data streams. The algorithm builds on the adaptive model rules framework. In contrast to the majority of the structured output predictors, this particular algorithm does not fall into the local and global categories. Instead, each rule specializes on related subsets of the output attributes. To evaluate the performance of the proposed algorithm, two other rule-based algorithms were developed, one using the local strategy and the other using the global strategy. These methods were compared considering their prediction error, memory usage, computational time, and model complexity. Experimental results on synthetic and real data show that the local-strategy algorithm usually obtains the lowest error. However, the proposed and the global-strategy algorithms use much less memory and run significantly much faster at the cost of a slightly increase in the error, which make them very attractive when computation resources are an important factor. Also, the models produced by the latter approaches are much easier to understand since considerably less rules are produced.
2015
Authors
Ferreira, P; Fonseca, NA; Dutra, I; Woods, R; Burnside, E;
Publication
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS
Abstract
The main goal of this work is to produce machine learning models that predict the outcome of a mammography from a reduced set of annotated mammography findings. In the study we used a dataset consisting of 348 consecutive breast masses that underwent image guided core biopsy performed between October 2005 and December 2007 on 328 female subjects. We applied various algorithms with parameter variation to learn from the data. The tasks were to predict mass density and to predict malignancy. The best classifier that predicts mass density is based on a support vector machine and has accuracy of 81.3%. The expert correctly annotated 70% of the mass densities. The best classifier that predicts malignancy is also based on a support vector machine and has accuracy of 85.6%, with a positive predictive value of 85%. One important contribution of this work is that our model can predict malignancy in the absence of the mass density attribute, since we can fill up this attribute using our mass density predictor.
2015
Authors
Frankish, A; Uszczynska, B; Ritchie, GRS; Gonzalez, JM; Pervouchine, D; Petryszak, R; Mudge, JM; Fonseca, N; Brazma, A; Guigo, R; Harrow, J;
Publication
BMC GENOMICS
Abstract
Background: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. Results: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most similar to 30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. Conclusions: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
2015
Authors
Aguiar, B; Vieira, J; Cunha, AE; Fonseca, NA; Iezzoni, A; van Nocker, S; Vieira, CP;
Publication
PLOS ONE
Abstract
S-RNase-based gametophytic self-incompatibility (GSI) has evolved once before the split of the Asteridae and Rosidae. This conclusion is based on the phylogenetic history of the S-RNase that determines pistil specificity. In Rosaceae, molecular characterizations of Prunus species, and species from the tribe Pyreae (i.e., Malus, Pyrus, Sorbus) revealed different numbers of genes determining S-pollen specificity. In Prunus only one pistil and pollen gene determine GSI, while in Pyreae there is one pistil but multiple pollen genes, implying different specificity recognition mechanisms. It is thus conceivable that within Rosaceae the genes involved in GSI in the two lineages are not orthologous but possibly paralogous. To address this hypothesis we characterised the S-RNase lineage and S-pollen lineage genes present in the genomes of five Rosaceae species from three genera: M. x domestica (apple, self-incompatible (SI); tribe Pyreae), P. persica (peach, self-compatible (SC); Amygdaleae), P. mume (mei, SI; Amygdaleae), Fragaria vesca (strawberry, SC; Potentilleae), and F. nipponica (mori-ichigo, SI; Potentilleae). Phylogenetic analyses revealed that the Malus and Prunus S-RNase and S-pollen genes belong to distinct gene lineages, and that only Prunus S-RNase and SFB-lineage genes are present in Fragaria. Thus, S-RNase based GSI system of Malus evolved independently from the ancestral system of Rosaceae. Using expression patterns based on RNA-seq data, the ancestral S-RNase lineage gene is inferred to be expressed in pistils only, while the ancestral S-pollen lineage gene is inferred to be expressed in tissues other than pollen.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.