2017
Authors
Kuang, Z; Peissig, PL; Costa, VS; Maclin, R; Page, D;
Publication
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017
Abstract
Several prominent public health incidents [29] that occurred at the beginning of this century due to adverse drug events (ADEs) have raised international awareness of governments and industries about pharmacovigilance (PhV) [6, 7], the science and activities to monitor and prevent adverse events caused by pharmaceutical products after they are introduced to the market. A major data source for PhV is large-scale longitudinal observational databases (LODs) [6] such as electronic health records (EHRs) and medical insurance claim databases. Inspired by the Multiple Self-Controlled Case Series (MSCCS) model [27], arguably the leading method for ADE discovery from LODs, we propose baseline regularization, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times. We apply the proposed method as well as MSCCS to the Marshfield Clinic EHR. Experimental results suggest that incorporatingthe heterogeneity among different patients and different times help to improve the performance in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth [26]. © 2017 Copyright held by the owner/author(s).
2013
Authors
Costa, VS; Vaz, D;
Publication
THEORY AND PRACTICE OF LOGIC PROGRAMMING
Abstract
The widespread availability of large data-sets poses both an opportunity and a challenge to logic programming. A first approach is to couple a relational database with logic programming, say, a Prolog system with MySQL. While this approach does pay off in cases where the data cannot reside in main memory, it is known to introduce substantial overheads. Ideally, we would like the Prolog system to deal with large data-sets in an efficient way both in terms of memory and of processing time. Just In Time Indexing (JITI) was mainly motivated by this challenge, and can work quite well in many application. Exo-compilation, designed to deal with large tables, is a next step that achieves very interesting results, reducing the memory footprint over two thirds. We show that combining exo-compilation with Just In Time Indexing can have significant advantages both in terms of memory usage and in terms of execution time. An alternative path that is relevant for many applications is User-Defined Indexing (UDI). This allows the use of specialized indexing for specific applications, say the spatial indexing crucial to any spatial system. The UDI sees indexing as pluggable modules, and can naturally be combined with Exo-compilation. We do so by using UDI with exo-data, and incorporating ideas from the UDI into high-performance indexers for specific tasks.
2015
Authors
Schwartz, MP; Hou, ZG; Propson, NE; Zhang, J; Engstrom, CJ; Costa, VS; Jiang, P; Nguyen, BK; Bolin, JM; Daly, W; Wang, Y; Stewart, R; Page, CD; Murphy, WL; Thomson, JA;
Publication
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
Abstract
Human pluripotent stem cell-based in vitro models that reflect human physiology have the potential to reduce the number of drug failures in clinical trials and offer a cost-effective approach for assessing chemical safety. Here, human embryonic stem (ES) cell-derived neural progenitor cells, endothelial cells, mesenchymal stem cells, and microglia/macrophage precursors were combined on chemically defined polyethylene glycol hydrogels and cultured in serum-free medium to model cellular interactions within the developing brain. The precursors self-assembled into 3D neural constructs with diverse neuronal and glial populations, interconnected vascular networks, and ramified microglia. Replicate constructs were reproducible by RNA sequencing (RNA-Seq) and expressed neurogenesis, vasculature development, and microglia genes. Linear support vector machines were used to construct a predictive model from RNA-Seq data for 240 neural constructs treated with 34 toxic and 26 nontoxic chemicals. The predictive model was evaluated using two standard hold-out testing methods: a nearly unbiased leave-one-out cross-validation for the 60 training compounds and an unbiased blinded trial using a single hold-out set of 10 additional chemicals. The linear support vector produced an estimate for future data of 0.91 in the cross-validation experiment and correctly classified 9 of 10 chemicals in the blinded trial.
2014
Authors
Amaral, C; Florido, M; Costa, VS;
Publication
FUNCTIONAL AND LOGIC PROGRAMMING, FLOPS 2014
Abstract
We present PrologCheck, an automatic tool for property-based testing of programs in the logic programming language Prolog with randomised test data generation. The tool is inspired by the well known QuickCheck, originally designed for the functional programming language Haskell. It includes features that deal with specific characteristics of Prolog such as its relational nature (as opposed to Haskell) and the absence of a strong type discipline. PrologCheck expressiveness stems from describing properties as Prolog goals. It enables the definition of custom test data generators for random testing tailored for the property to be tested. Further, it allows the use of a predicate specification language that supports types, modes and constraints on the number of successful computations. We evaluate our tool on a number of examples and apply it successfully to debug a Prolog library for AVL search trees.
2014
Authors
Kuusisto, F; Costa, VS; Nassif, H; Burnside, E; Page, D; Shavlik, J;
Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract
Machine learning is continually being applied to a growing set of fields, including the social sciences, business, and medicine. Some fields present problems that are not easily addressed using standard machine learning approaches and, in particular, there is growing interest in differential prediction. In this type of task we are interested in producing a classifier that specifically characterizes a subgroup of interest by maximizing the difference in predictive performance for some outcome between subgroups in a population. We discuss adapting maximum margin classifiers for differential prediction. We first introduce multiple approaches that do not affect the key properties of maximum margin classifiers, but which also do not directly attempt to optimize a standard measure of differential prediction. We next propose a model that directly optimizes a standard measure in this field, the uplift measure. We evaluate our models on real data from two medical applications and show excellent results. © 2014 Springer-Verlag.
2014
Authors
Peissig, PL; Costa, VS; Caldwell, MD; Rottscheit, C; Berg, RL; Mendonca, EA; Page, D;
Publication
JOURNAL OF BIOMEDICAL INFORMATICS
Abstract
Objective: Electronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient's clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using inductive logic programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping. Methods: Two relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance. Results: We developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each ML approach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p = 0.039), J48 (p = 0.003) and JRIP (p = 0.003). Discussion: ILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts. Conclusion: Relational learning using ILP offers a viable approach to EHR-driven phenotyping.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.