Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by LIAAD

2022

Scalable transcriptomics analysis with Dask: applications in data science and machine learning

Authors
Moreno, M; Vilaca, R; Ferreira, PG;

Publication
BMC BIOINFORMATICS

Abstract
Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https:// github. com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.

2022

A systematic evaluation of deep learning methods for the prediction of drug synergy in cancer

Authors
Baptista, D; Ferreira, PG; Rocha, M;

Publication

Abstract
AbstractOne of the main obstacles to the successful treatment of cancer is the phenomenon of drug resistance. A common strategy to overcome resistance is the use of combination therapies. However, the space of possibilities is huge and efficient search strategies are required. Machine Learning (ML) can be a useful tool for the discovery of novel, clinically relevant anti-cancer drug combinations. In particular, deep learning (DL) has become a popular choice for modeling drug combination effects. Here, we set out to examine the impact of different methodological choices on the performance of multimodal DL-based drug synergy prediction methods, including the use of different input data types, preprocessing steps and model architectures. Focusing on the NCI ALMANAC dataset, we found that feature selection based on prior biological knowledge has a positive impact on performance. Drug features appeared to be more predictive of drug response. Molecular fingerprint-based drug representations performed slightly better than learned representations, and gene expression data of cancer or drug response-specific genes also improved performance. In general, fully connected feature-encoding subnetworks outperformed other architectures, with DL outperforming other ML methods. Using a state-of-the-art interpretability method, we showed that DL models can learn to associate drug and cell line features with drug response in a biologically meaningful way. The strategies explored in this study will help to improve the development of computational methods for the rational design of effective drug combinations for cancer therapy.Author summaryCancer therapies often fail because tumor cells become resistant to treatment. One way to overcome resistance is by treating patients with a combination of two or more drugs. Some combinations may be more effective than when considering individual drug effects, a phenomenon called drug synergy. Computational drug synergy prediction methods can help to identify new, clinically relevant drug combinations. In this study, we developed several deep learning models for drug synergy prediction. We examined the effect of using different types of deep learning architectures, and different ways of representing drugs and cancer cell lines. We explored the use of biological prior knowledge to select relevant cell line features, and also tested data-driven feature reduction methods. We tested both precomputed drug features and deep learning methods that can directly learn features from raw representations of molecules. We also evaluated whether including genomic features, in addition to gene expression data, improves the predictive performance of the models. Through these experiments, we were able to identify strategies that will help guide the development of new deep learning models for drug synergy prediction in the future.

2022

Multiscale partial information decomposition of dynamic processes with short and long-range correlations: theory and application to cardiovascular control

Authors
Pinto, H; Pernice, R; Silva, ME; Javorka, M; Faes, L; Rocha, AP;

Publication
PHYSIOLOGICAL MEASUREMENT

Abstract
Objective. In this work, an analytical framework for the multiscale analysis of multivariate Gaussian processes is presented, whereby the computation of Partial Information Decomposition measures is achieved accounting for the simultaneous presence of short-term dynamics and long-range correlations. Approach. We consider physiological time series mapping the activity of the cardiac, vascular and respiratory systems in the field of Network Physiology. In this context, the multiscale representation of transfer entropy within the network of interactions among Systolic arterial pressure (S), respiration (R) and heart period (H), as well as the decomposition into unique, redundant and synergistic contributions, is obtained using a Vector AutoRegressive Fractionally Integrated (VARFI) framework for Gaussian processes. This novel approach allows to quantify the directed information flow accounting for the simultaneous presence of short-term dynamics and long-range correlations among the analyzed processes. Additionally, it provides analytical expressions for the computation of the information measures, by exploiting the theory of state space models. The approach is first illustrated in simulated VARFI processes and then applied to H, S and R time series measured in healthy subjects monitored at rest and during mental and postural stress. Main Results. We demonstrate the ability of the VARFI modeling approach to account for the coexistence of short-term and long-range correlations in the study of multivariate processes. Physiologically, we show that postural stress induces larger redundant and synergistic effects from S and R to H at short time scales, while mental stress induces larger information transfer from S to H at longer time scales, thus evidencing the different nature of the two stressors. Significance. The proposed methodology allows to extract useful information about the dependence of the information transfer on the balance between short-term and long-range correlations in coupled dynamical systems, which cannot be observed using standard methods that do not consider long-range correlations.

2022

Novel features for time series analysis: a complex networks approach

Authors
Silva, VF; Silva, ME; Ribeiro, P; Silva, F;

Publication
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Being able to capture the characteristics of a time series with a feature vector is a very important task with a multitude of applications, such as classification, clustering or forecasting. Usually, the features are obtained from linear and nonlinear time series measures, that may present several data related drawbacks. In this work we introduce NetF as an alternative set of features, incorporating several representative topological measures of different complex networks mappings of the time series. Our approach does not require data preprocessing and is applicable regardless of any data characteristics. Exploring our novel feature vector, we are able to connect mapped network features to properties inherent in diversified time series models, showing that NetF can be useful to characterize time data. Furthermore, we also demonstrate the applicability of our methodology in clustering synthetic and benchmark time series sets, comparing its performance with more conventional features, showcasing how NetF can achieve high-accuracy clusters. Our results are very promising, with network features from different mapping methods capturing different properties of the time series, adding a different and rich feature set to the literature.

2022

Censored Multivariate Linear Regression Model

Authors
Sousa, R; Pereira, I; Silva, ME;

Publication
RECENT DEVELOPMENTS IN STATISTICS AND DATA SCIENCE, SPE2021

Abstract
Often, real-life problems require modelling several response variables together. This work analyses a multivariate linear regression model when the data are censored. Censoring distorts the correlation structure of the underlying variables and increases the bias of the usual estimators. Thus, we propose three methods to deal with multivariate data under left censoring, namely Expectation Maximization (EM), DataAugmentation (DA) and Gibbs Sampler with Data Augmentation (GDA). Results from a simulation study showthat both DA and GDA estimates are consistent for low and moderate correlation. Under high correlation scenarios, EM estimates present a lower bias.

2022

Statistical education and official statistics - training future data scientists

Authors
Silva, ME; Campos, P;

Publication
Proceedings of the IASE 2021 Satellite Conference

Abstract
EMOS (The European Master in Official Statistics) was set up to strengthen the collaboration within academia and producers of official statistics and help develop professionals able to work with European official data at different levels in the fast-changing production system of the 21st century. In this paper we address the need for training in Official Statistics, particularly in current times, where new skill sets and competencies are necessary. In particular, the needs for new data sources currently used by national statistical systems require the development of new methodologies. For that purpose, we do a matching between National Statistical Offices (NSO) needs and the offer from universities.

  • 90
  • 467