2022
Authors
Dantas, M; Leitao, D; Cui, P; Macedo, R; Liu, XL; Xu, WJ; Paulo, J;
Publication
2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022)
Abstract
We present MONARCH, a framework-agnostic storage middleware that transparently employs storage tiering to accelerate Deep Learning (DL) training. It leverages existing storage tiers of modern supercomputers (i.e., compute node's local storage and shared parallel file system (PFS)), while considering the I/O patterns of DL frameworks to improve data placement across tiers. MONARCH aims at accelerating DL training and decreasing the I/O pressure imposed over the PFS. We apply MONARCH to TensorFlow and PyTorch, while validating its performance and applicability under different models and dataset sizes. Results show that, even when the training dataset can only be partially stored at local storage, MONARCH reduces TensorFlow's and PyTorch's training time by up to 28% and 37% for I/O-intensive models, respectively. Furthermore, MONARCH decreases the number of I/O operations submitted to the PFS by up to 56%.
2022
Authors
Macedo, R; Miranda, M; Tanimura, Y; Haga, J; Ruhela, A; Harrell, SL; Evans, RT; Paulo, J;
Publication
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022)
Abstract
Modern large-scale I/O applications that run on HPC infrastructures are increasingly becoming metadata-intensive. Unfortunately, having multiple concurrent applications submitting massive amounts of metadata operations can easily saturate the shared parallel file system's metadata resources, leading to unresponsiveness of the storage backend and overall performance degradation. To address these challenges, we present PADLL, a storage middleware that enables system administrators to proactively control and ensure QoS over metadata workflows in HPC storage systems. We demonstrate its performance and feasibility by controlling the rate of both synthetic and realistic I/O workloads. Results show that PADLL can dynamically control metadata-aggressive workloads, prevent I/O burstiness, and ensure I/O fairness and prioritization.
2022
Authors
Pereira, P; Fernandes, JP; Cunha, J;
Publication
2022 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2022, Rome, Italy, September 12-16, 2022
Abstract
Data collection is pervasively bound to our digital lifestyle. A recent study reports that the growth of the data created and replicated in 2020 was even higher than in the previous years to an astonishing global amount of 64.2 zettabytes of data. There are numerous companies whose services/products rely heavily on data analysis, and mining the produced data has already revealed great value for businesses in different sectors. In order to be able to support the professionals that do this job, typically known as data scientists, we first need to characterize them. To contribute towards this characterization, we conducted a public survey and in this work we present the results about a particular aspects of their life: the tools they use and need. © 2022 IEEE Computer Society. All rights reserved.
2022
Authors
Moreno, M; Vilaca, R; Ferreira, PG;
Publication
BMC BIOINFORMATICS
Abstract
Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https:// github. com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
2022
Authors
Alves, J; Soares, B; Brito, C; Sousa, A;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022
Abstract
Healthcare environments are generating a deluge of sensitive data. Nonetheless, dealing with large amounts of data is an expensive task, and current solutions resort to the cloud environment. Additionally, the intersection of the cloud environment and healthcare data opens new challenges regarding data privacy. With this in mind, we propose MEDCLOUDCARE (MCC), a healthcare application offering medical image viewing and processing tools while integrating cloud computing and AI. Moreover, MCC provides security and privacy features, scalability and high availability. The system is intended for two user groups: health professionals and researchers. The former can remotely view, process and share medical imaging information in the DICOM format. Also, it can use pre-trained Machine Learning (ML) models to aid the analysis of medical images. The latter can remotely add, share, and deploy ML models to perform inference on DICOM images. MCC incorporates a DICOM web viewer enabling users to view and process DICOM studies, which they can also upload and store. Regarding the security and privacy of the data, all sensitive information is encrypted at rest and in transit. Furthermore, MCC is intended for cloud environments. Thus, the system is deployed using Kubernetes, increasing the efficiency, availability and scalability of the ML inference process.
2022
Authors
Costa, L; Ribeiro, AN;
Publication
INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, ISDA 2021
Abstract
The process of migrating from a monolithic to a microservices based architecture is currently described as a form of modernizing applications. The core principles of microservices, which mostly reside in achieving loose coupling between the services, highly depend on the implementation approaches used. Being microservices a complete change of paradigm that contrasts with the traditional way of developing software, the current lack of established principles often results in implementations that conflict with its alleged benefits. Given its distributed nature, performance is affected, but specific implementation patterns can further impact it. This paper aims to address the impact that microservices-based solutions, featuring different implementation patterns, have on performance and how it compares with monolithic applications. To do so, benchmarks are conducted over one application developed following a traditional monolithic approach, and two equivalent microservices-based implementations featuring distinct inter-service communication mechanisms and data management methodologies.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.