Publications

Publications by João Paiva Cardoso

2020

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

Authors
Paulin, N; Ferreira, JC; Cardoso, JMP;

Publication
ACM COMPUTING SURVEYS

Abstract
The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6x to 5.6x are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3x and 3.9x. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

CloseRead Abstract

2020

Exploration of FPGA-Based Hardware Designs for QR Decomposition for Solving Stiff ODE Numerical Methods Using the HARP Hybrid Architecture

Authors
de Souza, CAO; Bispo, J; Cardoso, JMP; Diniz, PC; Marques, E;

Publication
ELECTRONICS

Abstract
In this article, we focus on the acceleration of a chemical reaction simulation that relies on a system of stiff ordinary differential equation (ODEs) targeting heterogeneous computing systems with CPUs and field-programmable gate arrays (FPGAs). Specifically, we target an essential kernel of the coupled chemistry aerosol-tracer transport model to the Brazilian developments on the regional atmospheric modeling system (CCATT-BRAMS). We focus on a linear solve step using the QR factorization based on the modified Gram-Schmidt method as the basis of the ODE solver in this application. We target Intel hardware accelerator research program (HARP) architecture with the OpenCL programming environment for these early experiments. Our design exploration reveals a hardware design that is up to 4 times faster than the original iterative Jacobi method used in this solver. Still, even with hardware support, the overall performance of our QR-based hardware is lower than its original software version.

CloseRead Abstract

2020

Optimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets

Authors
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publication
IEEE ACCESS

Abstract
High Level Synthesis (HLS) tools targeting Field Programmable Gate Arrays (FPGAs) aim to provide a method for programming these devices via high-level abstractions. Initially, HLS support for FPGAs focused on compiling C/C CC to hardware circuits. This raised the issue of determining the programming practices which resulted in the best performing circuits. Recently, to further increase the applicability of HLS approaches, renewed effort was placed on support for HLS of OpenCL code for FPGA, raising the same issues of coding practices and performance portability. This paper explores the performance of OpenCL code compiled for FPGAs for different coding techniques. We evaluate the use of task-kernels versus NDRange kernels, data vectorization, the use of on-chip local memories, and data transfer optimizations by exploiting burst access inference. We present this exploration via a case study of the k-means algorithm, and produce a total of 10 OpenCL implementations of the kernel. To determine the effects of different data set characteristics, and to determine the gains from specialization based on number of attributes, we generated a total of 12 integer data sets. The data sets vary regarding the number of instances, number of attributes (i.e., features), and number of clusters. We also vary the number of processing cores, and present the resulting required resources and operating frequencies. Finally, we execute the same OpenCL code on a 4 GHz Intel i7-6700K CPU, showing that the FPGA achieves speedups up to 1.54 x for four cases, and energy savings up to 80% in all cases.

CloseRead Abstract

2020

Executing ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework

Authors
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publication
2020 30TH INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL)

Abstract
Performance and power efficiency in edge and embedded systems can benefit from specialized hardware. To avoid the effort of manual hardware design, we explore the generation of accelerator circuits from binary instruction traces for several Instruction Set Architectures.

CloseRead Abstract

2020

kNN Prototyping Schemes for Embedded Human Activity Recognition with Online Learning

Authors
Ferreira, PJS; Cardoso, JMP; Moreira, JM;

Publication
Comput.

Abstract
The kNN machine learning method is widely used as a classifier in Human Activity Recognition (HAR) systems. Although the kNN algorithm works similarly both online and in offline mode, the use of all training instances is much more critical online than offline due to time and memory restrictions in the online mode. Some methods propose decreasing the high computational costs of kNN by focusing, e.g., on approximate kNN solutions such as the ones relying on Locality-Sensitive Hashing (LSH). However, embedded kNN implementations also need to address the target device’s memory constraints, especially as the use of online classification needs to cope with those constraints to be practical. This paper discusses online approaches to reduce the number of training instances stored in the kNN search space. To address practical implementations of HAR systems using kNN, this paper presents simple, energy/computationally efficient, and real-time feasible schemes to maintain at runtime a maximum number of training instances stored by kNN. The proposed schemes include policies for substituting the training instances, maintaining the search space to a maximum size. Experiments in the context of HAR datasets show the efficiency of our best schemes. © 2020 by the authors. Licensee MDPI, Basel, Switzerland.

CloseRead Abstract

2022

Pegasus: Performance Engineering for Software Applications Targeting HPC Systems

Authors
Pinto, P; Bispo, J; Cardoso, J; Barbosa, JG; Gadioli, D; Palermo, G; Martinovic, J; Golasowski, M; Slaninova, K; Cmar, R; Silvano, C;

Publication
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

Abstract
Developing and optimizing software applications for high performance and energy efficiency is a very challenging task, even when considering a single target machine. For instance, optimizing for multicore-based computing systems requires in-depth knowledge about programming languages, application programming interfaces (APIs), compilers, performance tuning tools, and computer architecture and organization. Many of the tasks of performance engineering methodologies require manual efforts and the use of different tools not always part of an integrated toolchain. This paper presents Pegasus, a performance engineering approach supported by a framework that consists of a source-to-source compiler, controlled and guided by strategies programmed in a Domain-Specific Language, and an autotuner. Pegasus is a holistic and versatile approach spanning various decision layers composing the software stack, and exploiting the system capabilities and workloads effectively through the use of runtime autotuning. The Pegasus approach helps developers by automating tasks regarding the efficient implementation of software applications in multicore computing systems. These tasks focus on application analysis, profiling, code transformations, and the integration of runtime autotuning. Pegasus allows developers to program their strategies or to automatically apply existing strategies to software applications in order to ensure the compliance of non-functional requirements, such as performance and energy efficiency. We show how to apply Pegasus and demonstrate its applicability and effectiveness in a complex case study, which includes tasks from a smart navigation system.

CloseRead Abstract