Publicacoes - INESC TEC

Publicações

Publicações por João Paiva Cardoso

2016

SSA-based MATLAB-to-C compilation and optimization

Autores
Reis, L; Bispo, J; Cardoso, JMP;

Publicação
Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY@PLDI 2016, Santa Barbara, CA, USA, June 14, 2016

Abstract
Many fields of engineering, science and finance use models that are developed and validated in high-level languages such as MATLAB. However, when moving to environments with resource constraints or portability challenges, these models often have to be rewritten in lower-level languages such as C. Doing so manually is costly and error-prone, but automated approaches tend to generate code that can be substantially less efficient than the handwritten equivalents. Additionally, it is usually difficult to read and improve code generated by these tools. In this paper, we describe how we improved our MATLAB-to-C compiler, based on the MATISSE framework, to be able to compete with handwritten C code. We describe our new IR and the most important optimizations that we use in order to obtain acceptable performance. We also analyze multiple C code versions to identify where the generated code is slower than the handwritten code and identify a few key improvements to generate code capable of outperforming handwritten C. We evaluate the new version of our compiler using a set of benchmarks, including the Disparity benchmark, from the San Diego Vision Benchmark Suite, on a desktop computer and on an embedded device. The achieved results clearly show the efficiency of the current version of the compiler. Copyright is held by the owner/author(s). Publication rights licensed to ACM.

FecharLer Abstract

2016

Towards a Multi-softcore FPGA Approach for the HOG Algorithm

Autores
Mascagni de Holanda, JAM; Paiva Cardoso, JMP; Marques, E;

Publicação
2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN)

Abstract
Object detection in images is a computing demanding task which usually needs to deal with the detection of different classes of objects, and thus requiring variations and adaptations easily provided by software solutions. Object detection algorithms are being part of real-time smarter embedded systems, such as automotive, medical, robotics and security systems. In most embedded systems, efficient implementations of object oriented algorithms need to provide high performance, low power consumption, and programmability to allow greater development flexibility. The Histogram of Oriented Gradients (HOG) is one of the most widely used algorithms for object detection in images. In this paper, we show our work towards mapping the HOG algorithm to an FPGA-based system consisting of multiple Nios II softcore processors and bearing in mind high-performance and programmability issues. We show how to reduce 19x the algorithms execution time by source to source transformations and specially avoiding redundant processing. Furthermore, we show how the use of pipelining processing using three Nios II processors allows a speedup of 49x compared to the embedded baseline application.

FecharLer Abstract

2015

Transparent Acceleration of Program Execution Using Reconfigurable Hardware

Autores
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publicação
2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE)

Abstract
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.

FecharLer Abstract

2013

Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units

Autores
Bispo, J; Paulino, N; Cardoso, JMP; Ferreira, JC;

Publicação
International Journal of Reconfigurable Computing

Abstract
The ability to map instructions running in a microprocessor to a reconfigurable processing unit (RPU), acting as a coprocessor, enables the runtime acceleration of applications and ensures code and possibly performance portability. In this work, we focus on the mapping of loop-based instruction traces (called Megablocks) to RPUs. The proposed approach considers offline partitioning and mapping stages without ignoring their future runtime applicability. We present a toolchain that automatically extracts specific trace-based loops, called Megablocks, from MicroBlaze instruction traces and generates an RPU for executing those loops. Our hardware infrastructure is able to move loop execution from the microprocessor to the RPU transparently, at runtime, and without changing the executable binaries. The toolchain and the system are fully operational. Three FPGA implementations of the system, differing in the hardware interfaces used, were tested and evaluated with a set of 15 application kernels. Speedups ranging from 1.26 × to 3.69 × were achieved for the best alternative using a MicroBlaze processor with local memory. © 2013 João Bispo et al.

FecharLer Abstract

2013

Transparent Trace-Based Binary Acceleration for Reconfigurable HW/SW Systems

Autores
Bispo, J; Paulino, N; Cardoso, JMP; Ferreira, JC;

Publicação
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Abstract
This paper presents a novel approach to accelerate program execution by mapping repetitive traces of executed instructions, called Megablocks, to a runtime reconfigurable array of functional units. An offline tool suite extracts Megablocks from microprocessor instruction traces and generates a Reconfigurable Processing Unit (RPU) tailored for the execution of those Megablocks. The system is able to transparently movebcomputations from the microprocessor to the RPU at runtime. A prototype implementation of the system using a cacheless MicroBlaze microprocessor running code located in external memory reaches speedups from 2.2x to 18.2x for a set of 14 benchmark kernels. For a system setup which maximizes microprocessor performance by having the application code located in internal block RAMs, speedups from 1.4x to 2.8x were estimated.

FecharLer Abstract

2016

The ANTAREX approach to autotuning and adaptivity for energy efficient HPC systems

Autores
Silvano, C; Agosta, G; Cherubin, S; Gadioli, D; Palermo, G; Bartolini, A; Benini, L; Martinovic, J; Palkovic, M; Slaninová, K; Bispo, J; Cardoso, JMP; Abreu, R; Pinto, P; Cavazzoni, C; Sanna, N; Beccari, AR; Cmar, R; Rohou, E;

Publicação
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, Como, Italy, May 16-19, 2016

Abstract
The ANTAREX 1 project aims at expressing the application selfadaptivity through a Domain Specific Language (DSL) and to runtime manage and autotune applications for green and heterogeneous High Performance Computing (HPC) systems up to Exascale. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. We show through a mini-App extracted from one of the project application use cases some initial exploration of application precision tuning by means enabled by the DSL. © 2016 Copyright held by the owner/author(s).

FecharLer Abstract