Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by João Paiva Cardoso

2013

The REFLECT design-flow

Authors
Cardoso, JMP; De F. Coutinho, JG; Nane, R; Sima, VM; Olivier, B; Carvalho, T; Nobre, R; Diniz, PC; Petrov, Z; Bertels, K; Gonçalves, F; Van Someren, H; Hübner, M; Constantinides, G; Luk, W; Becker, J; Krátký, K; Bhattacharya, S; Alves, JC; Ferreira, JC;

Publication
Compilation and Synthesis for Embedded Reconfigurable Systems: An Aspect-Oriented Approach

Abstract
This chapter describes the design-flow approach developed in the REFLECT project as presented originally in [1]. Over the course of the project, this design-flow has evolved and has been extended into a fully operational toolchain. We begin by presenting an overview of the underlying aspect-oriented compilation flow followed by an extended description of the design-flow and its toolchain. © Springer Science+Business Media New York 2013. All rights are reserved.

2014

Trace-Based Reconfigurable Acceleration with Data Cache and External Memory Support

Authors
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publication
2014 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA)

Abstract
This paper presents a binary acceleration approach based on extending a General Purpose Processor (GPP) with a Reconfigurable Processing Unit (RPU), both sharing an external data memory. In this approach repeating sequences of GPP instructions are migrated to the RPU. The RPU resources are selected and organized off-line using execution trace information. The RPU core is composed of Functional Units (FUs) that correspond to single CPU instructions. The FUs are arranged in stages of mutually independent operations. The RPU can enable several stages in tandem, depending on the data dependencies. External data memory accesses are handled by a configurable dual-port cache. A prototype implementation of the architecture on a Spartan-6 FPGA was validated with 12 benchmarks and achieved an overall geometric mean speedup of 1.91x.

2015

Use of previously acquired positioning of optimizations for phase ordering exploration

Authors
Nobre, R; Martins, LGA; Cardoso, JMP;

Publication
Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015

Abstract
This paper presents a new approach to efficiently search for suitable compiler pass sequences, a challenge known as phase ordering. Our approach relies on information about the relative positions of compiler passes in compiler pass sequences previously generated for a set of functions when compiling for a specific processor. We enhanced two iterative compiler pass exploration schemes, one relying on simple sequential compiler pass insertion and other implementing an auto-tuned simulated annealing process, with a data structure that holds information about the relative positions of compiler sequences; in order to reduce the set of compiler passes considered for insertion in a given position of a given candidate compiler pass sequence to include only the passes that have a higher probability of performing well on that relative position in the compiler sequence, speeding up the exploration time as a result. We tested our approach with two different compilers and two different targets; the ReflectC and the LLVM compilers, targeting a MicroBlaze processor and a LEON3 processor, respectively. The experimental results show that we can considerably reduce the number of algorithm iterations by a factor of up to more than an order of magnitude when targeting the MicroBlaze or the LEON3, while finding compiler sequences that result in binaries that when executed on the target processor/simulator are able to outperform (i.e. use less CPU cycles) all the standard optimization levels (i.e., we compare against the most performing optimization level flag on each kernel, e.g. -O1, -O2 or -O3 in the case of LLVM) by a geometric mean performance improvement of 1.23x and 1.20x when targeting the MicroBlaze processor, and 1.94x and 2.65x when targetting the LEON3 processor; for each of the two exploration algorithms and two kernel sets considered. © 2015 ACM.

2014

Coarse/Fine-grained Approaches for Pipelining Computing Stages in FPGA-Based Multicore Architectures

Authors
Azarian, A; Cardoso, JMP;

Publication
EURO-PAR 2014: PARALLEL PROCESSING WORKSHOPS, PT II

Abstract
In recent years, there has been increasing interest on using task-level pipelining to accelerate the overall execution of applications mainly consisting of producer/consumer tasks. This paper presents coarse/fine-grained data flow synchronization approaches to achieve pipelining execution of the producer/consumer tasks in FPGA-based multicore architectures. Our approaches are able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. An important component of our approach is the use of customized inter-stage buffer schemes to communicate data and to synchronize the cores associated to the producer/consumer tasks. The experimental results show the feasibility of the approach when dealing with producer/consumer tasks with out-of-order communication and reveal noticeable performance improvements for a number of benchmarks over a single core implementation and not using task-level pipelining.

2014

Multi-target c code generation from MATLAB

Authors
Bispo, J; Reis, L; Cardoso, JMP;

Publication
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

Abstract
This paper describes our recent work on MATISSE, a framework for MATLAB to C compilation. We focus on the new optimizations and transformations, as well as on OpenCL generation. MATISSE is controlled with LARA, an aspect-oriented language, able to specify transformations to the input MATLAB code (e.g., insertion of code for variable initialization and for monitoring) and to express information concerning types and shapes of variables. We evaluate the compiler with a set of benchmarks when targeting both an embedded system and a desktop system. The results show that we were able to achieve a speedup up to 1.8× by employing information provided by LARA aspects. We also compare the execution time of the generated C code with the original code running on MATLAB, and we achieve a geometric mean speedup of 19×. The geometric mean speedup reduces to 12× when optimizing the MATLAB code with LARA aspects. Finally, we present a preliminary version of a fully-functioning pragma-based OpenCL generator, built over the MATISSE framework..

2016

Pipelining data-dependent tasks in FPGA-based multicore architectures

Authors
Azarian, A; Cardoso, JMP;

Publication
MICROPROCESSORS AND MICROSYSTEMS

Abstract
In recent years, there has been increasing interest in using task-level pipelining to accelerate the overall execution of applications mainly consisting of producer/consumer tasks. This paper proposes fine- and coarse-grained data synchronization approaches to achieve pipelining execution of producer/consumer tasks in FPGA-based multicore architectures. Our approaches are able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. An important component of our approach is the use of customized inter-stage buffer schemes to communicate data and to synchronize the cores associated with the producer/consumer tasks. We propose techniques to reduce the number of accesses to external memory in our fine-grained data synchronization approach. The experimental results show the feasibility of the approach in both in-order and out-of-order producer/consumer tasks. Moreover, the results using our approach reveal noticeable performance improvements for a number of benchmarks over a single core implementation without using task-level pipelining.

  • 7
  • 42