Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by HumanISE

2015

Techniques for efficient MATLAB-to-C compilation

Authors
Bispo, J; Reis, L; Cardoso, JMP;

Publication
Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY@PLDI, Portland, OR, USA, June 15 - 17, 2015

Abstract
MATLAB to C translation is foreseen to raise the overall abstraction level when mapping computations to embedded systems (possibly consisting of software and hardware components), and thus for increasing productivity and for providing an automated modeldriven design-flow. This paper describes recent work developed in the context of MATISSE, a MATLAB to C compiler targeting embedded systems. We introduce several techniques to allow the efficient generation of C code, such as weak types, primitives and matrix views. We evaluate the compiler with a set of 9 publicly available benchmarks, targeting both embedded systems and a desktop system. We compare the execution time of the generated C code with the original code running on MATLAB, achieving a geometric mean speedup of 8.1 ×, and qualitatively compare our results with the performance of related approaches. The use of the new techniques allowed the compiler to achieve performance improvements of 46% on average.

2015

C and OpenCL Generation from MATLAB

Authors
Bispo, J; Reis, L; Cardoso, JMP;

Publication
30TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, VOLS I AND II

Abstract
In many engineering and science areas, models are developed and validated using high-level programing languages and environments as is the case with MATLAB. In order to target the multicore heterogeneous architectures being used on embedded systems to provide high performance computing with acceptable energy/power envelops, developers manually migrate critical code sections to lower-level languages such as C and OpenCL, a time consuming and error prone process. Thus, automatic source-to-source approaches are highly desirable. We present an approach to compile MATLAB and output equivalent C/OpenCL code to target architectures, such as GPU based hardware accelerators. We evaluate our approach on an existing MATLAB compiler framework named MATISSE. The OpenCL generation relies on the manual insertion of directives to guide the compilation and is also capable of generating C wrapper code to interface and synchronize with the OpenCL code. We evaluated the compiler with a number of benchmarks from different domains and the results are very encouraging.

2015

Transparent Acceleration of Program Execution Using Reconfigurable Hardware

Authors
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publication
2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE)

Abstract
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.

2015

Use of previously acquired positioning of optimizations for phase ordering exploration

Authors
Nobre, R; Martins, LGA; Cardoso, JMP;

Publication
Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015

Abstract
This paper presents a new approach to efficiently search for suitable compiler pass sequences, a challenge known as phase ordering. Our approach relies on information about the relative positions of compiler passes in compiler pass sequences previously generated for a set of functions when compiling for a specific processor. We enhanced two iterative compiler pass exploration schemes, one relying on simple sequential compiler pass insertion and other implementing an auto-tuned simulated annealing process, with a data structure that holds information about the relative positions of compiler sequences; in order to reduce the set of compiler passes considered for insertion in a given position of a given candidate compiler pass sequence to include only the passes that have a higher probability of performing well on that relative position in the compiler sequence, speeding up the exploration time as a result. We tested our approach with two different compilers and two different targets; the ReflectC and the LLVM compilers, targeting a MicroBlaze processor and a LEON3 processor, respectively. The experimental results show that we can considerably reduce the number of algorithm iterations by a factor of up to more than an order of magnitude when targeting the MicroBlaze or the LEON3, while finding compiler sequences that result in binaries that when executed on the target processor/simulator are able to outperform (i.e. use less CPU cycles) all the standard optimization levels (i.e., we compare against the most performing optimization level flag on each kernel, e.g. -O1, -O2 or -O3 in the case of LLVM) by a geometric mean performance improvement of 1.23x and 1.20x when targeting the MicroBlaze processor, and 1.94x and 2.65x when targetting the LEON3 processor; for each of the two exploration algorithms and two kernel sets considered. © 2015 ACM.

2015

Enabling FPGA routing configuration sharing in dynamic partial reconfiguration

Authors
Al Farisi, B; Heyse, K; Bruneel, K; Cardoso, J; Stroobandt, D;

Publication
DESIGN AUTOMATION FOR EMBEDDED SYSTEMS

Abstract
Using dynamic partial reconfiguration (DPR), several circuits can be time-multiplexed on the same FPGA region, saving considerable area compared to an implementation without DPR. However, the long reconfiguration time to switch between circuits remains a significant problem. In this work we show that it is possible to significantly reduce this overhead when the number of circuits is limited. We lower the DPR overhead by reducing the number of configuration bits that needs to be reconfigured. This is achieved by keeping a (predetermined) part of the configuration frames of the DPR region constant/static for all circuits and, consequentially, sharing this part of the configuration between all the circuits. We show that this can be done maintaining the possibility to implement completely unrelated circuits in the DPR region. An extension of the Pathfinder algorithm, called StaticRoute, is presented. It is able to route the nets of the different circuits simultaneously in such a way that the routing of the different circuits is the same in the static part and may only differ in the dynamic part. Our approach is evaluated on the architecture of a commercially available SRAM-based FPGA. We explore how the static part in the configuration memory is best chosen and investigate the associated impact on maximum operating clock frequency as the number of circuits increases. Our experiments show that it is possible to make 50 % of the routing configuration static and therefore reduce the routing reconfiguration time by 50 %, without a significant impact on maximum clock frequency of the circuits. This corresponds to a reduction of total reconfiguration time of 34 %.

2015

Guest Editorial FPL 2013

Authors
Cardoso, JMP; Diniz, PC; Morrow, K;

Publication
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS

Abstract

  • 356
  • 589