Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by HumanISE

2016

AutoTuning and Adaptivity appRoach for Energy efficient eXascale HPC systems: the ANTAREX Approach

Authors
Silvano, C; Agosta, G; Bartolini, A; Beccari, AR; Benini, L; Bispo, J; Cmar, R; Cardoso, JMP; Cavazzoni, C; Martinovic, J; Palermo, G; Palkovic, M; Pinto, P; Rohou, E; Sanna, N; Slaninova, K;

Publication
PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE)

Abstract
The main goal of the ANTAREX 1 project is to express by a Domain Specific Language (DSL) the application self-adaptivity and to runtime manage and autotune applications for green and heterogeneous High Performance Computing (HPC) systems up to the Exascale level. Key innovations of the project include the introduction of a separation of concerns between self-adaptivity strategies and application functionalities. The DSL approach will allow the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management.

2016

SSA-based MATLAB-to-C compilation and optimization

Authors
Reis, L; Bispo, J; Cardoso, JMP;

Publication
Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY@PLDI 2016, Santa Barbara, CA, USA, June 14, 2016

Abstract
Many fields of engineering, science and finance use models that are developed and validated in high-level languages such as MATLAB. However, when moving to environments with resource constraints or portability challenges, these models often have to be rewritten in lower-level languages such as C. Doing so manually is costly and error-prone, but automated approaches tend to generate code that can be substantially less efficient than the handwritten equivalents. Additionally, it is usually difficult to read and improve code generated by these tools. In this paper, we describe how we improved our MATLAB-to-C compiler, based on the MATISSE framework, to be able to compete with handwritten C code. We describe our new IR and the most important optimizations that we use in order to obtain acceptable performance. We also analyze multiple C code versions to identify where the generated code is slower than the handwritten code and identify a few key improvements to generate code capable of outperforming handwritten C. We evaluate the new version of our compiler using a set of benchmarks, including the Disparity benchmark, from the San Diego Vision Benchmark Suite, on a desktop computer and on an embedded device. The achieved results clearly show the efficiency of the current version of the compiler. Copyright is held by the owner/author(s). Publication rights licensed to ACM.

2016

Towards a Multi-softcore FPGA Approach for the HOG Algorithm

Authors
Mascagni de Holanda, JAM; Paiva Cardoso, JMP; Marques, E;

Publication
2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN)

Abstract
Object detection in images is a computing demanding task which usually needs to deal with the detection of different classes of objects, and thus requiring variations and adaptations easily provided by software solutions. Object detection algorithms are being part of real-time smarter embedded systems, such as automotive, medical, robotics and security systems. In most embedded systems, efficient implementations of object oriented algorithms need to provide high performance, low power consumption, and programmability to allow greater development flexibility. The Histogram of Oriented Gradients (HOG) is one of the most widely used algorithms for object detection in images. In this paper, we show our work towards mapping the HOG algorithm to an FPGA-based system consisting of multiple Nios II softcore processors and bearing in mind high-performance and programmability issues. We show how to reduce 19x the algorithms execution time by source to source transformations and specially avoiding redundant processing. Furthermore, we show how the use of pipelining processing using three Nios II processors allows a speedup of 49x compared to the embedded baseline application.

2016

The ANTAREX approach to autotuning and adaptivity for energy efficient HPC systems

Authors
Silvano, C; Agosta, G; Cherubin, S; Gadioli, D; Palermo, G; Bartolini, A; Benini, L; Martinovic, J; Palkovic, M; Slaninová, K; Bispo, J; Cardoso, JMP; Abreu, R; Pinto, P; Cavazzoni, C; Sanna, N; Beccari, AR; Cmar, R; Rohou, E;

Publication
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, Como, Italy, May 16-19, 2016

Abstract
The ANTAREX 1 project aims at expressing the application selfadaptivity through a Domain Specific Language (DSL) and to runtime manage and autotune applications for green and heterogeneous High Performance Computing (HPC) systems up to Exascale. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. We show through a mini-App extracted from one of the project application use cases some initial exploration of application precision tuning by means enabled by the DSL. © 2016 Copyright held by the owner/author(s).

2016

Pipelining data-dependent tasks in FPGA-based multicore architectures

Authors
Azarian, A; Cardoso, JMP;

Publication
MICROPROCESSORS AND MICROSYSTEMS

Abstract
In recent years, there has been increasing interest in using task-level pipelining to accelerate the overall execution of applications mainly consisting of producer/consumer tasks. This paper proposes fine- and coarse-grained data synchronization approaches to achieve pipelining execution of producer/consumer tasks in FPGA-based multicore architectures. Our approaches are able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. An important component of our approach is the use of customized inter-stage buffer schemes to communicate data and to synchronize the cores associated with the producer/consumer tasks. We propose techniques to reduce the number of accesses to external memory in our fine-grained data synchronization approach. The experimental results show the feasibility of the approach in both in-order and out-of-order producer/consumer tasks. Moreover, the results using our approach reveal noticeable performance improvements for a number of benchmarks over a single core implementation without using task-level pipelining.

2016

A Pipelined Multi-softcore Approach for the HOG Algorithm

Authors
Mascagni de Holanda, JAM; Paiva Cardoso, JMP; Marques, E;

Publication
PROCEEDINGS OF THE 2016 CONFERENCE ON DESIGN AND ARCHITECTURES FOR SIGNAL & IMAGE PROCESSING

Abstract
This paper describes the mapping and the acceleration of an object detection algorithm on a multiprocessor system based on an FPGA. We use HOG ( Histogram of Oriented Gradients), one of the most popular algorithms for detection of different classes of objects and currently being used in smart embedded systems. The use of HOG on such systems requires efficient implementations in order to provide high performance possibly with low energy/power consumption budgets. Also, as variations and adaptations of this algorithm are needed to deal with different scenarios and classes of objects, programmability is required to allow greater development flexibility. In this paper we show our approach towards implementing the HOG algorithm into a multi-softcore Nios II based-system, bearing in mind high-performance and programmability issues. By applying sourceto-source transformations we obtain speedups of 19x and by using pipelined processing we reduce the algorithms execution time 49x. We also show that improving the hardware with acceleration units can result in speedups of 72.4x compared to the embedded baseline application.

  • 320
  • 589