2016
Autores
de Oliveira, CB; Menotti, R; Cardoso, JMP; Marques, E;
Publicação
LANGUAGES, DESIGN METHODS, AND TOOLS FOR ELECTRONIC SYSTEM DESIGN
Abstract
2019
Autores
Paulino, NMC; Ferreira, JC; Cardoso, JMP;
Publicação
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Abstract
The use of specialized accelerator circuits is a feasible solution to address performance and energy issues in embedded systems. This paper extends a previous field-programmable gate array-based approach that automatically generates pipelined customized loop accelerators (CLAs) from runtime instruction traces. Despite efficient acceleration, the approach suffered from high area and resource requirements when offloading a large number of kernels from the target application. This paper addresses this by enhancing the CLA with dynamic partial reconfiguration (DPR) support. Each kernel to accelerate is implemented as a variant of a reconfigurable area of the CLA which hosts all functional units and configuration memory. Evaluation of the proposed system is performed on a Virtex-7 device. We show, for a set of 21 kernels, that when comparing two CLAs capable of accelerating the same subset of kernels, the one which benefits from DPR can be up to 4.3x smaller. Resorting to DPR allows for the implementation of CLAs which support numerous kernels without a significant decrease in operating frequency and does not affect the initiation intervals at which kernels are scheduled. Finally, the area required by a CLA instance can be further reduced by increasing the IIs of the scheduled kernels.
2018
Autores
Silvano, C; Agosta, G; Bartolini, A; Beccari, AR; Benini, L; Besnard, L; Bispo, J; Cmar, R; Cardoso, JMP; Cavazzoni, C; Cherubin, S; Gadioli, D; Golasowski, M; Lasri, I; Martinovic, J; Palermo, G; Pinto, P; Rohou, E; Sanna, N; Slaninová, K; Vitali, E;
Publicação
21st Euromicro Conference on Digital System Design, DSD 2018, Prague, Czech Republic, August 29-31, 2018
Abstract
The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. In this paper, we present an overview of the ANTAREX DSL and some of its capabilities through a number of examples, including how the DSL is applied in the context of one of the project use cases. © 2018 IEEE.
2019
Autores
Nobre, R; Reis, L; Cardoso, JMP;
Publicação
EURO-PAR 2018: PARALLEL PROCESSING WORKSHOPS
Abstract
Iterative compilation focused on specialized phase orders (i.e., custom selections of compiler passes and orderings for each program or function) can significantly improve the performance of compiled code. However, phase ordering specialization typically needs to deal with large solution space. A previous approach, evaluated by targeting an x86 CPU, mitigates this issue by first using a training phase on reference codes to produce a small set of high-quality reusable phase orders. This approach then uses these phase orders to compile new codes, without any code analysis. In this paper, we evaluate the viability of using this approach to optimize the GPU execution performance of OpenCL kernels. In addition, we propose and evaluate the use of a heuristic to further reduce the number of evaluated phase orders, by comparing the speedups of the resulting binaries with those of the training phase for each phase order. This information is used to predict which untested phase order is most likely to produce good results (e.g., highest speedup). We performed our measurements using the PolyBench/GPU OpenCL benchmark suite on an NVIDIA Pascal GPU. Without heuristics, we can achieve a geomean execution speedup of 1.64x, using cross-validation, with 5 non-standard phase orders. With the heuristic, we can achieve the same speedup with only 3 non-standard phase orders. This is close to the geomean speedup achieved in our iterative compilation experiments exploring thousands of phase orders. Given the significant reduction in exploration time and other advantages of this approach, we believe that it is suitable for a wide range of compiler users concerned with performance.
2018
Autores
Bartolini, A; Cardoso, JMP; Silvano, C;
Publicação
ANDARE@PACT
Abstract
2018
Autores
Bartolini, A; Cardoso, JMP; Silvano, C;
Publicação
ACM International Conference Proceeding Series
Abstract
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.