Publications

Publications by João Paiva Cardoso

2018

An Approach Based on a DSL plus API for Programming Runtime Adaptivity and Autotuning Concerns

Authors
Carvalho, T; Cardoso, JMP;

Publication
33RD ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING

Abstract
In the context of compiler optimizations, tuning of parameters and selection of algorithms, runtime adaptivity and autotuning are becoming increasingly important, especially due to the complexity of applications, workloads, computing devices and execution environments. For identifying and specifying adaptivity, different phases are required: analysis of program hotspots and adaptivity opportunities, code restructuring, and programming of adaptivity strategies. These phases usually require different tools and modications to the source code that may result in difficult to maintain and error prone code. This paper presents a flexible approach to support the different phases when developing adaptive applications. The approach is based on a single domain-specific language (DSL), able to specify and evaluate multiple strategies and to maintain a separation of concerns. We describe the requirements and the design of the DSL, an accompanying API, and of a Java-to-Java compiler that implements the approach. In addition, we present and evaluate the use of the approach to specify runtime adaptivity strategies in the context of Java programs, especially when considering runtime autotuning of optimization parameters and runtime selection of algorithms. Although simple, the case studies shown truly demonstrate the main advantages of the approach in terms of the programming model and of the performance impact.

CloseRead Abstract

2018

Impact of Vectorization Over 16-bit Data-Types on GPUs

Authors
Reis, L; Nobre, R; Cardoso, JMP;

Publication
PARMA-DITAM 2018: 9TH WORKSHOP ON PARALLEL PROGRAMMING AND RUNTIME MANAGEMENT TECHNIQUES FOR MANY-CORE ARCHITECTURES AND 7TH WORKSHOP ON DESIGN TOOLS AND ARCHITECTURES FOR MULTICORE EMBEDDED COMPUTING PLATFORMS

Abstract
Since the introduction of Single Instruction Multiple Thread (SIMT) GPU architectures, vectorization has seldom been recommended. However, for efficient use of 8-bit and 16-bit data types, vector types are necessary even on these GPUs. When only integer types were natively supported in sizes of less than 32-bits, the usefulness of vectors was limited, but the introduction of hardware support for packed half-precision floating point computations in recent GPU architectures changes this, as now floating-point programs can also benefit from vector types. Given a GPU kernel, using smaller data-types might not be sufficient to achieve the optimal performance for a given device, even on hardware with native support for half-precision, because the compiler targeting the GPU may not able to automatically vectorize the code. In this paper, we present a number of examples that make use of the OpenCL vector data-types, which we are currently implementing in our tool for automatic vectorization. We present a number of experiments targeting a graphics card with an AMD Vega 10 XT GPU, which has 2x peak arithmetic throughput using half-precision when compared with single-precision. For comparison, we also target an older GPU architecture, without native support for half-precision arithmetic. We found that, on an AMD Vega 10 XT GPU, half-precision vectorization leads to performance improvements over the scalar version using the same precision (geometric mean speedup of 1.50x), which can be attributed to the GPU being able to make use of native native support for arithmetic over packed half-precision data. However, we found that most of the performance improvement of vectorization is caused by related transformations, such as thread coarsening or loop unrolling.

CloseRead Abstract

2018

Rapid Prototyping and Verification of Hardware Modules Generated Using HLS

Authors
Caba, J; Cardoso, JMP; Rincón, F; Dondo, J; López, JC;

Publication
Applied Reconfigurable Computing. Architectures, Tools, and Applications - 14th International Symposium, ARC 2018, Santorini, Greece, May 2-4, 2018, Proceedings

Abstract
Most modern design suites include HLS tools that rise the design abstraction level and provide a fast and direct flow to programmable devices, getting rid of manually coding at the RTL. While HLS greatly reduces the design productivity gap, non-negligible problems arise. For instance, the co-simulation strategy may not provide trustworthy results due to the variable accuracy of simulation, especially when considering dynamic reconfiguration and access to system busses. This work proposes mechanisms aimed at improving the verification accuracy using a real device and a testing framework. One of the mechanisms is the inclusion of physical configuration macros (e.g., clock rate configuration macro) and test assertions based on physical parameters in the verification environment (e.g., timing assertions). In addition it is possible to change some of those parameters, such as clock speed rate, and check the behavior of a hardware component into an overclocking or underclocking scenario. Our on-board testing flow allows faster FPGA iterations to ensure the design intent and the hardware-design behavior match. This flow uses a real device to carry out the verification process and synthesizes only the DUT generating its partial bitstream in a few minutes. © Springer International Publishing AG, part of Springer Nature 2018.

CloseRead Abstract

2018

SOCRATES - A seamless online compiler and system runtime autotuning framework for energy-aware applications

Authors
Gadioli, D; Nobre, R; Pinto, P; Vitali, E; Ashouri, AH; Palermo, G; Cardoso, JMP; Silvano, C;

Publication
2018 Design, Automation & Test in Europe Conference & Exhibition, DATE 2018, Dresden, Germany, March 19-23, 2018

Abstract
Configuring program parallelism and selecting optimal compiler options according to the underlying platform architecture is a difficult task. Tipically, this task is either assigned to the programmer or done by a standard one-fits-all policy generated by the compiler or runtime system. A runtime selection of the best configuration requires the insertion of a lot of glue code for profiling and runtime selection. This represents a programming wall for application developers. This paper presents a structured approach, called SOCRATES, based on an aspect-oriented language (LARA) and a runtime autotuner (mARGOt) to mitigate this problem. LARA has been used to hide the glue code insertion, thus separating the pure functional application description from extra-functional requirements. mARGOT has been used for the automatic selection of the best configuration according to the runtime evolution of the application. 1 © 2018 EDAA.

CloseRead Abstract

2017

The ANTAREX tool flow for monitoring and autotuning energy efficient HPC systems

Authors
Silvano, C; Agosta, G; Barbosa, JG; Bartolini, A; Beccari, AR; Benini, L; Bispo, J; Cardoso, JMP; Cavazzoni, C; Cherubin, S; Cmar, R; Gadioli, D; Manelfi, C; Martinovic, J; Nobre, R; Palermo, G; Palkovic, M; Pinto, P; Rohou, E; Sanna, N; Slaninová, K;

Publication
2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017, Pythagorion, Greece, July 17-20, 2017

Abstract
Designing and optimizing HPC applications are difficult and complex tasks, which require mastering specialized languages and tools for performance tuning. As this is incompatible with the current trend to open HPC infrastructures to a wider range of users, the availability of more sophisticated programming languages and tools to assist and automate the design stages is crucial to provide smoothly migration paths towards novel heterogeneous HPC platforms. The ANTAREX project intends to address these issues by providing a tool flow, a Domain Specific Launguage and APIs to provide application's adaptivity and to runtime manage and autotune applications for heterogeneous HPC systems. Our DSL provides a separation of concerns, where analysis, runtime adaptivity, performance tuning and energy strategies are specified separately from the application functionalities with the goal to increase productivity, significantly reduce time to solution, while making possible the deployment of substantially improved implementations. This paper presents the ANTAREX tool flow and shows the impact of optimization strategies in the context of one of the ANTAREX use cases related to personalized drug design. We show how simple strategies, not devised by typical compilers, can substantially speedup the execution and reduce energy consumption. © 2017 IEEE.

CloseRead Abstract

2018

Autotuning and Adaptivity in Energy Efficient HPC Systems: The ANTAREX Toolbox

Authors
Silvano, C; Palermo, G; Agosta, G; Ashouri, AH; Gadioli, D; Cherubin, S; Vitali, E; Benini, L; Bartolini, A; Cesarini, D; Cardoso, J; Bispo, J; Pinto, P; Nobre, R; Rohou, E; Besnard, L; Lasri, I; Sanna, N; Cavazzoni, C; Cmar, R; Martinovic, J; Slaninova, K; Golasowski, M; Beccari, AR; Manelfi, C;

Publication
2018 ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS

Abstract
Designing and optimizing applications for energy-efficient High Performance Computing systems up to the Exascale era is an extremely challenging problem. This paper presents the toolbox developed in the ANTAREX European project for autotuning and adaptivity in energy efficient HPC systems. In particular, the modules of the ANTAREX toolbox are described as well as some preliminary results of the application to two target use cases.(1)

CloseRead Abstract