Publications

Publications by João Paiva Cardoso

2017

Impact of Compiler Phase Ordering When Targeting GPUs

Authors
Nobre, R; Reis, L; Cardoso, JMP;

Publication
Euro-Par 2017: Parallel Processing Workshops - Euro-Par 2017 International Workshops, Santiago de Compostela, Spain, August 28-29, 2017, Revised Selected Papers

Abstract
Research in compiler pass phase ordering (i.e., selection of compiler analysis/transformation passes and their order of execution) has been mostly performed in the context of CPUs and, in a small number of cases, FPGAs. In this paper we present experiments regarding compiler pass phase ordering specialization of OpenCL kernels targeting NVIDIA GPUs using Clang/LLVM 3.9 and the libclc OpenCL library. More specifically, we analyze the impact of using specialized compiler phase orders on the performance of 15 PolyBench/GPU OpenCL benchmarks. In addition, we analyze the final NVIDIA PTX assembly code generated by the different compilation flows in order to identify the main reasons for the cases with significant performance improvements. Using specialized compiler phase orders, we were able to achieve performance improvements over the CUDA version and OpenCL compiled with the NVIDIA driver. Compared to CUDA, we were able to achieve geometric mean improvements of 1.54× (up to 5.48×). Compared to the OpenCL driver version, we were able to achieve geometric mean improvements of 1.65× (up to 5.70×). © Springer International Publishing AG, part of Springer Nature 2018.

CloseRead Abstract

2017

On Coding Techniques for Targeting FPGAs via OpenCL

Authors
Paulino, N; Reis, L; Cardoso, JMP;

Publication
Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing, ParCo 2017, 12-15 September 2017, Bologna, Italy

Abstract
Software developers have always found it difficult to adopt Field-Programmable Gate Arrays (FPGAs) as computing platforms. Recent advances in HLS tools aim to ease the mapping of computations to FPGAs by abstracting the hardware design effort via a standard OpenCL interface and execution model. However, OpenCL is a low-level programming language and requires that developers master the target architecture in order to achieve efficient results. Thus, efforts addressing the generation of OpenCL from high-level languages are of paramount importance to increase design productivity and to help software developers. Existing approaches bridge this by translating MATLAB/Octave code into C, or similar languages, in order to improve performance by efficiently compiling for the target hardware. One example is the MATISSE source-to-source compiler, which translates MATLAB code into standard-compliant C and/or OpenCL code. In this paper, we analyse the viability of combining both flows so that sections of MATLAB code can be translated to specialized hardware with a small amount of effort, and test a few code optimizations and their effect on performance. We present preliminary results relative to execution times, and resource and power consumption, for two OpenCL kernels generated by MATISSE, and manual optimizations of each kernel based on different coding techniques. © 2018 The authors and IOS Press.

CloseRead Abstract

2014

Proceedings of the 5th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 3rd Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2014, Vienna, Austria, January 20, 2014

Authors
Silvano, C; Cardoso, JMP; Hübner, M;

Publication
PARMA-DITAM@HiPEAC

Abstract

2016

Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms, PARMA-DITAM 2016, Prague, Czech Republic, January 18, 2016

Authors
Silvano, C; Cardoso, JMP; Agosta, G; Hübner, M;

Publication
PARMA-DITAM@HiPEAC

Abstract

2018

Aspect-Driven Mixed-Precision Tuning Targeting GPUs

Authors
Nobre, R; Reis, L; Bispo, J; Carvalho, T; Cardoso, JMP; Cherubin, S; Agosta, G;

Publication
PARMA-DITAM 2018: 9TH WORKSHOP ON PARALLEL PROGRAMMING AND RUNTIME MANAGEMENT TECHNIQUES FOR MANY-CORE ARCHITECTURES AND 7TH WORKSHOP ON DESIGN TOOLS AND ARCHITECTURES FOR MULTICORE EMBEDDED COMPUTING PLATFORMS

Abstract
Writing mixed-precision kernels allows to achieve higher throughput together with outputs whose precision remain within given limits. The recent introduction of native half-precision arithmetic capabilities in several GPUs, such as NVIDIA P100 and AMD Vega 10, contributes to make precision-tuning even more relevant as of late. However, it is not trivial to manually find which variables are to be represented as half-precision instead of single- or double-precision. Although the use of half-precision arithmetic can speed up kernel execution considerably, it can also result in providing non-usable kernel outputs, whenever the wrong variables are declared using the half-precision data-type. In this paper we present an automatic approach for precision tuning. Given an OpenCL kernel with a set of inputs declared by a user (i.e., the person responsible for programming and/or tuning the kernel), our approach is capable of deriving the mixed-precision versions of the kernel that are better improve upon the original with respect to a given metric (e.g., time-to-solution, energy-to-solution). We allow the user to declare and/or select a metric to measure and to filter solutions based on the quality of the output. We implement a proof-of-concept of our approach using an aspect-oriented programming language called LARA. It is capable of generating mixed-precision kernels that result in considerably higher performance when compared with the original single-precision floating-point versions, while generating outputs that can be acceptable in some scenarios.

CloseRead Abstract

2018

AutoPar-Clava: An Automatic Parallelization source-to-source tool for C code applications

Authors
Arabnejad, H; Bispo, J; Barbosa, JG; Cardoso, JMP;

Abstract
Automatic parallelization of sequential code has become increasingly relevant in multicore programming. In particular, loop parallelization continues to be a promising optimization technique for scienti.c applications, and can provide considerable speedups for program execution. Furthermore, if we can verify that there are no true data dependencies between loop iterations, they can be easily parallelized. This paper describes Clava AutoPar, a library for the Clava weaver that performs automatic and symbolic parallelization of C code. The library is composed of two main parts, parallel loop detection and source-to-source code parallelization. The system is entirely automatic and attempts to statically detect parallel loops for a given input program, without any user intervention or profiling information. We obtained a geometric mean speedup of 1.5 for a set of programs from the C version of the NAS benchmark, and experimental results suggest that the performance obtained with Clava AutoPar is comparable or better than other similar research and commercial tools.

CloseRead Abstract