Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by João Paiva Cardoso

2015

13th IEEE International Conference on Embedded and Ubiquitous Computing, EUC 2013, Porto, Portugal, October 21-23, 2015

Authors
Bozorgzadeh, E; Cardoso, JMP; Abreu, R; Memik, SO;

Publication
EUC

Abstract

2016

A Graph-Based Iterative Compiler Pass Selection and Phase Ordering Approach

Authors
Nobre, R; Martins, LGA; Cardoso, JMP;

Publication
ACM SIGPLAN NOTICES

Abstract
Nowadays compilers include tens or hundreds of optimization passes, which makes it difficult to find sequences of optimizations that achieve compiled code more optimized than the one obtained using typical compiler options such as -O2 and -O3. The problem involves both the selection of the compiler passes to use and their ordering in the compilation pipeline. The improvement achieved by the use of custom phase orders for each function can be significant, and thus important to satisfy strict requirements such as the ones present in high-performance embedded computing systems. In this paper we present a new and fast iterative approach to the phase selection and ordering challenges resulting in compiled code with higher performance than the one achieved with the standard optimization levels of the LLVM compiler. The obtained performance improvements are comparable with the ones achieved by other iterative approaches while requiring considerably less time and resources. Our approach is based on sampling over a graph representing transitions between compiler passes. We performed a number of experiments targeting the LEON3 microarchitecture using the Clang/LLVM 3.7 compiler, considering 140 LLVM passes and a set of 42 representative signal and image processing C functions. An exhaustive cross-validation shows our new exploration method is able to achieve a geometric mean performance speedup of 1.28x over the best individually selected -OX flag when considering 100,000 iterations; versus geometric mean speedups from 1.16x to 1.25x obtained with state-of-the-art iterative methods not using the graph. From the set of exploration methods tested, our new method is the only one consistently finding compiler sequences that result in performance improvements when considering 100 or less exploration iterations. Specifically, it achieved geometric mean speedups of 1.08x and 1.16x for 10 and 100 iterations, respectively.

2015

A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses

Authors
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publication
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS

Abstract
This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruction traces and generates customized RPU implementations. The implementation of Megablocks with memory accesses uses a memory-sharing mechanism to support concurrent accesses to the entire address space of the GPP's data memory. The scheduling of load/store operations and memory access handling have been optimized to minimize the latency introduced by memory accesses. The system is able to dynamically switch the execution between the GPP and the RPU when executing the original binaries of the input application. Our proof-of-concept prototype achieved geometric mean speedups of 1.60x and 1.18x for, respectively, a set of 37 benchmarks and a subset considering the 9 most complex benchmarks. With respect to a previous version of our approach, we achieved geometric mean speedup improvements from 1.22 to 1.53 for the 10 benchmarks previously used.

2016

Clustering-Based Selection for the Exploration of Compiler Optimization Sequences

Authors
Martins, LGA; Nobre, R; Cardoso, JMP; Delbem, ACB; Marques, E;

Publication
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION

Abstract
A large number of compiler optimizations are nowadays available to users. These optimizations interact with each other and with the input code in several and complex ways. The sequence of application of optimization passes can have a significant impact on the performance achieved. The effect of the optimizations is both platform and application dependent. The exhaustive exploration of all viable sequences of compiler optimizations for a given code fragment is not feasible. As this exploration is a complex and time-consuming task, several researchers have focused on Design Space Exploration (DSE) strategies both to select optimization sequences to improve the performance of each function of the application and to reduce the exploration time. In this article, we present a DSE scheme based on a clustering approach for grouping functions with similarities and exploration of a reduced search space resulting from the combination of optimizations previously suggested for the functions in each group. The identification of similarities between functions uses a data mining method that is applied to a symbolic code representation. The data mining process combines three algorithms to generate clusters: the Normalized Compression Distance, the Neighbor Joining, and a new ambiguity-based clustering algorithm. Our experiments for evaluating the effectiveness of the proposed approach address the exploration of optimization sequences in the context of the ReflectC compiler, considering 49 compilation passes while targeting a Xilinx MicroBlaze processor, and aiming at performance improvements for 51 functions and four applications. Experimental results reveal that the use of our clustering-based DSE approach achieves a significant reduction in the total exploration time of the search space (20x over a Genetic Algorithm approach) at the same time that considerable performance speedups (41% over the baseline) were obtained using the optimized codes. Additional experiments were performed considering the LLVM compiler, considering 124 compilation passes, and targeting a LEON3 processor. The results show that our approach achieved geometric mean speedups of 1.49x, 1.32x, and 1.24x for the best 10, 20, and 30 functions, respectively, and a global improvement of 7% over the performance obtained when compiling with -O2.

2016

Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approach

Authors
Cardoso, JMP; Coutinho, JGF; Carvalho, T; Diniz, PC; Petrov, Z; Luk, W; Goncalves, F;

Publication
SOFTWARE-PRACTICE & EXPERIENCE

Abstract
The development of applications for high-performance embedded systems is a long and error-prone process because in addition to the required functionality, developers must consider various and often conflicting nonfunctional requirements such as performance and/or energy efficiency. The complexity of this process is further exacerbated by the multitude of target architectures and mapping tools. This article describes LARA, an aspect-oriented programming language that allows programmers to convey domain-specific knowledge and nonfunctional requirements to a toolchain composed of source-to-source transformers, compiler optimizers, and mapping/synthesis tools. LARA is sufficiently flexible to target different tools and host languages while also allowing the specification of compilation strategies to enable efficient generation of software code and hardware cores (using hardware description languages) for hybrid target architectures - a unique feature to the best of our knowledge not found in any other aspect-oriented programming language. A key feature of LARA is its ability to deal with different models of join points, actions, and attributes. In this article, we describe the LARA approach and evaluate its impact on code instrumentation and analysis and on selecting critical code sections to be migrated to hardware accelerators for two embedded applications from industry. Copyright (c) 2014 John Wiley & Sons, Ltd.

2015

Programming strategies for contextual runtime specialization

Authors
Carvalho, T; Pinto, P; Cardoso, JMP;

Publication
Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015

Abstract
Runtime adaptability is expected to adjust the application and the mapping of computations according to usage contexts, operating environments, resources availability, etc. However, extending applications with adaptive features can be a complex task, especially due to the current lack of programming models and compiler support. One of the runtime adaptability possibilities is the use of specialized code according to data workloads and environments. Traditional approaches use multiple code versions generated offline and, during runtime, a strategy is responsible to select a code version. Moving code generation to runtime can achieve important improvements but may impose unacceptable overhead. This paper presents an aspect-oriented programming approach for runtime adaptability. We focus on a separation of concerns (strategies vs. application) promoted by a domain-specific language for programming runtime strategies. Our strategies allow runtime specialization based on contextual information. We use a template-based runtime code generation approach to achieve program specialization. We demonstrate our approach with examples from image processing, which depict the benefits of runtime specialization and illustrate how several factors need to be considered to eficiently adapt the application. © 2015 ACM.

  • 1
  • 44