1997
Autores
Diniz, P; Rinard, M;
Publicação
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
Abstract
This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase measures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the previous sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment. We have implemented dynamic feedback in the context of a parallelizing compiler for object-based programs. The generated code uses dynamic feedback to automatically choose the best synchronization optimization policy. Our experimental results show that the synchronization optimization policy has a significant impact on the overall performance of the computation, that the best policy varies from program to program, that the compiler is unable to statically choose the best policy, and that dynamic feedback enables the generated code to exhibit performance that is comparable to that of code that has been manually tuned to use the best policy. We have also performed a theoretical analysis which provides, under certain assumptions, a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution.
1998
Autores
Diniz, PC; Rinard, MC;
Publicação
Journal of Parallel and Distributed Computing
Abstract
Atomic operations are a key primitive in parallel computing systems. The standard implementation mechanism for atomic operations uses mutual exclusion locks. In an object-based programming system, the natural granularity is to give each object its own lock. Each operation can then make its execution atomic by acquiring and releasing the lock for the object that it accesses. But this fine lock granularity may have high synchronization overhead because it maximizes the number of executed acquire and release constructs. To achieve good performance it may be necessary to reduce the overhead by coarsening the granularity at which the computation locks objects. In this article we describe a static analysis technique - lock coarsening - designed to automatically increase the lock granularity in object-based programs with atomic operations. We have implemented this technique in the context of a parallelizing compiler for irregular, object-based programs and used it to improve the generated parallel code. Experiments with two automatically parallelized applications show these algorithms to be effective in reducing the lock overhead to negligible levels. The results also show, however, that an overly aggressive lock coarsening algorithm may harm the overall parallel performance by serializing sections of the parallel computation. A successful compiler must therefore negotiate a trade-off between reducing lock overhead and increasing the serialization. © 1998 Academic Press.
1999
Autores
Bondalapati, K; Diniz, P; Duncan, P; Granacki, J; Hall, M; Jain, R; Ziegler, H;
Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract
The lack of high-level design tools hampers the widespread adoption of adaptive computing systems. Application developers have to master a wide range of functions, from the high-level architecture design, to the timing of actual control and data signals. In this paper we describe DEFACTO, an end-to-end design environment aimed at bridging the gap in tools for adaptive computing by bringing together parallelizing compiler technology and synthesis techniques. © Springer-Verlag Berlin Heidelberg 1999.
1999
Autores
Diniz, PC; Rinard, MC;
Publicação
Concurrency Practice and Experience
Abstract
This article describes a framework for synchronization optimizations and a set of transformations for programs that implement critical sections using mutual exclusion locks. The basic synchronization transformations take constructs that acquire and release locks and move these constructs both within and between procedures. They also eliminate, acquire and release constructs that use the same lock and are adjacent in the program. The article also presents a synchronization optimization algorithm, lock elimination, that uses these transformations to reduce the synchronization overhead. This algorithm locates computations that repeatedly acquire and release the same lock, then transforms the computations so that they acquire and release the lock only once. The goal of this algorithm is to reduce the lock overhead by reducing the number of times that computations acquire and release locks. But because the algorithm also increases the sizes of the critical sections, it may decrease the amount of available concurrency. The algorithm addresses this trade-off by providing several different optimization policies. The policies differ in the amount by which they increase the sizes of the critical sections. Experimental results from a parallelizing compiler for object-based programs illustrate the practical utility of the lock elimination algorithm. For three benchmark applications, the algorithm can dramatically reduce the number of times the applications acquire and release locks, which significantly reduces the amount of time processors spend acquiring and releasing locks. The resulting overall performance improvements for these benchmarks range from no observable improvement to up to 30% performance improvement.
2003
Autores
Baradaran, N; Chame, J; Chen, C; Diniz, P; Hall, M; Lee, YJ; Liu, B; Lucas, R;
Publicação
Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003
Abstract
In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Due to the growing complexity of accurate performance prediction, our system incorporates empirical techniques to execute variants of code segments with representative data on the target architecture. In this paper, we discuss how empirical techniques and performance modeling can be effectively combined. We also discuss the role of historical information from prior runs, and programmer specifications supporting run-time adaptation. These techniques can be employed to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs. © 2003 IEEE.
2003
Autores
Shayee, KRS; Park, J; Diniz, PC;
Publicação
IEEE Symposium on FPGAs for Custom Computing Machines, Proceedings
Abstract
Digital image processing algorithms are a good match for direct implementation on FPGAs as current FPGA architectures can naturally match the fine grain parallelism in these applications. Typically, these algorithms are structured as a sequence of operations, expressed in high-level programming languages as tight loop nests. The loops usually define a shifting-window region over which the algorithm applies a simple localized operator (e.g., a differential gradient, or a min/max). In this research we focus on the development of fast, yet accurate performance and area modeling of complete FPGA designs that combine analytical, empirical and behavioral estimation techniques. We model the application of a set of important program transformations for image processing algorithms, namely loop unrolling, tiling, loop interchanging, loop fission and array privatization, and explore pipelined and non-pipelined execution modes. We take into consideration the impact of various transformations, in the presence of limited I/O resources like address generators and external memory data channels, on the performance of a complete design implemented in a FPGA based architecture. © 2003 IEEE.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.