Publications

Publications by Pedro Diniz

2003

Eliminating Synchronization Bottlenecks Using Adaptive Replication

Authors
Rinard, MC; Diniz, PC;

Publication
ACM Transactions on Programming Languages and Systems

Abstract
The utilization of adaptive replication technique for automatic elimination of synchronization bottlenecks was analyzed. Adaptive replication involves dynamic detection of contention at each object which eliminates unnecessary replication. The technique was implemented in the context of parallelizing compiler for a subset of C++. The combination of lock coarsening and adaptive replication eliminated synchronization bottlenecks and also reduced the overheads associated with the synchronization and replication processes.

CloseRead Abstract

2003

Compiler-generated communication for pipelined FPGA applications

Authors
Ziegler, HE; Hall, MW; Diniz, PC;

Publication
Proceedings - Design Automation Conference

Abstract
In this paper, we describe a set of compiler analyses and an implementation that automatically map a sequential and un-annotated C program into a pipelined implementation, targeted for an FPGA with multiple external memories. For this purpose, we extend array data-flow analysis techniques from parallelizing compilers to identify pipeline stages, required inter-pipeline stage communication, and opportunities to find a minimal program execution time by trading communication overhead with the amount of computation overlap in different stages. Using the results of this analysis, we automatically generate application-specific pipelined FPGA hardware designs. We use a sample image processing kernel to illustrate these concepts. Our algorithm finds a solution in which transmitting a row of an array between pipeline stages per communication instance leads to a speedup of 1.76 over an implementation that communicates the entire array at once.

CloseRead Abstract

2003

Using estimates from behavioral synthesis tools in compiler-directed design space exploration

Authors
So, B; Diniz, PC; Hall, MW;

Publication
Proceedings - Design Automation Conference

Abstract
This paper considers the role of performance and area estimates from behavioral synthesis in design space exploration. We have developed a compilation system that automatically maps high-level algorithms written in C to application-specific designs for Field Programmable Gate Arrays (FPGAs), through a collaboration between parallelizing compiler technology and high-level synthesis tools. Using several code transformations, the compiler optimizes a design to increase parallelism and utilization of external memory band-width, and selects the best design among a set of candidates. Performance and area estimates from behavioral synthesis provide feedback to the compiler to guide this selection. Estimates can be derived far more quickly (up to several orders of magnitude faster) than full synthesis and place-and-route, thus allowing the compiler to consider many more designs than would otherwise be practical. In this paper, we examine the accuracy of the estimates from behavioral synthesis as compared to the fully synthesized designs for a collection of 209 designs for five multimedia kernels. Though the estimates are not completely accurate, our results show that the same design would be selected by the design space exploration algorithm, whether we use estimates or actual results from place-and-route, because it favors smaller designs and only increases complexity when the benefit is significant.

CloseRead Abstract

1999

Eliminating synchronization bottlenecks in object-based programs using adaptive replication

Authors
Rinard, M; Diniz, P;

Publication
Proceedings of the International Conference on Supercomputing

Abstract
This paper presents a technique, adaptive replication, for automatically eliminating synchronization bottlenecks in multithreaded programs that perform atomic operations on objects. Synchronization bottlenecks occur when multiple threads attempt to concurrently update the same object. It is often possible to eliminate synchronization bottlenecks by replicating objects. Each thread can then update its own local replica without synchronization and without interacting with other threads. When the computation needs to access the original object, it combines the replicas to produce the correct values in the original object. One potential problem is that eagerly replicating all objects may lead to performance degradation and excessive memory consumption. Adaptive replication eliminates unnecessary replication by dynamically measuring the amount of contention at each object to detect and replicate only those objects that would otherwise cause synchronization bottlenecks. We have implemented adaptive replication in the context of a parallelizing compiler for a subset of C++. Given an unannotated sequential program written in C++, the compiler automatically extracts the concurrency, determines when it is legal to apply adaptive replication, and generates parallel code that uses adaptive replication to efficiently eliminate synchronization bottlenecks. Our experimental results show that for our set of benchmark programs, adaptive replication can improve the overall performance by up to a factor of three over versions that use no replication, and can reduce the memory consumption by up to a factor of four over versions that fully replicate updated objects. Furthermore, adaptive replication never significantly harms the performance and never increases the memory usage without a corresponding increase in the performance.

CloseRead Abstract

1999

Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback

Authors
Diniz, PC; Rinard, MC;

Publication
ACM Transactions on Computer Systems

Abstract
This article presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase measures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the previous sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment. We have implemented dynamic feedback in the context of a parallelizing compiler for object-based programs. The generated code uses dynamic feedback to automatically choose the best synchronization optimization policy. Our experimental results show that the synchronization optimization policy has a significant impact on the overall performance of the computation, that the best policy varies from program to program, that the compiler is unable to statically choose the best policy, and that dynamic feedback enables the generated code to exhibit performance that is comparable to that of code that has been manually tuned to use the best policy. We have also performed a theoretical analysis which provides, under certain assumptions, a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution.

CloseRead Abstract

2001

Matching and searching analysis for parallel hardware implementation on FPGAs

Authors
Moisset, P; Diniz, P; Park, J;

Publication
ACM/SIGDA International Symposium on Field Programmable Gate Arrays - FPGA

Abstract
Matching and searching computations play an important role in the indexing of data. These computations are typically encoded in very tight loops with a single index variable and a simple search/matching predicate. Their inherent sequential nature, either because of data dependences but more often because of very strong control dependences, makes it impossible to apply existing data dependence and parallelization analysis to exploit significant levels parallelism on traditional architectures. This paper describes a class of searching and matching computations and describes a mapping strategy to map these computations to hardware. We have developed a compiler analysis in SUIF using array data dependence analysis and implicit loop unrolling analysis to expose more parallelism for the parallel evaluation of these computations. Our compiler generates parallel hardware specifications in VHDL. The resulting parallel hardware yields significant performance improvements when these kernel operators are repeated over shifted portions of the input data on FPGA-based computing architectures.

CloseRead Abstract