2005
Authors
Baradaran, N; Diniz, PC;
Publication
Proceedings - 2005 IEEE International Conference on Field Programmable Technology
Abstract
Emerging computing architectures exhibit a rich variety of controllable storage resources. Allocation and management of these resources critically affect the performance of data intensive applications. In this paper we describe a synergistic collaboration between compiler data dependence analysis and execution modeling techniques to explore the application of data caching and software prefetching for hardware designs in high-level synthesis. We describe a design space exploration algorithm that selects between data caching and prefetching of array references along the critical paths of the computation with the objective of minimizing the overall execution time, while meeting the architecture's storage and bandwidth constraints. We present preliminary results of the application of the algorithm for a set of image/signal processing kernels on a commercial FPGA. The high precision of our execution model (average 94%) results in the selection of the fastest design in every case. © 2005 IEEE.
2005
Authors
Diniz, PC;
Publication
Proceedings - 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005
Abstract
Fine-grain configurable architectures such as contemporary Field-Programmable Gate-Arrays (FPGAs) offer ample opportunities for data reuse through application-specific storage structures, making them an ideal target for memory-intensive image/signal processing computations. In this paper we explore the area and time trade-off in terms of configurable resources and overall wall-clock time of several implementation schemes that exploit opportunities for data reuse using scalar replacement in fine-grain FPGAs. The preliminary results, on a Xilinx Virtex™ FPGA device, reveal that rotation-based solutions combined with predicated accesses tend to lead to higher-quality designs. © 2005 IEEE.
2005
Authors
Diniz, PC; Liu, B;
Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract
Fitting algorithms to meet input data characteristics and/or a changing computing environment is a tedious and error prone task. Programmers need to deal with code instrumentation details and implement the selection of which algorithm best suits a given data set. In this paper we describe a set of simple programming constructs for C that allows programmers to specify and generate applications that can select at run-time the best of several possible implementations based on measured run-time performance and/or algorithmic input values. We describe the application of this approach to a realistic linear solver for an engineering crash analysis code. The preliminary experimental results reveal that this approach provides an effective mechanism for creating sophisticated dynamic application behavior with minimal effort. © Springer-Verlag Berlin Heidelberg 2005.
1999
Authors
Hall, M; Kogge, P; Koller, J; Diniz, P; Chame, J; Draper, J; LaCoss, J; Granacki, J; Brockman, J; Srivastava, A; Athas, W; Freeh, V; Shin, J; Park, J;
Publication
ACM/IEEE SC 1999 Conference, SC 1999
Abstract
Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the memory model and parcel definitions; (2) the PIM-to-PIM interconnect; and, (3) requirements for the processor-to-memory interface. We demonstrate the potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query. © 1999 IEEE.
2001
Authors
Park, J; Diniz, P;
Publication
Proceedings - 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2001
Abstract
The proposed architecture and related interfaces presents several advantages over the current design implementation practices. First, we provide a target independent view of the core datapath design and in particular execution control. The simple full/empty and pop/push set of signals protocol for retrieving/storing data from the channels allows for a smooth integration of behavioral design specifications with the structural descriptions that interface with memory. Second, given that the components of the proposed architecture and interfaces are parameterizable, we developed several code generation functions that can be integrated as part of a compilation system. Third, the decoupling of the scheduling of the computation in the code datapath with the memory accesses exposes several opportunities for memory operation optimizations, which are beyond the scope of current behavioral tools - memory operation pipelining and grouping[3]. The main disadvantage of the proposed approach is the incurred overhead of memory operations by the additions of an extra layer of abstractions (e.g., the stream channels) and their interfaces. While this is a potential disadvantage for increasing the latency of memory accesses, throughput and also clock rates can potentially be improved due to the simpler (and therefore possibly shorter) connections between the core datapath and the FIFO queues. Also, deeper conversion FIFO queues will allow overlapping of computation with communication. We believe the advantages of simpler interfaces and amenability for automation overall outweigh the disadvantages from additional latency. We have successfully integrated the external memory architecture presented here in the context of the DEFACTO[1] for a set of simple image processing kernels without a substantial performance sacrifice. © 2001 Non IEEE.
2001
Authors
Diniz, P; Venkatachar, A;
Publication
Proceedings - 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2001
Abstract
We have briefly described a uniform estimation interface to two commercially available behavioral synthesis tools. Using this interface we have developed a simple design exploration strategy and applied it to a set of kernels computations. This experience reveals the importance of the proposed interface in allowing access to estimation features of existing tools to a wide range of users and programmers. © 2001 Non IEEE.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.