2005
Autores
Lee, YJ; Diniz, PC; Hall, MW; Lucas, R;
Publicação
International Journal of Parallel Programming
Abstract
This paper describes initial experiences with semi-automated performance tuning of a sparse linear solver in LS-DYNA, a large, widely used engineering application. Through a collection of tools supporting empirical optimization, we alleviate the burden of performance tuning for mapping today's sophisticated engineering software to increasingly complex hardware platforms. We describe a tool that automatically isolates code segments to create benchmark subsets for the purposes of performance tuning. We present a collection of automatically generated empirical results that demonstrate the sensitivity of the application's performance to optimization parameters. Through this case study, we demonstrate the importance of developing automatic performance tuning support for performance-sensitive applications. © 2005 Springer Science+Business Media, Inc.
2005
Autores
Diniz, P; Hall, M; Park, J; So, B; Ziegler, H;
Publicação
Microprocessors and Microsystems
Abstract
The DEFACTO compilation and synthesis system is capable of automatically mapping computations expressed in high-level imperative programming languages as C to FPGA-based systems. DEFACTO combines parallelizing compiler technology with behavioral VHDI, synthesis tools to guide the application of high-level compiler transformations in the search of high-quality hardware designs. In this article we illustrate the effectiveness of this approach in automatically mapping several kernel codes to an FPGA quickly and correctly. We also present a detailed example of the comparison of the performance of an automatically generated design against a manually generated implementation of the same computation. The design-space-exploration component of DEFACTO is able to explore a large number of designs for a particular computation that would otherwise be impractical for any designers.
2006
Autores
Baradaran, N; Diniz, PC;
Publicação
Proceedings - 2006 International Conference on Field Programmable Logic and Applications, FPL
Abstract
Configurable architectures offer the unique opportunity of customizing the storage allocation to meet specific applications' needs. In this paper we describe a compiler approach to map the arrays of a loop-based computation to internal memories of a configurable architecture with the objective of minimizing the overall execution time. We present an algorithm that considers the data access patterns of the arrays along the critical path of the computation as well as the available storage and memory bandwidth. We demonstrate experimental results of the application of this approach for a set of kernel codes when targeting a Field-Programmable Gate-Array (FPGA). The results reveal that our algorithm outperforms naive and custom data layouts for these kernels by an average of 33% and 15% in terms of execution time, while taking into account the available hardware resources. © 2006 IEEE.
2006
Autores
Diniz, PC; Govindu, G;
Publicação
Proceedings - 2006 International Conference on Field Programmable Logic and Applications, FPL
Abstract
The growth in FPGA capacity and the inclusion of embedded arithmetic cores has enabled the use of these devices for general purpose floating-point computing. Despite their clock rate handicap with respect to contemporary general-purpose processors, these devices can be field-programmable to meet the precision requirements and operator-level parallelism of a specific computation. In this paper we describe and evaluate the performance of dual-precision, pipelined, floating-point arithmetic cores for addition, multiplication and division. Each of these arithmetic cores can be switched at run-time to perform either one double-precision operation, or with the same hardware resources, perform two single-precision operations. We also implemented quad-precision cores which can be switched to perform either one quad-precision operation or two double-precision operations. As an application of these cores, we describe and evaluate the performance potential of a custom, but flexible, vector processing units as part of a system-level architecture targeting a Xilinx Virtex-II Pro™ 100 FPGA device connected to multiple SRAM banks. ©2006 IEEE.
2006
Autores
Ziegler, HE; Malusare, PL; Diniz, PC;
Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract
Configurable architectures, with multiple independent on-chip RAM modules, offer the unique opportunity to exploit inherent parallel memory accesses in a sequential program by not only tailoring the number and configuration of the modules in the resulting hardware design but also the accesses to them. In this paper we explore the possibility of array replication for loop computations that is beyond the reach of traditional privatization and parallelization analyses. We present a compiler analysis that identifies portions of array variables that can be temporarily replicated within the execution of a given loop iteration, enabling the concurrent execution of statements or even non-perfectly nested loops. For configurable architectures where array replication is essentially free in terms of execution time, this replication enables not only parallel execution but also reduces or even eliminates memory contention. We present preliminary experiments applying the proposed technique to hardware designs for commercially available FPGA devices. © 2006 Springer-Verlag Berlin Heidelberg.
2006
Autores
Chame, J; Chen, C; Diniz, P; Hall, M; Lee, YJ; Lucas, RF;
Publicação
20th International Parallel and Distributed Processing Symposium, IPDPS 2006
Abstract
In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. The overall approach can be employed to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs. The main focus of the paper is an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. We have developed an initial compiler implementation, and present automatically-generated results on matrix multiply. Results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the handtuned vendor BLAS library. This paper describes other components of the ECO system, including supporting tools and experiments with programmer-guided performance tuning. This approach has provided a foundation for a general framework for systematic optimization of domain-specific applications. Specifically, we are developing an optimization system for signal and image processing that exploits signal properties, and we are using machine learning and a knowledge-rich representation can be exploited to optimize molecular dynamics simulation. © 2006 IEEE.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.