2008
Authors
Demertzi, M; Diniz, PC; Hall, MW; Gilbert, AC; Wang, Y;
Publication
2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8
Abstract
This paper evaluates the potential of exploiting computation reuse in a signal recognition system that is jointly optimized from mathematical representation, algorithm design and final implementation. Walsh wavelet packets in conjunction with a BestBasis algorithm are used to derive transforms that discriminate between signals. The FPGA implementation of this computation exploits the structure of the resulting transform matrices in several ways to derive a highly optimized hardware representation of this signal recognition problem. Specifically, we observe in the transform matrices a significant amount of reuse of subrows, thus indicating redundant computation. Through analysis of this reuse, we discover the potential for a 3X reduction in the amount of computation of combining a transform matrix and signal. In this paper, we focus on how the implementation might exploit this reuse in a profitable way. By exploiting a subset of this computation reuse, the system can navigate the tradeoff space of reducing computation and the extra storage required.
2012
Authors
Hukerikar, S; Diniz, PC; Lucas, RF;
Publication
Proceedings of the International Conference on Dependable Systems and Networks
Abstract
System resilience is an important challenge that needs to be addressed in the era of extreme scale computing. Exascale supercomputers will be architected using millions of processor cores and memory modules. As process technology scales, the reliability of such systems will be challenged by the inherent unreliability of individual components due to extremely small transistor geometries, variability in silicon manufacturing processes, device aging, etc. Therefore, errors and failures in extreme scale systems will increasingly be the norm rather than the exception. Not all errors detected warrant catastrophic system failure, but there are presently no mechanisms for the programmer to communicate their knowledge of algorithmic fault tolerance to the system. We present a programming model approach for system resilience that allows programmers to explicitly express their fault tolerance knowledge. We propose novel resilience oriented programming model extensions and programming directives, and illustrate their effectiveness. An inference engine leverages this information and combines it with runtime gathered context to increase the dependability of HPC systems. © 2012 IEEE.
2012
Authors
Hukerikar, S; Diniz, PC; Lucas, RF;
Publication
Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
Abstract
System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer's fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications. © 2012 IEEE.
2012
Authors
Abramson, J; Diniz, PC;
Publication
2012 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2012
Abstract
VLIW architectures are seeing increased deployment in a number of hostile environments. In addition, softcore VLIW architectures, which allow for run-time customization of the VLIW datapath, are becoming viable for a number of safety-critical applications. As error and failure rates rise, these applications elicit a need for automated and resilient architecture configuration tools. To mitigate these issues, this paper presents a Resiliency-aware Scheduling approach to the configuration of a custom VLIW architecture, providing computational resilience via software duplication. The automated RaS tool determines the optimal set of resources needed to provide a given level of resilience for a reconfigurable softcore VLIW architecture. For a sample case study, based on a common physics code kernel, the RaS approach is compared to traditional hardware (TMR) and software (source-level code replication) approaches. Results show a Resiliency-aware Scheduling-generated architecture configuration can potentially require up to 50% fewer functional units when compared to a TMR-hardened machine of similar performance, and can potentially improve performance by up to 40% over source-level software approaches. © 2012 IEEE.
2012
Authors
Abramson, J; Diniz, PC;
Publication
FPT 2012 - 2012 International Conference on Field-Programmable Technology
Abstract
The number of configurable systems deployed in hostile environments continues to rise. This, along with decreasing geometries and lower operating voltages leads to an expected increase in transient errors. This paper presents Resiliency-aware Scheduling, a novel approach to resource allocation for hardening computations on configurable systems. Using modular and replicated functional units called hybrid TMR that exploit a computation's Intrinsic Resiliency, our results show that for designs with similar performance, RaS exhibits a 60% area savings over a traditional TMR configuration with the same operation coverage. © 2012 IEEE.
2012
Authors
Abramson, J; Diniz, PC;
Publication
Proceedings - 22nd International Conference on Field Programmable Logic and Applications, FPL 2012
Abstract
Hostile environments, shrinking feature sizes and processor aging elicit a need for resilient computing. Coarse-grained hardware approaches, such as Triple Modular Redundancy (TMR) and Temporal Redundancy (TR), while exhibiting acceptable levels of fault coverage [1], are often wasteful of resources such as time, device/chip area and power. A TMR-hardened computation can exhibit poor performance relative to a non-TMR hardware configuration with similar area. This is because the resources that are used to replicate functional units in parallel (in the case of TMR) can only execute one operation at a time. Conversely, in an equivalent non-TMR configuration, those same resources could execute three different operations concurrently (albeit with no resiliency coverage). In short, TMR is very rigid in its allocation of resources, using them only for resiliency. © 2012 IEEE.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.