Power efficiency and performance for embedded and HPC systems with custom CGRAs
The domains of embedded systems (ES) and high-performance computing (HPC) are usually seen as distant, but some of their requirements are converging: a modern ES runs complex algorithms with high computational power; the power consumption of ever larger HPC systems requires new levels of power efficiency ("green computing"). Both domains are gaining importance for the society as a whole and for the individual citizens,: complex embedded systems (including cyber-physical systems) have tangible impact on human well-being, natural resource management (smart homes, intelligent transportation) and industrial capabilities, while HPC systems are used to extract new knowledge from the mass of information acquired by ES and provide innovative services both to enterprises (including small and medium ones) and individuals (ranging from traffic management to computer-aided drug discovery). ES have adopted heterogeneous architectures, often based on reconfigurable accelerators like FPGAs or Coarse-Grained Reconfigurable Arrays (CGRAs), because they combine hardware specialization (improving performance and power efficiency) with adaptability at run-time. A similar trend towards heterogeneity is observed in HPC systems with GPUs as accelerators: HPC stakeholders have identified several challenges related to auto-tuning and self adapting systems, and power-aware resource management, matching the trends identified by ES stakeholders on the relevance of heterogeneous accelerators. Many applications in HPC and ES have a small number of regular computational kernels that account for most of the execution time and energy consumption. The manual introduction of reconfigurable accelerators requires significant design time and hardware expertise. It is vital to not compromise developer productivity by requiring manual hardware development and source code alterations. Therefore, this project focuses on approaches that only require knowledge about the program binary code and its runtime behavior. The goal of this project is to devise efficient techniques for dynamically mapping computations extracted from execution behavior to the resources of specialized reconfigurable accelerators. The techniques will identify at runtime the hotspots of program execution. They are then optimized and mapped to CGRAs tailored to the actual set of executing kernels. Whenever one hotspot needs to be executed, the accelerator is transparently invoked. The use of specialized CGRAs reduces resource usage and improves performance. The project will apply these concepts in the ES and HPC domains. Expected achievements include important performance improvements over CPU-based computation in the system (at least 3x on average), while maintaining almost the same energy consumption. The project expects to reduce energy consumption by at least 30% through reduction of the clock frequency of the CPU, while still matching the performance of the system without accelerator.