2021
Authors
Paulino, N; Bispo, J; Ferreira, JC; Cardoso, JMP;
Publication
IEEE MICRO
Abstract
As applications move to the edge, efficiency in computing power and power/energy consumption is required. Heterogeneous computing promises to meet these requirements through application-specific hardware accelerators. Runtime adaptivity might be of paramount importance to realize the potential of hardware specialization, but further study is required on workload retargeting and offloading to reconfigurable hardware. This article presents our framework for the exploration of both offloading and hardware generation techniques. The framework is currently able to process instruction sequences from MicroBlaze, ARMv8, and riscv32imaf binaries, and to represent them as Control and Dataflow Graphs for transformation to implementations of hardware modules. We illustrate the framework's capabilities for identifying binary sequences for hardware translation with a set of 13 benchmarks.
2021
Authors
Santos, T; Paulino, N; Bispo, J; Cardoso, JMP; Ferreira, JC;
Publication
2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT)
Abstract
By using Dynamic Binary Translation, instruction traces from pre-compiled applications can be offloaded, at runtime, to FPGA-based accelerators, such as Coarse-Grained Loop Accelerators, in a transparent way. However, scheduling onto coarse-grain accelerators is challenging, with two of current known issues being the density of computations that can be mapped, and the effects of memory accesses on performance. Using an in-house framework for analysis of instruction traces, we explore the effect of different window sizes when applying list scheduling, to map the window operations to a coarse-grain loop accelerator model that has been previously experimentally validated. For all window sizes, we vary the number of ALUs and memory ports available in the model, and comment how these parameters affect the resulting latency. For a set of benchmarks taken from the PolyBench suite, compiled for the 32-bit MicroBlaze softcore, we have achieved an average iteration speedup of 5.10x for a basic block repeated 5 times and scheduled with 8 ALUs and memory ports, and an average speedup of 5.46x when not considering resource constraints. We also identify which benchmarks contribute to the difference between these two speedups, and breakdown their limiting factors. Finally, we reflect on the impact memory dependencies have on scheduling.
2021
Authors
Paulino, N; Pessoa, LM; Branquinho, A; Goncalves, E;
Publication
2021 JOINT EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS & 6G SUMMIT (EUCNC/6G SUMMIT)
Abstract
The recent Bluetooth 5.1 specification introduced the use of Angle-of-Arrival (AoA) information which enables the design of novel low-cost indoor positioning systems. Existing approaches rely on multiple fixed gateways equipped with antenna arrays, in order to determine the location of an arbitrary number of simple mobile omni-directional emitters. In this paper, we instead present an approach where mobile receivers are equipped with antenna arrays, and the fixed infrastructure is composed of battery-powered beacons. We implement a simulator to evaluate the solution using a real-world data set of AoA measurements. We evaluated the solution as a function of the number of beacons, their transmission period, and algorithmic parameters of the position estimation. Sub-meter accuracy is achievable using 1 beacon per 15 m(2) and a beacon transmission period of 500 ms.
2021
Authors
Silva, PF; Bispo, J; Paulino, N;
Publication
2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT)
Abstract
We discuss the concept of FPGA-unfriendliness, the property of certain algorithms, programs, or domains which may limit their applicability to FPGAs. Specifically, we look at graph analysis, which has recently seen increased interest in combination with High-Level Synthesis, but has yet to find great success compared to established acceleration mechanisms. To this end, we make use of Xilinx's Vitis Graph Library to implement Single-Source Shortest Paths (SSSP) and PageRank (PR), and present a custom kernel written from the ground up for Distinctiveness Centrality (DC, a novel graph centrality measure). We use public datasets to test these implementations, and analyse power consumption and execution time. Our comparisons against published data for GPU and CPU execution show FPGA slowdowns in execution time between around 18.5x and 328x for SSSP, and around 1.8x and 195x for PR, respectively. In some instances, we obtained FPGA speedups versus CPU of up to 2.5x for PR. Regarding DC, results show speedups from 0.1x to 3.5x, and energy efficiency increases from 0.8x to 6x. Lastly, we provide some insights regarding the applicability of FPGAs in FPGA-unfriendly domains, and comment on the future as FPGA and HLS technology advances.
2022
Authors
Sousa, LM; Paulino, N; Ferreira, JC; Bispo, J;
Publication
2022 IEEE 21ST MEDITERRANEAN ELECTROTECHNICAL CONFERENCE (IEEE MELECON 2022)
Abstract
Decision trees are often preferred when implementing Machine Learning in embedded systems for their simplicity and scalability. Hoeffding Trees are a type of Decision Trees that take advantage of the Hoeffding Bound to allow them to learn patterns in data without having to continuously store the data samples for future reprocessing. This makes them especially suitable for deployment on embedded devices. In this work we highlight the features of a HLS implementation of the Hoeffding Tree. The implementation parameters include the feature size of the samples (D), the number of output classes (K), and the maximum number of nodes to which the tree is allowed to grow (Nd). We target a Xilinx MPSoC ZCU102, and evaluate: the design's resource requirements and clock frequency for different numbers of classes and feature size, the execution time on several synthetic datasets of varying sizes (N) and the execution time and accuracy for two datasets from UCI. For a problem size of D=3, K=5, and N=40000, a single decision tree operating at 103MHz is capable of 8.3x faster inference than the 1.2 GHz ARM Cortex-A53 core. Compared to a reference implementation of the Hoeffding tree, we achieve comparable classification accuracy for the UCI datasets.
2022
Authors
Paulino, N; Pessoa, LM; Branquinho, A; Gonçalves, E;
Publication
13th International Symposium on Communication Systems, Networks and Digital Signal Processing, CSNDSP 2022, Porto, Portugal, July 20-22, 2022
Abstract
One the of the applications in the realm of the Internet-of-Things (IoT) is real-time localization of assets in specific application environments where satellite based global positioning is unviable. Numerous approaches for localization relying on wireless sensor mesh systems have been evaluated, but the recent Bluetooth Low Energy (BLE) 5.1 direction finding features based on Angle-of-Arrival (AoA) promise a low-cost solution for this application. In this paper, we present an implementation of a BLE 5.1 based circular antenna array, and perform two experimental evaluations over the quality of the retrieved data sampled from the array. Specifically, we retrieve samples of the phase value of the Constant Tone Extension which enables the direction finding functionalities through calculation of phase differences between antenna pairs. We evaluate the quality of the sampled phase data in an anechoic chamber, and in a real-world environment using a setup composed of four BLE beacons. © 2022 IEEE.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.