University of Texas, Dallas, United States of America
Many computationally intensive applications are accelerated on FPGAs following the stream computing, also called dataflow computing, paradigm. This entails that data is streamed through different components of a given application in wide deep pipelines to maximize throughput. One of the main drawbacks of this computing paradigm is that it consumes a large number of hardware resources.
Thus, in this work, we propose a partial runtime reconfigurable overlay on which to map any computationally intensive application given as a behavioral description for High-Level Synthesis (HLS) composed of multiple stages, which would typically fit the stream computing paradigm. This overlay uses the internal's FPGA BlockRAM to store the intermediate results of each stage in order to speed up the computation and time-multiplexes the different stages by reconfiguring the computational part.
This work also includes a design methodology to optimize the micro-architectural implementation of each stage in order to balance the dataflow architecture as well as generating systems with unique area vs. performance trade-offs. The proposed architecture and methodology has been prototyped on a Xilinx Zedboard mounting a Zynq FPGA using a variety of synthetic dataflows and a case study of a JPEG encoder is presented highlighting the benefits of it. The overlay will be made public and open source after the publication of this paper.