Back

Paper

SoftCache: A Software Cache for PCIe-Attached Hardware Accelerators

Monday, June 3, 2024
17:30
-
18:00
CEST
Climate, Weather and Earth Sciences
Climate, Weather and Earth Sciences
Climate, Weather and Earth Sciences
Chemistry and Materials
Chemistry and Materials
Chemistry and Materials
Computer Science and Applied Mathematics
Computer Science and Applied Mathematics
Computer Science and Applied Mathematics
Humanities and Social Sciences
Humanities and Social Sciences
Humanities and Social Sciences
Engineering
Engineering
Engineering
Life Sciences
Life Sciences
Life Sciences
Physics
Physics
Physics

Description

Hardware accelerators are used to speed up computationally expensive
applications. Offloading
tasks to accelerator cards requires data to be transferred between
the memory of the host and the external memory of the accelerator
card; this data movement becomes the bottleneck for increasing
accelerator performance. Here, we explore the use
of a software cache to optimize communication and alleviate the
data-movement bottleneck by transparently exploiting locality and
data reuse. We present a generic, application-agnostic framework,
dubbed SoftCache, that can be used with GPU and FPGA accelerator
cards. SoftCache exploits locality to optimize data movement
in a non-intrusive manner (i.e., no algorithmic changes are
necessary) and allows the programmer to tune the cache size,
organization, and replacement policy toward the application needs.
Each cache line can store data of any size, thereby eliminating the
need for separate caches for different data types. We used a phylogenetic
application to showcase SoftCache. Phylogenetics study
the evolutionary history and relationships among different species
or groups of organisms. The phylogenetic application implements
a tree-search algorithm to create and evaluate phylogenetic trees,
while hardware accelerators are used to reduce the computation
time of probability vectors at every tree node. Using SoftCache,
we observed that the total number of bytes transferred during a
complete run of the application was reduced by as much as 89%,
resulting in up to 1.7x (81% of the theoretical peak) and 3.5x (75%
of the theoretical peak) higher accelerator performance (as seen by
the application) for a GPU and an FPGA accelerator, respectively.

Authors