How fast are DMA transfers on Xilinx Zynq?

According to many online sources ([1], [2], …), it is possible to find out various possible data transfer speeds between the Xilinx Zynq’s Processing System and Programmable Logic. There are several ways how to connect the user FPGA logic to the Processing System (HP, GP, ACP) and each provides different throughputs and properties in this area. The theoretical throughput differes from interface to interface. The HP port’s theoretical maximum is stated to be 1200 MB/s. However, you will unlikely reach this performance, especially on Linux. The platform has a high overall throughput (according to our measurements, the total R/W throughput of the system – ACP and all HPs – is ~3500 MB/s) when just PL is involved in the transfers. When the PS is used (processes data from the PL), the throughput can decrease significantly (no matter of the PL frequency). This highly depends on the target application and its demands.

We measure throughput using the RSoC Framework to get a practical throughput, i.e. a throughput that you can really reach in your application and with some safety in mind (e.g. respecting the boundaries of the Linux Kernel). With the standard I/O interfaces (read, write), the maximal throughput is highly limited by the memcpy operation. The maximal speeds are about 160 MB/s. To reach faster transfers, a zero-copy API can be designed. Our zero-copy implementation (and API) can read data at more then 400 MB/s with its initial settings (smallest buffer and ring sizes). Unfortunately, for practical applications (which perform processing of the received data), it is usually lower. Copying from an FPGA unit to stdout can reach 250 MB/s (that’s almost a half!) with the zero-copy API. The most significant issue here is the software-controlled cache-coherency. We have measured that the software coherency is about 60 % of the system’s performance! For some applications, the ACP can help, however, this is unlikely a general solution.

There are applications, that can reach even higher throughputs then what it was the presented. The size of the DMA buffers is an important factor for this. Unfortunately, there are not many applications that can produce such high contiguous bandwith. E.g. the network packets are from 64-1500 MB long. To have more effecient transfers, a system page-sized buffer is much more suitable. But this involves hardware concatenation of the packets.

Posted in News