Many users around the Internet who are utilizing the Xilinx Zynq for their applications are asking about the data transfers between the Processing System and Programmable Logic. How to build a Linux-based application talking to their FPGA firmware. Another issue is how to perform the communication fast.
The first issue, the architecture of a Linux-based (or another OS based) system is solved by the RSoC Framework. The RSoC Framework provides an FPGA bridge with DMAs and drivers. So, the user application just utilizes the API and leaves the “how” question up to the framework.
One of the biggest flow of the Xilinx Zynq (well, the ARMv7 architectures in general) is the lack of a cache-coherent interconnect for peripherals. I call it simply I/O coherency. The common x86-based computers with the PCI bridges do not have to solve this issue as the I/O coherency is included in the PCI bridge. On the ARMv7-based chips, the I/O coherency is to be solved explicitly by software. The Linux Kernel provides its DMA API for this purpose. However, you need to perform system calls to use them. Another possibility is to use uncached bufferers which leads to slow processing of the data.
The RSoC Framework provides two kinds of APIs to the (Linux) userspace: standard I/O and zero-copy I/O. The standard I/O is based on the common
write(2) functions. Thus, every data transferred is copied between FPGA and a kernelspace buffer and between the kernelspace buffer and a userspace buffer. To avoid userspace-kernelspace copy, the zero-copy I/O API can be used. It utilizes the
write(2) system calls to transfer information about the completed buffers (much less of data copying) and the
mmap(2) system call to access the actual DMA buffers.
Let’s have a look at the following statistics based on measurements of the RSoC Framework throughput on the Xilinx Zynq. It shows how the Zynq behaves with certain buffer and DMA ring sizes. Moreover, it demonstrates the high overhead of the explicit I/O coherency issued by the ARM cores.
Zero-copy I/O with explicit cache control
You can see quite pure performance with the explicit cache-coherency control even with the zero-copy approach. Remember, the HP interface’s theoretical throughput is 1200 MB/s!
Zero-copy I/O without cache control
Well, 694 MB/s is much better. However, you need to have large buffers in the DMA ring and uncached buffers. These properties are not always drawbacks. But most applications cannot go with this approach. Alternatively, it is possible to utilize the ACP. But ACP cannot solve all the issues as it influences the Processing System cache performance and shares the DRAM bandwith with the MPCore.