Why hardware coherency matters…

Many users around the Internet who are utilizing the Xilinx Zynq for their applications are asking about the data transfers between the Processing System and Programmable Logic. How to build a Linux-based application talking to their FPGA firmware. Another issue is how to perform the communication fast.

The first issue, the architecture of a Linux-based (or another OS based) system is solved by the RSoC Framework. The RSoC Framework provides an FPGA bridge with DMAs and drivers. So, the user application just utilizes the API and leaves the “how” question up to the framework.

One of the biggest flow of the Xilinx Zynq (well, the ARMv7 architectures in general) is the lack of a cache-coherent interconnect for peripherals. I call it simply I/O coherency. The common x86-based computers with the PCI bridges do not have to solve this issue as the I/O coherency is included in the PCI bridge. On the ARMv7-based chips, the I/O coherency is to be solved explicitly by software. The Linux Kernel provides its DMA API for this purpose. However, you need to perform system calls to use them. Another possibility is to use uncached bufferers which leads to slow processing of the data.

The RSoC Framework provides two kinds of APIs to the (Linux) userspace: standard I/O and zero-copy I/O. The standard I/O is based on the common read(2), write(2) functions. Thus, every data transferred is copied between FPGA and a kernelspace buffer and between the kernelspace buffer and a userspace buffer. To avoid userspace-kernelspace copy, the zero-copy I/O API can be used. It utilizes the read(2) and write(2) system calls to transfer information about the completed buffers (much less of data copying) and the mmap(2) system call to access the actual DMA buffers.

Let’s have a look at the following statistics based on measurements of the RSoC Framework throughput on the Xilinx Zynq. It shows how the Zynq behaves with certain buffer and DMA ring sizes. Moreover, it demonstrates the high overhead of the explicit I/O coherency issued by the ARM cores.

Zero-copy I/O with explicit cache control

size (B) MB/s pps
64 35 578k
128 61 506k
256 103 423k
512 140 288k
1024 186 190k
2048 215 110k
4096 238 61k
8192 251 32k
16384 256 16k

You can see quite pure performance with the explicit cache-coherency control even with the zero-copy approach. Remember, the HP interface’s theoretical throughput is 1200 MB/s!

Zero-copy I/O without cache control

size (B) MB/s pps
64 37 613k
128 67 553k
256 115 471k
512 176 361k
1024 241 247k
2048 295 151k
4096 426 109k
8192 546 70k
16384 637 40k
32768 694 22k

Well, 694 MB/s is much better. However, you need to have large buffers in the DMA ring and uncached buffers. These properties are not always drawbacks. But most applications cannot go with this approach. Alternatively, it is possible to utilize the ACP. But ACP cannot solve all the issues as it influences the Processing System cache performance and shares the DRAM bandwith with the MPCore.

Posted in News