How fast are DMA transfers on Xilinx Zynq?

According to many online sources ([1], [2], …), it is possible to find out various possible data transfer speeds between the Xilinx Zynq’s Processing System and Programmable Logic. There are several ways how to connect the user FPGA logic to the Processing System (HP, GP, ACP) and each provides different throughputs and properties in this area. The theoretical throughput differes from interface to interface. The HP port’s theoretical maximum is stated to be 1200 MB/s. However, you will unlikely reach this performance, especially on Linux. The platform has a high overall throughput (according to our measurements, the total R/W throughput of the system – ACP and all HPs – is ~3500 MB/s) when just PL is involved in the transfers. When the PS is used (processes data from the PL), the throughput can decrease significantly (no matter of the PL frequency). This highly depends on the target application and its demands.

We measure throughput using the RSoC Framework to get a practical throughput, i.e. a throughput that you can really reach in your application and with some safety in mind (e.g. respecting the boundaries of the Linux Kernel). With the standard I/O interfaces (read, write), the maximal throughput is highly limited by the memcpy operation. The maximal speeds are about 160 MB/s. To reach faster transfers, a zero-copy API can be designed. Our zero-copy implementation (and API) can read data at more then 400 MB/s with its initial settings (smallest buffer and ring sizes). Unfortunately, for practical applications (which perform processing of the received data), it is usually lower. Copying from an FPGA unit to stdout can reach 250 MB/s (that’s almost a half!) with the zero-copy API. The most significant issue here is the software-controlled cache-coherency. We have measured that the software coherency is about 60 % of the system’s performance! For some applications, the ACP can help, however, this is unlikely a general solution.

There are applications, that can reach even higher throughputs then what it was the presented. The size of the DMA buffers is an important factor for this. Unfortunately, there are not many applications that can produce such high contiguous bandwith. E.g. the network packets are from 64-1500 MB long. To have more effecient transfers, a system page-sized buffer is much more suitable. But this involves hardware concatenation of the packets.

FacebookTwitterGoogle+Share
Posted in News

Why hardware coherency matters…

Many users around the Internet who are utilizing the Xilinx Zynq for their applications are asking about the data transfers between the Processing System and Programmable Logic. How to build a Linux-based application talking to their FPGA firmware. Another issue is how to perform the communication fast.

The first issue, the architecture of a Linux-based (or another OS based) system is solved by the RSoC Framework. The RSoC Framework provides an FPGA bridge with DMAs and drivers. So, the user application just utilizes the API and leaves the “how” question up to the framework.

One of the biggest flow of the Xilinx Zynq (well, the ARMv7 architectures in general) is the lack of a cache-coherent interconnect for peripherals. I call it simply I/O coherency. The common x86-based computers with the PCI bridges do not have to solve this issue as the I/O coherency is included in the PCI bridge. On the ARMv7-based chips, the I/O coherency is to be solved explicitly by software. The Linux Kernel provides its DMA API for this purpose. However, you need to perform system calls to use them. Another possibility is to use uncached bufferers which leads to slow processing of the data.

The RSoC Framework provides two kinds of APIs to the (Linux) userspace: standard I/O and zero-copy I/O. The standard I/O is based on the common read(2), write(2) functions. Thus, every data transferred is copied between FPGA and a kernelspace buffer and between the kernelspace buffer and a userspace buffer. To avoid userspace-kernelspace copy, the zero-copy I/O API can be used. It utilizes the read(2) and write(2) system calls to transfer information about the completed buffers (much less of data copying) and the mmap(2) system call to access the actual DMA buffers.

Let’s have a look at the following statistics based on measurements of the RSoC Framework throughput on the Xilinx Zynq. It shows how the Zynq behaves with certain buffer and DMA ring sizes. Moreover, it demonstrates the high overhead of the explicit I/O coherency issued by the ARM cores.

Zero-copy I/O with explicit cache control

size (B) MB/s pps
64 35 578k
128 61 506k
256 103 423k
512 140 288k
1024 186 190k
2048 215 110k
4096 238 61k
8192 251 32k
16384 256 16k

You can see quite pure performance with the explicit cache-coherency control even with the zero-copy approach. Remember, the HP interface’s theoretical throughput is 1200 MB/s!

Zero-copy I/O without cache control

size (B) MB/s pps
64 37 613k
128 67 553k
256 115 471k
512 176 361k
1024 241 247k
2048 295 151k
4096 426 109k
8192 546 70k
16384 637 40k
32768 694 22k

Well, 694 MB/s is much better. However, you need to have large buffers in the DMA ring and uncached buffers. These properties are not always drawbacks. But most applications cannot go with this approach. Alternatively, it is possible to utilize the ACP. But ACP cannot solve all the issues as it influences the Processing System cache performance and shares the DRAM bandwith with the MPCore.

FacebookTwitterGoogle+Share
Posted in News

RSoC Framework and the DPDK

We are happy to announce a great synergy between the RSoC Framework and DPDK. The RSoC Framework is a system providing various (high-performance or low-latency) data transfers with respect to the underlying architecture which is usually a CPU with an FPGA. The DPDK is a set of libraries and userspace drivers providing very fast access to the network cards. This allows to process the network traffic at a very high packet rate.

As DPDK now works on the ARM platform, we can use RSoC Framework as a reasonable solution for data transfer as its backend. You can work with DPDK while the RSoC Framework ensures the proper communication with your hardware accelerators and EMACs. The first working solution is already available as a packet-capture video done for the Xilinx Zynq (Zedboard).

FacebookTwitterGoogle+Share
Posted in News

RSoC Framework at SPS IPC Drives 2015

Meet RSoC Framework at SPS IPC Drivers 2015 Europe’s leading exhibition for electric automation. From 24 to 26 November 2015 in Nuremberg. RSoC Framework will be presented by local partner Ingenieurbüro Dübon at Hall 8-408.

FacebookTwitterGoogle+Share
Posted in News

DPDK for ARM has been merged to mainline

We are glad to announce that the DPDK community has finally accepted our ARMv7 patch set into the DPDK mainline. DPDK is a very promising technology enabling fast data transfers into the userspace. We are bringing DPDK to the ARM platform to support high-performance processing. The next step is to provide the RSoC Framework as a DPDK backend. The RSoC Framework matches into this area pretty well!

FacebookTwitterGoogle+Share
Posted in News

New DMA engines under the hood

We were quiet for some time while working on new cool DMA backends for the RSoC Framework. As the general principle of the RSoC Framework promises, it provides a vendor independent stable interfaces for hardware accelerators and the related software. Accelerators can be quickly connected via the simple AXI-Stream buses and the software utilizes the well-known write(2), read(2) system calls (and more).

ARM PL330 DMA and Super DMA as new backends of the RSoC Framework

New DMA backends for RSoC Framework

However, under the hood, we can run different units the moves data between the FPGA and CPU (well, the memory). Until recently, we supported just a single DMA for both FPGA platforms, Xilinx Zynq and Altera SoC FPGA. But this is changing…

There are two new DMA engines inside the RSoC Framework:

  1. slower but with a very low resource consumption (utilizing the on-chip ARM PL330 DMA)
  2. very fast and low latency DMA (Super DMA)

Same interfaces, different performance, and different resource consumption.

FacebookTwitterGoogle+Share
Posted in News

Throughput over 300 MB/s from FPGA to Linux userspace

RSoC Framework allows to reach throughput over 300 MB/s (2400 Mbps) in Linux on Xilinx Zynq.

FacebookTwitterGoogle+Share
Posted in News

FreeRTOS support

RSoC Framework fully supports FreeRTOS in a form of a separate device driver. Now, you can use RSoC Framework with bare-metal, FreeRTOS and Linux. Demonstration application will be added soon, so stay tuned…

FacebookTwitterGoogle+Share
Posted in News

RSoC Framework at DATE 2015

Thanks to guys from Brno University of Technology the RSoC Framework will be presented at the Design, Automation & Test in Europe (DATE) conference in Grenoble, France on 11.-13.3.2015.

FacebookTwitterGoogle+Share
Posted in News

New version of video demonstration!

We have just updated our video demonstration. Now it is possible to see one of three mode of operation (only ARM processor, acceleration with ARM NEON engine or FPGA acceleration) on the ZedBoard’s integrated OLED display as it is shown at this YouTube video. Try it and send us your feedback!

FacebookTwitterGoogle+Share
Posted in News