Device drivers that "cache enabled" access to FPGA memory from Linux

Introduction

Refusal

The device drivers presented in this article implement CPU data cache operations in arm64 / arm assembly language. Data cache operations should normally use the Linux Kernel API, but unfortunately there was nothing that could be used ("How to" cache-enabled "access to FPGA memory from Linux" @Qiita ]).

Please note that this article is just a trial article that I tried a little.

What I wanted to do

When exchanging data with the PS (Processing System) section and the PL (Programmable Logic) section with ZynqMP (ARM64) or Zynq (ARM), prepare memory such as BRAM on the PL side and access from the CPU of the PS section. There is a way. At that time, it is convenient to meet the following conditions.

  1. CPU data cache can be enabled.
  2. You can manually operate the CPU data cache (Flush or Invalidiate).
  3. It can be freely attached and detached after starting Linux with Device Tree Overlay etc.

If you just want to access it normally, you can use uio. However, uio cannot enable the data cache of the CPU under condition 1, which is disadvantageous in terms of performance when transferring a large amount of data.

Also, with the method using / dev / mem and reserved_memory shown in the reference ["Accessing BRAM In Linux"], although the data cache can be enabled, the cache operation cannot be performed manually, so the data with the PL side Not suitable for interaction. Also, reserved_memory can only be specified when Linux boots, so it cannot be freely attached or detached after Linux boots.

What i did

I made a prototype device driver to access the memory on the PL side from Linux with the cache enabled. The name is uiomem. It is published at the following URL. (It is still an alpha version prototype.)

This article describes the following:

Data cache effect

In this chapter, we will actually measure and show what effect the data cache has when accessing the memory on the PL side from the PS side.

Measurement environment

The environment used for the measurement is as follows.

Implement the following design on the PL side. 256KByte of memory is implemented in BRAM on the PL side, and Xilinx's AXI BRAM Controller is used for the interface. The operating frequency is 100MHz. ILA (Integrated Logic Analyzer) is connected to observe the AXI I / F of AXI BRAM Controller and the waveform of BRAM I / F.

Fig.1 PLBRAM-Ultra96 のブロック図

Fig.1 Block diagram of PLBRAM-Ultra96


These environments are published on github.

Memory write when data cache is off

It took 0.496 msec to turn off the data cache and use memcpy () to write 256KByte of data to the BRAM on the PL side. The write speed is about 528MByte / sec.

The AXI I / F waveform at that time is as follows.

Fig.2 データキャッシュオフ時のメモリライトの AXI IF 波形

Fig.2 AXI IF waveform of memory write when data cache is off


As you can see from the waveform, there is no burst transfer (AWLEN = 00). You can see that it is transferring one word (16 bytes) at a time.

Memory write when data cache is on

It took 0.317 msec to turn on the data cache and use memcpy () to write 256KByte of data to the BRAM on the PL side. The write speed is about 827 MByte / sec.

The AXI I / F waveform at that time is as follows.

Fig.3 データキャッシュオン時のメモリライトの AXI IF 波形

Fig.3 AXI IF waveform of memory write when data cache is on


As you can see from the waveform, a burst transfer of 4 words (64 bytes) is performed in one write (AWLEN = 03).

Writes to the BRAM do not occur when the CPU writes. When the CPU writes, the data is first written to the data cache and not yet written to the BRAM. Then, the BRAM is written only when the data cache flush instruction is manually executed, or when the data cache is full and the unused cache is freed. At that time, writing is performed collectively for each cache line size of the data cache (64 bytes for arm64).

Memory read when data cache is off

It took 3.485 msec to turn off the data cache and use memcpy () to read 256KByte of data from the BRAM on the PL side. The read speed is about 75MByte / sec.

The AXI I / F waveform at that time is as follows.

Fig.4 データキャッシュオフ時のメモリリードの AXI IF 波形

Fig.4 AXI IF waveform of memory read when data cache is off


As you can see from the waveform, there is no burst transfer (ARLEN = 00). You can see that it is transferring one word (16 bytes) at a time.

Memory read when data cache is on

It took 0.409 msec to turn on the data cache and use memcpy () to read 256KByte of data from the BRAM on the PL side. The read speed is about 641MByte / sec.

The AXI I / F waveform at that time is as follows.

Fig.5 データキャッシュオン時のメモリリードの AXI IF 波形

Fig.5 AXI IF waveform of memory read when data cache is on


As you can see from the waveform, a burst transfer of 4 words (64 bytes) is performed in one read (ARLEN = 03).

When the CPU reads the memory, if there is no data in the data cache, it reads the data from the BRAM and fills the cache. At that time, the cache line size (64 bytes for arm64) of the data cache is collectively read from BRAM. After that, as long as there is data in the data cache, the data will be provided to the CPU from the data cache and no access to BRAM will occur. Therefore, the memory read is faster than when the data cache is off. In this environment, when the data cache is off, the performance is significantly improved to 641MByte / sec when the data cache is turned on for 75MByte / sec.

Introducing uiomem

What is uiomem

uiomem is a Linux device driver for accessing memory areas not managed by the Linux kernel from user space. uiomem has the following functions.

You can use a device file (such as / dev / uiomem0) to map to user memory space, or use the read () / write () functions to access memory from user space.

The memory space start address and size can be specified in the device tree or as an argument when loading the device driver with the insmod command.

Supported platforms

Installation

Load uiomem with insmod. At this time, a memory area not managed by the Linux kernel can be specified as an argument.

shell$ sudo insmod uiomem.ko uiomem0_addr=0x0400000000 uiomem0_size=0x00040000
[  276.428346] uiomem uiomem0: driver version = 1.0.0-alpha.1
[  276.433903] uiomem uiomem0: major number   = 241
[  276.438534] uiomem uiomem0: minor number   = 0
[  276.442980] uiomem uiomem0: range address  = 0x0000000400000000
[  276.448901] uiomem uiomem0: range size     = 262144
[  276.453775] uiomem uiomem.0: driver installed.
shell$ ls -la /dev/uiomem0
crw------- 1 root root 241, 0 Aug  7 12:51 /dev/uiomem0

Settings by device tree

In addition to specifying a memory area that is not managed by the Linux kernel with the argument of insmod, uiomem can specify the memory area by the device tree file that the Linux kernel loads at boot time. If you add the following entry to the device tree file, / dev / uiomem0 will be created automatically when you load it with insmod.

devicetree.dts


 		#address-cells = <2>;
		#size-cells = <2>;
		uiomem_plbram {
 			compatible  = "ikwzm,uiomem";
			device-name = "uiomem0";
			minor-number = <0>;
			reg = <0x04 0x00000000 0x0 0x00040000>;
		};

The memory area is indicated by the reg property. The first element of the reg property (two elements if # address-cells is 2) indicates the start address of the memory area. The remaining elements of the reg property (2 elements if # size-cells is 2) indicate the size of the memory area in bytes. In the above example, the start address of the memory area is 0x04_0000_0000 and the size of the memory area is 0x40000.

Specify the device name in the device-name property.

Specify the minor number of uiomem in the minor-number property. Minor numbers can be from 0 to 255. However, the insmod argument takes precedence, and if the minor numbers conflict, the one specified in the device tree will fail. If the minor-number property is omitted, a free minor number will be assigned.

The device name is determined as follows.

  1. If device-name was specified, device-name.
  2. If device-name is omitted and minor-number is specified, sprintf ("uiomem% d", minor-number).
  3. If device-name is omitted and minor-number is also omitted, the entry name of devicetree (uiomem_plbram in the example).

Device file

Loading uiomem into the kernel will create a device file similar to the following: \ <device-name > is the device name described in the previous section.

/dev/<device-name>

/ dev / \ <device-name > is used to map the memory area to user space using mmap () or to access the memory area using read () and write ().

uiomem_test.c


 	if ((fd  = uiomem_open(uiomem, O_RDWR)) != -1) {
 		iomem = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
		uiomem_sync_for_cpu();
		/*Process to access iomem here*/
		uiomem_sync_for_device();
		close(fd);
	}

Data cache control may be required when mapping to user space using mmap (). Data cache is controlled by sync_for_cpu and sync_for_device. These will be described later.

You can also read / write directly from the shell by specifying the device file with the dd command or the like.

shell$ dd if=/dev/urandom of=/dev/uiomem0 bs=4096 count=64
64+0 records in
64+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.00746404 s, 35.1 MB/s
shell$ dd if=/dev/uiomem0 of=random.bin bs=4096
64+0 records in
64+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.00578518 s, 45.3 MB/s

phys_addr

/ sys / class / uiomem / \ <device-name > / phys_addr can read the start address of the memory area.

size

/ sys / class / uiomem / \ <device-name > / size can read the size of the memory area.

sync_direction

/ sys / class / uiomem / \ <device-name > / sync_direction specifies the access direction when manually controlling the cache of uiomem.

sync_offset

/ sys / class / uiomem / \ <device-name > / sync_offset specifies the beginning of the range when cache control is performed manually by the offset value from the memory area.

sync_size

/ sys / class / uiomem / \ <device-name > / sync_size specifies the size of the range for manual cache control.

sync_for_cpu

/ sys / class / uiomem / \ <device-name > / sync_for_cpu invalidates the CPU cache by writing a non-zero value to this device file when manually controlling the cache. This device file is write-only.

If you write 1 in this device file, / sys / class / uiomem / \ <device-name > / sync_offset and / sys when sync_direction is 2 (= read only) or 0 (= read / write bidirectional) The CPU cache in the range specified by / class / uiomem / \ <device-name > / sync_size is invalidated.

uiomem_test.c


void  uiomem_sync_for_cpu(void)
{
	unsigned char  attr[1024];
	unsigned long  sync_for_cpu   = 1;
	if ((fd  = open("/sys/class/uiomem/uiomem0/sync_for_cpu", O_WRONLY)) != -1) {
		sprintf(attr, "%d",  sync_for_cpu);
		write(fd, attr, strlen(attr));
		close(fd);
 	}
}

The values written to this device file can include sync_offset, sync_size and sync_direction as follows:

uiomem_test.c


void uiomem_sync_for_cpu(unsigned long sync_offset, unsigned long sync_size, unsigned int sync_direction)
{
	unsigned char  attr[1024];
	unsigned long  sync_for_cpu   = 1;
	if ((fd  = open("/sys/class/uiomem/uiomem0/sync_for_cpu", O_WRONLY)) != -1) {
		sprintf(attr, "0x%08X%08X", (sync_offset & 0xFFFFFFFF), (sync_size & 0xFFFFFFF0) | (sync_direction << 2) | sync_for_cpu);
		write(fd, attr, strlen(attr));
		close(fd);
 	}
}

The sync_offset, sync_size, and sync_direction specified in this method are temporary and are in the device files / sys / class / uiomem / \ <device-name > / sync_offset, / sys / class / uiomem / \ <device-name. It does not affect the values of > / sync_size and / sys / class / uiomem / \ <device-name > / sync_direction.

Also, due to format reasons, the range that can be specified with sync_offset and sync_size is only the range that can be indicated by 32 bits.

sync_for_device

/ sys / class / uiomem / \ <device-name > / sync_for_device flushes the CPU cache by writing a non-zero value to this device file when manually controlling the cache. This device file is write-only.

If you write 1 in this device file, / sys / class / uiomem / \ <device-name > / sync_offset and / sys when sync_direction is 1 (= write only) or 0 (= read / write bidirectional) The CPU cache in the range specified by / class / uiomem / \ <device-name > / sync_size is flushed.

uiomem_test.c


void uiomem_sync_for_device(void)
{
	unsigned char  attr[1024];
	unsigned long  sync_for_device   = 1;
	if ((fd  = open("/sys/class/uiomem/uiomem0/sync_for_cpu", O_WRONLY)) != -1) {
		sprintf(attr, "%d",  sync_for_device);
		write(fd, attr, strlen(attr));
		close(fd);
 	}
}

The values written to this device file can include sync_offset, sync_size and sync_direction as follows:

uiomem_test.c


void uiomem_sync_for_device(unsigned long sync_offset, unsigned long sync_size, unsigned int sync_direction)
{
	unsigned char  attr[1024];
	unsigned long  sync_for_device  = 1;
	if ((fd  = open("/sys/class/uiomem/uiomem0/sync_for_device", O_WRONLY)) != -1) {
		sprintf(attr, "0x%08X%08X", (sync_offset & 0xFFFFFFFF), (sync_size & 0xFFFFFFF0) | (sync_direction << 2) | sync_for_device);
		write(fd, attr, strlen(attr));
		close(fd);
 	}
}

The sync_offset, sync_size, and sync_direction specified in this method are temporary and are in the device files / sys / class / uiomem / \ <device-name > / sync_offset, / sys / class / uiomem / \ <device-name. It does not affect the values of > / sync_size and / sys / class / uiomem / \ <device-name > / sync_direction.

Also, due to format reasons, the range that can be specified with sync_offset and sync_size is only the range that can be indicated by 32 bits.

Data cache control

If you only want to use the memory on the PL side as memory that can only be accessed from the CPU, you only need to enable the data cache. However, enabling the data cache is not enough for devices other than the CPU to access the memory on the PL side, or to enable or disable the memory on the PL side after starting Linux. Since data mismatch between the data cache and the memory on the PL side can occur, it is necessary to match the contents of the memory on the data cache and the memory on the PL side in some way.

uiomem implements data cache control directly using arm64 / arm data cache instructions.

uiomem.c


#if (defined(CONFIG_ARM64))
static inline u64  arm64_read_dcache_line_size(void)
{
    u64       ctr;
    u64       dcache_line_size;
    const u64 bytes_per_word = 4;
    asm volatile ("mrs %0, ctr_el0" : "=r"(ctr) : : );
    asm volatile ("nop" : : : );
    dcache_line_size = (ctr >> 16) & 0xF;
    return (bytes_per_word << dcache_line_size);
}
static inline void arm64_inval_dcache_area(void* start, size_t size)
{
    u64   vaddr           = (u64)start;
    u64   __end           = (u64)start + size;
    u64   cache_line_size = arm64_read_dcache_line_size();
    u64   cache_line_mask = cache_line_size - 1;
    if ((__end & cache_line_mask) != 0) {
        __end &= ~cache_line_mask;
        asm volatile ("dc civac, %0" :  : "r"(__end) : );
    }
    if ((vaddr & cache_line_mask) != 0) {
        vaddr &= ~cache_line_mask;
        asm volatile ("dc civac, %0" :  : "r"(vaddr) : );
    }
    while (vaddr < __end) {
        asm volatile ("dc ivac, %0"  :  : "r"(vaddr) : );
        vaddr += cache_line_size;
    }
    asm volatile ("dsb	sy"  :  :  : );
}
static inline void arm64_clean_dcache_area(void* start, size_t size)
{
    u64   vaddr           = (u64)start;
    u64   __end           = (u64)start + size;
    u64   cache_line_size = arm64_read_dcache_line_size();
    u64   cache_line_mask = cache_line_size - 1;
    vaddr &= ~cache_line_mask;
    while (vaddr < __end) {
        asm volatile ("dc cvac, %0"  :  : "r"(vaddr) : );
        vaddr += cache_line_size;
    }
    asm volatile ("dsb	sy"  :  :  : );
}
static void arch_sync_for_cpu(void* virt_start, phys_addr_t phys_start, size_t size, enum uiomem_direction direction)
{
    if (direction != UIOMEM_WRITE_ONLY)
        arm64_inval_dcache_area(virt_start, size);
}
static void arch_sync_for_dev(void* virt_start, phys_addr_t phys_start, size_t size, enum uiomem_direction direction)
{
    if (direction == UIOMEM_READ_ONLY)
        arm64_inval_dcache_area(virt_start, size);
    else
        arm64_clean_dcache_area(virt_start, size);
}
#endif

sync_for_cpu and sync_for_device call the architecture-dependent arch_sync_for_cpu () and arch_sync_for_device (), respectively.

reference

["How to access FPGA memory from Linux with" cache enabled "@Qiita]: https://qiita.com/ikwzm/items/1580e89ecdb9cf9392eb" Enable cache from Linux to FPGA memory " "How to access" @Qiita " [「Accessing BRAM In Linux」]: https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18842412/Accessing+BRAM+In+Linux "「Accessing BRAM In Linux」" [uiomem v1.0.0-alpha.1]: https://github.com/ikwzm/uiomem/tree/v1.0.0-alpha.1 "uiomem v1.0.0-alpha.1" [ZynqMP-FPGA-Linux v2020.1.1]: https://github.com/ikwzm/ZynqMP-FPGA-Linux/tree/v2020.1.1 "ZynqMP-FPGA-Linux v2020.1.1"

Recommended Posts

Device drivers that "cache enabled" access to FPGA memory from Linux
How to "cache enabled" access to FPGA memory from Linux
ODBC access to SQL Server from Linux with Python
How to access wikipedia from python
Device driver (NumPy compatible) for programs and hardware that run in user space on Linux to share memory