AI Edge Contest (Implementation Contest) Tutorial [10: Control HW with Python ..., but ...]

It's finally been a long way and the preparations are complete. Let's move the convolution circuit designed with Ultra96V2 board!

Transfer the required files to Ultra96 V2

The following file was created in the previous time (AI Edge Contest (Implementation Contest) Tutorial [9: Until HW synthesis to generate bitstream]) must.

These files are referred to the 3rd (AI Edge Contest (Implementation Contest) Tutorial [3: Inference Execution on Ultra96 Board CPU]) Place it in / home / xilinx / pynq / overlays / base on the Ultra96 V2 board.

Also, the test bench I have used so far

Place in / home / xilinx / data.

Notebook for controlling hardware is a repository of tutorials (https://github.com/HirokiNakahara/FPGA_AI_2019 It is located in blob / master / Inference_PYNQ_1). Clone it and put it in your Ultra96 V2 home / home / xilinx. Later, I will load it with the Ultra96 V2 Jupyter Notebook.

Finally inference is executed in hardware

Connect to the Ultra96 V2 Jupyter Notebook from your browser. Please refer to Part 3. スクリーンショット 2020-03-20 11.21.39.png Click Upload to load and run the notebook (ʻultra96v2_pynq_convolution_layer0.ipynb`). After that, if you execute from the top, preparation, inference execution (but slow), and inference on the CPU for comparison will be performed respectively.

Below, I will focus on the main points.

ultra96v2_pynq_convolution_layer0.ipynb


from pynq import Overlay
import pynq

overlay = Overlay('/home/xilinx/pynq/overlays/base/pynq_ultra96_conv_l0_r1.bit')
dir(overlay)

PYNQ abstracts hardware with the concept of overlay. As you can see by displaying it, I think that some core names (kernel_0 or axi_dma_0) used in the previous IP connection are shown. Access it and perform operations.

ultra96v2_pynq_convolution_layer0.ipynb


registers = overlay.kernel_0.register_map

For example, you can access the register_map to control the user's IP core with Python. This time, the AXI stream is specified by the pragma, so the operation is easy! This is great (for anyone who has written AXI buses in RTL).

Although it is a DMA setting,

ultra96v2_pynq_convolution_layer0.ipynb


import pynq.lib.dma

dma = overlay.axi_dma_0

And access the overlay

ultra96v2_pynq_convolution_layer0.ipynb


from pynq import Xlnk

inimg_size = 416*11*3
outfmap_size = 102*64+1

xlnk = Xlnk()

send_buf   = xlnk.cma_array(shape=(inimg_size),dtype=np.int32)
recv_buf = xlnk.cma_array(shape=(outfmap_size),dtype=np.int32)

And Xlnk () (a wrapper for DMA control middleware designed by Xilinx), I have secured the array. Easy victory.

Hardware data transfer and reception

ultra96v2_pynq_convolution_layer0.ipynb


%%time
for line in range(102):
    # load input image
    for i in range(11):
        inimg_buf[i] = inimg[i+line*4]
    
    tmp = inimg_buf.copy().transpose((2,0,1)).reshape(-1,) # CH,Y,X
    send_buf[0:inimg_size] = tmp[0:inimg_size]

    # activate DMA
    registers.CTRL.AP_START = 1

    # DMA access
    dma.sendchannel.transfer(send_buf)
    dma.recvchannel.transfer(recv_buf)

    # wait DMA
    dma.sendchannel.wait()
    dma.recvchannel.wait()
    
    # store output buffer
    tmp2 = recv_buf[0:outfmap_size - 1]
    tmp2 = tmp2.reshape((64,102)) # CH, X
    outfmap_buf[line] = tmp2

Pass an array of numpy (inimg in this case) and set the transfer start register to ON (ʻAP_START). After that, pass the buffer to transfer and wait until the transfer (that is, the processing of the convolution operation) is completed ( wait`). After that, pass the corresponding data to the numpy array and you're done. Repeat this for the output line.

I measured the time with %% time. It's a way to call an external command in Jupyter Notebook. And which one

CPU times: user 22.5 s, sys: 6.85 ms, total: 22.5 s
Wall time: 22.5 s

Slow. .. .. After all, the estimate of 22 seconds and HLS was accurate. .. .. We are also verifying after this. Please move it and check that the HW is working properly.

bonus. What is the inference time on the CPU?

By the way, you should have Pytorch installed, so let's check the CPU inference time.

ultra96v2_pynq_convolution_layer0.ipynb


import torch
x = torch.randn(1,3,416,416)
conv = torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=11,stride=4,bias=False)
y = conv(x)
CPU times: user 259 ms, sys: 7.96 ms, total: 267 ms
Wall time: 93.2 ms

Eh, about 100 times faster. .. .. This is Dosun. (That is quite common> FPGA design)

What are you going to do

So I tried all the steps from learning Pytorch → software design → hardware design → actually working with FPGA, but the result was disappointing. .. I think you can understand how high the hurdles for hardware design are, let alone deep learning.

At this rate, it's not good enough, so let's do our best to make it faster. So it will continue for a while.

Recommended Posts

AI Edge Contest (Implementation Contest) Tutorial [10: Control HW with Python ..., but ...]
3. 3. AI programming with Python
[Python tutorial] Control structure tool
[Python] Using OpenCV with Python (Edge Detection)
Try frequency control simulation with Python
Implementation of Dijkstra's algorithm with python
Exclusive control with lock file in Python
[In-Database Python Analysis Tutorial with SQL Server 2017]
[Episode 2] Beginners tried Numeron AI with python
Recommendation tutorial using association analysis (python implementation)
[Episode 0] Beginners tried Numeron AI with python
[Episode 1] Beginners tried Numeron AI with python
[ev3dev × Python] SSH Control (remote control with keyboard)
Build AI / machine learning environment with Python