AI Edge Contest (Implementation Contest) Tutorial [10: Control HW with Python ..., but ...]

It's finally been a long way and the preparations are complete. Let's move the convolution circuit designed with Ultra96V2 board!

Transfer the required files to Ultra96 V2

The following file was created in the previous time (AI Edge Contest (Implementation Contest) Tutorial [9: Until HW synthesis to generate bitstream]) must.

pynq_ultra96_conv_l0_r1.bit
pynq_ultra96_conv_l0_r1.tcl
pynq_ultra96_conv_l0_r1.hdf
pynq_ultra96_conv_l0_r1.hwh

These files are referred to the 3rd (AI Edge Contest (Implementation Contest) Tutorial [3: Inference Execution on Ultra96 Board CPU]) Place it in / home / xilinx / pynq / overlays / base on the Ultra96 V2 board.

Also, the test bench I have used so far

testbench_input.txt
testbench_output.txt

Place in / home / xilinx / data.

Notebook for controlling hardware is a repository of tutorials (https://github.com/HirokiNakahara/FPGA_AI_2019 It is located in blob / master / Inference_PYNQ_1). Clone it and put it in your Ultra96 V2 home / home / xilinx. Later, I will load it with the Ultra96 V2 Jupyter Notebook.

Finally inference is executed in hardware

Connect to the Ultra96 V2 Jupyter Notebook from your browser. Please refer to Part 3. スクリーンショット 2020-03-20 11.21.39.png Click Upload to load and run the notebook (ʻultra96v2_pynq_convolution_layer0.ipynb`). After that, if you execute from the top, preparation, inference execution (but slow), and inference on the CPU for comparison will be performed respectively.

Below, I will focus on the main points.

`ultra96v2_pynq_convolution_layer0.ipynb`


from pynq import Overlay
import pynq

overlay = Overlay('/home/xilinx/pynq/overlays/base/pynq_ultra96_conv_l0_r1.bit')
dir(overlay)

PYNQ abstracts hardware with the concept of overlay. As you can see by displaying it, I think that some core names (kernel_0 or axi_dma_0) used in the previous IP connection are shown. Access it and perform operations.

`ultra96v2_pynq_convolution_layer0.ipynb`


registers = overlay.kernel_0.register_map

For example, you can access the register_map to control the user's IP core with Python. This time, the AXI stream is specified by the pragma, so the operation is easy! This is great (for anyone who has written AXI buses in RTL).

Although it is a DMA setting,

`ultra96v2_pynq_convolution_layer0.ipynb`


import pynq.lib.dma

dma = overlay.axi_dma_0

And access the overlay

`ultra96v2_pynq_convolution_layer0.ipynb`


from pynq import Xlnk

inimg_size = 416*11*3
outfmap_size = 102*64+1

xlnk = Xlnk()

send_buf   = xlnk.cma_array(shape=(inimg_size),dtype=np.int32)
recv_buf = xlnk.cma_array(shape=(outfmap_size),dtype=np.int32)

And Xlnk () (a wrapper for DMA control middleware designed by Xilinx), I have secured the array. Easy victory.

Hardware data transfer and reception

`ultra96v2_pynq_convolution_layer0.ipynb`


%%time
for line in range(102):
    # load input image
    for i in range(11):
        inimg_buf[i] = inimg[i+line*4]
    
    tmp = inimg_buf.copy().transpose((2,0,1)).reshape(-1,) # CH,Y,X
    send_buf[0:inimg_size] = tmp[0:inimg_size]

    # activate DMA
    registers.CTRL.AP_START = 1

    # DMA access
    dma.sendchannel.transfer(send_buf)
    dma.recvchannel.transfer(recv_buf)

    # wait DMA
    dma.sendchannel.wait()
    dma.recvchannel.wait()
    
    # store output buffer
    tmp2 = recv_buf[0:outfmap_size - 1]
    tmp2 = tmp2.reshape((64,102)) # CH, X
    outfmap_buf[line] = tmp2

Pass an array of numpy (inimg in this case) and set the transfer start register to ON (ʻAP_START). After that, pass the buffer to transfer and wait until the transfer (that is, the processing of the convolution operation) is completed ( wait`). After that, pass the corresponding data to the numpy array and you're done. Repeat this for the output line.

I measured the time with %% time. It's a way to call an external command in Jupyter Notebook. And which one

CPU times: user 22.5 s, sys: 6.85 ms, total: 22.5 s
Wall time: 22.5 s

Slow. .. .. After all, the estimate of 22 seconds and HLS was accurate. .. .. We are also verifying after this. Please move it and check that the HW is working properly.

bonus. What is the inference time on the CPU?

By the way, you should have Pytorch installed, so let's check the CPU inference time.

`ultra96v2_pynq_convolution_layer0.ipynb`


import torch
x = torch.randn(1,3,416,416)
conv = torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=11,stride=4,bias=False)
y = conv(x)

CPU times: user 259 ms, sys: 7.96 ms, total: 267 ms
Wall time: 93.2 ms

Eh, about 100 times faster. .. .. This is Dosun. (That is quite common> FPGA design)

What are you going to do

So I tried all the steps from learning Pytorch → software design → hardware design → actually working with FPGA, but the result was disappointing. .. I think you can understand how high the hurdles for hardware design are, let alone deep learning.

At this rate, it's not good enough, so let's do our best to make it faster. So it will continue for a while.