It's finally been a long way and the preparations are complete. Let's move the convolution circuit designed with Ultra96V2 board!
The following file was created in the previous time (AI Edge Contest (Implementation Contest) Tutorial [9: Until HW synthesis to generate bitstream]) must.
These files are referred to the 3rd (AI Edge Contest (Implementation Contest) Tutorial [3: Inference Execution on Ultra96 Board CPU]) Place it in / home / xilinx / pynq / overlays / base
on the Ultra96 V2 board.
Also, the test bench I have used so far
Place in / home / xilinx / data
.
Notebook for controlling hardware is a repository of tutorials (https://github.com/HirokiNakahara/FPGA_AI_2019 It is located in blob / master / Inference_PYNQ_1
). Clone it and put it in your Ultra96 V2 home / home / xilinx
. Later, I will load it with the Ultra96 V2 Jupyter Notebook.
Connect to the Ultra96 V2 Jupyter Notebook from your browser. Please refer to Part 3. Click Upload to load and run the notebook (ʻultra96v2_pynq_convolution_layer0.ipynb`). After that, if you execute from the top, preparation, inference execution (but slow), and inference on the CPU for comparison will be performed respectively.
Below, I will focus on the main points.
ultra96v2_pynq_convolution_layer0.ipynb
from pynq import Overlay
import pynq
overlay = Overlay('/home/xilinx/pynq/overlays/base/pynq_ultra96_conv_l0_r1.bit')
dir(overlay)
PYNQ abstracts hardware with the concept of overlay. As you can see by displaying it, I think that some core names (kernel_0 or axi_dma_0) used in the previous IP connection are shown. Access it and perform operations.
ultra96v2_pynq_convolution_layer0.ipynb
registers = overlay.kernel_0.register_map
For example, you can access the register_map to control the user's IP core with Python. This time, the AXI stream is specified by the pragma, so the operation is easy! This is great (for anyone who has written AXI buses in RTL).
Although it is a DMA setting,
ultra96v2_pynq_convolution_layer0.ipynb
import pynq.lib.dma
dma = overlay.axi_dma_0
And access the overlay
ultra96v2_pynq_convolution_layer0.ipynb
from pynq import Xlnk
inimg_size = 416*11*3
outfmap_size = 102*64+1
xlnk = Xlnk()
send_buf = xlnk.cma_array(shape=(inimg_size),dtype=np.int32)
recv_buf = xlnk.cma_array(shape=(outfmap_size),dtype=np.int32)
And Xlnk ()
(a wrapper for DMA control middleware designed by Xilinx), I have secured the array. Easy victory.
Hardware data transfer and reception
ultra96v2_pynq_convolution_layer0.ipynb
%%time
for line in range(102):
# load input image
for i in range(11):
inimg_buf[i] = inimg[i+line*4]
tmp = inimg_buf.copy().transpose((2,0,1)).reshape(-1,) # CH,Y,X
send_buf[0:inimg_size] = tmp[0:inimg_size]
# activate DMA
registers.CTRL.AP_START = 1
# DMA access
dma.sendchannel.transfer(send_buf)
dma.recvchannel.transfer(recv_buf)
# wait DMA
dma.sendchannel.wait()
dma.recvchannel.wait()
# store output buffer
tmp2 = recv_buf[0:outfmap_size - 1]
tmp2 = tmp2.reshape((64,102)) # CH, X
outfmap_buf[line] = tmp2
Pass an array of numpy (inimg in this case) and set the transfer start register to ON (ʻAP_START). After that, pass the buffer to
transfer and wait until the transfer (that is, the processing of the convolution operation) is completed (
wait`). After that, pass the corresponding data to the numpy array and you're done. Repeat this for the output line.
I measured the time with %% time
. It's a way to call an external command in Jupyter Notebook. And which one
CPU times: user 22.5 s, sys: 6.85 ms, total: 22.5 s
Wall time: 22.5 s
Slow. .. .. After all, the estimate of 22 seconds and HLS was accurate. .. .. We are also verifying after this. Please move it and check that the HW is working properly.
By the way, you should have Pytorch installed, so let's check the CPU inference time.
ultra96v2_pynq_convolution_layer0.ipynb
import torch
x = torch.randn(1,3,416,416)
conv = torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=11,stride=4,bias=False)
y = conv(x)
CPU times: user 259 ms, sys: 7.96 ms, total: 267 ms
Wall time: 93.2 ms
Eh, about 100 times faster. .. .. This is Dosun. (That is quite common> FPGA design)
So I tried all the steps from learning Pytorch → software design → hardware design → actually working with FPGA, but the result was disappointing. .. I think you can understand how high the hurdles for hardware design are, let alone deep learning.
At this rate, it's not good enough, so let's do our best to make it faster. So it will continue for a while.
Recommended Posts