[PYTHON] Documents on approaches for embedding in ARM

I found the material [1] on the approach for embedding in ARM, so Make a note for reference.

The document "OpenCV for Embedded: Lessons Learned" Is concisely organized in a 15-page document. On page 6, those steps are briefly described.

・ Prototyping (x86) ・ Porting regression tests ・ Profiling performance tests ・ Bottleneck optimization ・ Fine Tuning ・ Production

On the other hand, I will describe what it looks like in my case for reference in the case of Python.

・ Prototyping (x86) Use OpenCV + Python + Numpy + matplotlib + scikit-learn etc. In the case of image recognition, we take an approach that can only be done by knowing the algorithm, such as changing the resolution, replacing it with resampling instead of reducing the area, or reducing the area. In the case of Python, it is easy to make trial and error, so I am happy that the data structure is easy to understand even with the algorithm used for the first time. (All objects can be printed, all data members can be referenced, and numerical data can be checked with matplotlib using abundant graph drawing functions similar to MATLAB.)

In addition, at this prototyping stage, the software design is reviewed and separated into a structure that is easy to maintain. Write a Documentation comment so that a third party can see what each function, class, or method does.

Of course, version control is done from this stage. Use SVN to put a repository on your local HD and version control it for yourself. The character code should be unicode, assuming that it will be ported to another OS.

・ Porting Even if you move to Linux (ARM) Use OpenCV + Python + Numpy + matplotlib + scikit-learn etc. These libraries are apt-get install library name Can be transplanted as. 　RegressionTest The test uses unitTest. To compare the degree of matching of numerical data, it is better to use numpy.testing.assert_almost_equal (actual, desired [, ...]) provided in numpy. Since unitTest is based on the same concept as CppUnit, it will be easy to support porting to CppUnit in C ++ in the future.

・ Profiling 　PerformanceTest There is a Profiling function in the standard library. These unitTests and performances can be done with a common source regardless of OS or CPU type.

・ Bottleneck optimization It depends on whether the target CPU is single core or multiple cores. Since the built-in CPU also has multiple cores, it seems that it will be possible to speed up with Raspberry Pi 2 etc. by referring to the material in [2]. Even with an ARM dual core + FPGA configuration such as Zynq, it will be possible to rewrite using the fact that there are multiple cores. The approach depends on what is rate-determining.

・ Fine Tuning In the case of Python, the performance of the module of the pyd file written in C ++ is effective. Whether the build conditions when creating a pyd file are optimal for the target CPU I understand that the performance is significantly different. At my level, at the time of prototyping, it was an improvement by reducing the amount of data processed, and it did not reach the optimization of the code part written in C ++. In the case of built-in ARM, it seems that the point is whether you can use NEON instructions well. To use the NEON instruction, it seems better to read the source code of the OpenCV module.

#if CV_NEON #endif The code of the part surrounded by c:\opencv\sources\modules\imgproc\src*.cpp Let's find out from such things and imitate it.

In the case of CPU rate-determining in Python, speed can be improved by using Cython. In the case of image processing / recognition, how efficiently numpy can be used The execution speed will be significantly different. When building numpy, use a library that speeds up linear algebra, such as MKL If you rebuild it, it will be faster even if the description in python is the same.

・ Production

Additional notes
In the case of Python binding of OpenCV, it is a pity that gpu cannot be used yet. Of course I know there is a GPU python library unrelated to OpenCV, I would like to know if anyone knows how useful Python can be because it does not use the ARM embedded board (Jetson TK1) that can take advantage of the performance of the GPU.

ARM resources are significantly different from PCs, so be careful to determine if everything you need to implement can be achieved on that ARM board. If everything you need to implement isn't clear enough, be sure to choose an ARM board that has plenty of room. Is 2 cores enough (Raspberry Pi3 if 64bit cores are 4 cores) or does it require a GPU (Jetson)? TX1), do you need an FPGA? The choices are increasing.

URL [1]:OpenCV for Embedded: Lessons Learned http://www.slideshare.net/YuryGorbachev1/opencv-for-embedded-lessons-learned

[2]: Try to use up the Raspberry Pi 2's 4-core CPU with Parallel Python http://qiita.com/akidn8/items/ea3dcae810ff36fd5401