Build a faster OpenCV for Raspberry Pi3
So you want to build a faster OpenCV for Raspberry Pi3, but want to be sure – Are you using the right build flags? How can you prove this is faster? Well, you can never be 100% sure in advance, but here is a methodic way to get there.
Build a faster OpenCV deb package for Raspberry Pi
Let’s cut to the chase and start with the conclusion. Later I’ll show you how to prove this is true and add a short discussion on how to make sure this is true for your use case.
From my personal experience and testing, this is how you would build the best performing OpenCV 3.1.0:
- The source is extracted in ~/Downloads/opencv-3.1.0 (comment – for big builds I usually prefer for the source and build directories to be mounted externally to the Pi).
- You already followed this post – Intel TBB on Raspberry Pi. (Optional for reasons I’ll detail later on, but I’m showing below how to build with TBB – if you don’t want it then remove the TBB related configuration parameters below).
cd ~/Downloads/opencv-3.1.0 mkdir build_rpi3_release_fp_tbb cd build_rpi3_release_fp_tbb cmake -DCMAKE_CXX_FLAGS="-DTBB_USE_GCC_BUILTINS=1 -D__TBB_64BIT_ATOMICS=0" -DENABLE_VFPV3=ON -DENABLE_NEON=ON -DBUILD_TESTS=OFF -DWITH_TBB=ON -DCMAKE_BUILD_TYPE=Release ..
make -j 4
Please note that building in parallel also requires a lot of RAM usage and the Pi doesn’t have so much RAM. If the make fails with “out of memory” or becomes slow because of swap usage then reduce the parallelism and/or close redundant processes (for example disable the GUI and instead use ssh to connect to the pi).
Prepare a package
This will create a package named opencv_3.1.0-1_armhf.deb in the current build directory.
sudo apt-get install checkinstall echo "opencv 3.1.0 build_rpi3_release_fp_tbb" > description-pak echo | sudo checkinstall -D --install=no --pkgname=opencv --pkgversion=3.1.0 --provides=opencv --nodoc --backup=no --exclude=$HOME
sudo dpkg -i opencv_3.1.0-1_armhf.deb
How much faster is it?
In short – about 30% faster.
How did I come up with this number?
Faster by definition is a relative term, so we need to determine what use case we’re comparing and with which build configuration.
I’m comparing this build with a simple ‘Release’ configuration, built on Raspberry Pi3. As for the use case, there are so many of them, so this is how I came up with an average:
- Build different configurations to compare in different build directories.
- Execute the performance tests which come with OpenCV 3.1.0 (run.py python scripts supplied with OpenCV).
- Compare the the different builds against the ‘Release’ build (summary.py python scripts supplied with OpenCV). Output gives a ‘x-factor’ per test which is how much slower or faster a test was relative to the base ‘Release’ build.
- Calculate an average for this ‘x-factor’ (small perl script I wrote to parse the html output of summary.py).
The resulting x-factor was 1.2975, which is ~ 30% faster.
Was this performance gain due to TBB?
No. 28% should be credited to building with Floating Point optimizations flags, as I verified in my tests. Nevertheless squeezing an extra 2% won’t do any harm. I’m using TBB anyway in many of my projects for it’s great parallelism tools (mainly their pipeline), but if you don’t want it, then leave it out.
Show me the numbers
See on this GitHub project’s comparisons sub directory (download the html files and open them locally in a browser).
But this is an average, how can I make sure for my use case?
The bottom line is this – if you don’t have performance issues then don’t optimize. If you do have performance issues and you’re sure there is nothing you can improve in your own code, then go ahead and investigate several build configurations. Try linking with each one and see your benefits.
The simple automated “build – test – compare” system I used to perform the above tests is available in this GitHub project, and you can use it as you wish. See the project’s page for simple usage instructions.
- Don’t optimize unless you need to.
- Optimize your own code first.
- You can build a 30% faster OpenCV package for Raspberry Pi3.
- Building packages is always better since they are easier to maintain and deliver for installations.
- Using TBB is optional since OpenCV has an alternative parallelism mechanism.
- Independently you can build a simple and fast pipeline with TBB for your own usage with OpenCV (I’ll share example code in a future post).
- UPDATE: See more on using OpenCV with TBB at https://www.theimpossiblecode.com/blog/faster-opencv-smiles-tbb.
- You can freely use the build/pkg/test/compare system I used for OpenCV 3.1.0 from here.
See you soon in an upcoming post.
Thank you for your great and well explained posts!
I have installed your tbb package and compiled openCV according to your instructions, the only difference being my openCV is in different directory and that I use openCV_contrib.
I encountered a problem while trying to run the dmp example code in the openCV_contrib, it says that TBB is not defined.
Maybe you have any advice? I think that maybe I’m using a wrong command line when I compile the code, can you write the compilation command?
Can you copy paste your compilation command and error?
Thanks for this interesting article. I had not heard of TBB, but my initial look at it makes it appear very useful. For now I have just been using Boost for basic threading. Would be interesting to see an article with some pointers and tips of how you make use of TBB. Cheers!
See this new post https://www.theimpossiblecode.com/blog/faster-opencv-smiles-tbb
Thanks for that. Very useful example, showing the parallel pipeline, especially for OpenCV 🙂
Hi, Thanks for your useful post. I have a question. After compiling OpenCV with NEON flags, Should I set -mfpu=neon flag for compliling my own program? How much -O3 will affect the performance?
Hi, I’m glad this is useful to you 🙂
– I’m not sure you’ll benefit transparently from these NEON flags for your own programs. You can try and see.
– Using cmake to build a ‘Release’ version automatically adds ‘-O3 -DNDEBUG’ (use ‘make VERBOSE=1’ to see the compilation lines).
Thanks for the reply. I have successfully built OpenCV 3.1 with neon flags and TBB. The effect of neon flags are clear but TBB won’t speed up the run time.
My functions use parallel_for and I see my cpu usage achieving almost 50% but setting number of threads used by OpenCV (through the command ‘setNumThreads()’) to 1 or 4 doesn’t change the run time!!!! I tried to use OpenMP and #pragma directions instead but no improvement is seen. What should I do? The same code is well parallelized on PC and run time speed ups are clear.
The strange thing is that the cpu usage is increased up to 50% when setting number of threads to 4 but no improvement in run time is seen compared to when number of threads are set to 1 and cpu usage lowering to almsot 30%.
It seems that you have specific optimization issues and I can’t help without looking at your code. Maybe these general insights will help you:
1. About the absolute times – check your bottlenecks again on the PI to make sure you are improving the correct code parts. Improving the faster part just to wait for non optimized slower parts might give you such symptoms.
2. Using parallel_for is not the only optimization with TBB. See more on using OpenCV with TBB in other ways at https://www.theimpossiblecode.com/blog/faster-opencv-smiles-tbb
3. Work on smaller resolutions.
My bootleneck is the built-in KLT tracker function which uses parallel_for as I have seen in its source code. But setting number of threads to 1 or 4 doesn’t change performance!
Could you please give me some hints?
I guess I found the problem. I use clock() from time.h to measure run time of a code. I guess raspberry Pi has only one clock in CPU and cannot measure the elapsed time truly when multi-threading is enabled.
How can we validate that the OpenCV build has actually the NEON to be ON?
1. Before you build you can do this to see if there are neon flags in the compile command (from my opencv3.3.0 build on pi3):
$ grep OPENCV_EXTRA_CXX_FLAGS CMakeCache.txt |grep -i neon
OPENCV_EXTRA_CXX_FLAGS:INTERNAL= -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wno-narrowing -Wno-delete-non-virtual-dtor -Wno-comment -fdiagnostics-show-option -pthread -fomit-frame-pointer -ffunction-sections -mfpu=neon -mfp16-format=ieee
2. After you build you can do this to verify hardware floating point support:
$ readelf -A /usr/local/lib/libopencv_core.so |grep -i neon