I aim to start opencv little by little but first I need to decide which API of OpenCV is more useful. I predict that Python implementation is shorter but running time will be more dense and slow compared to the native C++ implementations. Is there any know can comment about performance and coding differences between these two perspectives?
As mentioned in earlier answers, Python is slower compared to C++ or C. Python is built for its simplicity, portability and moreover, creativity where users need to worry only about their algorithm, not programming troubles.
But here in OpenCV, there is something different. Python-OpenCV is just a wrapper around the original C/C++ code. It is normally used for combining best features of both the languages, Performance of C/C++ & Simplicity of Python.
So when you call a function in OpenCV from Python, what actually run is underlying C/C++ source. So there won't be much difference in performance.( I remember I read somewhere that performance penalty is <1%, don't remember where. A rough estimate with some basic functions in OpenCV shows a worst-case penalty of <4%. ie penalty = [maximum time taken in Python - minimum time taken in C++]/minimum time taken in C++ ).
The problem arises when your code has a lot of native python codes.For eg, if you are making your own functions that are not available in OpenCV, things get worse. Such codes are ran natively in Python, which reduces the performance considerably.
But new OpenCV-Python interface has full support to Numpy. Numpy is a package for scientific computing in Python. It is also a wrapper around native C code. It is a highly optimized library which supports a wide variety of matrix operations, highly suitable for image processing. So if you can combine both OpenCV functions and Numpy functions correctly, you will get a very high speed code.
Thing to remember is, always try to avoid loops and iterations in Python. Instead, use array manipulation facilities available in Numpy (and OpenCV). Simply adding two numpy arrays using C = A+B is a lot times faster than using double loops.
For eg, you can check these articles :
Fast Array Manipulation in Python
Performance comparison of OpenCV-Python interfaces, cv and cv2
All google results for openCV state the same: that python will only be slightly slower. But not once have I seen any profiling on that. So I decided to do some and discovered:
Python is significantly slower than C++ with opencv, even for trivial programs.
The most simple example I could think of was to display the output of a webcam on-screen and display the number of frames per second. With python, I achieved 50FPS (on an Intel atom). With C++, I got 65FPS, an increase of 25%. In both cases, the CPU usage was using a single core, and to the best of my knowledge, was bound by the performance of the CPU.
Additionally this test case about aligns with what I have seen in projects I've ported from one to the other in the past.
Where does this difference come from? In python, all of the openCV functions return new copies of the image matrices. Whenever you capture an image, or if you resize it - in C++ you can re-use existing memory. In python you cannot. I suspect this time spent allocating memory is the major difference, because as others have said: the underlying code of openCV is C++.
Before you throw python out the window: python is much faster to develop in, and if long as you aren't running into hardware-constraints, or if development speed it more important than performance, then use python. In many applications I've done with openCV, I've started in python and later converted only the computer vision components to C++ (eg using python's ctype module and compiling the CV code into a shared library).
Python Code:
import cv2
import time
FPS_SMOOTHING = 0.9
cap = cv2.VideoCapture(2)
fps = 0.0
prev = time.time()
while True:
now = time.time()
fps = (fps*FPS_SMOOTHING + (1/(now - prev))*(1.0 - FPS_SMOOTHING))
prev = now
print("fps: {:.1f}".format(fps))
got, frame = cap.read()
if got:
cv2.imshow("asdf", frame)
if (cv2.waitKey(2) == 27):
break
C++ Code:
#include <opencv2/opencv.hpp>
#include <stdint.h>
using namespace std;
using namespace cv;
#define FPS_SMOOTHING 0.9
int main(int argc, char** argv){
VideoCapture cap(2);
Mat frame;
float fps = 0.0;
double prev = clock();
while (true){
double now = (clock()/(double)CLOCKS_PER_SEC);
fps = (fps*FPS_SMOOTHING + (1/(now - prev))*(1.0 - FPS_SMOOTHING));
prev = now;
printf("fps: %.1f\n", fps);
if (cap.isOpened()){
cap.read(frame);
}
imshow("asdf", frame);
if (waitKey(2) == 27){
break;
}
}
}
Possible benchmark limitations:
Camera frame rate
Timer measuring precision
Time spent in print formatting
The answer from sdfgeoff is missing the fact that you can reuse arrays in Python. Preallocate them and pass them in, and they will get used. So:
image = numpy.zeros(shape=(height, width, 3), dtype=numpy.uint8)
#....
retval, _ = cv.VideoCapture.read(image)
You're right, Python is almost always significantly slower than C++ as it requires an interpreter, which C++ does not. However, that does require C++ to be strongly-typed, which leaves a much smaller margin for error. Some people prefer to be made to code strictly, whereas others enjoy Python's inherent leniency.
If you want a full discourse on Python coding styles vs. C++ coding styles, this is not the best place, try finding an article.
EDIT:
Because Python is an interpreted language, while C++ is compiled down to machine code, generally speaking, you can obtain performance advantages using C++. However, with regard to using OpenCV, the core OpenCV libraries are already compiled down to machine code, so the Python wrapper around the OpenCV library is executing compiled code. In other words, when it comes to executing computationally expensive OpenCV algorithms from Python, you're not going to see much of a performance hit since they've already been compiled for the specific architecture you're working with.
Why choose?
If you know both Python and C++, use Python for research using Jupyter Notebooks and then use C++ for implementation.
The Python stack of Jupyter, OpenCV (cv2) and Numpy provide for fast prototyping.
Porting the code to C++ is usually quite straight-forward.
Related
The Github Page of Caffe contains a Windows Branch. I have taken this branch and created a Windows DLL. It is losely based on https://github.com/BVLC/caffe/blob/master/examples/cpp_classification/classification.cpp.
The DLL works and outputs correct classification results. But it is 1.5-5 times slower than the pyCaffe interface. It is very interesting that the pyCaffe Interface takes around 1 second for four images using AlexNet on all computers tested. The DLL time ranges from 1.5 seconds to 2 seconds to 4 seconds.
We have measured the time before and after the loop (using Easily measure elapsed time) of
template <typename Dtype> Dtype Net<Dtype>::ForwardFromTo(int start, int end)
This function resides in https://github.com/BVLC/caffe/blob/master/src/caffe/net.cpp and is called by the CPP and Python Code.
We have compiled Caffe as 32-bit programm without GPU support using Visual Studio 2013
Possible things we have checked so far.
Compiler Optimizations
The data
OS and computer configurations (like CPU/Memory etc.)
We have measured multiple times in one execution, such that the benchmark is more stable.
We have also profiled the code using CodeXL but I could not find anything unusual, but that of course is a little bit vague.
We concluded following: Caffe uses GLog. GLog has Fatal Warning which may look like this
CHECK(a<=b) << "a must be bigger than b";
These warnings let the program crash and are hardly catchable. For that we have created a class to replace GLog. It is fairly simple and uses std::stringstream. Google has done something clever. Whenever the condition is true, the right hand side is not evaluated.
https://github.com/google/glog/blob/de6149ef8e67b064a433a8b88924fa9f606ad5d5/src/windows/glog/logging.h#L569
They solved it using the (void) 0. We missed that part. When I wanted to post the profiling data here, I recognised that some time is lost due to the << operator. We started looking at the profiling data closer and increased the number of function calls, such that every number gets a little bit bigger and clearer. This then has lead us to the solution.
I am currently need to run FFT on 1024 sample points signal. So far I have implementing my own DFT algorithm in python, but it is very slow. If I use the NUMPY fftpack, or even move to C++ and use FFTW, do you guys think it would be better?
If you are implementing the DFFT entirely within Python, your code will run orders of magnitude slower than either package you mentioned. Not just because those libraries are written in much lower-level languages, but also (FFTW in particular) they are written so heavily optimized, taking advantage of cache locality, vector units, and basically every trick in the book, that it would not surprise me if they ran at 10,000x the speed of a naive Python implementation. Even if you are using numpy in your implementation, it will still pale in comparison.
So yes; use numpy's fftpack. If that is not fast enough, you can try the python bindings for FFTW (PyFFTW), but the speedup from fftpack to fftw will not be nearly as dramatic. I really doubt there's a need to drop into C++ just for FFTs - they're sort of the ideal case for Python bindings.
If you need speed, then you want to go for FFTW, check out the pyfftw project.
In order to use processor SIMD instructions, you need to align the data and there is not an easy way of doing so in numpy. Moreover, pyfftw allows you to use true multithreading, so trust me, it will be much faster.
In case you wish to stick to Python (handling and maintaining custom C++ bindings can be time consuming), you have the alternative of using OpenCV's implementation of FFT.
I put together a toy example comparing OpenCV's dft() and numpy's fft2 functions in python (Intel(R) Core(TM) i7-3930K CPU).
samplesFreq_cv2 = [
cv2.dft(samples[iS])
for iS in xrange(nbSamples)]
samplesFreq_np = [
np.fft.fft2(samples[iS])
for iS in xrange(nbSamples)]
Results for sequentially transforming 20000 image patches of varying resolutions from 20x20 to 60x60:
Numpy's fft2: 1.709100 seconds
OpenCV's dft: 0.621239 seconds
This is likely not as fast as binding to a dedicates C++ library like fftw, but it's a rather low-hanging fruit.
I want to perform image processing on a low-end (Atom processor) embedded computer or microcontroller that is running Linux.
I'm trying to decide whether I should write my image processing code in Octave or Python. I feel comfortable in both languages, but is there any reason why I should use one over the other? Are there huge performance differences? I feel as though Octave may more closely resemble, syntax-wise, the domain of image processing than Python.
Thanks for your input.
Edit: The motivation for this question comes from the fact that I design in Octave and get a working algorithm and then port the algorithm to C++. I am trying to avoid this double work and go from design to deployment easily.
I am bit surprised that you don't stick to C/C++ - many convenient image processing libraries exists. Even though, I have like 20 years of experience with C, 8 years of experience with Matlab and only 1 years of experience with Python, I would choose Python together with OpenCV, which is an extremely optimized library for computer vision supporting Intel Performance Primitives. Once you have a working Python solution, it is easy to translate this to C or C++ to get the additional performance or reduce the power consumption. I would start with Python and Numpy using matplotlib for displaying / prototyping, optimize using OpenCV from within Python and finally use C++ and test it against the Python reference implementation.
MATLAB has a code generation feature which could potentially help with your workflow. Have a look at this example. My understanding is that the Atom is x86 architecture, so the generated code ought to work on it too. You could consider getting a Trial version and giving the above example a spin on your specific target to evaluate performance and inspect the generated C code.
I was using PIL to do image processing, and I tried to convert a color image into a grayscale one, so I wrote a Python function to do that, meanwhile I know PIL already provides a convert function to this.
But the version I wrote in Python takes about 2 seconds to finish the grayscaling, while PIL's convert almost instantly. So I read the PIL code, figured out that the algorithm I wrote is pretty much the same, but
PIL's convert is written in C or C++.
So is this the problem making the performance's different?
Yes, coding the same algorithm in Python and in C, the C implementation will be faster. This is definitely true for the usual Python interpreter, known as CPython. Another implementation, PyPy, uses a JIT, and so can achieve impressive speeds, sometimes as fast as a C implementation. But running under CPython, the Python will be slower.
If you want to do image processing, you can use
OpenCV(cv2), SimpleCV, NumPy, SciPy, Cython, Numba ...
OpenCV, SimpleCV SciPy have many image processing routines already.
NumPy can do operations on arrays in c speed.
If you want loops in Python, you can use Cython to compile your python code with static declaration into an external module.
Or you can use Numba to do JIT convert, it can convert your python code into machine binary code, and will give you near c speed.
I guess the question speaks for itself. I'm interested in doing some serious computations but am not a programmer by trade. I can string enough python together to get done what I want. But can I write a program in python and have the GPU execute it using CUDA? Or do I have to use some mix of python and C?
The examples on Klockner's (sp) "pyCUDA" webpage had a mix of both python and C, so I'm not sure what the answer is.
If anyone wants to chime in about Opencl, feel free. I heard about this CUDA business only a couple of weeks ago and didn't know you could use your video cards like this.
You should take a look at CUDAmat and Theano. Both are approaches to writing code that executes on the GPU without really having to know much about GPU programming.
I believe that, with PyCUDA, your computational kernels will always have to be written as "CUDA C Code". PyCUDA takes charge of a lot of otherwise-tedious book-keeping, but does not build computational CUDA kernels from Python code.
pyopencl offers an interesting alternative to PyCUDA. It is described as a "sister project" to PyCUDA. It is a complete wrapper around OpenCL's API.
As far as I understand, OpenCL has the advantage of running on GPUs beyond Nvidia's.
Great answers already, but another option is Clyther. It will let you write OpenCL programs without even using C, by compiling a subset of Python into OpenCL kernels.
A promising library is Copperhead (alternative link), you just need to decorate the function that you want to be run by the GPU (and then you can opt-in / opt-out it to see what's best between cpu or gpu for that function)
There is a good, basic set of math constructs with compute kernels already written that can be accessed through pyCUDA's cumath module. If you want to do more involved or specific/custom stuff you will have to write a touch of C in the kernel definition, but the nice thing about pyCUDA is that it will do the heavy C-lifting for you; it does a lot of meta-programming on the back-end so you don't have to worry about serious C programming, just the little pieces. One of the examples given is a Map/Reduce kernel to calculate the dot product:
dot_krnl = ReductionKernel(np.float32, neutral="0",
reduce_expr="a+b",
map_expr="x[i]*y[i]",
arguments="float *x, float *y")
The little snippets of code inside each of those arguments are C lines, but it actually writes the program for you. the ReductionKernel is a custom kernel type for map/reducish type functions, but there are different types. The examples portion of the official pyCUDA documentation goes into more detail.
Good luck!
Scikits CUDA package could be a better option, provided that it doesn't require any low-level knowledge or C code for any operation that can be represented as numpy array manipulation.
I was wondering the same thing and carried a few searches. I found the article linked below which seems to answer your question. However, you asked this back in 2014 and the Nvidia article does not have a date.
https://developer.nvidia.com/how-to-cuda-python
The video goes through the set up, an initial example and, quite importantly, profiliing. However, I do not know if you can implement all of the usual general compute patterns. I would think you can because as far as I could there are no limitations in NumPy.