Have people had success speeding up video (post) processing in python/opencv?
I'm using 4.5.2 from brew on a (non-M1) MacBook Pro.
The two "tall poles" I expect to impact performance are:
I'm using ndarray types for Mat (the default), so I don't believe
algorithms can take advantage of OpenCL acceleration.
VideoCapture looks to be read()'ing in real-time, so I'm limited to 30 fps max
processing on a 30 fps source. A few sources suggest with FFMPEG,
i.e. build from source, would allow reading as fast as the processor
can run.
What I've seen from online searching (since I'm post-processing) suggest using parallel processing by splitting the file into chunks and reading in a separate thread. I can look at that. But I'd also like to understand if people had success fixing the two issues I raised. Ideally, faster reading and processing would allow me to avoid having to chunk and then combine video snippets...
My initial attempts to convert Mat -> UMat have had no noticeable improvement, even though I can see that OpenCV source has OpenCL implementations for the methods I am using (undistort, cvtColor, calcOpticalFlowFarneback, etc). FWIW, I'm getting < 5 fps for a 30 fps source.
To summarize:
Have people seen improvements from using UMat instead of Mat for OpenCL support?
Does a custom build with FFMPEG allow faster than real-time reading with VideoCapture?
Suggestion I haven't thought of?
The intent is to use python because of the speed of iteration, so I'm hoping for some easy tricks to speed of the computational step of each iteration.
Related
I am looking to do something very basic. I have a piece of code that I did not write which performs some processing that takes approximately 10 minutes to run on a single data set. I have 50,000 data sets, so I would like to utilize many GPUs to run this in parallel. I am familiar with how to do this on CPUs, however I do not know how to do this on GPUs. I see many examples of how to increase the speed the of certain function calls with GPUs via numba, although I cannot find how I would run a for loop on a gpu. Is this possible? In essence I have 50,000 image names, and I want a loop which reads through all images and performs the processing, then writes the extracted information to a .csv
I'm participate in Supercomputer Challenge.
From my experience, it's a complicated job to boost CPU code with GPU.
But there are some projects/libraries about python may help you.
CuPy: Easy to convert numpy code to CUDA code
Numba: JIT compiler which you mention above
PyCUDA: run C CUDA coda in Python
RAPDIS: cuXX which developed by Nvidia
easy -> hard : CuPy/RAPDIS > Numba > PyCUDA
In summary, you should study CuPy if you are using numpy, or try to find similar graphics processing methods in RAPDIS Library (ex:cuGraph). PyCUDA is the most difficult option for this case.
Just some suggestions, Speed up!
I am generating super random binary images and I am doing that on one CPU core atm. Since I want to generate millions of images, I need to do this on my CUDA GPU. I think numba is the right tool to use, but which of its features? I would like to compute each image on a different GPGPU core, so my main process on the CPU should just copy the image info (basically only the id) and generate as many images as possible parallel on the GPGPU cores.
I thought about using jit but I am not sure if it suits my needs and that is why I want to hear some experts on the topic.
The code is fairly simple, I want to parallel execute
import numpy as np
def gen_img(id):
np.random.seed(id)
a = np.random.randint(2, size=(1080, 1080))
return a
Does numba.jit suits my needs?
Q : "Does numba.jit suit my needs?"
No. Given your aim is to have a high-performance production of a "just"-[CONCURRENT] workflow of generating 1080 x 1080 bitmaps
(at random - which is a topic of its own), neither the python, not the numba.jit-accelerated code will perform anywhere near a right-enough, low-level CUDA-optimised code.
A quality of the PRNG-produced randomness, based on a centrally dispatched seed-id is a core problem here, not the GPU-hosted production code + a few file-I/O.
The problem of achieving high quality distribution mapping between the seed-id and PRNG-produced goes well beyond the Stack Overflow Q/A site and belongs to the field of cryptography, not the PRNG implementation. If interested in using smart, high quality PRNG-s composable as CUDA-kernels ( i.e. not depending on the limits of the GPU-hardware for not very deep (quite shallow and often without published properties of the produced distributions of the PRNG numbers, compared to other PRNG-s, incl. those with the published source-codes ) random vectors of bits, there are many posts to start from here.
An inspiration for using the right-enough tools :
As an example, one may source such bitmaps from shell directly, having whatever degree of parallelism of jobs fits the hardware-contraints, without ever calling a GIL-lock dancing Python interpreter:
$ seq 4096 | parallel --jobs 32 \
--bar \
'(base64 -w0 /dev/urandom | head -c 145800 > random_data_1k80_1k80_1bit.{})'
Adding file-format specific header to the raw-data or sending raw-data over a pipe / socket to some other process is easy and obvious, using the right tools. Isn't that great?
I have a python script that loops through a dataset of videos and applies a face and lip detector function to each video. The function returns a 3D numpy array of pixel data centered on the human lips in each frame.
The dataset is quite large (70GB total, ~500,000 videos each about 1 second in duration) and executing on a normal CPU would take days. I have a Nvidia 2080 Ti that I would like to use to execute code. Is it possible to include some code that executes my entire script on the available GPU? Or am I oversimplifying a complex problem?
So far I have been trying to implement using numba and pycuda and havent made any progress as the examples provided don't really fit my problem well.
Your first problem is actually getting your Python code to run on all CPU cores!
Python is not fast, and this is pretty much by design. More accurately, the design of Python emphasizes other qualities. Multi-threading is fairly hard in general, and Python can't make it easy due to those design constraints. A pity, because modern CPU's are highly parallel. In your case, there's a lucky escape. Your problem is also highly parallel. You can just divide those 500,000 video's over CPU cores. Each core runs a copy of the Python script over its own input. Even a quad-core would process 4x125.000 files using that strategy.
As for the GPU, that's not going to help much with Python code. Python simply doesn't know how to send data to the GPU, send commands to the CPU, or get results back. Some Pythons extensions can use the GPU, such as Tensorflow. But they use the GPU for their own internal purposes, not to run Python code.
Say there are many (about 300,000) JSON files that take much time (about 30 minutes) to load into a list of Python objects. Profiling revealed that it is in fact not the file access but the decoding, which takes most of the time. Is there a format that I can convert these files to, which can be loaded much faster into a python list of objects?
My attempt: I converted the files to ProtoBuf (aka Google's Protocol Buffers) but even though I got really small files (reduced to ~20% of their original size), the time to load them did not improve that dramatically (still more than 20 minutes to load them all).
You might be looking into the wrong direction with the conversion as it will probably not cut your loading times as much as you would like. If the decoding is taking a lot of time, it will probably take quite some time from other formats as well, assuming that the JSON decoder is not really badly written. I am assuming the standard library functions have decent implementations, and JSON is not a lousy format for data storage speed-wise.
You could try running your program with PyPy instead of the default CPython implementation that I will assume you are using. PyPy could decrease the execution time tremendously. It has a faster JSON module and uses a JIT which might speed up your program a lot.
If you are using Python 3 you could also try using ProcessPoolExecutor to run the file loading and data deserialization / decoding concurrently. You will have to experiment with the degree of concurrency, but a good starting point is the number of your CPU cores, which you can halve or double. If your program waits for I/O a lot, you should run a higher degree of concurrency, if the degree of I/O is smaller you can try and reduce the concurrency. If you write each executor so that they load the data into Python objects and simply return them, you should be able to cut your loading times significantly. Note that you must use a process-driven approach, using threads will not work with the GIL.
You could also use a faster JSON library which could speed up your execution times two or three-fold in an optimal case. In a real-world use case the speed up will probably be smaller. Do note that these might not work with PyPy since it uses an alternative CFFI implementation and will not work with CPython programs, and PyPy has a good JSON module anyway.
Try ujson, it's quite a bit faster.
"Decoding takes most of the time" can be seen as "building the Python objects takes all the time". Do you really need all these things as Python objects in RAM all the time? It must be quite a lot.
I'd consider using a proper database for e.g. querying data of such size.
If you need mass processing of a different kind, e.g. stats or matrix processing, I'd take a look at pandas.
I aim to start opencv little by little but first I need to decide which API of OpenCV is more useful. I predict that Python implementation is shorter but running time will be more dense and slow compared to the native C++ implementations. Is there any know can comment about performance and coding differences between these two perspectives?
As mentioned in earlier answers, Python is slower compared to C++ or C. Python is built for its simplicity, portability and moreover, creativity where users need to worry only about their algorithm, not programming troubles.
But here in OpenCV, there is something different. Python-OpenCV is just a wrapper around the original C/C++ code. It is normally used for combining best features of both the languages, Performance of C/C++ & Simplicity of Python.
So when you call a function in OpenCV from Python, what actually run is underlying C/C++ source. So there won't be much difference in performance.( I remember I read somewhere that performance penalty is <1%, don't remember where. A rough estimate with some basic functions in OpenCV shows a worst-case penalty of <4%. ie penalty = [maximum time taken in Python - minimum time taken in C++]/minimum time taken in C++ ).
The problem arises when your code has a lot of native python codes.For eg, if you are making your own functions that are not available in OpenCV, things get worse. Such codes are ran natively in Python, which reduces the performance considerably.
But new OpenCV-Python interface has full support to Numpy. Numpy is a package for scientific computing in Python. It is also a wrapper around native C code. It is a highly optimized library which supports a wide variety of matrix operations, highly suitable for image processing. So if you can combine both OpenCV functions and Numpy functions correctly, you will get a very high speed code.
Thing to remember is, always try to avoid loops and iterations in Python. Instead, use array manipulation facilities available in Numpy (and OpenCV). Simply adding two numpy arrays using C = A+B is a lot times faster than using double loops.
For eg, you can check these articles :
Fast Array Manipulation in Python
Performance comparison of OpenCV-Python interfaces, cv and cv2
All google results for openCV state the same: that python will only be slightly slower. But not once have I seen any profiling on that. So I decided to do some and discovered:
Python is significantly slower than C++ with opencv, even for trivial programs.
The most simple example I could think of was to display the output of a webcam on-screen and display the number of frames per second. With python, I achieved 50FPS (on an Intel atom). With C++, I got 65FPS, an increase of 25%. In both cases, the CPU usage was using a single core, and to the best of my knowledge, was bound by the performance of the CPU.
Additionally this test case about aligns with what I have seen in projects I've ported from one to the other in the past.
Where does this difference come from? In python, all of the openCV functions return new copies of the image matrices. Whenever you capture an image, or if you resize it - in C++ you can re-use existing memory. In python you cannot. I suspect this time spent allocating memory is the major difference, because as others have said: the underlying code of openCV is C++.
Before you throw python out the window: python is much faster to develop in, and if long as you aren't running into hardware-constraints, or if development speed it more important than performance, then use python. In many applications I've done with openCV, I've started in python and later converted only the computer vision components to C++ (eg using python's ctype module and compiling the CV code into a shared library).
Python Code:
import cv2
import time
FPS_SMOOTHING = 0.9
cap = cv2.VideoCapture(2)
fps = 0.0
prev = time.time()
while True:
now = time.time()
fps = (fps*FPS_SMOOTHING + (1/(now - prev))*(1.0 - FPS_SMOOTHING))
prev = now
print("fps: {:.1f}".format(fps))
got, frame = cap.read()
if got:
cv2.imshow("asdf", frame)
if (cv2.waitKey(2) == 27):
break
C++ Code:
#include <opencv2/opencv.hpp>
#include <stdint.h>
using namespace std;
using namespace cv;
#define FPS_SMOOTHING 0.9
int main(int argc, char** argv){
VideoCapture cap(2);
Mat frame;
float fps = 0.0;
double prev = clock();
while (true){
double now = (clock()/(double)CLOCKS_PER_SEC);
fps = (fps*FPS_SMOOTHING + (1/(now - prev))*(1.0 - FPS_SMOOTHING));
prev = now;
printf("fps: %.1f\n", fps);
if (cap.isOpened()){
cap.read(frame);
}
imshow("asdf", frame);
if (waitKey(2) == 27){
break;
}
}
}
Possible benchmark limitations:
Camera frame rate
Timer measuring precision
Time spent in print formatting
The answer from sdfgeoff is missing the fact that you can reuse arrays in Python. Preallocate them and pass them in, and they will get used. So:
image = numpy.zeros(shape=(height, width, 3), dtype=numpy.uint8)
#....
retval, _ = cv.VideoCapture.read(image)
You're right, Python is almost always significantly slower than C++ as it requires an interpreter, which C++ does not. However, that does require C++ to be strongly-typed, which leaves a much smaller margin for error. Some people prefer to be made to code strictly, whereas others enjoy Python's inherent leniency.
If you want a full discourse on Python coding styles vs. C++ coding styles, this is not the best place, try finding an article.
EDIT:
Because Python is an interpreted language, while C++ is compiled down to machine code, generally speaking, you can obtain performance advantages using C++. However, with regard to using OpenCV, the core OpenCV libraries are already compiled down to machine code, so the Python wrapper around the OpenCV library is executing compiled code. In other words, when it comes to executing computationally expensive OpenCV algorithms from Python, you're not going to see much of a performance hit since they've already been compiled for the specific architecture you're working with.
Why choose?
If you know both Python and C++, use Python for research using Jupyter Notebooks and then use C++ for implementation.
The Python stack of Jupyter, OpenCV (cv2) and Numpy provide for fast prototyping.
Porting the code to C++ is usually quite straight-forward.