compute multiple images in parallel using numpy - python

I am generating super random binary images and I am doing that on one CPU core atm. Since I want to generate millions of images, I need to do this on my CUDA GPU. I think numba is the right tool to use, but which of its features? I would like to compute each image on a different GPGPU core, so my main process on the CPU should just copy the image info (basically only the id) and generate as many images as possible parallel on the GPGPU cores.
I thought about using jit but I am not sure if it suits my needs and that is why I want to hear some experts on the topic.
The code is fairly simple, I want to parallel execute
import numpy as np
def gen_img(id):
np.random.seed(id)
a = np.random.randint(2, size=(1080, 1080))
return a
Does numba.jit suits my needs?

Q : "Does numba.jit suit my needs?"
No. Given your aim is to have a high-performance production of a "just"-[CONCURRENT] workflow of generating 1080 x 1080 bitmaps
(at random - which is a topic of its own), neither the python, not the numba.jit-accelerated code will perform anywhere near a right-enough, low-level CUDA-optimised code.
A quality of the PRNG-produced randomness, based on a centrally dispatched seed-id is a core problem here, not the GPU-hosted production code + a few file-I/O.
The problem of achieving high quality distribution mapping between the seed-id and PRNG-produced goes well beyond the Stack Overflow Q/A site and belongs to the field of cryptography, not the PRNG implementation. If interested in using smart, high quality PRNG-s composable as CUDA-kernels ( i.e. not depending on the limits of the GPU-hardware for not very deep (quite shallow and often without published properties of the produced distributions of the PRNG numbers, compared to other PRNG-s, incl. those with the published source-codes ) random vectors of bits, there are many posts to start from here.
An inspiration for using the right-enough tools :
As an example, one may source such bitmaps from shell directly, having whatever degree of parallelism of jobs fits the hardware-contraints, without ever calling a GIL-lock dancing Python interpreter:
$ seq 4096 | parallel --jobs 32 \
--bar \
'(base64 -w0 /dev/urandom | head -c 145800 > random_data_1k80_1k80_1bit.{})'
Adding file-format specific header to the raw-data or sending raw-data over a pipe / socket to some other process is easy and obvious, using the right tools. Isn't that great?

Related

Speed up opencv from python?

Have people had success speeding up video (post) processing in python/opencv?
I'm using 4.5.2 from brew on a (non-M1) MacBook Pro.
The two "tall poles" I expect to impact performance are:
I'm using ndarray types for Mat (the default), so I don't believe
algorithms can take advantage of OpenCL acceleration.
VideoCapture looks to be read()'ing in real-time, so I'm limited to 30 fps max
processing on a 30 fps source. A few sources suggest with FFMPEG,
i.e. build from source, would allow reading as fast as the processor
can run.
What I've seen from online searching (since I'm post-processing) suggest using parallel processing by splitting the file into chunks and reading in a separate thread. I can look at that. But I'd also like to understand if people had success fixing the two issues I raised. Ideally, faster reading and processing would allow me to avoid having to chunk and then combine video snippets...
My initial attempts to convert Mat -> UMat have had no noticeable improvement, even though I can see that OpenCV source has OpenCL implementations for the methods I am using (undistort, cvtColor, calcOpticalFlowFarneback, etc). FWIW, I'm getting < 5 fps for a 30 fps source.
To summarize:
Have people seen improvements from using UMat instead of Mat for OpenCL support?
Does a custom build with FFMPEG allow faster than real-time reading with VideoCapture?
Suggestion I haven't thought of?
The intent is to use python because of the speed of iteration, so I'm hoping for some easy tricks to speed of the computational step of each iteration.

Tensorflow - executing a model in production as efficiently as possible

I have a semantic-segmentation model created using Keras.
Now I want to use it in production where I need to execute the model on a large folder with 10k-100k images a few times a day. This takes several hours, so every improvement is helpful.
I'm wondering what is the correct way to use it in production. I currently just use model.predict() on a created Sequence. But everywhere I look I see all kinds of different libraries or technologies that seem relevant.
Tensorflow-serving, converting to C, different libraries by intel and others.
I'm wondering what is the bottom-line recommended way to execute a model as production-grade and as efficiently as possible.
I'm not sure this has a canonical answer — as with many things there are lots of tradeoffs between different choices — but I'll attempt to give an answer.
I've been pretty happy using TensorFlow Serving to do model deployment, with separate services doing the business logic calling those models and doing something with the predictions. That provides a small boost because there won't be as much resource contention — the TensorFlow Serving instances do nothing but run the models. We have them deployed via Kubernetes, and that makes grouping a cluster of TensorFlow Serving instances very easy if you want to scale horizontally as well for more throughput.
You're unlikely to get meaningful improvements by messing around the edges with things like making sure the TensorFlow Serving deployment is compiled with the right flags to use all of Intel's vector instructions. The big boost is running everything in fast C++ code. The one (probably very obvious) way to boost performance is to run the inference on a GPU rather than a CPU. That's going to scale more or less the way you'd expect: the more powerful the GPU, the faster the inference is going to be.
There are probably more involved things you could do to eke our more single percentage point gains. But this strikes a pretty good balance of speed with flexibility. It's definitely a little bit more finicky to have this separate service architecture: if you're not doing something too complicated, it might be easier (if quite a bit slower) to use your models "as-is" in a Python script rather than going to the trouble of setting up TensorFlow serving. On the other hand, the speedup is pretty significant, and it's pretty easy to manage. On the other end of the spectrum, I have no idea what crazy things you could do to eke out more marginal performance gains, but instinct tells me that they're going to be pretty exotic, and therefore pretty difficult to maintain.
It is difficult to answer, but I will consider the following orthogonal aspects
Is it possible for me to run a model at a lower resolution? If so, resizing an image before running the model -- this should give you X**2 times of speed up, where X is the downsampling factor that you use.
Production models are often executed remotely. So understanding your remote machine config is very important. If you only have CPU-only machines, options like OpenVINO typically give more speed-up than the native tensorflow. If you have GPU machines, options like tensorRT can also help you. The actual speed-up is very difficult to estimate, but I would say at least 2x faster.
Uploading/downloading a JPEG image instead of PNG or BMP. This should largely reduce your communication time.

How to compute the performance of a python program on different machines

I would like to know what are the different performance characteristics that can be used to find the performance of a python code on 2 different systems. Also is it possible to extend about its performance on a different machine? Is this kind of stuff possible?
Lets assume that one of the two systems is computation on GPU and other is on a CPU
I want to extend the python code's performance on a CPU enabled different system.
Can this be also derived analytically?
In my experiences making assumptions based on hands on performance analysis has been sufficient for identifying initial instance sizes/requirements, and then using real time telemetry and instrumentation to closely monitor those solutions.
There are a couple ways, I've used, to commute performance (the terms are gibberish i've made up):
Informal Characterization of Bottlenecks
This has involved informally understanding where the bottlenecks of your application are likely to be, to give a very rough idea of capacity/machine requirements. If you're performing CPU bound calculations with little to no network, then could bypassing starting with a network optimized instance. Also if you're materializing processing to filesystem, and memory overhead is pretty small or bounded then you don't need a high memory instance.
External Performance Experiments
This involves creating performance test harnesses to establish base line experiments allowing you to change computer variables to determine what sort of effect they have on your program performance. I like to setup queue based systems with throughput tests ie #10k requests / second what is the queue saturation, what is the service time. It involves adding logging/telemetry to code to log those numbers. Also setup a backlog to understand how fast a single instance can process a backlog.
For HTTP there are many tools to generate load.
Hopefully there is an automated tool to support your input format but if not you may have to write your own.
Performance Profiling
I consider this using "low level" tools to scientifically (opposed to the informal analysis) determine where your code is spending its time. Usually involves using python profiler to determine which routines you're spending time in, and then try to optimize them. http://www.brendangregg.com/linuxperf.html
For this step if the performance test harness has acceptable performance then this can be ignored :p
Real time telemetry
After acceptable performance and instance size has been determined, real time telemetry is critical to see how program is perform in real-time-ish to real life workloads.
I've found Throughput, processing counts, errors, etc to all be critical to maintaining high performance systems:
http://www.brendangregg.com/usemethod.html

How to speed up Python code for running on a powerful machine? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've completed writing a multiclass classification algorithm that uses boosted classifiers. One of the main calculations consists of weighted least squares regression.
The main libraries I've used include:
statsmodels (for regression)
numpy (pretty much everywhere)
scikit-image (for extracting HoG features of images)
I've developed the algorithm in Python, using Anaconda's Spyder.
I now need to use the algorithm to start training classification models. So I'll be passing approximately 7000-10000 images to this algorithm, each about 50x100, all in gray scale.
Now I've been told that a powerful machine is available in order to speed up the training process. And they asked me "am I using GPU?" And a few other questions.
To be honest I have no experience in CUDA/GPU, etc. I've only ever heard of them. I didn't develop my code with any such thing in mind. In fact I had the (ignorant) impression that a good machine will automatically run my code faster than a mediocre one, without my having to do anything about it. (Apart from obviously writing regular code efficiently in terms of loops, O(n), etc).
Is it still possible for my code to get speeded up simply by virtue of being on a high performance computer? Or do I need to modify it to make use of a parallel-processing machine?
The comments and Moj's answer give a lot of good advice. I have some experience on signal/image processing with python, and have banged my head against the performance wall repeatedly, and I just want to share a few thoughts about making things faster in general. Maybe these help figuring out possible solutions with slow algorithms.
Where is the time spent?
Let us assume that you have a great algorithm which is just too slow. The first step is to profile it to see where the time is spent. Sometimes the time is spent doing trivial things in a stupid way. It may be in your own code, or it may even be in the library code. For example, if you want to run a 2D Gaussian filter with a largish kernel, direct convolution is very slow, and even FFT may be slow. Approximating the filter with computationally cheap successive sliding averages may speed things up by a factor of 10 or 100 in some cases and give results which are close enough.
If a lot of time is spent in some module/library code, you should check if the algorithm is just a slow algorithm, or if there is something slow with the library. Python is a great programming language, but for pure number crunching operations it is not good, which means most great libraries have some binary libraries doing the heavy lifting. On the other hand, if you can find suitable libraries, the penalty for using python in signal/image processing is often negligible. Thus, rewriting the whole program in C does not usually help much.
Writing a good algorithm even in C is not always trivial, and sometimes the performance may vary a lot depending on things like CPU cache. If the data is in the CPU cache, it can be fetched very fast, if it is not, then the algorithm is much slower. This may introduce non-linear steps into the processing time depending on the data size. (Most people know this from the virtual memory swapping, where it is more visible.) Due to this it may be faster to solve 100 problems with 100 000 points than 1 problem with 10 000 000 points.
One thing to check is the precision used in the calculation. In some cases float32 is as good as float64 but much faster. In many cases there is no difference.
Multi-threading
Python - did I mention? - is a great programming language, but one of its shortcomings is that in its basic form it runs a single thread. So, no matter how many cores you have in your system, the wall clock time is always the same. The result is that one of the cores is at 100 %, and the others spend their time idling. Making things parallel and having multiple threads may improve your performance by a factor of, e.g., 3 in a 4-core machine.
It is usually a very good idea if you can split your problem into small independent parts. It helps with many performance bottlenecks.
And do not expect technology to come to rescue. If the code is not written to be parallel, it is very difficult for a machine to make it parallel.
GPUs
Your machine may have a great GPU with maybe 1536 number-hungry cores ready to crunch everything you toss at them. The bad news is that making GPU code is a bit different from writing CPU code. There are some slightly generic APIs around (CUDA, OpenCL), but if you are not accustomed to writing parallel code for GPUs, prepare for a steepish learning curve. On the other hand, it is likely someone has already written the library you need, and then you only need to hook to that.
With GPUs the sheer number-crunching power is impressive, almost frightening. We may talk about 3 TFLOPS (3 x 10^12 single-precision floating-point ops per second). The problem there is how to get the data to the GPU cores, because the memory bandwidth will become the limiting factor. This means that even though using GPUs is a good idea in many cases, there are a lot of cases where there is no gain.
Typically, if you are performing a lot of local operations on the image, the operations are easy to make parallel, and they fit well a GPU. If you are doing global operations, the situation is a bit more complicated. A FFT requires information from all over the image, and thus the standard algorithm does not work well with GPUs. (There are GPU-based algorithms for FFTs, and they sometimes make things much faster.)
Also, beware that making your algorithms run on a GPU bind you to that GPU. The portability of your code across OSes or machines suffers.
Buy some performance
Also, one important thing to consider is if you need to run your algorithm once, once in a while, or in real time. Sometimes the solution is as easy as buying time from a larger computer. For a dollar or two an hour you can buy time from quite fast machines with a lot of resources. It is simpler and often cheaper than you would think. Also GPU capacity can be bought easily for a similar price.
One possibly slightly under-advertised property of some cloud services is that in some cases the IO speed of the virtual machines is extremely good compared to physical machines. The difference comes from the fact that there are no spinning platters with the average penalty of half-revolution per data seek. This may be important with data-intensive applications, especially if you work with a large number of files and access them in a non-linear way.
I am afraid you can not speed up your program by just running it on a powerful computer. I had this issue while back. I first used python (very slow), then moved to C(slow) and then had to use other tricks and techniques. for example it is sometimes possible to apply some dimensionality reduction to speed up things while having reasonable accurate result, or as you mentioned using multi processing techniques.
Since you are dealing with image processing problem, you do a lot of matrix operations and GPU for sure would be a great help. there are some nice and active cuda wrappers in python that you can easily use, by not knowing too much CUDA. I tried Theano, pycuda and scikit-cuda (there should be more than that since then).

Python Multiprocessing/EM

I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.

Categories