As a starter in Open CL, I have a simple understanding question to optimize GPU computing.
As far as I understood I can make i.e. a matrix of 1000 X 1000 and put one code at each pixel at the same time using a GPU. What about the following option :
I have 100 times a 100 x 100 matrixes and need to calculate them differently. So I need to
do the serial or can I start 100 instances, i.e. I start 100 Python multiprocesses and each
shoot a matrix calculation to the GPU (assumning thetre are enough resources).
Other way round, I have one matrix of 1000 X 1000 and 100 different instance to
calculate,
can I do this as the same time or serial processing ?
Any advice or concept how to solve this the fastest way is appreciated
Thanks
Adrian
The OpenCL execution model revolves around kernels, which are just functions that execute for each point in your problem domain. When you launch a kernel for execution on your OpenCL device, you define a 1, 2 or 3-dimensional index space for this domain (aka the NDRange or global work size). It's entirely up to you how you map the NDRange onto your actual problem.
For example, you could launch an NDRange that is 100x100x100, in order to process 100 sets of 100x100 matrices (assuming they are all independent). Your kernel then defines the computation for a single element of one of these matrices. Alternatively, you could launch 100 kernels, each with a 100x100 NDRange to achieve the same thing. The former is probably faster, since it avoids the overhead of launching multiple kernels.
I strongly recommend taking a look at the OpenCL specification for more information about the OpenCL execution model. Specifically, section 3.2 has a great description of the core concepts surrounding kernel execution.
Related
I am currently working with a dataset of approximately 4 million data points. I am using R in Rstudio on a Macbook Pro (32gb Ram, 2.2GHz Intel Core i7, 155 available memory on the Hard Drive).
My goal is to perform a non-linear mixed effect regression on the data. the data has two random effects which are nested and the model needs varying slopes as well as intercepts.
My code for this model is:
model <- lmer(DV ~ I(IVr^2) + IV + (IV | group/episode), data = data, REML=FALSE)
However, when running the model, the process uses up ~ 42gb of RAM before crashing. The log is:
Vector memory exhausted (limit reached?)
I want to manipulate R in someway so that it can run slower but use up my available hard drive memory so that it can handle this run. The closest solution I have found is the biglm package in R, however, I cannot find an equivalent for lmer(). I also can't find much on manipulating swap space on a Mac. Any solutions to the problem welcome.
Alternatively, I speak python so a solution using python instead would be great. (i.e. a module which can handle polynomial mixed effect models and a module to sort out my memory issue)
I want to test using cupy whether a float is positive, e.g.:
import cupy as cp
u = cp.array(1.3)
u < 2.
>>> array(True)
My problem is that this operation is extremely slow:
%timeit u < 2. gives 26 micro seconds on my computer. It is orders of magnitude greater than what I get in CPU. I suspect it is because u has to be cast on the CPU...
I'm trying to find a faster way to do this operation.
Thanks !
Edit for clarification
My code is something like:
import cupy as cp
n = 100000
X = cp.random.randn(n) # can be greater
for _ in range(100): # There may be more iterations
result = X.dot(X)
if result < 1.2:
break
And it seems like the bottleneck of this code (for this n) is the evaluation of result < 1.2. It is still much faster than on CPU since the dot costs way less.
Running a single operation on the GPU is always a bad idea. To get performance gains out of your GPU, you need to realize a good 'compute intensity'; that is, the amount of computation performed relative to movement of memory; either from global ram to gpu mem, or from gpu mem into the cores themselves. If you dont have at least a few hunderd flops per byte of compute intensity, you can safely forget about realizing any speedup on the gpu. That said your problem may lend itself to gpu acceleration, but you certainly cannot benchmark statements like this in isolation in any meaningful way.
But even if your algorithm consists of chaining a number of such simple low-compute intensity operations on the gpu, you still will be disappointed by the speedup. Your bottleneck will be your gpu memory bandwidth; which really isnt that great compared to cpu memory bandwidth as it may look on paper. Unless you will be writing your own compute-intense kernels, or have plans for running some big ffts or such using cupy, dont think that it will give you any silver-bullet speedups by just porting your numpy code.
This may be because, when using CUDA, the array must be copied to the GPU before processing. Therefore, if your array has only one element, it can be slower in GPU than in CPU. You should try a larger array and see if this keeps happening
I think the problem here is your just leveraging one GPU device. Consider using say 100 to do all the for computations in parallel (although in the case of your simple example code it would only need doing once). https://docs-cupy.chainer.org/en/stable/tutorial/basic.html
Also there is a cupy greater function you could use to do the comparison in the GPU
Also the first time the dot gets called the kernel function will need to be compiled for the GPU which will take significantly longer than subsequent calls.
I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)
You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.
You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link
I am working in python and I have encountered a problem: I have to initialize a huge array (21 x 2000 x 4000 matrix) so that I can copy a submatrix on it.
The problem is that I want it to be really quick since it is for a real-time application, but when I run numpy.ones((21,2000,4000)), it takes about one minute to create this matrix.
When I run numpy.zeros((21,2000,4000)), it is instantaneous, but as soon as I copy the submatrix, it takes one minute, while in the first case the copying part was instantaneous.
Is there a faster way to initialize a huge array?
I guess there's not a faster way. The matrix you're building is quite large (8 byte float64 x 21 x 2000 x 4000 = 1.25 GB), and might be using up a large fraction of the physical memory on your system; thus, the one minute that you're waiting might be because the operating system has to page other stuff out to make room. You could check this by watching top or similar (e.g., System Monitor) while you're doing your allocation and watching memory usage and paging.
numpy.zeros seems to be instantaneous when you call it, because memory is allocated lazily by the OS. However, as soon as you try to use it, the OS actually has to fit that data somewhere. See Why the performance difference between numpy.zeros and numpy.zeros_like?
Can you restructure your code so that you only create the submatrices that you were intending to copy, without making the big matrix?
I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?
Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.
I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.