Vectorization of recursively defined problem - Python

Vectorization of recursively defined problem - Python - python

I was wondering if anyone would have an idea on how I am able to vectorize the following loop:
for i in range(1,(T*n)+1):
Y = Y + np.diag(mu) # Y * dt + np.multiply(np.diag(sigma)#Y, L # np.random.normal( 0, dt, (d,N)))
Whereas the following parameters are already a dxN matrices (I already vectorized a loop with that..):
Y (this is the recursive Parameter)
np.diag(mu) # Y * dt
np.diag(sigma) # Y
L # np.random.normal( 0, dt, (d,N))
Any help would be very appreciated. :)
With best regards!

Unfortunately, this doesn't look like vectorizable code:
Iterations should be independent. Typically, vectorization means making several iterations at once. Typically, it also implies using AVX, SSE or FMA instructions (if we talk about x86 processors) to make iterations go truly in parallel on a hardware level.
Continuing about vector assembly instructions, such level of optimization is typically unreachable from python code because the interpreter isn't that smart. An iteration is also doing too much to be vectorized. It actually contains sub-loops! We don't see it but matrix multiplications do involve more loops.
So I woudn't call optimization of this loop a "vectorization". But luckily, there are still things to check:
Profile it. Find out what part of the computation consumes most of the time.
Verify that np.random doesn't slow down the program significantly. If yes, you can rely on pre-generated values instead.
Check if code that can be vectorized is vectorized. That means, verify that your numpy is built with SSE/AVX support and that matrix multiplications use that under the hood. It can be a bit tricky to do but up to x4 speedups* are possible with AVX usage.
If parts of the code are indeed vectorized on the assembly level, switching to storing data in float16 arrays can make it even faster. To my knowledge, AVX does support operations on large blocks of 16-bit floats.
Rewrite it in C/Cython or try out Numba JIT compilation for the same task.
If compilation even with Numba is not the case, I wonder if Tensorflow can help here. With Tensorflow, Python code doesn't kick off computations immediately but rather constructs a computational graph that is then executed without returning to the interpreter level. Tensorflow does support AVX and SSE (although not without pain), so you may expect more control over low-level details than with numpy. And you can also try to launch it on GPU.
Last thing, I don't quite believe in it, but does loop unrolling help?
for i in range(1, (T * n + 1) // 4):
Y = Y + ...
Y = Y + ...
Y = Y + ...
Y = Y + ...
* - subject to Amdahl's law

Related

Efficient computation of entropy-like formula (sum(xlogx)) in Python

I'm looking for an efficient way to compute the entropy of vectors, without normalizing them and while ignoring any non-positive value.
Since the vectors aren't probability vectors, and shouldn't be normalized, I can't use scipy's entropy function.
So far I couldn't find a single numpy or scipy function to obtain this, and as a result my alternatives involve breaking the computation into 2 steps, which involve intermediate arrays and slow down the run time. If anyone can think of a single function for this computation it will be interseting.
Below is a timeit script for measuring several alternatives at I tried. I'm using a pre-allocated array to avoid repeated allocations and deallocations during run-time. It's possible to select which alternative to run by setting the value of func_code. I included the nansum offered by one of the answers. The measurements on My MacBook Pro 2019 are:
matmul: 16.720187613
xlogy: 17.296380516
nansum: 20.059866123000003
import timeit
import numpy as np
from scipy import special
def matmul(arg):
a, log_a = arg
log_a.fill(0)
np.log2(a, where=a > 0, out=log_a)
return (a[:, None, :] # log_a[..., None]).ravel()
def xlogy(arg):
a, log_a = arg
a[a < 0] = 0
return np.sum(special.xlogy(a, a), axis=1) * (1/np.log(2))
def nansum(arg):
a, log_a = arg
return np.nansum(a * np.log2(a, out=log_a), axis=1)
def setup():
a = np.random.rand(20, 1000) - 0.1
log = np.empty_like(a)
return a, log
setup_code = """
from __main__ import matmul, xlogy, nansum, setup
data = setup()
"""
func_code = "matmul(data)"
print(timeit.timeit(func_code, setup=setup_code, number=100000))

On my machine the computation of the logarithms takes about 80% of the time of matmul so it is definitively the bottleneck an optimizing other functions will result in a negligible speed up.
The bad news is that the default implementation np.log is not yet optimized on most platforms. Indeed, it is not vectorized by default, except on recent x86 Intel processors supporting AVX-512 (ie. basically Skylake processors on servers and IceLake processors on PCs, not recent AlderLake though). This means the computation could be significantly faster once vectorized. AFAIK, the close-source SVML library do support AVX/AVX2 and could speed up it (on x86-64 processors only). SMVL is supported by Numexpr and Numba which can be faster because of that assuming you have access to the non-free SVML which is a part of Intel tools often available on HPC machines (eg. like MKL, OneAPI, etc.).
If you do not have access to the SVML there are two possible remaining options:
Implement your own optimized SIMD log2 function which is possible but hard since it require a good understanding of the hardware SIMD units and certainly require to write a C or Cython code. This solutions consists in computing the log2 function as a n-degree polynomial approximation (it can be exact to 1 ULP with a big n though one generally do not need that). Naive approximations (eg. n=1) are much simple to implement but often too inaccurate for a scientific use).
Implement a multi-threaded log computation typically using Numba/Cython. This is a desperate solution as multithreading can slow things down if the input data is not large enough.
Here is an example of multi-threaded Numba solution:
import numba as nb
#nb.njit('(UniTuple(f8[:,::1],2),)', parallel=True)
def matmul(arg):
a, log_a = arg
result = np.empty(a.shape[0])
for i in nb.prange(a.shape[0]):
s = 0.0
for j in range(a.shape[1]):
if a[i, j] > 0:
s += a[i, j] * np.log2(a[i, j])
result[i] = s
return result
This is about 4.3 times faster on my 6-core PC (200 us VS 46.4 us). However, you should be careful if you run this on a server with many cores on such small dataset as it can actually be slower on some platforms.

Having np.log2 of negative numbers (or zero) just gives a runtime warning and sets those values to np.nan, which is probably the best way to deal with them. If you don't want them to pollute your sum, just use
np.nansum(v_i*np.log2(v_i))

Python: what is the difference between a package and a compiler?

I was reading the wiki page for Numba, and it says Numba is a "compiler". But then later on, it says that to use Numba, you import it like a package. I later looked up how to use Numba, and indeed, you just pip install it.
So now I am confused. I thought Numba was a compiler? But it seems to be used just like any other package, like numpy or pandas? What's the difference?

A compiler is a program that inputs something in human-readable form (usually a program in a specified language) and outputs a functionally equivalent stream in another, more machine-digestible form. Just as with any other transformation, it's equally viable as a command-line invocation or a function call.
As long as it's wrapped properly in a package for general use, it's perfect reasonable to deliver a compiler as a Python package.
Does that clear up the difficulty?

From what I have read at Numba documentation it's a package that you import to you project and then use the Numba decorator do indicate parts of your code that you would like to have compiled in JIT (Just in Time) in order to optimize them. Like in the following example:
from numba import jit
import random
#jit(nopython=True)
def monte_carlo_pi(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
When the monte_carlo_pi function is called Numba will have it compiled in order to optimize it, so there isn't a compilation step that you can take.

Decomposition of matrices for CPLEX and machine learning application

I am dealing with big matrices and time to time my code ends with 'killed:9' message in my terminal. I'm working on Mac OSx.
A wise programmer tells me the problem in my code is liked to the stored matrix I am dealing with.
nn = 35000
dd = 35
XX = np.random.rand(nn,dd)
XX = XX.dot(XX.T) #it should be faster than np.dot(XX,XX.T)
yy = np.random.rand(nn,1)
XX = np.multiply(XX,yy.T)
I have to store this huge matrix XX, my guess: I split the matrix with
upp = np.triu(XX)
Do I actually save space in terms of stored data?
What if later on I store
low = app.T
am I wasting memory and computational time?

It should take up the same total amount of memory. To avoid the error you are probably looking at a few options:
Process batch wise
If you create your model over the CPLEX API, once you supplied the data it is handled by CPLEX I believe. So you could split the data and load it piece by piece and add it to the model consecutively.
Allocate memory manually
If you use Cython you can use the function malloc to allocate memory manually for your array, the size will very likely be no issue then.
Option 1 would be the preferred option in my opinion.
EDIT:
I constructed a little example. It actually combines the two options. The array is not stored as a Python object, but as a C array and the values are computed piecewise.
I am allocating the memory for the array using Cython and malloc. To run the code you have to install Cython.Then you can open a python interpreter at the directory you saved the file and write:
import pyximport;pyximport.install()
import nameofscript
An example for processing your array:
import numpy as np
from libc.stdlib cimport malloc # Allocate memory manually
from cython.parallel import prange # Parallel processing without GIL
dd = 35
# With cdef we can define C variables in Cython.
cdef double **XXN
cdef double y[35000]
cdef int i, j, nn
nn = 35000
# Allocate memory for the Matrix with 1.225 billion double elements
XXN = <double **>malloc(nn * sizeof(double *))
for i in range(nn):
XXN[i] = <double *>malloc(nn * sizeof(double))
XX = np.random.rand(nn,dd)
for i in range(nn):
for j in range(nn):
# Compute the values for the new matrix element by element
XXN[i][j] = XX[i].dot(XX[j].T)
# Multiply the new matrix with y column wise
for i in prange(nn, nogil=True, num_threads=4):
for j in range(nn):
XXN[i][j] = XXN[i][j] * y[i]
Save this file as nameofscript.pyx and run it as described above. I have briefly tested this script and it runs about half an hour on my machine. You can extend this script and use the result array XXN for your further computations.
A little example for parallelization: I did not initialize y and did not assign any values. If you declare y as a C array, you can e. g. assign some values from python objects to fill it with values. Then, you can conduct the last multiplication without GIL, in a parallelized manner, as shown in the code sample.
Regarding computational efficiency: This is probably not the fastest way (which may be writing your code for the CPLEX C Interface entirely maybe), but it does not throw the memory error and does run in an acceptable time if you do not have to repeat this computation too often.

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!

Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

pyOpenCL getting different results compared to numpy

I'm trying to get started with pyOpenCL and GPGPU in general.
For the below dot product code I'm getting fairly different results between the GPU and CPU versions. What am I doing wrong?
The difference of ~0.5% seems large for floating point errors to account for the difference. The difference does seem to increase with array size (~9e-8 relative difference with array size of 10000). Maybe it's an issue with combining results across blocks...? Either way, color me disconcerted.
I don't know if it matters: I'm running this on a MacBook Air, Intel(R) Core(TM) i7-4650U CPU # 1.70GHz, with Intel HD Graphics 5000.
Thanks in advance.
import pyopencl as cl
import numpy
from pyopencl.reduction import ReductionKernel
import pyopencl.clrandom as cl_rand
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
dot = ReductionKernel( ctx, \
dtype_out = numpy.float32, \
neutral = "0", \
map_expr = "x[i]*y[i]", \
reduce_expr = "a+b", \
arguments = "__global const float *x, __global const float *y"
)
x = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
y = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
x_dot_y = dot(x,y).get() # GPU: array(25001304.0, dtype=float32)
x_dot_y_cpu = numpy.dot(x.get(), y.get()) # CPU: 24875690.0
print abs(x_dot_y_cpu - x_dot_y)/x_dot_y # 0.0050496689740063489

The order in which values are reduced will likely be very different between these two methods. Across large data sets, the tiny errors in floating point rounding can soon add up. There could also be other details about the underlying implementations that affect the precision of the result.
I've run your example code on my own machine and get a similar sort of difference in the final result (~0.5%). As a data point, you can implement a very simple dot product in raw Python and see how much that differs from both the OpenCL result and from Numpy.
For example, you could add something simple like this to your example:
x_dot_y_naive = sum(a*b for a,b in zip(x.get(), y.get()))
Here's the results I get on my machine:
OPENCL: 25003466.000000
NUMPY: 24878146.000000 (0.5012%)
NAIVE: 25003465.601387 (0.0000%)
As you can see, the naive implementation is closer to the OpenCL version than Numpy is. One explanation for this could be that Numpy's dot function probably makes use of fused multiply-add (FMA) operations, which will change how intermediate results are rounded. Without any compiler options to tell it otherwise, OpenCL should be fully complying to the IEE-754 standard, rather than using the faster FMA operations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.