efficiently computing parafac / CP product in numpy

efficiently computing parafac / CP product in numpy - python

This question focuses on numpy.
I have a set of matrices which all share the same number of columns and have different number of rows. Let's call them A, B, C, D, etc and let their dimensions be IaxK IbxK, IcxK, etc
What I want is to efficiently compute the IaxIbxIc... tensor P defined as follow:
P(ia,ib,ic,id,ie,...)=\sum_k A(ia,k)B(ib,k)C(ic,k)...
So if I have two factors, I end up with simple matrix product.
Of course I can compute this "by hand" through outer products, something like:
def parafac(factors,components=None):
ndims = len(factors)
ncomponents = factors[0].shape[1]
total_result=array([])
if components is None:
components=range(ncomponents)
for k in components:
#for each component (to save memory)
result = array([])
for dim in range(ndims-1,-1,-1):
#Augments model with next dimension
current_dim_slice=[slice(None,None,None)]
current_dim_slice.extend([None]*(ndims-dim-1))
current_dim_slice.append(k)
if result.size:
result = factors[dim].__getitem__(tuple(current_dim_slice))*result[None,...]
else:
result = factors[dim].__getitem__(tuple(current_dim_slice))
if total_result.size:
total_result+=result
else:
total_result=result
return total_result
Still, I would like something much more computationally efficient, like relying on builtin numpy functions, but I cannot find relevant functions, can someone help me ?
Cheers, thanks

Thank you all very much for your answers, I've spent the day on this and I eventually found the solution, so I post it here for the record
This solution requires numpy 1.6 and makes use of einsum, which is
powerful voodoo magic
basically, if you had factor=[A,B,C,D] with A,B,C and D matrices with
the same number of columns, then you would compute the parafac model using:
import numpy
P=numpy.einsum('az,bz,cz,dz->abcd',A,B,C,D)
so, one line!
In the general case, I end up with this:
def parafac(factors):
ndims = len(factors)
request=''
for temp_dim in range(ndims):
request+=string.lowercase[temp_dim]+'z,'
request=request[:-1]+'->'+string.lowercase[:ndims]
return einsum(request,*factors)

Having in mind that outer product is Kronecker product in disguise your problem should be solved by this simple functions:
def outer(vectors):
shape=[v.shape[0] for v in vectors]
return reduce(np.kron, vectors).reshape(shape)
def cp2Tensor(l,A):
terms=[]
for r in xrange(A[0].shape[1]):
term=l[r]*outer([A[n][:,r] for n in xrange(len(A))])
terms.append(term)
return sum(terms)
cp2Tensor takes list of real numbers and list of matrices.
Edited after comment by Jaime.

Ok, so the following works. First a worked out example of what's going on...
a = np.random.rand(5, 8)
b = np.random.rand(4, 8)
c = np.random.rand(3, 8)
ret = np.ones(5,4,3,8)
ret *= a.reshape(5,1,1,8)
ret *= b.reshape(1,4,1,8)
ret *= c.reshape(1,1,3,8)
ret = ret.sum(axis=-1)
And a full function
def tensor(elems) :
cols = elems[0].shape[-1]
n_elems = len(elems)
ret = np.ones(tuple([j.shape[0] for j in elems] + [cols]))
for j,el in enumerate(elems) :
ret *= el.reshape((1,) * j + (el.shape[0],) +
(1,) * (len(elems) - j - 1) + (cols,))
return ret.sum(axis=-1)

Related

How to speed up an N dimensional interval tree in python?

Consider the following problem: Given a set of n intervals and a set of m floating-point numbers, determine, for each floating-point number, the subset of intervals that contain the floating-point number.
This problem has been addressed by constructing an interval tree (or called range tree or segment tree). Implementations have been done for the one-dimensional case, e.g. python's intervaltree package. Usually, these implementations consider one or few floating-point numbers, namely a small "m" above.
In my problem setting, both n and m are extremely large numbers (from solving an image processing problem). Further, I need to consider the N-dimensional intervals (called cuboid when N=3, because I was modeling human brains with the Finite Element Method). I have implemented a simple N-dimensional interval tree in python, but it run in a loop and can only take one floating-point number at a time. Can anyone help improve the implementation in terms of efficiency? You can change data structure freely.
import sys
import time
import numpy as np
# find the index of a satisfying x > a in one dimension
def find_index_smaller(a, x):
idx = np.argsort(a)
ss = np.searchsorted(a, x, sorter=idx)
res = idx[0:ss]
return res
# find the index of a satisfying x < a in one dimension
def find_index_larger(a, x):
return find_index_smaller(-a, -x)
# find the index of a satisfing amin < x < amax in one dimension
def find_intv_at(amin, amax, x):
idx = find_index_smaller(amin, x)
idx2 = find_index_larger(amax[idx], x)
res = idx[idx2]
return res
# find the index of a satisfying amin < x < amax in N dimensions
def find_intv_at_nd(amin, amax, x):
dim = amin.shape[0]
res = np.arange(amin.shape[-1])
for i in range(dim):
idx = find_intv_at(amin[i, res], amax[i, res], x[i])
res = res[idx]
return res
I also have two test examples for sanity check and performance testing:
def demo1():
print ("By default, we do a correctness test")
n_intv = 2
n_point = 2
# generate the test data
point = np.random.rand(3, n_point)
intv_min = np.random.rand(3, n_intv)
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point ")
print (point)
print ("intv_min")
print (intv_min)
print ("intv_max")
print (intv_max)
print ("===Indexes of intervals that contain the point===")
for i in range(n_point):
print (find_intv_at_nd(intv_min,intv_max, point[:, i]))
def demo2():
print ("Performance:")
n_points=100
n_intv = 1000000
# generate the test data
points = np.random.rand(n_points, 3)*512
intv_min = np.random.rand(3, n_intv)*512
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point.shape = "+str(points.shape))
print ("intv_min.shape = "+str(intv_min.shape))
print ("intv_max.shape = "+str(intv_max.shape))
starttime = time.time()
for point in points:
tmp = find_intv_at_nd(intv_min, intv_max, point)
print("it took this long to run {} points, with {} interva: {}".format(n_points, n_intv, time.time()-starttime))
My idea would be:
Remove np.argsort() from the algo, because the interval tree does not change, so sorting could have been done in pre-processing.
Vectorize x. The algo runs a loop for each x. It would be nice if we can get rid of the loop over x.
Any contribution would be appreciated.

Why is Z3 slow for tiny search space?

I'm trying to make a Z3 program (in Python) that generates boolean circuits that do certain tasks (e.g. adding two n-bit numbers) but the performance is terrible to the point where a brute-force search of the entire solution space would be faster. This is my first time using Z3 so I could be doing something that impacts my performance, but my code seems fine.
The following is copied from my code here:
from z3 import *
BITLEN = 1 # Number of bits in input
STEPS = 1 # How many steps to take (e.g. time)
WIDTH = 2 # How many operations/values can be stored in parallel, has to be at least BITLEN * #inputs
# Input variables
x = BitVec('x', BITLEN)
y = BitVec('y', BITLEN)
# Define operations used
op_list = [BitVecRef.__and__, BitVecRef.__or__, BitVecRef.__xor__, BitVecRef.__xor__]
unary_op_list = [BitVecRef.__invert__]
for uop in unary_op_list:
op_list.append(lambda x, y : uop(x))
# Chooses a function to use by setting all others to 0
def chooseFunc(i, x, y):
res = 0
for ind, op in enumerate(op_list):
res = res + (ind == i) * op(x, y)
return res
s = Solver()
steps = []
# First step is just the bits of the input padded with constants
firststep = Array("firststep", IntSort(), BitVecSort(1))
for i in range(BITLEN):
firststep = Store(firststep, i * 2, Extract(i, i, x))
firststep = Store(firststep, i * 2 + 1, Extract(i, i, y))
for i in range(BITLEN * 2, WIDTH):
firststep = Store(firststep, i, BitVec("const_0_%d" % i, 1))
steps.append(firststep)
# Generate remaining steps
for i in range(1, STEPS + 1):
this_step = Array("step_%d" % i, IntSort(), BitVecSort(1))
last_step = steps[-1]
for j in range(WIDTH):
func_ind = Int("func_%d_%d" % (i,j))
s.add(func_ind >= 0, func_ind < len(op_list))
x_ind = Int("x_%d_%d" % (i,j))
s.add(x_ind >= 0, x_ind < WIDTH)
y_ind = Int("y_%d_%d" % (i,j))
s.add(y_ind >= 0, y_ind < WIDTH)
node = chooseFunc(func_ind, Select(last_step, x_ind), Select(last_step, y_ind))
this_step = Store(this_step, j, node)
steps.append(this_step)
# Set the result to the first BITLEN bits of the last step
if BITLEN == 1:
result = Select(steps[-1], 0)
else:
result = Concat(*[Select(steps[-1], i) for i in range(BITLEN)])
# Set goal
goal = x | y
s.add(ForAll([x, y], goal == result))
print(s)
print(s.check())
print(s.model())
The code basically lays out the inputs as individual bits, then at each "step" one of 5 boolean functions can operate on the values from the previous step, where the final step represents the end result.
In this example, I generate a circuit to calculate the boolean OR of two 1-bit inputs, and an OR function is available in the circuit, so the solution is trivial.
I have a solution space of only 5*5*2*2*2*2=400:
5 Possible functions (two function nodes)
2 Inputs for each function, each of which has two possible values
This code takes a few seconds to run and provides a correct answer, but I feel like it should run instantaneously as there are only 400 possible solutions, of which quite a few are valid. If I increase the inputs to be two bits long, the solution space has a size of (5^4)*(4^8)=40,960,000 and never finishes on my computer, though I feel this should be easily doable with Z3.
I also tried effectively the same code but substituted Arrays/Store/Select for Python lists and "selected" the variables by using the same trick I used in chooseFunc(). The code is here and it runs in around the same time the original code does, so no speedup.
Am I doing something that would drastically slow down the solver? Thanks!

You have a duplicated __xor__ in your op_list; but that's not really the major problem. The slowdown is inevitable as you increase bit-size, but on a first look you can (and should) avoid mixing integer reasoning with booleans here. I'd code your chooseFunc as follows:
def chooseFunc(i, x, y):
res = False;
for ind, op in enumerate(op_list):
res = If(ind == i, op (x, y), res)
return res
See if that improves run-times in any meaningful way. If not, the next thing to do would be to get rid of arrays as much as possible.

How do you use a list as an index argument for numpy ndarrays?

So I have a problem that might be super duper simple.
I have these numpy ndarrays that I allocated and want to assign values to them via indices returned as lists. It might be easier if I showed you some example code. The questionable code I have is at the bottom, and in my testing (before actually taking this to scale) I keep getting syntax errors :'(
EDIT: edited to make it easier to troubleshoot and put some example code at the bottoms
import numpy as np
def do_stuff(index, mask):
# this is where the calculations are made
magic = sum(mask)
return index, magic
def foo(full_index, comparison_dims, *xargs):
# I have this function executed in Parallel since I'm using a machine with 36 nodes per core, and can access upto 16 cores for each script #blessed
# figure out how many dimensions there are, and how big they are
parent_dims = []
parent_diffs = []
for j in xargs:
parent_dims += [len(j)]
parent_diffs += [j[1] - j[0]] # this is used to find a mask
index = [] # this is where the individual dimension indices will be stored
dim_n = 0
# loop through the dimensions
while dim_n < len(parent_dims):
dim_index = full_index % parent_dims[dim_n]
index += [dim_index]
if dim_n == 0:
mask = (comparison_dims[dim_n] > xargs[dim_n][dim_index] - parent_diffs[dim_n]/2) * \
(comparison_dims[dim_n] <= xargs[dim_n][dim_index] +parent_diffs[dim_n] / 2)
else:
mask *= (comparison_dims[dim_n] > xargs[dim_n][dim_index] - parent_diffs[dim_n]/2) * \
(comparison_dims[dim_n] <=xargs[dim_n][dim_index] + parent_diffs[dim_n] / 2)
full_index //= parent_dims[dim_n]
dim_n += 1
return do_stuff(index, mask)
def bar(comparison_dims, *xargs):
if len(xargs) == comparison_dims.shape[0]:
pass
elif len(comparison_dims.shape) == 2:
pass
else:
raise ValueError("silly person, you failed")
from joblib import Parallel, delayed
dims = []
for j in xargs:
dims += [len(j)]
myArray = np.empty(tuple(dims))
results = Parallel(n_jobs=1)(
delayed(foo)(
index, comparison_dims, *xargs)
for index in range(np.prod(dims))
)
# LOOK HERE, HELP HERE!
for index_list, result in results:
# I thought this would work, but oh golly was I was wrong, index_list here is a list of ints, and result is a value
# for example index, result = [0,3,7], 45.4
# so in execution, that would yield: myArray[0,3,7] = 45.4
# instead it yields SyntaxError because I don't know what I'm doing XD
myArray[*index_list] = result
return myArray
Any ideas how I can make that work. What do I need to do?
I'm not the sharpest tool in the shed, but I think with your help we might be able to figure this out!
A quick example to troubleshoot this problem would be:
compareDims = np.array([np.random.rand(1000), np.random.rand(1000)])
dim0 = np.arange(0,1,1./20)
dim1 = np.arange(0,1,1./30)
myArray = bar(compareDims, dim0, dim1)

To index a numpy array with an arbitrary list of multidimensional indices. you actually need to use a tuple:
for index_list, result in results:
myArray[tuple(index_list)] = result

I need to vectorize the following in order for the code can run faster

This portion I was able to vectorize and get rid of a nested loop.
def EMalgofast(obsdata, beta, pjt):
n = np.shape(obsdata)[0]
g = np.shape(pjt)[0]
zijtpo = np.zeros(shape=(n,g))
for j in range(g):
zijtpo[:,j] = pjt[j]*stats.expon.pdf(obsdata,scale=beta[j])
zijdenom = np.sum(zijtpo, axis=1)
zijtpo = zijtpo/np.reshape(zijdenom, (n,1))
pjtpo = np.mean(zijtpo, axis=0)
I wasn't able to vectorize the portion below. I need to figure that out
betajtpo_1 = []
for j in range(g):
num = 0
denom = 0
for i in range(n):
num = num + zijtpo[i][j]*obsdata[i]
denom = denom + zijtpo[i][j]
betajtpo_1.append(num/denom)
betajtpo = np.asarray(betajtpo_1)
return(pjtpo,betajtpo)

I'm guessing Python is not your first programming language based on what I see. The reason I'm saying this is that in python, normally we don't have to deal with manipulating indexes. You act directly on the value or the key returned. Make sure not to take this as an offense, I do the same coming from C++ myself. It's a hard to remove habits ;).
If you're interested in performance, there is a good presentation by Raymond Hettinger on what to do in Python to be optimised and beautiful :
https://www.youtube.com/watch?v=OSGv2VnC0go
As for the code you need help with, is this helping for you? It's unfortunatly untested as I need to leave...
ref:
Iterating over a numpy array
http://docs.scipy.org/doc/numpy/reference/generated/numpy.true_divide.html
def EMalgofast(obsdata, beta, pjt):
n = np.shape(obsdata)[0]
g = np.shape(pjt)[0]
zijtpo = np.zeros(shape=(n,g))
for j in range(g):
zijtpo[:,j] = pjt[j]*stats.expon.pdf(obsdata,scale=beta[j])
zijdenom = np.sum(zijtpo, axis=1)
zijtpo = zijtpo/np.reshape(zijdenom, (n,1))
pjtpo = np.mean(zijtpo, axis=0)
betajtpo_1 = []
#manipulating an array of numerator and denominator instead of creating objects each iteration
num=np.zeros(shape=(g,1))
denom=np.zeros(shape=(g,1))
#generating the num and denom real value for the end result
for (x,y), value in numpy.ndenumerate(zijtpo):
num[x],denom[x] = num[x] + value *obsdata[y],denom[x] + value
#dividing all at once after instead of inside the loop
betajtpo_1= np.true_divide(num/denom)
betajtpo = np.asarray(betajtpo_1)
return(pjtpo,betajtpo)
Please leave me some feedback !
Regards,
Eric Lafontaine

How to apply my own function along each rows and columns with NumPy

I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!

Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

efficiently computing parafac / CP product in numpy - python

Related

How to speed up an N dimensional interval tree in python?

Why is Z3 slow for tiny search space?

How do you use a list as an index argument for numpy ndarrays?

I need to vectorize the following in order for the code can run faster

How to apply my own function along each rows and columns with NumPy

Categories

Resources