Efficient tensor contraction in python - python

I have a list L of tensors (ndarray objects), with several indices each. I need to contract these indices according to a graph of connections.
The connections are encoded in a list of tuples in the form ((m,i),(n,j)) signifying "contract the i-th index of the tensor L[m] with the j-th index of the tensor L[n].
How can I handle non-trivial connectivity graphs? The first problem is that as soon as I contract a pair of indices, the result is a new tensor that does not belong to the list L. But even if I solved this (e.g. by giving a unique identifier to all the indices of all the tensors), there is the issue that one can pick any order to perform the contractions, and some choices yield unnecessarily enormous beasts in mid-computation (even if the final result is small). Suggestions?

Memory considerations aside, I believe you can do the contractions in a single call to einsum, although you'll need some preprocessing. I'm not entirely sure what you mean by "as I contract a pair of indices, the result is a new tensor that does not belong to the list L", but I think doing the contraction in a single step would exactly solve this problem.
I suggest using the alternative, numerically indexed syntax of einsum:
einsum(op0, sublist0, op1, sublist1, ..., [sublistout])
So what you need to do is encode the indices to contract in integer sequences. First you'll need to set up a range of unique indices initially, and keep another copy to be used as sublistout. Then, iterating over your connectivity graph, you need to set contracted indices to the same index where necessary, and at the same time remove the contracted index from sublistout.
import numpy as np
def contract_all(tensors,conns):
'''
Contract the tensors inside the list tensors
according to the connectivities in conns
Example input:
tensors = [np.random.rand(2,3),np.random.rand(3,4,5),np.random.rand(3,4)]
conns = [((0,1),(2,0)), ((1,1),(2,1))]
returned shape in this case is (2,3,5)
'''
ndims = [t.ndim for t in tensors]
totdims = sum(ndims)
dims0 = np.arange(totdims)
# keep track of sublistout throughout
sublistout = set(dims0.tolist())
# cut up the index array according to tensors
# (throw away empty list at the end)
inds = np.split(dims0,np.cumsum(ndims))[:-1]
# we also need to convert to a list, otherwise einsum chokes
inds = [ind.tolist() for ind in inds]
# if there were no contractions, we'd call
# np.einsum(*zip(tensors,inds),sublistout)
# instead we need to loop over the connectivity graph
# and manipulate the indices
for (m,i),(n,j) in conns:
# tensors[m][i] contracted with tensors[n][j]
# remove the old indices from sublistout which is a set
sublistout -= {inds[m][i],inds[n][j]}
# contract the indices
inds[n][j] = inds[m][i]
# zip and flatten the tensors and indices
args = [subarg for arg in zip(tensors,inds) for subarg in arg]
# assuming there are no multiple contractions, we're done here
return np.einsum(*args,sublistout)
A trivial example:
>>> tensors = [np.random.rand(2,3), np.random.rand(4,3)]
>>> conns = [((0,1),(1,1))]
>>> contract_all(tensors,conns)
array([[ 1.51970003, 1.06482209, 1.61478989, 1.86329518],
[ 1.16334367, 0.60125945, 1.00275992, 1.43578448]])
>>> np.einsum('ij,kj',tensors[0],tensors[1])
array([[ 1.51970003, 1.06482209, 1.61478989, 1.86329518],
[ 1.16334367, 0.60125945, 1.00275992, 1.43578448]])
In case there are multiple contractions, the logistics in the loop becomes a bit more complex, because we need to handle all the duplicates. The logic, however, is the same. Furthermore, the above is obviously missing checks to ensure that the corresponding indices can be contracted.
In hindsight I realized that the default sublistout doesn't have to be specified, einsum uses that order anyway. I decided to leave that variable in the code, because in case we want a non-trivial output index order, we'll have to handle that variable appropriately, and it might come handy.
As for optimization of the contraction order, you can effect internal optimization in np.einsum as of version 1.12 (as noted by #hpaulj in a now-deleted comment). This version introduced the optimize optional keyword argument to np.einsum, allowing to choose a contraction order that cuts down on computational time at the cost of memory. Passing 'greedy' or 'optimal' as the optimize keyword will make numpy choose a contraction order in roughly decreasing order of sizes of the dimensions.
The options available for the optimize keyword come from the apparently undocumented (as far as online documentation goes; help() fortunately works) function np.einsum_path:
einsum_path(subscripts, *operands, optimize='greedy')
Evaluates the lowest cost contraction order for an einsum expression by
considering the creation of intermediate arrays.
The output contraction path from np.einsum_path can also be used as an input for the optimize argument of np.einsum. In your question you were worried about too much memory being used, so I suspect that the default of no optimization (with potentially longer runtime and smaller memory footprint).

Maybe helpful: Take a look into https://arxiv.org/abs/1402.0939 which is a description of an efficient framework for the problem of contracting so called tensor networks in a single function ncon(...). As far as I see implementations of it are directly available for Matlab (can be found within in the link) and for Python3 (https://github.com/mhauru/ncon).

Related

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Filtered Numpy Array Changes Number of Dimensions

I'm having trouble getting used to Numpy arrays (I'm a Matlab user). When I try to select just a range of values from an array, I see the resulting array has an extra dimension:
ioi = np.nonzero((self.data_array[0,:] >= range_start) & (self.data_array[0,:] <= range_end))
print("self.data_array.shape = {0}".format(self.data_array.shape))
print("self.data_array.shape[:,ioi] = {0}".format(self.data_array[:,ioi].shape))
The result is:
self.data_array.shape = (5, 50000)
self.data_array.shape[:,ioi] = (5, 1, 408)
I also see that ioi is a tuple. I don't know if that has anything to do with it.
What is happening here to create that extra dimension and what should I do, in the most direct way, to get an array shape of (5,408) in this case?
The simplest and most efficient thing would be to get rid of the np.nonzero call, and use logical indexing just as one would in Matlab. Here's an example. (I'm using random data of the same shape, FYI.)
>>> data = np.random.randn(5, 5000)
>>> start, end = -0.5, 0.5
>>> ioi = (data[0] > start) & (data[0] < end)
>>> print(ioi.shape)
(5000,)
>>> print(ioi.sum())
1900
>>> print(data[:, ioi].shape)
(5, 1900)
The np.nonzero call is not usually needed. Just like Matlab's find function, it's slow compared with logical indexing, and usually one's goal can be more efficiently accomplished with logical indexing. np.nonzero, just like find, should mostly be used only when you need the actual index values themselves.
As you suspected, the reason for the extra dimensions is that tuples are handled differently from other types of indexing arrays in NumPy. This is to allow more flexible indexing, such as with slices, ellipses, etc. See this useful page for in-depth explanation, especially the last section.
There are at least two other options to solve the problem. One is to use the ioi array, as returned from np.nonzero, directly as your only index to the data array. As in: self.data_array[ioi]. Part of why you have an extra dimension is that you actually have two set of indices in your call: the slice (:) and the tuple ioi. np.nonzero is guaranteed to return a tuple exactly for this reason, so that its output can always be used to directly index the source array.
The last option is to call np.squeeze on the returned array, but I'd opt for one of the above first.

Python - sum the intersection of rows and columns in a matrix

Let's suppose we have a matrix and a list of indexes:
adj_mat = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
indexes = [0,2]
What I want is to sum the rows and columns corresponding to the sub matrix we get by the intersection of the rows and columns of the indexes list. In this case it would be:
sub_matrix = ([[1,3]
[7,9]])
result_rows = [4,16]
result_columns = [8,12]
However, I do this calculation rather a lot of times with the same original matrix and different indexes lists, so I am looking for an efficent solution without creating the sub matrix each iteration. My solution so far is (and for columns respectively):
def sum_rows(matrix, indexes):
sum_r = [0]*len(indexes)
for i in range(len(indexes)):
for j in indexes:
sum_r[i] += matrix.item(indexes[i], j)
return sum_r
I'm looking for a more efficient algorithm as I remember there is a method which looks like this that sums all rows (or columns?) in the indexes:
matrix.sum(:, indexes)
matrix.sum(indexes, indexes)
I assume what I need is the second line, if it exists. I tried to google it, with or without numpy, but couldn't find the right syntax.
Is there a solution as I described here but I'm just using the wrong syntax? Or any other suggestions for improvement?
IIUC:
import numpy as np
adj_mat = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
indexes = np.array([1, 3]) - 1
sub_matrix = adj_mat[np.ix_(indexes, indexes)]
result_rows, result_columns = sub_matrix.sum(axis=1), sub_matrix.sum(axis=0)
Result:
array([ 4, 16]) # result_rows
array([ 8, 12]) # result_columns
So assuming you made a mistake and you meant indexes = [0,2] and sub_matrix = [[1,3], [7,9]], then this should do what you want
def sum_sub(matrix, indices):
"""
Returns the sum of each row and column (as a tuple)
for each index in indices (as an array)
"""
# note that this sub matrix does not copy any data from matrix,
# it is a "view" which simply holds a reference to matrix
sub_mat = matrix[np.ix_(indices, indices)]
return sub_mat.sum(axis=1), sub_mat.sum(axis=0)
sum_row, sum_col = sum_sub(np.arange(1,10).reshape((3,3)), [0,2])
The results of this are
sum_col # --> [ 8 12]
sum_row # --> [ 4 16]
Since the point of efficiency was brought up in the question, a little further analysis should probably be done.
First and foremost, the code looks like code to find a matrix inverse using the adjoint matrix. Unless that particular method is important to the project, the standard np.linalg.inv() is almost certainly going to be faster than anything we cook up here. Moreover, in many applications you can get away with solving a system of linear equations rather than finding an inverse and multiplying by it, cutting run times in half or more again.
Second, any discussion of efficient numpy code needs to address views as opposed to copies. Memory allocation, writing to memory, and memory deallocation are all extremely expensive operations when compared with standard floating point arithmetic. That's not to say that they're slow, but you can notice an order of magnitude or two of difference in the speed of code memory efficient code vs nearly anything else. That's the entire premise behind the fastest implementation of persistent homology calculations I know of, among other things.
All of the other answers (at the time of writing) create a copy of the data they're working with, explicitly storing that information in a new variable sub_matrix. It isn't possible to create every fancy-indexed matrix with a copy, but oftentimes equivalent operations can be performed.
For example, if this really is a set of computations on adjoint matrices so that your indexes variable consists of all but one of the available indices (in your example, all but the middle index), then instead of explicitly summing over all the intended indices, we can sum over all indices and subtract the one we don't care about. The effect is that all the intermediate matrices are views rather than copies, preventing the expensive memory allocations. On my machine, this is twice as fast for the tiny 3x3 example given and 10x as fast for 500x500 matrices.
bad_row = 1
bad_col = 1
result_rows = (np.sum(adj_mat, axis=1)-adj_mat[:,bad_col])[np.arange(adj_mat.shape[0])!=bad_row]
result_cols = (np.sum(adj_mat, axis=0)-adj_mat[bad_row,:])[np.arange(adj_mat.shape[1])!=bad_col]
Of course, it's even faster if you can use slices to represent whatever you're doing and you don't have to work around the problem with extra operations as I did, but the example you gave doesn't easily permit slices.

Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

I have a (large) length-N array of k distinct functions, and a length-N array of abcissa. I want to evaluate the functions at the abcissa to return a length-N array of ordinates, and critically, I need to do it very fast.
I have tried the following loop over a call to np.where, which is too slow:
Create some fake data to illustrate the problem:
def trivial_functional(i): return lambda x : i*x
k = 250
func_table = [trivial_functional(j) for j in range(k)]
func_table = np.array(func_table) # possibly unnecessary
We have a table of 250 distinct functions. Now I create a large array with many repeated entries of those functions, and a set of points of the same length at which these functions should be evaluated.
Npts = 1e6
abcissa_array = np.random.random(Npts)
function_indices = np.random.random_integers(0,len(func_table)-1,Npts)
func_array = func_table[function_indices]
Finally, loop over every function used by the data and evaluate it on the set of relevant points:
desired_output = np.zeros(Npts)
for func_index in set(function_indices):
idx = np.where(function_indices==func_index)[0]
desired_output[idx] = func_table[func_index](abcissa_array[idx])
This loop takes ~0.35 seconds on my laptop, the biggest bottleneck in my code by an order of magnitude.
Does anyone see how to avoid the blind lookup call to np.where? Is there a clever use of numba that can speed this loop up?
This does almost the same thing as your (excellent!) self-answer, but with a bit less rigamarole. It seems marginally faster on my machine as well -- about 30ms based on a cursory test.
def apply_indexed_fast(array, func_indices, func_table):
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(array)
for f, start, end in zip(func_table, func_ranges, func_ranges[1:]):
ix = func_argsort[start:end]
out[ix] = f(array[ix])
return out
Like yours, this splits a sequence of argsort indices into chunks, each of which corresponds to a function in func_table. It then uses each chunk to select input and output indices for its corresponding function. To determine the chunk boundaries, it uses np.searchsorted instead of np.unique -- where searchsorted(a, b) could be thought of as a binary search algorithm that returns the index of the first value in a equal to or greater than the given value or values in b.
Then the zip function simply iterates over its arguments in parallel, returning a single item from each one, collected together in a tuple, and stringing those together into a list. (So zip([1, 2, 3], ['a', 'b', 'c'], ['b', 'c', 'd']) returns [(1, 'a', 'b'), (2, 'b', 'c'), (3, 'c', 'd')].) This, along with the for statement's built-in ability to "unpack" those tuples, allows for a terse but expressive way to iterate over multiple sequences in parallel.
In this case, I've used it to iterate over the functions in func_tables alongside two out-of-sync copies of func_ranges. This ensures that the item from func_ranges in the end variable is always one step ahead of the item in the start variable. By appending None to func_ranges, I ensure that the final chunk is handled gracefully -- zip stops when any one of its arguments runs out of items, which cuts off the final value in the sequence. Conveniently, the None value also serves as an open-ended slice index!
Another trick that does the same thing requires a few more lines, but has lower memory overhead, especially when used with the itertools equivalent of zip, izip:
range_iter_a = iter(func_ranges) # create generators that iterate over the
range_iter_b = iter(func_ranges) # values in `func_ranges` without making copies
next(range_iter_b, None) # advance the second generator by one
for f, start, end in itertools.izip(func_table, range_iter_a, range_iter_b):
...
However, these low-overhead generator-based approaches can sometimes be a bit slower than vanilla lists. Also, note that in Python 3, zip behaves more like izip.
Thanks to hpaulj for the suggestion to pursue a groupby approach. There are lots of canned routines out there for this operation, such as Pandas DataFrames, but they all come with the overhead cost of the data structure initialization, which is one-time-only, but can be costly if using for just a single calculation.
Here is my pure numpy solution that is a factor of 13 faster than the original where loop I was using. The upshot summary is that I use np.argsort and np.unique together with some fancy indexing gymnastics.
First we sort the function indices, and then find the elements of the sorted array where each new index begins
idx_funcsort = np.argsort(function_indices)
unique_funcs, unique_func_indices = np.unique(function_indices[idx_funcsort], return_index=True)
Now there is no longer a need for blind lookups, since we know exactly which slice of the sorted array corresponds to each unique function. So we still loop over each called function, but without calling where:
for func_index in range(len(unique_funcs)-1):
idx_func = idx_funcsort[unique_func_indices[func_index]:unique_func_indices[func_index+1]]
func = func_table[unique_funcs[func_index]]
desired_output[idx_func] = func(abcissa_array[idx_func])
That covers all but the final index, which somewhat annoyingly we need to call individually due to Python indexing conventions:
func_index = len(unique_funcs)-1
idx_func = idx_funcsort[unique_func_indices[func_index]:]
func = func_table[unique_funcs[func_index]]
desired_output[idx_func] = func(abcissa_array[idx_func])
This gives identical results to the where loop (a bookkeeping sanity check), but the runtime of this loop is 0.027 seconds, a speedup of 13x over my original calculation.
That is a beautiful example of functional programming being somewhat emulated in Python.
Now, if you want to apply your function to a set of points, I'd recommend numpy's ufunc framework, which will allow you to create blazingly fast vectorized versions of your functions.

Which is faster, numpy transpose or flip indices?

I have a dynamic programming algorithm (modified Needleman-Wunsch) which requires the same basic calculation twice, but the calculation is done in the orthogonal direction the second time. For instance, from a given cell (i,j) in matrix scoreMatrix, I want to both calculate a value from values "up" from (i,j), as well as a value from values to the "left" of (i,j). In order to reuse the code I have used a function in which in the first case I send in parameters i,j,scoreMatrix, and in the next case I send in j,i,scoreMatrix.transpose(). Here is a highly simplified version of that code:
def calculateGapCost(i,j,scoreMatrix,gapcost):
return scoreMatrix[i-1,j] - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost)
gapUp = calculateGapCost(j,i,scoreMatrix.transpose(),gapcost)
...
I realized that I could alternatively send in a function that would in the one case pass through arguments (i,j) when retrieving a value from scoreMatrix, and in the other case reverse them to (j,i), rather than transposing the matrix each time.
def passThrough(i,j,matrix):
return matrix[i,j]
def flipIndices(i,j,matrix):
return matrix[j,i]
def calculateGapCost(i,j,scoreMatrix,gapcost,retrieveValue):
return retrieveValue(i-1,j,scoreMatrix) - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost,passThrough)
gapUp = calculateGapCost(j,i,scoreMatrix,gapcost,flipIndices)
...
However if numpy transpose uses some features I'm unaware of to do the transpose in just a few operations, it may be that transpose is in fact faster than my pass-through function idea. Can anyone tell me which would be faster (or if there is a better method I haven't thought of)?
The actual method would call retrieveValue 3 times, and involves 2 matrices that would be referenced (and thus transposed if using that approach).
In NumPy, transpose returns a view with a different shape and strides. It does not touch the data.
Therefore, you will likely find that the two approaches have identical performance, since in essence they are exactly the same.
However, the only way to be sure is to benchmark both.

Categories