I have an N-dimensional numpy array S. Every iteration, exactly one value in this array will change.
I have a second array, G that stores the gradient of S, as calculated by numpy's gradient() function. Currently, my code unnecessarily recalculates all of G every time I update S, but this is unnecessary, as only one value in S has changed, and so I only should have to recalculate 1+d*2 values in G, where d is the number of dimensions in S.
This would be an easier problem to solve if I knew the dimensionality of the arrays, but the solutions I have come up with in the absence of this knowledge have been quite inefficient (not substantially better than just recalculating all of G).
Is there an efficient way to recalculate only the necessary values in G?
Edit: adding my attempt, as requested
The function returns a vector indicating the gradient of S at coords in each dimension. It calculates this without calculating the gradient of S at every point, but the problem is that it does not seem to be very efficient.
It looks similar in some ways to the answers already posted, but maybe there is something quite inefficient about it?
The idea is the following: I iterate through each dimension, creating a slice that is a vector only in that dimension. For each of these slices, I calculate the gradient and place the appropriate value from that gradient into the correct place in the returned vector grad.
The use of min() and max() is to deal with the boundary conditions.
def getSGradAt(self,coords) :
"""Returns the gradient of S at position specified by
the vector argument 'coords'.
self.nDim : the number of dimensions of S
self.nBins : the width of S (same in every dim)
self.s : S """
grad = zeros(self.nDim)
for d in xrange(self.nDim) :
# create a slice through S that has size > 1 only in the current
# dimension, d.
slices = list(coords)
slices[d] = slice(max(0,coords[d]-1),min(self.nBins,coords[d]+2))
# take the middle value from the gradient vector
grad[d] = gradient(self.s[sl])[1]
return grad
The problem is that this doesn't run very quickly. In fact, just taking the gradient of the whole array S seems to run faster (for nBins = 25 and nDim = 4).
Edited again, to add my final solution
Here is what i ended up using. This function updates S, changing the value at X by the amount change. It then updates G using a variation on the technique proposed by Jaime.
def changeSField(self,X,change) :
# change s
self.s[X] += change
# update g (gradient field)
slices = tuple(slice(None if j-2 <= 0 else j-2, j+3, 1) for j in X)
newGrads = gradient(self.s[slices])
for i in arange(self.nDim) :
self.g[i][slices] = newGrads[i]
Your question is much to open for you to get a good answer: it is always a good idea to post your inefficient code, so that potential answerers can better help you. Anyway, lets say you know the coordinates of the point that has changed, and that you store those in a tuple named coords. First, lets construct a tuple of slices encompassing your point:
slices = tuple(slice(None if j-1 <= 0 else j-1, j+2, 1) for j in coords)
You may want to extend the limits to j-2 and j+3 so that the gradient is calculated using central differences whenever possible, but it will be slower.
You can now update you array doing something like:
G[slices] = np.gradient(N[slices])
Uhmmm, I could work better if I had an example, but what about just creating a secondary array, S2 (by the way, I'd choose longer and more meaningful names for your variables) and recalculate the gradient for it, G2, and then introduce it back into G?
Another question is: if you don't know the dimensionality of S, how are you changing the particular element that changes? Are you just recalculating the whole of S?
I suggest you clarify this things so that people can help you better.
Cheers!
Related
I have two arrays that are related to each other via a mapping operation. I will call them S(fk,fq) and Z(fi,αj). The arguments are all sampling frequencies. The mapping rule is fairly straightforward:
fi = 0.5 · (fk - fq)
αj = fk + fq
S is the result of several FFTs and complex multiplications and is defined on a rectangular grid. However, Z is defined on a diamond-shaped grid and it is not clear to me how best to store this. The image below is an attempt at visualizing the operation for a simple example of a 4×4 array, but in general the dimensions are not equal and are much larger (maybe 64×16384, but this is user-selectable). Blue points are the resulting values of fi and αj and the text describes how these are related to fk, fq, and the discrete indices.
The diamond-shaped nature of Z means that in one "row" there will be "columns" that fall in between the "columns" of adjacent "rows". Another way to think of this is that fi can take on fractional index values!
Note that using zero's or nan's to fill in elements that don't exist in any given row has two drawbacks 1) it inflates the size of what may already be a very large 2-D array and 2) it does not really represent the true nature of Z (e.g. the array size will not really be correct).
Currently I am using a dictionary indexed on the actual values of αj to store the results:
import numpy as np
from collections import defaultdict
nrows = 64
ncolumns = 16384
fk = np.fft.fftfreq(nrows)
fq = np.fft.fftfreq(ncolumns)
# using random numbers here to simplify the example
# in practice S is the result of several FFTs and complex multiplications
S = np.random.random(size=(nrows,ncolumns)) + 1j*np.random.random(size=(nrows,ncolumns))
ret = defaultdict(lambda: {"fi":[],"Z":[]})
for k in range(-nrows//2,nrows//2):
for q in range(-ncolumns//2,ncolumns//2):
fi = 0.5*fk[k] - fq[q]
alphaj = fk[k] + fq[q]
Z = S[k,q]
ret[alphaj]["fi"].append(fi)
ret[alphaj]["Z"].append(Z)
I still find this a bit cumbersome to work with and wonder if anyone has suggestions for a better approach? "Better" here would be defined as more computationally and memory efficient and/or easier to interact with and visualize using something like matplotlib.
Note: This is related to another question about how to get rid of those nasty for-loops. Since this is about storing the results I thought it would be better to create two separate questions.
You can still view it as a straight two-dimensional array. But you can represent it as an array of rows, each row of which has a different number of items. For example, here's your 4x4 as a 2D array: (each 0 here is a unique data item)
xxx0xxx
xx0x0xx
x0x0x0x
0x0x0x0
x0x0x0x
xx0x0xx
xxx0xxx
Its sparse representation would be:
[
[0],
[0,0],
[0,0,0],
[0,0,0,0],
[0,0,0],
[0,0],
[0]
]
With this representation you eliminate the empty space. There's a little math involved in converting from Color Temperature to row, and from Spectral Frequency to column (and vice-versa), but that's tractable. You know the bounds and that items are evenly spaced out across each row. So it should be easy enough to do the translation.
Unless I'm missing something . . .
It turns out that the answer to a related question on optimization effectively solved my problem of how to better store the data. The new code returns 2-D arrays for fi, %alpha;j, and these can be used to directly index S. So to get all values of S for %alpha;j = 0, for example, one can do
S[alphaj == 0]
I can use this pretty effectively and it seems like the quickest way to create a reasonable data structure.
I need to find the closest possible sentence.
I have an array of sentences and a user sentence, and I need to find the closest to the user's sentence element of the array.
I presented each sentence in the form of a vector using word2vec:
def get_avg_vector(word_list, model_w2v, size=500):
sum_vec = np.zeros(shape = (1, size))
count = 0
for w in word_list:
if w in model_w2v and w != '':
sum_vec += model_w2v[w]
count +=1
if count == 0:
return sum_vec
else:
return sum_vec / count + 1
As a result, the array element looks like this:
array([[ 0.93162371, 0.95618944, 0.98519795, 0.98580566, 0.96563747,
0.97070891, 0.99079191, 1.01572807, 1.00631016, 1.07349398,
1.02079309, 1.0064849 , 0.99179418, 1.02865136, 1.02610303,
1.02909719, 0.99350413, 0.97481178, 0.97980362, 0.98068508,
1.05657591, 0.97224562, 0.99778703, 0.97888296, 1.01650529,
1.0421448 , 0.98731804, 0.98349052, 0.93752996, 0.98205837,
1.05691232, 0.99914532, 1.02040555, 0.99427229, 1.01193818,
0.94922226, 0.9818139 , 1.03955 , 1.01252615, 1.01402485,
...
0.98990598, 0.99576604, 1.0903802 , 1.02493086, 0.97395976,
0.95563786, 1.00538653, 1.0036294 , 0.97220088, 1.04822631,
1.02806122, 0.95402776, 1.0048053 , 0.97677222, 0.97830801]])
I represent the sentence of the user also as a vector, and I compute the closest element to it is like this:
%%cython
from scipy.spatial.distance import euclidean
def compute_dist(v, list_sentences):
dist_dict = {}
for key, val in list_sentences.items():
dist_dict[key] = euclidean(v, val)
return sorted(dist_dict.items(), key=lambda x: x[1])[0][0]
list_sentences in the method above is a dictionary in which keys are a text representation of sentences, and values are vector.
It takes a very long time, because I have more than 60 million sentences.
How can I speed up, optimize this process?
I'll be grateful for any advice.
The initial calculation of the 60 million sentences' vectors is essentially a fixed cost you'll pay once. I'm assuming you mainly care about the time for each subsequent lookup, for a single user-supplied query sentence.
Using numpy native array operations can speed up the distance calculations over doing your own individual calculations in a Python loop. (It's able to do things in bulk using its optimized code.)
But first you'd want to replace list_sentences with a true numpy array, accessed only by array-index. (If you have other keys/texts you need to associate with each slot, you'd do that elsewhere, with some dict or list.)
Let's assume you've done that, in whatever way is natural for your data, and now have array_sentences, a 60-million by 500-dimension numpy array, with one sentence average vector per row.
Then a 1-liner way to get an array full of the distances is as the vector-length ("norm") of the difference between each of the 60 million candidates and the 1 query (which gives a 60-million entry answer with each of the differences):
dists = np.linalg.norm(array_sentences - v)
Another 1-liner way is to use the numpy utility function cdist() for comuting distance between each pair of two collections of inputs. Here, your first collection is just the one query vector v (but if you had batches to do at once, supplying more than one query at a time could offer an additional slight speedup):
dists = np.linalg.cdists(array[v], array_sentences)
(Note that such vector comparisons often use cosine-distance/cosine-similarity rather than euclidean-distance. If you switch to that, you might be doing other norming/dot-products instead of the first option above, or use the metric='cosine' option to cdist().)
Once you have all the distances in a numpy array, using a numpy-native sort option is likely to be faster than using Python sorted(). For example, numpy's indirect sort argsort(), which just returns the sorted indexes (and thus avoids moving all the vector coordinates-around), since you just want to know which items are the best match(es). For example:
sorted_indexes = argsort(dists)
best_index = sorted_indexes[0]
If you need to turn that int index back into your other key/text, you'd use your own dict/list that remembered the slot-to-key relationships.
All these still give an exactly right result, by comparing against all candidates, which (even when done optimally well) is still time-consuming.
There are ways to get faster results, based on pre-building indexes to the full set of candidates – but such indexes become very tricky in high-dimensional spaces (like your 500-dimensional space). They often trade off perfectly accurate results for faster results. (That is, what they return for 'closest 1' or 'closest N' will have some errors, but usually not be off by much.) For examples of such libraries, see Spotify's ANNOY or Facebook's FAISS.
At least if you are doing this procedure for multiple sentences, you could try using scipy.spatial.cKDTree (I don't know whether it pays for itself on a single query. Also 500 is quite high, I seem to remember KDTrees work better for not quite as many dimensions. You'll have to experiment).
Assuming you've put all your vectors (dict values) into one large numpy array:
>>> import numpy as np
>>> from scipy.spatial import cKDTree as KDTree
>>>
# 100,000 vectors (that's all my RAM can take)
>>> a = np.random.random((100000, 500))
>>>
>>> t = KDTree(a)
# create one new vector and find distance and index of closest
>>> t.query(np.random.random(500))
(8.20910072933986, 83407)
I can think about 2 possible ways of optimizing this process.
First, if your goal is only to get the closest vector (or sentence), you could get rid of the list_sentences variable and only keep in memory the closest sentence you have found yet. This way, you won't need to sort the complete (and presumably very large) list at the end, and only return the closest one.
def compute_dist(v, list_sentences):
min_dist = 0
for key, val in list_sentences.items():
dist = euclidean(v, val)
if dist < min_dist:
closest_sentence = key
min_dist = dist
return closest_sentence
The second one is maybe a little more unsound. You can try to re implement the euclidean method by giving it a third argument which would be the current minimum distance min_dist between the closest vector you have found so far and the user vector. I don't know how the scipy euclidean method is implemented but I guess it is close to summing squared differences along all the vectors dimensions. What you want is the method to stop if the sum is higher than min_dist (the distance will be higher than min_dist anyway and you won't keep it).
For an assignment I have to use different combinations of features belonging to some data, to evaluate a classification system. By features I mean measurements, e.g. height, weight, age, income. So for instance I want to see how well a classifier performs when given just the height and weight to work with, and then the height and age say. I not only want to be able to test what two features work best together, but also what 3 features work best together and would like to be able to generalise this to n features.
I've been attempting this using numpy's mgrid, to create n dimensional arrays, flattening them, and then making arrays that use the same elements from each array to create new ones. Tricky to explain so here is some code and psuedo code:
import numpy as np
def test_feature_combos(data, combinations):
dimensions = combinations.shape[0]
grid = np.empty(dimensions)
for i in xrange(dimensions):
grid[i] = combinations[i].flatten()
#The above code throws an error "setting an array element with a sequence" error which I understand, but this shows my approach.
**Pseudo code begin**
For each element of each element of this new array,
create a new array like so:
[[1,1,2,2],[1,2,1,2]] ---> [[1,1],[1,2],[2,1],[2,2]]
Call this new array combo_indices
Then choose the columns (features) from the data in a loop using:
new_data = data[:, combo_indices[j]]
combinations = np.mgrid[1:5,1:5]
test_feature_combos(data, combinations)
I concede that this approach means a lot of unnecessary combinations due to repeats, however I cannot even implement this so beggars can not be choosers.
Please can someone advise me on how I can either a) implement my approach or b) achieve this goal in a much more elegant way.
Thanks in advance, and let me know if any clarification needs to be made, this was tough to explain.
To generate all combinations of k elements drawn without replacement from a set of size n you can use itertools.combinations, e.g.:
idx = np.vstack(itertools.combinations(range(n), k)) # an (n, k) array of indices
For the special case where k=2 it's often faster to use the indices of the upper triangle of an n x n matrix, e.g.:
idx = np.vstack(np.triu_indices(n, 1)).T
I have a dynamic programming algorithm (modified Needleman-Wunsch) which requires the same basic calculation twice, but the calculation is done in the orthogonal direction the second time. For instance, from a given cell (i,j) in matrix scoreMatrix, I want to both calculate a value from values "up" from (i,j), as well as a value from values to the "left" of (i,j). In order to reuse the code I have used a function in which in the first case I send in parameters i,j,scoreMatrix, and in the next case I send in j,i,scoreMatrix.transpose(). Here is a highly simplified version of that code:
def calculateGapCost(i,j,scoreMatrix,gapcost):
return scoreMatrix[i-1,j] - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost)
gapUp = calculateGapCost(j,i,scoreMatrix.transpose(),gapcost)
...
I realized that I could alternatively send in a function that would in the one case pass through arguments (i,j) when retrieving a value from scoreMatrix, and in the other case reverse them to (j,i), rather than transposing the matrix each time.
def passThrough(i,j,matrix):
return matrix[i,j]
def flipIndices(i,j,matrix):
return matrix[j,i]
def calculateGapCost(i,j,scoreMatrix,gapcost,retrieveValue):
return retrieveValue(i-1,j,scoreMatrix) - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost,passThrough)
gapUp = calculateGapCost(j,i,scoreMatrix,gapcost,flipIndices)
...
However if numpy transpose uses some features I'm unaware of to do the transpose in just a few operations, it may be that transpose is in fact faster than my pass-through function idea. Can anyone tell me which would be faster (or if there is a better method I haven't thought of)?
The actual method would call retrieveValue 3 times, and involves 2 matrices that would be referenced (and thus transposed if using that approach).
In NumPy, transpose returns a view with a different shape and strides. It does not touch the data.
Therefore, you will likely find that the two approaches have identical performance, since in essence they are exactly the same.
However, the only way to be sure is to benchmark both.
I have two arrays A,B and want to take the outer product on their last dimension,
e.g.
result[:,i,j]=A[:,i]*B[:,j]
when A,B are 2-dimensional.
How can I do this if I don't know whether they will be 2 or 3 dimensional?
In my specific problem A,B are slices out of a bigger 3-dimensional array Z,
Sometimes this may be called with integer indices A=Z[:,1,:], B=Z[:,2,:] and other times
with slices A=Z[:,1:3,:],B=Z[:,4:6,:].
Since scipy "squeezes" singleton dimensions, I won't know what dimensions my inputs
will be.
The array-outer-product I'm trying to define should satisfy
array_outer_product( Y[a,b,:], Z[i,j,:] ) == scipy.outer( Y[a,b,:], Z[i,j,:] )
array_outer_product( Y[a:a+N,b,:], Z[i:i+N,j,:])[n,:,:] == scipy.outer( Y[a+n,b,:], Z[i+n,j,:] )
array_outer_product( Y[a:a+N,b:b+M,:], Z[i:i+N, j:j+M,:] )[n,m,:,:]==scipy.outer( Y[a+n,b+m,:] , Z[i+n,j+m,:] )
for any rank-3 arrays Y,Z and integers a,b,...i,j,k...n,N,...
The kind of problem I'm dealing with involves a 2-D spatial grid, with a vector-valued function at each grid point. I want to be able to calculate the covariance matrix (outer product) of these vectors, over regions defined by slices in the first two axes.
You may have some luck with einsum :
http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html
After discovering the use of ellipsis in numpy/scipy arrays
I ended up implementing it as a recursive function:
def array_outer_product(A, B, result=None):
''' Compute the outer-product in the final two dimensions of the given arrays.
If the result array is provided, the results are written into it.
'''
assert(A.shape[:-1] == B.shape[:-1])
if result is None:
result=scipy.zeros(A.shape+B.shape[-1:], dtype=A.dtype)
if A.ndim==1:
result[:,:]=scipy.outer(A, B)
else:
for idx in xrange(A.shape[0]):
array_outer_product(A[idx,...], B[idx,...], result[idx,...])
return result
Assuming I've understood you correctly, I encountered a similar issue in my research a couple weeks ago. I realized that the Kronecker product is simply an outer product which preserves dimensionality. Thus, you could do something like this:
import numpy as np
# Generate some data
a = np.random.random((3,2,4))
b = np.random.random((2,5))
# Now compute the Kronecker delta function
c = np.kron(a,b)
# Check the shape
np.prod(c.shape) == np.prod(a.shape)*np.prod(b.shape)
I'm not sure what shape you want at the end, but you could use array slicing in combination with np.rollaxis, np.reshape, np.ravel (etc.) to shuffle things around as you wish. I guess the downside of this is that it does some extra calculations. This may or may not matter, depending on your limitations.