Related
I have a numpy vector, and a numpy array.
I need to take from every row in the matrix the first N (lets say 3) values that are smaller than (or equal to) the corresponding line in the vector.
so if this is my vector:
7,
9,
22,
38,
6,
15
and this is my matrix:
[[ 20., 9., 7., 5., None, None],
[ 33., 21., 18., 9., 8., 7.],
[ 31., 21., 13., 12., 4., 0.],
[ 36., 18., 11., 7., 7., 2.],
[ 20., 14., 10., 6., 6., 3.],
[ 14., 14., 13., 11., 5., 5.]]
the output should be:
[[7,5,None],
[9,8,7],
[21,13,12],
[36,18,11],
[6,6,3],
14,14,13]]
Is there any efficient way to do that with masks or something, without an ugly for loop?
Any help will be appreciated!
Approach #1
Here's one with broadcasting -
def takeN_le_per_row_broadcasting(a, b, N=3): # a, b : 1D, 2D arrays respectively
# First col indices in each row of b with <= corresponding one in a
idx = (b <= a[:,None]).argmax(1)
# Get all N ranged column indices
all_idx = idx[:,None] + np.arange(N)
# Finally advanced-index with those indices into b for desired output
return b[np.arange(len(all_idx))[:,None], all_idx]
Approach #2
Inspired by NumPy Fancy Indexing - Crop different ROIs from different channels's solution, we can leverage np.lib.stride_tricks.as_strided for efficient patch extraction, like so -
from skimage.util.shape import view_as_windows
def takeN_le_per_row_strides(a, b, N=3): # a, b : 1D, 2D arrays respectively
# First col indices in each row of b with <= corresponding one in a
idx = (b <= a[:,None]).argmax(1)
# Get 1D sliding windows for each element off data
w = view_as_windows(b, (1,N))[:,:,0]
# Use fancy/advanced indexing to select the required ones
return w[np.arange(len(idx)), idx]
Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.
Basically, I need to be able to take arrays like this:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
And then "add" them so they look like this:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.
Thanks!
Edit: Ready for some speedtests?
Intel win 64bit machine. All of the following values are in seconds; 20000 loops.
collections.Counter results:
2.131000, 2.125000, 2.125000
Divakar's union1d + masking results:
1.641000, 1.633000, 1.625000
Divakar's union1d + indexing results:
0.625000, 0.625000, 0.641000
Histogram results:
1.844000, 1.938000, 1.858000
Pandas results:
16.659000, 16.686000, 16.885000
Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!
Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).
Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.
Here's one approach with np.union1d and masking -
def app1(a,b):
c0 = np.union1d(a[:,0],b[:,0])
out = np.zeros((len(c0),2))
out[:,0] = c0
mask1 = np.in1d(c0,a[:,0])
out[mask1,1] = a[:,1]
mask2 = np.in1d(c0,b[:,0])
out[mask2,1] += b[:,1]
return out
Sample run -
In [174]: a
Out[174]:
array([[ 0., 2.],
[ 12., 2.],
[ 23., 1.]])
In [175]: b
Out[175]:
array([[ 0., 3.],
[ 1., 1.],
[ 12., 5.]])
In [176]: app1(a,b)
Out[176]:
array([[ 0., 5.],
[ 1., 1.],
[ 12., 7.],
[ 23., 1.]])
Here's another with np.union1d and indexing -
def app2(a,b):
n = np.maximum(a[:,0].max(), b[:,0].max())+1
c0 = np.union1d(a[:,0],b[:,0])
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out = out0[c0.astype(int)]
out[:,0] = c0
return out
For the case where all indices are covered by the first column values in a and b -
def app2_specific(a,b):
c0 = np.union1d(a[:,0],b[:,0])
n = c0[-1]+1
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out0[:,0] = c0
return out0
Sample run -
In [234]: a
Out[234]:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
In [235]: b
Out[235]:
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
In [236]: app2_specific(a,b)
Out[236]:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If you know the number of fields, use np.bincount.
c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T
This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.
out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])
You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.
import numpy as np
x = np.array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
y = np.array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.],
[ 5., 3.]])
c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w,
bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0. 5.]
# [ 1. 1.]
# [ 2. 7.]
# [ 3. 1.]
# [ 4. 0.]
# [ 5. 3.]]
r_nonzero = r[r[:,1]!=0]
Pandas have some functions doing exactly what you intend
import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)
Edit: If you actually need the data back in numpy format, just do
array_res = result.reset_index(name=1).values
This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:
import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])
Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.
I am working with a large array of data, but every so often I wind up with a nan instead of a value. I need to remove these somehow. Here is an example of my dataset
1 2
3 4
nan 5
6 7
8 nan
9 10
and I would to remove the bad data to become:
1 2
3 4
6 7
9 10
If you're just using numpy, use logical indexing:
import numpy as np
x = np.array([[ 1., 2.],
[ 3., 4.],
[ np.nan, 5.],
[ 6., 7.],
[ 8., np.nan],
[ 9., 10.]])
# find which rows contain nans
ix = np.any(np.isnan(x), axis=1)
# remove them
x = x[~ix]
Which gives:
array([[ 1., 2.],
[ 3., 4.],
[ 6., 7.],
[ 9., 10.]])
This will work for arrays of any number of columns: if a row contains a NaN in at least one column, it is removed.
Alternatively, if you're using pandas, simply use dropna:
import pandas as pd
df = pd.DataFrame(x)
df.dropna()
You can do:
my_numpy_arr = my_numpy_arr[(my_numpy_arr==my_numpy_arr).all(1)]
I have two arrays A and B:
A=array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B=array([[ 1., 1., 2.],
[ 3., 2., 1.]])
Anywhere there is a "1" in B I want to sum the same row and column locations in A.
So for example for this one the answer would be 5+5+9=10
I would want this to continue for 2,3....n (all unique values in B)
So for the 2's... it would be 9+5=14 and for the 3's it would be 8
I found the unique values by using:
numpy.unique(B)
I realize this make take multiple steps but I can't really wrap my head around using the index matrix to sum those locations in another matrix.
For each unique value x, you can do
A[B == x].sum()
Example:
>>> A[B == 1.0].sum()
19.0
I thinknumpy.bincount is what you want. If B is an array of small integers like in you example you can do something like this:
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1, 1, 2],
[ 3, 2, 1]])
print numpy.bincount(B.ravel(), weights=A.ravel())
# [ 0. 19. 14. 8.]
or if B has anything but small integers you can do something like this
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1., 1., 2.],
[ 3., 2., 1.]])
uniqB, inverse = numpy.unique(B, return_inverse=True)
print uniqB, numpy.bincount(inverse, weights=A.ravel())
# [ 1. 2. 3.] [ 19. 14. 8.]
[(val, np.sum(A[B==val])) for val in np.unique(B)] gives you a list of tuples where the first element is one of the unique values in B, and the second element is the sum of elements in A where the corresponding value in B is that value.
>>> [(val, np.sum(A[B==val])) for val in np.unique(B)]
[(1.0, 19.0), (2.0, 14.0), (3.0, 8.0)]
The key is that you can use A[B==val] to access items in A at positions where B equals val.
Edit: If you just want the sums, just do [np.sum(A[B==val]) for val in np.unique(B)].
I'd use numpy masked arrays. These are standard numpy arrays with a mask associated with them blocking off certain values. The process is pretty straight forward, create a masked array using
numpy.ma.masked_array(data, mask)
where mask is generated by using a masked function
mask = numpy.ma.masked_not_equal(B, 1).mask
and data is A
for i in numpy.unique(B):
print numpy.ma.masked_array(A, numpy.ma.masked_not_equal(B, i).mask).sum()
19.0
14.0
8.0
i found old question here
one of the answer
def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
in your case, you can flatten your array
aflat = A.flatten()
bflat = B.flatten()
sum_by_group(aflat, bflat)
I want to center multi-dimensional data in a n x m matrix (<class 'numpy.matrixlib.defmatrix.matrix'>), let's say X . I defined a new array ones(645), lets say centVector to produce the mean for every row in matrix X. And now I want to iterate every row in X, compute the mean and assign this value to the corresponding index in centVector. Isn't this possible in a single row in scipy/numpy? I am not used to this language and think about something like:
centVector = ones(645)
for key, val in X:
centVector[key] = centVector[key] * (val.sum/val.size)
Afterwards I just need to subtract the mean in every Row:
X = X - centVector
How can I simplify this?
EDIT: And besides, the above code is not actually working - for a key-value loop I need something like enumerate(X). And I am not sure if X - centVector is returning the proper solution.
First, some example data:
>>> import numpy as np
>>> X = np.matrix(np.arange(25).reshape((5,5)))
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
numpy conveniently has a mean function. By default however, it'll give you the mean over all the values in the array. Since you want the mean of each row, you need to specify the axis of the operation:
>>> np.mean(X, axis=1)
matrix([[ 2.],
[ 7.],
[ 12.],
[ 17.],
[ 22.]])
Note that axis=1 says: find the mean along the columns (for each row), where 0 = rows and 1 = columns (and so on). Now, you can subtract this mean from your X, as you did originally.
Unsolicited advice
Usually, it's best to avoid the matrix class (see docs). If you remove the np.matrix call from the example data, then you get a normal numpy array.
Unfortunately, in this particular case, using an array slightly complicates things because np.mean will return a 1D array:
>>> X = np.arange(25).reshape((5,5))
>>> r_means = np.mean(X, axis=1)
>>> print r_means
[ 2. 7. 12. 17. 22.]
If you try to subtract this from X, r_means gets broadcast to a row vector, instead of a column vector:
>>> X - r_means
array([[ -2., -6., -10., -14., -18.],
[ 3., -1., -5., -9., -13.],
[ 8., 4., 0., -4., -8.],
[ 13., 9., 5., 1., -3.],
[ 18., 14., 10., 6., 2.]])
So, you'll have to reshape the 1D array into an N x 1 column vector:
>>> X - r_means.reshape((-1, 1))
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])
The -1 passed to reshape tells numpy to figure out this dimension based on the original array shape and the rest of the dimensions of the new array. Alternatively, you could have reshaped the array using r_means[:, np.newaxis].