numpy.where equivalent for csr_matrix python - python

I am trying to use numpy.where with csr_matrix, which dose not work. I am asking is there some built in function equivalent to numpy.where for sparse matrix. Here is an example of what I would like to do without using Forloop or .todense()
import scipy.sparse as spa
import numpy as np
N = 100
A = np.zeros((N,N))
di = np.diag_indices((len(A[:,0])))
A[di] = 2.3
'''
adding some values to non-diagonal terms
for sake of example
'''
for k in range(0,len(A)-1):
for j in range(-1,3,2):
A[k,k+j] = 4.0
A[2,3] =0.1
A[3,3] = 0.1
A[0,4] = 0.2
A[0,2] = 3
'''
creating sparse matrix
'''
A = spa.csc_matrix((N,N))
B = spa.csc_matrix((N,N))
'''
Here I get
TypeError: unsupported operand type(s) for &: 'csc_matrix' and 'csc_matrix'
'''
ind1 = np.where((A>0.0) & (A<=1.0))
B[ind1] = (3.0-B[ind1])**5-6.0*(2.0-B[ind1])**5

How about working with underlying arrays for A and B, the data arrays
In [36]: ind2=np.where((A.data>0.0)&(A.data<=1.0))
In [37]: A.indices[ind2]
Out[37]: array([2, 3, 0])
In [38]: A.indptr[ind2]
Out[38]: array([28, 31, 37])
In [39]: A.data[ind2]
Out[39]: array([ 0.1, 0.1, 0.2])
In [41]: B.data[ind2]=(3.0-B.data[ind2])**5-6.0*(2.0-B.data[ind2])**5
In [42]: B.data[ind2]
Out[42]: array([ 56.54555, 56.54555, 58.7296 ])
To see what ind2 corresponds to in the dense version, convert the array to coo
In [53]: Ac=A.tocoo()
In [54]: (Ac.row[ind2], Ac.col[ind2])
Out[54]: (array([2, 3, 0]), array([3, 3, 4]))
where, for reference, the where on the dense array is:
In [57]: np.where((A.A>0.0) & (A.A<=1.0))
Out[57]: (array([0, 2, 3]), array([4, 3, 3]))
One important caution - working with A.data means you exclude all of the zero entries of the dense array.

Related

applying a function to rows of a ndarray && converting itertool object to numpy array

I am trying to create permutations of size 4 from a group of real numbers. After that, I'd like to know the position of the first element in a permutation after I sort it. Here is what I have tried so far. What's the best way to do this?
import numpy as np
from itertools import chain, permutations
N_PLAYERS = 4
N_STATES = 60
np.random.seed(0)
state_space = np.linspace(0.0, 1.0, num=N_STATES, retstep=True)[0].tolist()
perms = permutations(state_space, N_PLAYERS)
perms_arr = np.fromiter(chain(*perms),dtype=np.float16)
def loc(row):
return np.where(np.argsort(row) == 0)[0].tolist()[0]
locs = np.apply_along_axis(loc, 0, perms)
In [153]: N_PLAYERS = 4
...: N_STATES = 60
...: np.random.seed(0)
...: state_space = np.linspace(0.0, 1.0, num=N_STATES, retstep=True)[0].tolist()
...: perms = itertools.permutations(state_space, N_PLAYERS)
In [154]: alist = list(perms)
In [155]: len(alist)
Out[155]: 11703240
Simply making a list from the permuations produces a list of lists, with all sublists of length N_PLAYERS.
Making an array from that with chain flattens it:
In [156]: perms = itertools.permutations(state_space, N_PLAYERS)
In [158]: perms_arr = np.fromiter(itertools.chain(*perms),dtype=np.float16)
In [159]: perms_arr.shape
Out[159]: (46812960,)
In [160]: alist[0]
Which could be reshaped to (11703240,4).
Using apply on that 1d array doesn't work (or make sense):
In [170]: perms_arr.shape
Out[170]: (46812960,)
In [171]: locs = np.apply_along_axis(loc, 0, perms_arr)
In [172]: locs.shape
Out[172]: ()
Reshape to 4 columns:
In [173]: locs = np.apply_along_axis(loc, 0, perms_arr.reshape(-1,4))
In [174]: locs.shape
Out[174]: (4,)
In [175]: locs
Out[175]: array([ 0, 195054, 578037, 769366])
This applies loc to each column, returning one value for each. But loc has a row variable. Is that supposed to be significant?
I could switch the axis; this takes much longer, and al
In [176]: locs = np.apply_along_axis(loc, 1, perms_arr.reshape(-1,4))
In [177]: locs.shape
Out[177]: (11703240,)
list comprehension
This iteration does the same thing as your apply_along_axis, and I expect is faster (though I haven't timed it - it's too slow).
In [188]: locs1 = np.array([loc(row) for row in perms_arr.reshape(-1,4)])
In [189]: np.allclose(locs, locs1)
Out[189]: True
whole array sort
But argsort takes an axis, so I can sort all rows at once (instead of iterating):
In [185]: np.nonzero(np.argsort(perms_arr.reshape(-1,4), axis=1)==0)
Out[185]:
(array([ 0, 1, 2, ..., 11703237, 11703238, 11703239]),
array([0, 0, 0, ..., 3, 3, 3]))
In [186]: np.allclose(_[1],locs)
Out[186]: True
Or going the other direction: - cf with Out[175]
In [187]: np.nonzero(np.argsort(perms_arr.reshape(-1,4), axis=0)==0)
Out[187]: (array([ 0, 195054, 578037, 769366]), array([0, 1, 2, 3]))

can't reverse reshaped numpy array

I want to reverse reshaped numpy by calling reshape again on the array to reshape it into the original dimensions.
I have an array trian_x with dimensions (x, y, z) then I reshape train_x
train_X_1 = train_X.reshape(train_X.shape[0], train_X.shape[1] * train_X.shape[2])
then I want to reverse the reshaped
train_X_2 = train_X_1.reshape((train_X.shape[0], train_X.shape[1], train_X.shape[2])
when I compare
print((train_X_2 == train_X).all())
I get False
what's wrong with my code? thanks
Are you just trying this:
In [184]: x = np.arange(24).reshape(2,3,4)
In [185]: x1 = x.reshape(2,12)
In [186]: x2 = x1.reshape(2,3,4)
In [187]: np.allclose(x,x2)
Out[187]: True
What's your dtype? allclose is better for floats.
In [218]: data = np.load('../Downloads/train_X.npy')
In [219]: data.shape
Out[219]: (97848, 20, 2)
In [220]: data.dtype
Out[220]: dtype('float64')
In [221]: data1 = data.reshape(data.shape[0], data.shape[1]*data.shape[2])
In [222]: data1.shape
Out[222]: (97848, 40)
In [223]: data2 = data1.reshape(data.shape)
In [224]: data2.shape
Out[224]: (97848, 20, 2)
In [225]: np.allclose(data, data2)
Out[225]: False
In [226]: np.max(np.abs(data - data2))
Out[226]: nan
In [247]: np.isnan(data).sum()
Out[247]: 2514
In [248]: np.isnan(data2).sum()
Out[248]: 2514
There's your problem - the array contains nan, which don't test ==. Let's compare without those nan:
In [251]: np.allclose(np.nan_to_num(data),np.nan_to_num(data2))
Out[251]: True
It sounds like you want to flatten, then reverse, then reshape.
starting with an array:
import numpy as np
arr = np.arange(6).reshape((2,3)) #[[0, 1, 2,], [3, 4, 5]]
We can flatten into a 1D array using ravel
arr = arr.ravel() #[0,1,2,3,4,5]
We can then reverse the order
arr = arr[::-1] #[5,4,3,2,1,0]
Then we reshape it
arr.reshape(2,3) #[[5, 4, 3], [2, 1, 0]]
Altogether:
import numpy as np
arr = np.arange(6).reshape((2,3))
arr = arr.ravel()[::-1].reshape(2,3)
print(arr)

How to add element to empty 2d numpy array

I'm trying to insert elements to an empty 2d numpy array. However, I am not getting what I want.
I tried np.hstack but it is giving me a normal array only. Then I tried using append but it is giving me an error.
Error:
ValueError: all the input arrays must have same number of dimensions
randomReleaseAngle1 = np.random.uniform(20.0, 77.0, size=(5, 1))
randomVelocity1 = np.random.uniform(40.0, 60.0, size=(5, 1))
randomArray =np.concatenate((randomReleaseAngle1,randomVelocity1),axis=1)
arr1 = np.empty((2,2), float)
arr = np.array([])
for i in randomArray:
data = [[170, 68.2, i[0], i[1]]]
df = pd.DataFrame(data, columns = ['height', 'release_angle', 'velocity', 'holding_angle'])
test_y_predictions = model.predict(df)
print(test_y_predictions)
if (np.any(test_y_predictions == 1)):
arr = np.hstack((arr, np.array([i[0], i[1]])))
arr1 = np.append(arr1, np.array([i[0], i[1]]), axis=0)
print(arr)
print(arr1)
I wanted to get something like
[[1.5,2.2],
[3.3,4.3],
[7.1,7.3],
[3.3,4.3],
[3.3,4.3]]
However, I'm getting
[56.60290125 49.79106307 35.45102444 54.89380834 47.09359271 49.19881675
22.96523274 44.52753514 67.19027156 54.10421167]
The recommended list append approach:
In [39]: alist = []
In [40]: for i in range(3):
...: alist.append([i, i+10])
...:
In [41]: alist
Out[41]: [[0, 10], [1, 11], [2, 12]]
In [42]: np.array(alist)
Out[42]:
array([[ 0, 10],
[ 1, 11],
[ 2, 12]])
If we start with a empty((2,2)) array:
In [47]: arr = np.empty((2,2),int)
In [48]: arr
Out[48]:
array([[139934912589760, 139934912589784],
[139934871674928, 139934871674952]])
In [49]: np.concatenate((arr, [[1,10]],[[2,11]]), axis=0)
Out[49]:
array([[139934912589760, 139934912589784],
[139934871674928, 139934871674952],
[ 1, 10],
[ 2, 11]])
Note that empty does not mean the same thing as the list []. It's a real 2x2 array, with 'unspecified' values. And those values remain when we add other arrays to it.
I could start with an array with a 0 dimension:
In [51]: arr = np.empty((0,2),int)
In [52]: arr
Out[52]: array([], shape=(0, 2), dtype=int64)
In [53]: np.concatenate((arr, [[1,10]],[[2,11]]), axis=0)
Out[53]:
array([[ 1, 10],
[ 2, 11]])
That looks more like the list append approach. But why start with the (0,2) array in the first place?
np.concatenate takes a list of arrays (or lists that can be made into arrays). I used nested lists that make (1,2) arrays. With this I can join them on axis 0.
Each concatenate makes a new array. So if done iteratively it is more expensive than the list append.
np.append just takes 2 arrays and does a concatenate. So doesn't add much. hstack tweaks shapes and joins on the 2nd (horizontal) dimension. vstack is another variant. But they all end up using concatenate.
With the hstack method, you can just reshape after you get the final array:
arr = arr.reshape(-1, 2)
print(arr)
The other method can be more easily done in a similar way:
arr1 = np.append(arr1, np.array([i[0], i[1]]) # in the loop
arr1 = arr1.reshape(-1, 2)
print(arr1)

How to efficiently sum and mean 2D NumPy arrays by id?

I have a 2d array a and a 1d array b. I want to compute the sum of rows in array a group by each id in b. For example:
import numpy as np
a = np.array([[1,2,3],[2,3,4],[4,5,6]])
b = np.array([0,1,0])
count = len(b)
ls = list(set(b))
res = np.zeros((len(ls),a.shape[1]))
for i in ls:
res[i] = np.array([a[x] for x in range(0,count) if b[x] == i]).sum(axis=0)
print res
I got the printed result as:
[[ 5. 7. 9.]
[ 2. 3. 4.]]
What I want to do is, since the 1st and 3rd elements of b are 0, I perform a[0]+a[2], which is [5, 7, 9] as one row of the results. Similarly, the 2nd element of b is 1, so that I perform a[1], which is [2, 3, 4] as another row of the results.
But it seems my implementation is quite slow for large array. Is there any better implementation?
I know there is a bincount function in numpy. But it seems only supports 1d array.
Thank you all for helping me!
The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:
import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)
Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.
import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)
Approach #1
You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -
np.add.at(res, b, a)
Sample run -
In [40]: a
Out[40]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [41]: b
Out[41]: array([0, 1, 0])
In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)
In [46]: np.add.at(res, b, a)
In [47]: res
Out[47]:
array([[5, 7, 9],
[2, 3, 4]])
To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -
In [49]: res/np.bincount(b)[:,None].astype(float)
Out[49]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -
def groupby_addat(a, b, out="sum"):
unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
np.add.at(res, tags, a)
if out=="mean":
return unqb, res/counts[:,None].astype(float)
elif out=="sum":
return unqb, res
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [204]: b_ids, means = groupby_addat(a, b, out="mean")
In [205]: b_ids
Out[205]: array([ 5, 10])
In [206]: means
Out[206]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Approach #2
We could also make use of np.add.reduceat and might be more performant -
def groupby_addreduceat(a, b, out="sum"):
sidx = b.argsort()
sb = b[sidx]
spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
sums = np.add.reduceat(a[sidx],spt_idx[:-1])
if out=="mean":
counts = spt_idx[1:] - spt_idx[:-1]
return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
elif out=="sum":
return sb[spt_idx[:-1]], sums
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")
In [208]: b_ids
Out[208]: array([ 5, 10])
In [209]: means
Out[209]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])

Numpy - Matrix Generation from list comprehension

I have the following list of np.array:
dataset = [np.random.normal(r_mean/(p*t), r_vol/t/np.sqrt(p), n) \
for t in rule]
I want to transform it into an 2D np.array (ie. a matrix). I could use np.asarray, but (I believe) it would be inefficient.
Also, each np.random.normal(r_mean/(p*t), r_vol/t/np.sqrt(p), n) is meant to be a column of the resulting matrix, not a row (ie. I'd have to transpose np.asarray(dataset)).
What is the best way of achieving the result ?
You can use broadcasting to create dataset with a single call to numpy.random.normal. Instead of using a list comprehension, make rule a numpy array and use it where you have t in your expression, and request a sample with size (n, len(rule)):
In [66]: r_mean = 1.0
In [67]: r_vol = 3.0
In [68]: p = 2.0
In [69]: rule = np.array([1.0, 100.0, 10000.0])
In [70]: n = 8
In [71]: dataset = np.random.normal(r_mean/(p*rule), r_vol/rule/np.sqrt(p), size=(n, len(rule)))
In [72]: dataset
Out[72]:
array([[ 7.44295301e-01, -1.57786106e-03, -1.85518458e-04],
[ -2.37293991e+00, -2.27875859e-02, 3.38182239e-04],
[ 2.01362974e+00, 5.93566418e-02, -3.00178175e-04],
[ 2.52533022e+00, 8.15380813e-03, 1.82511343e-04],
[ 7.32980563e-01, 2.67511372e-02, -1.95965258e-04],
[ 2.91958598e+00, -1.36314059e-02, 2.45200175e-04],
[ -4.43329724e+00, -5.85052629e-02, -1.75796458e-04],
[ -2.45005431e-01, -1.68543495e-02, 1.69715542e-04]])
If you are unsure that the columns correctly match the parameters, we can test a large sample:
In [73]: n = 100000
Create mu and std so we can see the requested means and standard deviations:
In [74]: mu = r_mean/(p*rule)
In [75]: std = r_vol/rule/np.sqrt(p)
Generate the data:
In [76]: dataset = np.random.normal(mu, std, size=(n, len(rule)))
Here's the mu that we requested:
In [77]: mu
Out[77]: array([ 5.00000000e-01, 5.00000000e-03, 5.00000000e-05])
And here's what we got in the sample:
In [78]: dataset.mean(axis=0)
Out[78]: array([ 4.95672937e-01, 5.08624034e-03, 5.02922664e-05])
Here are the desired standard deviations:
In [79]: std
Out[79]: array([ 2.12132034e+00, 2.12132034e-02, 2.12132034e-04])
And here's what we got:
In [80]: dataset.std(axis=0)
Out[80]: array([ 2.11258192e+00, 2.12437161e-02, 2.11784163e-04])
ds = np.empty((dataset[0].size, len(dataset)), dtype=dataset[0].dtype)
for i in range(ds.shape[1]):
ds[:, i] = dataset[i]
but only do that if you must precompute the dataset list first.
Else use a generator:
ds = np.empty((n, len(rule)))
dataset = (np.random.normal(r_mean/(p*t), r_vol/t/np.sqrt(p), n) for t in rule)
for i, d in enumerate(dataset):
ds[:, i] = d

Categories