I have a sparse 2D matrix, typically something like this:
test
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 2., 1., 0.],
[ 0., 0., 0., 1.]])
I'm interested in all nonzero elements in "test"
index = numpy.nonzero(test) returns a tuple of arrays giving me the indices for the nonzero elements:
index
(array([0, 2, 2, 3]), array([0, 1, 2, 3]))
For each row I would like to print out all the nonzero elements, but skipping all rows containing only zero elements.
I would appreciate hints for this.
Thanks for the hints. This solved the problem:
>>> test
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 2., 1., 0.],
[ 0., 0., 0., 1.]])
>>> transp=np.transpose(np.nonzero(test))
>>> transp
array([[0, 0],
[2, 1],
[2, 2],
[3, 3]])
>>> for index in range(len(transp)):
row,col = transp[index]
print 'Row index ',row,'Col index ',col,' value : ', test[row,col]
giving me:
Row index 0 Col index 0 value : 1.0
Row index 2 Col index 1 value : 2.0
Row index 2 Col index 2 value : 1.0
Row index 3 Col index 3 value : 1.0
Given
rows, cols = np.nonzero(test)
you could also use so-called advanced integer indexing:
test[rows, cols]
For example,
test = np.array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 2., 1., 0.],
[ 0., 0., 0., 1.]])
rows, cols = np.nonzero(test)
print(test[rows, cols])
yields
array([ 1., 2., 1., 1.])
Use array indexing:
test[test != 0]
There is no array operation to do this per-row (instead of for the entire matrix), as that would return a variable number of elements per row. You can use something like
[row[row != 0] for row in test]
to achieve that.
Related
A is an numpy array with shape (6, 8)
I want:
x_id = np.array([0, 3])
y_id = np.array([1, 3, 4, 7])
A[ [x_id, y_id] += 1 # this doesn't actually work.
Tricks like ::2 won't work because the indices do not increase regularly.
I don't want to use extra memory to repeat [0, 3] and make a new array [0, 3, 0, 3] because that is slow.
The indices for the two dimensions do not have equal length.
which is equivalent to:
A[0, 1] += 1
A[3, 3] += 1
A[0, 4] += 1
A[3, 7] += 1
Can numpy do something like this?
Update:
Not sure if broadcast_to or stride_tricks is faster than nested python loops. (Repeat NumPy array without replicating data?)
You can convert y_id to a 2d array with the 2nd dimension the same as x_id, and then the two indices will be automatically broadcasted due to the dimension difference:
x_id = np.array([0, 3])
y_id = np.array([1, 3, 4, 7])
A = np.zeros((6,8))
A[x_id, y_id.reshape(-1, x_id.size)] += 1
A
array([[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.]])
I have the following list of indices [2 4 3 4] which correspond to my target indices. I'm creating a matrix of zeroes with the following line of code targets = np.zeros((features.shape[0], 5)). Im wondering if its possible to slice in such a way that I could update the specific indices all at once and set those values to 1 without a for loop, ideally the matrix would look like
([0,0,1,0,0], [0,0,0,0,1], [0,0,0,1,0], [0,0,0,0,1])
I believe you can do something like this:
targets = np.zeros((4, 5))
ind = [2, 4, 3, 4]
targets[np.arange(0, 4), ind] = 1
Here is the result:
array([[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 1.]])
I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})
id item
0 1 a
1 1 a
2 2 c
3 2 b
4 3 a
5 3 b
dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()
id item_a item_b item_c
0 1 2 0 0
1 2 0 1 1
2 3 1 1 0
Update:
Now I'm here, and the 'id' is lost, how to do aggregation then?
lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))
dummy.todense()
matrix([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
In case anyone needs a reference in the future, I put my solution here.
I used scipy sparse matrix.
First, do a grouping and count the number of records.
df = df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'})
This takes some time but not days.
Then use pivot table, which I found a solution here.
from scipy.sparse import csr_matrix
def to_sparse_pivot(df, id, item, count):
id_u = list(df[id].unique())
item_u = list(np.sort(df[item].unique()))
data = df[count].tolist()
row = df[id].astype('category', categories=id_u).cat.codes
col = df[item].astype('category', categories=item_u).cat.codes
return csr_matrix((data, (row, col)), shape=(len(id_u), len(item_u)))
Then call the function
result = to_sparse_pivot(df, 'id', 'item', 'count')
OneHotEncoder requires integers, so here is one way to map your items to a unique integer. Because the mapping is one-to-one, we can also reverse this dictionary.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'Item': ['a', 'a', 'c', 'b', 'a', 'b']})
mapping = {letter: integer for integer, letter in enumerate(df.Item.unique())}
reverse_mapping = {integer: letter for letter, integer in mapping.iteritems()}
>>> mapping
{'a': 0, 'b': 2, 'c': 1}
>>> reverse_mapping
{0: 'a', 1: 'c', 2: 'b'}
Now create a OneHotEncoder and map your values.
hot = OneHotEncoder()
h = hot.fit_transform(df.Item.map(mapping).values.reshape(len(df), 1))
>>> h
<6x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> h.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 0., 1.]])
And for reference, these would be the appropriate columns:
>>> [reverse_mapping[n] for n in reverse_mapping.keys()]
['a', 'c', 'b']
From your data, you can see that the value c in the dataframe was in the third row (with an index value of 2). This has been mapped to c which you can see from the reverse mapping is the middle column. It is also the only value in the middle column of the matrix to contain a value of one, confirming the result.
Beyond this, I'm not sure where you'd be stuck. If you still have issues, please clarify.
To concatenate the ID values:
>>> np.concatenate((df.ID.values.reshape(len(df), 1), h.toarray()), axis=1)
array([[ 1., 1., 0., 0.],
[ 1., 1., 0., 0.],
[ 2., 0., 1., 0.],
[ 2., 0., 0., 1.],
[ 3., 1., 0., 0.],
[ 3., 0., 0., 1.]])
To keep the array sparse:
from scipy.sparse import hstack, lil_matrix
id_vals = lil_matrix(df.ID.values.reshape(len(df), 1))
h_dense = hstack([id_vals, h.tolil()])
>>> type(h_dense)
scipy.sparse.coo.coo_matrix
>>> h_dense.toarray()
array([[ 1., 1., 0., 0.],
[ 1., 1., 0., 0.],
[ 2., 0., 1., 0.],
[ 2., 0., 0., 1.],
[ 3., 1., 0., 0.],
[ 3., 0., 0., 1.]])
I would like to change diagonal elements from a 2d matrix. These are both main and non-main diagonals.
numpy.diagonal()
In NumPy 1.10, it will return a read/write view, Writing to the returned
array will alter your original array.
numpy.fill_diagonal(), numpy.diag_indices()
Only works with main-diagonal elements
Here is my use case: I want to recreate a matrix of the following form, which is very trivial using diagonal notation given that I have all the x, y, z as arrays.
Try this:
>>> A = np.zeros((6,6))
>>> i,j = np.indices(A.shape)
>>> z = [1, 2, 3, 4, 5]
Now you can intuitively access any diagonal:
>>> A[i==j-1] = z
>>> A
array([[ 0., 1., 0., 0., 0., 0.],
[ 0., 0., 2., 0., 0., 0.],
[ 0., 0., 0., 3., 0., 0.],
[ 0., 0., 0., 0., 4., 0.],
[ 0., 0., 0., 0., 0., 5.],
[ 0., 0., 0., 0., 0., 0.]])
In the same way you can assign arrays to A[i==j], etc.
You could always use slicing to assign a value or array to the diagonals.
Passing in a list of row indices and a list of column indices lets you access the locations directly (and efficiently). For example:
>>> z = np.zeros((5,5))
>>> z[np.arange(5), np.arange(5)] = 1 # diagonal is 1
>>> z[np.arange(4), np.arange(4) + 1] = 2 # first upper diagonal is 2
>>> z[np.arange(4) + 1, np.arange(4)] = [11, 12, 13, 14] # first lower diagonal values
changes the array of zeros z to:
array([[ 1., 2., 0., 0., 0.],
[ 11., 1., 2., 0., 0.],
[ 0., 12., 1., 2., 0.],
[ 0., 0., 13., 1., 2.],
[ 0., 0., 0., 14., 1.]])
In general for a k x k array called z, you can set the ith upper diagonal with
z[np.arange(k-i), np.arange(k-i) + i]
and the ith lower diagonal with
z[np.arange(k-i) + i, np.arange(k-i)]
Note: if you want to avoid calling np.arange several times, you can simply write ix = np.arange(k) once and then slice that range as needed:
np.arange(k-i) == ix[:-i]
Here is another approach just for fun. You can write your own diagonal function to return of view of the diagonal you need.
import numpy as np
def diag(a, k=0):
if k > 0:
a = a[:, k:]
elif k < 0:
a = a[-k:, :]
shape = (min(a.shape),)
strides = (sum(a.strides),)
return np.lib.stride_tricks.as_strided(a, shape, strides)
a = np.arange(20).reshape((4, 5))
diag(a, 2)[:] = 88
diag(a, -2)[:] = 99
print(a)
# [[ 0 1 88 3 4]
# [ 5 6 7 88 9]
# [99 11 12 13 88]
# [15 99 17 18 19]]
I have a large numpy matrix M. Some of the rows of the matrix have all of their elements as zero and I need to get the indices of those rows. The naive approach I'm considering is to loop through each row in the matrix and then check each elements.
What would be a better and a faster approach to accomplish this using numpy?
Here's one way. I assume numpy has been imported using import numpy as np.
In [20]: a
Out[20]:
array([[0, 1, 0],
[1, 0, 1],
[0, 0, 0],
[1, 1, 0],
[0, 0, 0]])
In [21]: np.where(~a.any(axis=1))[0]
Out[21]: array([2, 4])
It's a slight variation of this answer: How to check that a matrix contains a zero column?
Here's what's going on:
The any method returns True if any value in the array is "truthy". Nonzero numbers are considered True, and 0 is considered False. By using the argument axis=1, the method is applied to each row. For the example a, we have:
In [32]: a.any(axis=1)
Out[32]: array([ True, True, False, True, False], dtype=bool)
So each value indicates whether the corresponding row contains a nonzero value. The ~ operator is the binary "not" or complement:
In [33]: ~a.any(axis=1)
Out[33]: array([False, False, True, False, True], dtype=bool)
(An alternative expression that gives the same result is (a == 0).all(axis=1).)
To get the row indices, we use the where function. It returns the indices where its argument is True:
In [34]: np.where(~a.any(axis=1))
Out[34]: (array([2, 4]),)
Note that where returned a tuple containing a single array. where works for n-dimensional arrays, so it always returns a tuple. We want the single array in that tuple.
In [35]: np.where(~a.any(axis=1))[0]
Out[35]: array([2, 4])
The accepted answer works if the elements are int(0). If you want to find rows where all the values are 0.0 (floats), you have to use np.isclose():
print(x)
# output
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.],
])
np.where(np.all(np.isclose(labels, 0), axis=1))
(array([ 0, 3]),)
Note: this also works with PyTorch Tensors, which is nice for when you want to find zeroed multihot encoding vectors.
Solution using np.sum,
useful if you want to use a threshold
a = np.array([[1.0, 1.0, 2.99],
[0.0000054, 0.00000078, 0.00000232],
[0, 0, 0],
[1, 1, 0.0],
[0.0, 0.0, 0.0]])
print(np.where(np.sum(np.abs(a), axis=1)==0)[0])
>>[2 4]
print(np.where(np.sum(np.abs(a), axis=1)<0.0001)[0])
>>[1 2 4]
Use np.prod to check if row contains atleast one zero element
print(np.where(np.prod(a, axis=1)==0)[0])
>>[2 3 4]
a = numpy.array([[10,0],[0,0],[0,10]])
isZero = numpy.all(a == 0, axis=1)
deleteFullZero = a[~numpy.all(a== 0, axis=1)]
#isZero >> [False True False]
#deleteFullZero >> [[10 0][0,10]]