np.ufunc.at for 2D array - python

In order to compute confusion matrix (not the accuracy) loop over the predicted and true labels may be needed. How to perform that in a numpy manner, if next code does not give needed result?
>> a = np.zeros((5, 5))
>> indices = np.array([
[0, 0],
[2, 2],
[4, 4],
[0, 0],
[2, 2],
[4, 4],
])
np.add.at(a, indices, 1)
>> a
>> array([
[4., 4., 4., 4., 4.],
[0., 0., 0., 0., 0.],
[4., 4., 4., 4., 4.],
[0., 0., 0., 0., 0.],
[4., 4., 4., 4., 4.]
])
# Wanted
>> array([
[2., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 2., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 2.]
])

Docs say If first operand has multiple dimensions, indices can be a tuple of array like index objects or slice objects.
Using next tupling wanted result is reached.
np.add.at(a, (indices[:, 0], indices[:, 1]), 1)

Related

Checking non zero-sum rows in numpy array and removing them

I have a numpy array like this:
array([[ 3., 2., 3., ..., 0., 0., 0.],
[ 3., 2., -4., ..., 0., 0., 0.],
[ 3., -4., 1., ..., 0., 0., 0.],
...,
[-1., -2., 4., ..., 0., 0., 0.],
[ 4., -2., -2., ..., 0., 0., 0.],
[-2., 2., 4., ..., 0., 0., 0.]], dtype=float32)
what I want to do is removing all the rows that do not sum to zero and remove them, while also saving such rows indexes/positions in order to eliminate them to another array.
I'm trying the following:
for i in range(len(arr1)):
count=0
for j in arr1[i]:
count+=j
if count != 0:
arr_1 = np.delete(arr1,i,axis=0)
arr_2 = np.delete(arr2,i,axis=0)
the resulting arr_1 and arr_2 still contain rows that do not sum to zero. What am I doing wrong?
You can compute sum then keep row that have sum == 0 like below:
a=np.array([
[ 3., 2., 3., 0., 0., 0.],
[ 3., 2., -4., 0., 0., 0.],
[ 3., -4., 1., 0., 0., 0.]])
b = a.sum(axis=1)
# array([8., 1., 0.])
print(a[b==0])
Output:
array([[ 3., -4., 1., 0., 0., 0.]])
Just use sum(axis=1):
mask = a.sum(axis=1) != 0
do_sum_to_0 = a[~mask]
dont_sum_to_0 = a[mask]

How to distribute a Numpy array along the diagonal of an array of higher dimension?

I have three two dimensional Numpy arrays x, w, d and want to create a fourth one called a. w and d define only the shape of a with d.shape + w.shape. I want to have x in the entries of a with a zeros elsewhere.
Specifically, I want a loop-free version of this code:
a = np.zeros(d.shape + w.shape)
for j in range(d.shape[1]):
a[:,j,:,j] = x
For example, given:
x = np.array([
[2, 3],
[1, 1],
[8,10],
[0, 1]
])
w = np.array([
[ 0, 1, 1],
[-1,-2, 1]
])
d = np.matmul(x,w)
I want a to be
array([[[[ 2., 0., 0.],
[ 3., 0., 0.]],
[[ 0., 2., 0.],
[ 0., 3., 0.]],
[[ 0., 0., 2.],
[ 0., 0., 3.]]],
[[[ 1., 0., 0.],
[ 1., 0., 0.]],
[[ 0., 1., 0.],
[ 0., 1., 0.]],
[[ 0., 0., 1.],
[ 0., 0., 1.]]],
[[[ 8., 0., 0.],
[10., 0., 0.]],
[[ 0., 8., 0.],
[ 0., 10., 0.]],
[[ 0., 0., 8.],
[ 0., 0., 10.]]],
[[[ 0., 0., 0.],
[ 1., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 1., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 1.]]]])
This answer inspired the following solution:
# shape a: (4, 3, 2, 3)
# shape x: (4, 2)
a = np.zeros(d.shape + w.shape)
a[:, np.arange(a.shape[1]), :, np.arange(a.shape[3])] = x
It uses Numpy's broadcasting (see here or here) im combination with Advanced Indexing to enlarge x to fit the slicing.
I happen to have an even simpler solution: a = np.tensordot(x, np.identity(3), axes = 0).swapaxes(1,2)
The size of the identity matrix will be decided by the number of times you wish to repeat the elements of x.

Alternatives to np.newaxis() for saving memory when comparing arrays

I want to copared each vector from one array with all vectors from another array, and count how many symbols matches per vector. Let me show an example.
I have two arrays, a and b.
For each vector in a, I want to compare it with each vector in b. I then want to return a new array which is with dimension np.array((len(a),14)) where each vector holds the number of times vectors in a had 0,1,2,3,4,..,12,13 matches with vectors from b. The wished results are shown in array c below.
I already have solved this problem using np.newaxis() but my issue is (see my function below), that this takes up so much memory so my computer can't handle it when a and b gets larger. Hence, I am looking for a more efficient way to do this calculation, as it hurts my memory big time to add on dimensions to the vectors. One solution is to go with a normal for loop, but this method is rather slow.
Is it possible to make these calculations more efficient?
a = array([[1., 1., 1., 2., 1., 1., 2., 1., 0., 2., 2., 2., 2.],
[0., 2., 2., 0., 1., 1., 0., 1., 1., 0., 2., 1., 2.],
[0., 0., 0., 1., 1., 0., 2., 1., 2., 0., 1., 2., 2.],
[1., 2., 2., 0., 1., 1., 0., 2., 0., 1., 1., 0., 2.],
[1., 2., 0., 2., 2., 0., 2., 0., 0., 1., 2., 0., 0.]])
b = array([[0., 2., 0., 0., 0., 0., 0., 1., 1., 1., 0., 2., 2.],
[1., 0., 1., 2., 2., 0., 1., 1., 1., 1., 2., 1., 2.],
[1., 2., 1., 2., 0., 0., 0., 1., 1., 2., 2., 0., 2.],
[0., 1., 2., 0., 2., 1., 0., 1., 2., 0., 0., 0., 2.],
[0., 2., 2., 1., 2., 1., 0., 1., 1., 1., 2., 2., 2.],
[0., 2., 2., 1., 0., 1., 1., 0., 1., 0., 2., 2., 1.],
[1., 0., 2., 2., 0., 1., 0., 1., 0., 1., 1., 2., 2.],
[1., 1., 0., 2., 1., 1., 1., 1., 0., 2., 0., 2., 2.],
[1., 2., 0., 0., 0., 1., 2., 1., 0., 1., 2., 0., 1.],
[1., 2., 1., 2., 2., 1., 2., 0., 2., 0., 0., 1., 1.]])
c = array([[0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 2, 3, 1, 2, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 3, 2, 4, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 0, 3, 2, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 4, 0, 3, 0, 1, 0, 0, 0, 0, 0]])
My solution:
def new_method_test(a,b):
test = (a[:,np.newaxis] == b).sum(axis=2)
zero = (test == 0).sum(axis=1)
one = (test == 1).sum(axis=1)
two = (test == 2).sum(axis=1)
three = (test == 3).sum(axis=1)
four = (test == 4).sum(axis=1)
five = (test == 5).sum(axis=1)
six = (test == 6).sum(axis=1)
seven = (test == 7).sum(axis=1)
eight = (test == 8).sum(axis=1)
nine = (test == 9).sum(axis=1)
ten = (test == 10).sum(axis=1)
eleven = (test == 11).sum(axis=1)
twelve = (test == 12).sum(axis=1)
thirteen = (test == 13).sum(axis=1)
c = np.concatenate((zero,one,two,three,four,five,six,seven,eight,nine,ten,eleven,twelve,thirteen), axis = 0).reshape(14,len(a)).T
return c
Thank you for you help.
welcome to Stackoverflow! I think a for loop is the way to go if you want to save memory (and it's really not that slow). Additionally you can directly go from one test to your c output matrix with np.bincount. I think this method will be approximately equally fast as yours and it will use significantly less memory by comparison.
import numpy as np
c = np.empty(a.shape, dtype=int)
for i in range(a.shape[0]):
test_one_vector = (a[i,:]==b).sum(axis=1)
c[i,:] = np.bincount(test_one_vector, minlength=a.shape[1])
Small sidenote if you are really dealing with floating point numbers in a and b you should consider dropping the equality check (==) in favor of a proximity check like e.g. np.isclose

tf.gather_nd without "flattening" shape?

I'm still playing around with tensorflow and been trying to use the gather_nd op, but the return value is not in the shape/format I want...
Input Tensor: - shape: (2, 7, 4)
array([[[ 0., 0., 1., 2.],
[ 0., 0., 2., 2.],
[ 0., 0., 3., 3.],
[ 0., 0., 4., 3.],
[ 0., 0., 5., 4.],
[ 0., 0., 6., 4.],
[ 0., 0., 7., 5.]],
[[ 1., 1., 0., 2.],
[ 1., 2., 0., 2.],
[ 1., 3., 0., 3.],
[ 1., 4., 0., 3.],
[ 1., 5., 0., 4.],
[ 1., 6., 0., 5.],
[ 1., 7., 0., 5.]]], dtype=float32)
Indices returned by tf.where op: - shape: (3, 2)
array([[0, 0],
[0, 1],
[1, 0]])
tf.gather results: (shape = [3, 4])
array([[ 0., 0., 1., 2.],
[ 0., 0., 2., 2.],
[ 1., 1., 0., 2.]], dtype=float32)
desired results: = (2, sparse, 4)
array([[[ 0., 0., 1., 2.],
[ 0., 0., 2., 2.]],
[[ 1., 1., 0., 2.]]], dtype=float32)
What's the best way to achieve this, keeping in mind that tf.where = dynamic shapes and no guarantees of shape consistency across the 2nd dimension (axis=1)?
NB: Ignore this question - See my answer
I think its a Tensorflow version problem. In my version (1.2.1), I get the exact desired output from your inputs. However, I also tried the following code according to the older version.
import tensorflow as tf
indices = [[[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]],
[[0, 1, 0], [0, 1, 1], [0, 1, 2], [0, 1, 3]],
[[1, 0, 0], [1, 0, 1], [1, 0, 2], [1, 0, 3]]]
params = [[[ 0., 0., 1., 2.],
[ 0., 0., 2., 2.],
[ 0., 0., 3., 3.],
[ 0., 0., 4., 3.],
[ 0., 0., 5., 4.],
[ 0., 0., 6., 4.],
[ 0., 0., 7., 5.]],
[[ 1., 1., 0., 2.],
[ 1., 2., 0., 2.],
[ 1., 3., 0., 3.],
[ 1., 4., 0., 3.],
[ 1., 5., 0., 4.],
[ 1., 6., 0., 5.],
[ 1., 7., 0., 5.]]]
output = tf.gather_nd(params, indices)
with tf.Session()as sess:
print (sess.run(output))
Hope this helps.
I realized the idiocy of my question.
# of tuples when 1st dim is 0 != # of tuples when 1st dim is 1
I'm not sure that what I'm asking is feasible ...

Scikit: Convert one-hot encoding to encoding with integers

I need to convert one-hot encoding to categories represented by unique integers. So one-hot encoding created with the following code:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
labels = [[1],[2],[3]]
enc.fit(labels)
for x in [1,2,3]:
print(enc.transform([[x]]).toarray())
Out:
[[ 1. 0. 0.]]
[[ 0. 1. 0.]]
[[ 0. 0. 1.]]
Could be converted back to a set of unique integers, for example:
[1,2,3] or [11,37, 45] or any other where each integer uniquely represents a single class.
Is it possible to do with scikit-learn or any other python lib?
* Update *
Tried to:
labels = [[1],[2],[3], [4], [5],[6],[7]]
enc.fit(labels)
lst = []
for x in [1,2,3,4,5,6,7]:
lst.append(enc.transform([[x]]).toarray())
lst
Out:
[array([[ 1., 0., 0., 0., 0., 0., 0.]]),
array([[ 0., 1., 0., 0., 0., 0., 0.]]),
array([[ 0., 0., 1., 0., 0., 0., 0.]]),
array([[ 0., 0., 0., 1., 0., 0., 0.]]),
array([[ 0., 0., 0., 0., 1., 0., 0.]]),
array([[ 0., 0., 0., 0., 0., 1., 0.]]),
array([[ 0., 0., 0., 0., 0., 0., 1.]])]
a = np.array(lst)
np.where(a==1)[1]
Out:
array([0, 0, 0, 0, 0, 0, 0], dtype=int64)
Not what I need
You can do that using np.where as follows:
import numpy as np
a=np.array([[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]])
np.where(a==1)[1]
This prints array([1, 0, 2], dtype=int64). This works since np.where(a==1)[1] returns the column indices of the 1's, which are exactly the labels.
In addition, since a is a 0,1-matrix, you can also replace np.where(a==1)[1] with just np.where(a)[1].
Update: The following solution should work with your format:
l=[np.array([[ 1., 0., 0., 0., 0., 0., 0.]]),
np.array([[ 0., 0., 1., 0., 0., 0., 0.]]),
np.array([[ 0., 1., 0., 0., 0., 0., 0.]]),
np.array([[ 0., 0., 0., 0., 1., 0., 0.]]),
np.array([[ 0., 0., 0., 0., 1., 0., 0.]]),
np.array([[ 0., 0., 0., 0., 0., 1., 0.]]),
np.array([[ 0., 0., 0., 0., 0., 0., 1.]])]
a=np.array(l)
np.where(a)[2]
This prints
array([0, 2, 1, 4, 4, 5, 6], dtype=int64)
Alternativaly, you could use the original solution together with #ml4294's comment.
You can use np.argmax():
from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder()
labels = [[1],[2],[3]]
enc.fit(labels)
x = enc.transform(labels).toarray()
# x = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
xr = (np.argmax(x, axis=1)+1).reshape(-1, 1)
print(xr)
This should return array([[1], [2], [3]]). If you want instead array([[0], [1], [2]]), just remove the +1 in the definition of xr.
Since you are using sklearn.preprocessing.OneHotEncoder to 'encode' the data, you can use its .inverse_transform() method to 'decode' the data (I think this requires .__version__ = 0.20.1 or newer):
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
labels = [[1],[2],[3]]
encoder = enc.fit(labels)
encoded_labels = encoder.transform(labels)
decoded_labels = encoder.inverse_transform(encoded_labels)
decoded_labels # array([[1],
[2],
[3]])
n.b. decoded_labels is a numpy array not a list.
Source: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.inverse_transform

Categories