Keras np_utils.to_categorical behaves differently - python

Why does Keras to_categorical behaves differently on [1, -1] and [2, -2]?
y = [1, -1, -1]
y_ = np_utils.to_categorical(y)
array([[ 0., 1.],
[ 0., 1.],
[ 0., 1.]])
y = [2, -2, -2]
y_ = np_utils.to_categorical(y)
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])

to_categorical does not take negative values, if you have a dataset that has negative values, you can pass y - y.min() to to_categorical so it works as you would expect:
>>> y = numpy.array([2, -2, -2])
>>> to_categorical(y)
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])
>>> to_categorical(y - y.min())
array([[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.]])

y = np.array(y, dtype='int').ravel()
if not num_classes:
num_classes = np.max(y) + 1
n = y.shape[0]
categorical = np.zeros((n, num_classes))
categorical[np.arange(n), y] = 1
above is the implementation of to_categorical.
So in [1, -1, -1] case what happened is :
num_classes = 2 [np.max()+1]
categorical shape becomes [3,2]
so when -1 comes it reads the last index and makes it 1. and for 1 also it reads index 1(index starts from 0).
that is why final output becomes
array([[ 0., 1.],
[ 0., 1.],
[ 0., 1.]])
in [2, -2, -2] case what happened is :
num_classes = 3 [np.max()+1]
categorical shape becomes [3,3]
so when -2 comes it reads the second last index and makes it 1. and for 2 it reads index 2(index starts from 0).
that is why final output becomes
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])
so if you try something like [2, -4, -4] it will give you an error as there is no index -4 as categorical shape is [3,3].

Related

How to distribute a Numpy array along the diagonal of an array of higher dimension?

I have three two dimensional Numpy arrays x, w, d and want to create a fourth one called a. w and d define only the shape of a with d.shape + w.shape. I want to have x in the entries of a with a zeros elsewhere.
Specifically, I want a loop-free version of this code:
a = np.zeros(d.shape + w.shape)
for j in range(d.shape[1]):
a[:,j,:,j] = x
For example, given:
x = np.array([
[2, 3],
[1, 1],
[8,10],
[0, 1]
])
w = np.array([
[ 0, 1, 1],
[-1,-2, 1]
])
d = np.matmul(x,w)
I want a to be
array([[[[ 2., 0., 0.],
[ 3., 0., 0.]],
[[ 0., 2., 0.],
[ 0., 3., 0.]],
[[ 0., 0., 2.],
[ 0., 0., 3.]]],
[[[ 1., 0., 0.],
[ 1., 0., 0.]],
[[ 0., 1., 0.],
[ 0., 1., 0.]],
[[ 0., 0., 1.],
[ 0., 0., 1.]]],
[[[ 8., 0., 0.],
[10., 0., 0.]],
[[ 0., 8., 0.],
[ 0., 10., 0.]],
[[ 0., 0., 8.],
[ 0., 0., 10.]]],
[[[ 0., 0., 0.],
[ 1., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 1., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 1.]]]])
This answer inspired the following solution:
# shape a: (4, 3, 2, 3)
# shape x: (4, 2)
a = np.zeros(d.shape + w.shape)
a[:, np.arange(a.shape[1]), :, np.arange(a.shape[3])] = x
It uses Numpy's broadcasting (see here or here) im combination with Advanced Indexing to enlarge x to fit the slicing.
I happen to have an even simpler solution: a = np.tensordot(x, np.identity(3), axes = 0).swapaxes(1,2)
The size of the identity matrix will be decided by the number of times you wish to repeat the elements of x.

Map numpy categorical data to a numpy vector

I am having a numpy array that is looking like:
my_arr = array([[0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.],
...
...]
I want to return a vector that will contain for each vector of my_arr the index of entry with value one. How can I do so?
You use np.argmax() for that.
inds = np.argmax(my_arr, axis=1)
# array([4, 1, 3, 4, 0, 4, 1, 4])
np.where(my_arr)[1]
Look at docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
You can use np.argwhere to return an array of coordinates:
arr = np.random.randint(0, 2, (5, 5))
print(arr)
[[0 0 1 1 1]
[0 1 0 1 1]
[1 1 0 0 1]
[1 1 1 0 0]
[1 1 1 1 0]]
res = np.argwhere(arr)
print(res)
array([[0, 2], [0, 3], ..., [4, 2], [4, 3]], dtype=int64)

A neater way to set values at indexes with NumPy

I have a numpy array initially with zeros, like this:
v = np.zeros((5, 5))
v
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
I also have a set of arrays idx1 and idx2.
idx1
array([[0, 3],
[0, 4],
[1, 3],
[2, 4]])
idx2
array([[0, 1],
[0, 2],
[0, 4],
[1, 3]])
Look upon each pair of values as row and column indices. So, for example, in idx1, the first pair (0, 3) would be indexers into v[0, 3] and so on.
I want to first set values at indexes specified by idx1 to 1, followed by all indexes specified by idx2 to 0.
Also, please note that if there is a pair (i, j) in some array, I want to set v[i, j] and v[j, i] at the same time.
My final result becomes:
array([[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.]])
I currently achieve this by doing:
def set_vals(x, i, j, v):
x[i, j] = x.T[i, j] = v
v = np.zeros((5, 5))
i1, j1 = idx1[:, 0], idx1[:, 1]
i2, j2 = idx2[:, 0], idx2[:, 1]
set_vals(v, i1, j1, 1)
set_vals(v, i2, j2, 0)
v # the result
However, I believe there might be a better way. Would love to hear any thoughts/suggestions for improvement. Thanks!
In search of a more "compact" way of expressing it, I got this -
v = np.zeros((5, 5))
v[tuple(np.r_[idx1,idx1[:,::-1]].T)] = 1
v[tuple(np.r_[idx2,idx2[:,::-1]].T)] = 0
On python3.6+, you can use the * unpacking operator to reduce this further:
v[[*np.r_[idx1,idx1[:,::-1]].T]] = 1
v[[*np.r_[idx2,idx2[:,::-1]].T]] = 0
v
array([[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.]])

Scikit: Convert one-hot encoding to encoding with integers

I need to convert one-hot encoding to categories represented by unique integers. So one-hot encoding created with the following code:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
labels = [[1],[2],[3]]
enc.fit(labels)
for x in [1,2,3]:
print(enc.transform([[x]]).toarray())
Out:
[[ 1. 0. 0.]]
[[ 0. 1. 0.]]
[[ 0. 0. 1.]]
Could be converted back to a set of unique integers, for example:
[1,2,3] or [11,37, 45] or any other where each integer uniquely represents a single class.
Is it possible to do with scikit-learn or any other python lib?
* Update *
Tried to:
labels = [[1],[2],[3], [4], [5],[6],[7]]
enc.fit(labels)
lst = []
for x in [1,2,3,4,5,6,7]:
lst.append(enc.transform([[x]]).toarray())
lst
Out:
[array([[ 1., 0., 0., 0., 0., 0., 0.]]),
array([[ 0., 1., 0., 0., 0., 0., 0.]]),
array([[ 0., 0., 1., 0., 0., 0., 0.]]),
array([[ 0., 0., 0., 1., 0., 0., 0.]]),
array([[ 0., 0., 0., 0., 1., 0., 0.]]),
array([[ 0., 0., 0., 0., 0., 1., 0.]]),
array([[ 0., 0., 0., 0., 0., 0., 1.]])]
a = np.array(lst)
np.where(a==1)[1]
Out:
array([0, 0, 0, 0, 0, 0, 0], dtype=int64)
Not what I need
You can do that using np.where as follows:
import numpy as np
a=np.array([[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]])
np.where(a==1)[1]
This prints array([1, 0, 2], dtype=int64). This works since np.where(a==1)[1] returns the column indices of the 1's, which are exactly the labels.
In addition, since a is a 0,1-matrix, you can also replace np.where(a==1)[1] with just np.where(a)[1].
Update: The following solution should work with your format:
l=[np.array([[ 1., 0., 0., 0., 0., 0., 0.]]),
np.array([[ 0., 0., 1., 0., 0., 0., 0.]]),
np.array([[ 0., 1., 0., 0., 0., 0., 0.]]),
np.array([[ 0., 0., 0., 0., 1., 0., 0.]]),
np.array([[ 0., 0., 0., 0., 1., 0., 0.]]),
np.array([[ 0., 0., 0., 0., 0., 1., 0.]]),
np.array([[ 0., 0., 0., 0., 0., 0., 1.]])]
a=np.array(l)
np.where(a)[2]
This prints
array([0, 2, 1, 4, 4, 5, 6], dtype=int64)
Alternativaly, you could use the original solution together with #ml4294's comment.
You can use np.argmax():
from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder()
labels = [[1],[2],[3]]
enc.fit(labels)
x = enc.transform(labels).toarray()
# x = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
xr = (np.argmax(x, axis=1)+1).reshape(-1, 1)
print(xr)
This should return array([[1], [2], [3]]). If you want instead array([[0], [1], [2]]), just remove the +1 in the definition of xr.
Since you are using sklearn.preprocessing.OneHotEncoder to 'encode' the data, you can use its .inverse_transform() method to 'decode' the data (I think this requires .__version__ = 0.20.1 or newer):
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
labels = [[1],[2],[3]]
encoder = enc.fit(labels)
encoded_labels = encoder.transform(labels)
decoded_labels = encoder.inverse_transform(encoded_labels)
decoded_labels # array([[1],
[2],
[3]])
n.b. decoded_labels is a numpy array not a list.
Source: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.inverse_transform

Numpy Cyclic Broadcast of Fancy Indexing

A is an numpy array with shape (6, 8)
I want:
x_id = np.array([0, 3])
y_id = np.array([1, 3, 4, 7])
A[ [x_id, y_id] += 1 # this doesn't actually work.
Tricks like ::2 won't work because the indices do not increase regularly.
I don't want to use extra memory to repeat [0, 3] and make a new array [0, 3, 0, 3] because that is slow.
The indices for the two dimensions do not have equal length.
which is equivalent to:
A[0, 1] += 1
A[3, 3] += 1
A[0, 4] += 1
A[3, 7] += 1
Can numpy do something like this?
Update:
Not sure if broadcast_to or stride_tricks is faster than nested python loops. (Repeat NumPy array without replicating data?)
You can convert y_id to a 2d array with the 2nd dimension the same as x_id, and then the two indices will be automatically broadcasted due to the dimension difference:
x_id = np.array([0, 3])
y_id = np.array([1, 3, 4, 7])
​
A = np.zeros((6,8))
A[x_id, y_id.reshape(-1, x_id.size)] += 1
A
array([[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.]])

Categories