Related
I am running in to RuntimeWarning: Invalid value encountered in divide
import numpy
a = numpy.random.rand((1000000, 100))
b = numpy.random.rand((1,100))
dots = numpy.dot(b,a.T)/numpy.dot(b,b)
norms = numpy.linalg.norm(a, axis =1)
angles = dots/norms ### Basically I am calculating angle between 2 vectors
There are some vectors in my a which have norm as 0. so while calculating angles it is giving runtime warning.
Is there a one line pythonic way to compute angles while taking into account norms which are 0?
angles =[i/j if j!=0 else -2 for i,j in zip(dots, norms)] # takes 10.6 seconds
But it takes a lot of time. Since all angles will have values between 1 and -1 and I need only 10 max values this will help me. This takes around 10.6 seconds which is insane.
you can ignore warings with the np.errstate context manager and later replace nans with what you want:
import numpy as np
angle = np.arange(-5., 5.)
norm = np.arange(10.)
with np.errstate(divide='ignore'):
print np.where(norm != 0., angle / norm, -2)
# or:
with np.errstate(divide='ignore'):
res = angle/norm
res[np.isnan(res)] = -2
In newer versions of numpy there is a third alternative option that avoids needing to use the errstate context manager.
All Numpy ufuncs accept an optional "where" argument. This acts slightly differently than the np.where function, in that it only evaluates the function "where" the mask is true. When the mask is False, it doesn't change the value, so using the "out" argument allows us to preallocate any default we want.
import numpy as np
angle = np.arange(-5., 5.)
norm = np.arange(10.)
# version 1
with np.errstate(divide='ignore'):
res1 = np.where(norm != 0., angle / norm, -2)
# version 2
with np.errstate(divide='ignore'):
res2 = angle/norm
res2[np.isinf(res2)] = -2
# version 3
res3 = -2. * np.ones(angle.shape)
np.divide(angle, norm, out=res3, where=norm != 0)
print(res1)
print(res2)
print(res3)
np.testing.assert_array_almost_equal(res1, res2)
np.testing.assert_array_almost_equal(res1, res3)
You could use angles[~np.isfinite(angles)] = ... to replace nan values with some other value.
For example:
In [103]: angles = dots/norms
In [104]: angles
Out[104]: array([[ nan, nan, nan, ..., nan, nan, nan]])
In [105]: angles[~np.isfinite(angles)] = -2
In [106]: angles
Out[106]: array([[-2., -2., -2., ..., -2., -2., -2.]])
Note that division by zero may result in infs, rather than nans,
In [140]: np.array([1, 2, 3, 4, 0])/np.array([1, 2, 0, -0., 0])
Out[140]: array([ 1., 1., inf, -inf, nan])
so it is better to call np.isfinite rather than np.isnan to identify the places where there was division by zero.
In [141]: np.isfinite(np.array([1, 2, 3, 4, 0])/np.array([1, 2, 0, -0., 0]))
Out[141]: array([ True, True, False, False, False], dtype=bool)
Note that if you only want the top ten values from an NumPy array, using the np.argpartition function may be quicker than fully sorting the entire array, especially for large arrays:
In [110]: N = 3
In [111]: x = np.array([50, 40, 30, 20, 10, 0, 100, 90, 80, 70, 60])
In [112]: idx = np.argpartition(-x, N)
In [113]: idx
Out[113]: array([ 6, 7, 8, 9, 10, 0, 1, 4, 3, 2, 5])
In [114]: x[idx[:N]]
Out[114]: array([100, 90, 80])
This shows np.argpartition is quicker for even only moderately large arrays:
In [123]: x = np.array([50, 40, 30, 20, 10, 0, 100, 90, 80, 70, 60]*1000)
In [124]: %timeit np.sort(x)[-N:]
1000 loops, best of 3: 233 µs per loop
In [125]: %timeit idx = np.argpartition(-x, N); x[idx[:N]]
10000 loops, best of 3: 53.3 µs per loop
You want to be using np.where. See the documentation.
angles = np.where(norms != 0, dots/norms, -2)
Angles will consist of downs/norms whenever norms != 0, and will be -2 otherwise. You will still get the RuntimeWarning, as np.where will still calculate the entire vector dots/norms internally, but you can safely ignore it.
You can use np.where( condition ) to perform a conditional slice of where norms does not equal 0 before dividing:
norms = np.where(norms != 0 )
angles = dots/norms
Input
known_array : numpy array; consisting of scalar values only; shape: (m, 1)
test_array : numpy array; consisting of scalar values only; shape: (n, 1)
Output
indices : numpy array; shape: (n, 1); For each value in test_array finds the index of the closest value in known_array
residual : numpy array; shape: (n, 1); For each value in test_array finds the difference from the closest value in known_array
Example
In [17]: known_array = np.array([random.randint(-30,30) for i in range(5)])
In [18]: known_array
Out[18]: array([-24, -18, -13, -30, 29])
In [19]: test_array = np.array([random.randint(-10,10) for i in range(10)])
In [20]: test_array
Out[20]: array([-6, 4, -6, 4, 8, -4, 8, -6, 2, 8])
Sample Implementation (Not fully vectorized)
def find_nearest(known_array, value):
idx = (np.abs(known_array - value)).argmin()
diff = known_array[idx] - value
return [idx, -diff]
In [22]: indices = np.zeros(len(test_array))
In [23]: residual = np.zeros(len(test_array))
In [24]: for i in range(len(test_array)):
....: [indices[i], residual[i]] = find_nearest(known_array, test_array[i])
....:
In [25]: indices
Out[25]: array([ 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])
In [26]: residual
Out[26]: array([ 7., 17., 7., 17., 21., 9., 21., 7., 15., 21.])
What is the best way to speed up this task? Cython is an option, but, I would always prefer to be able to remove the for loop and let the code remain are pure NumPy.
NB: Following Stack Overflow questions were consulted
Python/Numpy - Quickly Find the Index in an Array Closest to Some Value
Find the index of numerically closest value
Find nearest value in numpy array
Finding the nearest value and return the index of array in Python
finding nearest items across two lists/arrays in Python
Updates
I did some small benchmarks for comparing the non-vectorized and vectorized solution (accepted answer).
In [48]: [indices1, residual1] = find_nearest_vectorized(known_array, test_array)
In [53]: [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)
In [54]: indices1==indices2
Out[54]: array([ True, True, True, True, True, True, True, True, True, True], dtype=bool)
In [55]: residual1==residual2
Out[55]: array([ True, True, True, True, True, True, True, True, True, True], dtype=bool)
In [56]: %timeit [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)
10000 loops, best of 3: 173 µs per loop
In [57]: %timeit [indices1, residual1] = find_nearest_vectorized(known_array, test_array)
100000 loops, best of 3: 16.8 µs per loop
About a 10-fold speedup!
Clarification
known_array is not sorted.
I ran the benchmarks as given in the answer by #cyborg below.
Case 1: If known_array were sorted
known_array = np.arange(0,1000)
test_array = np.random.randint(0, 100, 10000)
print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
assert np.allclose(base(known_array, test_array), eval(func_name+'(known_array, test_array)'))
Speedups:
diffs is x0.4 faster than base.
searchsorted1 is x81.3 faster than base.
searchsorted2 is x107.6 faster than base.
Firstly, for large arrays diffs method is actually slower, it also eats up a lot of RAM and my system hanged when I ran it on actual data.
Case 2 : When known_array is not sorted; which represents actual scenario
known_array = np.random.randint(0,100,100)
test_array = np.random.randint(0, 100, 100)
Speedups:
diffs is x8.9 faster than base.
AssertionError Traceback (most recent call last)
<ipython-input-26-3170078c217a> in <module>()
5 for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
6 print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
----> 7 assert np.allclose(base(known_array, test_array), eval(func_name+'(known_array, test_array)'))
AssertionError:
searchsorted1 is x14.8 faster than base.
I must also comment that the approach should also be memory efficient. Otherwise my 8 GB of RAM is not sufficient. In the base case, it is easily sufficient.
If the array is large, you should use searchsorted:
import numpy as np
np.random.seed(0)
known_array = np.random.rand(1000)
test_array = np.random.rand(400)
%%time
differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))
indices = np.abs(differences).argmin(axis=0)
residual = np.diagonal(differences[indices,])
output:
CPU times: user 11 ms, sys: 15 ms, total: 26 ms
Wall time: 26.4 ms
searchsorted version:
%%time
index_sorted = np.argsort(known_array)
known_array_sorted = known_array[index_sorted]
idx1 = np.searchsorted(known_array_sorted, test_array)
idx2 = np.clip(idx1 - 1, 0, len(known_array_sorted)-1)
diff1 = known_array_sorted[idx1] - test_array
diff2 = test_array - known_array_sorted[idx2]
indices2 = index_sorted[np.where(diff1 <= diff2, idx1, idx2)]
residual2 = test_array - known_array[indices]
output:
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 311 µs
We can check that the results is the same:
assert np.all(residual == residual2)
assert np.all(indices == indices2)
TL;DR: use numpy.searchsorted().
import inspect
from timeit import timeit
import numpy as np
known_array = np.arange(-10, 10)
test_array = np.random.randint(-10, 10, 1000)
number = 1000
def base(known_array, test_array):
def find_nearest(known_array, value):
idx = (np.abs(known_array - value)).argmin()
return idx
indices = np.zeros_like(test_array, dtype=known_array.dtype)
for i in range(len(test_array)):
indices[i] = find_nearest(known_array, test_array[i])
return indices
def diffs(known_array, test_array):
differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))
indices = np.abs(differences).argmin(axis=0)
return indices
def searchsorted1(known_array, test_array):
index_sorted = np.argsort(known_array)
known_array_sorted = known_array[index_sorted]
idx1 = np.searchsorted(known_array_sorted, test_array)
idx1[idx1 == len(known_array)] = len(known_array)-1
idx2 = np.clip(idx1 - 1, 0, len(known_array_sorted)-1)
diff1 = known_array_sorted[idx1] - test_array
diff2 = test_array - known_array_sorted[idx2]
indices2 = index_sorted[np.where(diff1 <= diff2, idx1, idx2)]
return indices2
def searchsorted2(known_array, test_array):
index_sorted = np.argsort(known_array)
known_array_sorted = known_array[index_sorted]
known_array_middles = known_array_sorted[1:] - np.diff(known_array_sorted.astype('f'))/2
idx1 = np.searchsorted(known_array_middles, test_array)
indices = index_sorted[idx1]
return indices
def time_f(func_name):
return timeit(func_name+"(known_array, test_array)",
'from __main__ import known_array, test_array, ' + func_name, number=number)
print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
Output:
Speedups:
diffs is x29.9 faster than base.
searchsorted1 is x37.4 faster than base.
searchsorted2 is x64.3 faster than base.
For example, you can compute all the differences in on go with:
differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))
And use argmin and fancy indexing along with np.diagonal to get desired indices and differences:
indices = np.abs(differences).argmin(axis=0)
residual = np.diagonal(differences[indices,])
So for
>>> known_array = np.array([-24, -18, -13, -30, 29])
>>> test_array = np.array([-6, 4, -6, 4, 8, -4, 8, -6, 2, 8])
One get
>>> indices
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>>> residual
array([ 7, 17, 7, 17, 21, 9, 21, 7, 15, 21])
The fastest I could come up with. This requires the array to be sorted.
The value could be a scalar or a list/array.
def find_nearest(value, array):
idx = np.searchsorted(array, value, side="left")
cond = np.logical_and(idx>0, np.logical_or(idx == len(array), np.fabs(value - array[idx-1]) < np.fabs(value - array[idx])))
return np.where(cond, array[idx-1], array[idx])
I have an array of a size 2 x 2 and I want to change the size to 3 x 4.
A = [[1 2 ],[2 3]]
A_new = [[1 2 0 0],[2 3 0 0],[0 0 0 0]]
I tried 3 shape but it didn't and append can only append row, not column. I don't want to iterate through each row to add the column.
Is there any vectorized way to do this like that of in MATLAB: A(:,3:4) = 0; and A(3,:) = 0; this converted the A from 2 x 2 to 3 x 4. I was thinking is there a similar way in python?
In Python, if the input is a numpy array, you can use np.lib.pad to pad zeros around it -
import numpy as np
A = np.array([[1, 2 ],[2, 3]]) # Input
A_new = np.lib.pad(A, ((0,1),(0,2)), 'constant', constant_values=(0)) # Output
Sample run -
In [7]: A # Input: A numpy array
Out[7]:
array([[1, 2],
[2, 3]])
In [8]: np.lib.pad(A, ((0,1),(0,2)), 'constant', constant_values=(0))
Out[8]:
array([[1, 2, 0, 0],
[2, 3, 0, 0],
[0, 0, 0, 0]]) # Zero padded numpy array
If you don't want to do the math of how many zeros to pad, you can let the code do it for you given the output array size -
In [29]: A
Out[29]:
array([[1, 2],
[2, 3]])
In [30]: new_shape = (3,4)
In [31]: shape_diff = np.array(new_shape) - np.array(A.shape)
In [32]: np.lib.pad(A, ((0,shape_diff[0]),(0,shape_diff[1])),
'constant', constant_values=(0))
Out[32]:
array([[1, 2, 0, 0],
[2, 3, 0, 0],
[0, 0, 0, 0]])
Or, you can start off with a zero initialized output array and then put back those input elements from A -
In [38]: A
Out[38]:
array([[1, 2],
[2, 3]])
In [39]: A_new = np.zeros(new_shape,dtype = A.dtype)
In [40]: A_new[0:A.shape[0],0:A.shape[1]] = A
In [41]: A_new
Out[41]:
array([[1, 2, 0, 0],
[2, 3, 0, 0],
[0, 0, 0, 0]])
In MATLAB, you can use padarray -
A_new = padarray(A,[1 2],'post')
Sample run -
>> A
A =
1 2
2 3
>> A_new = padarray(A,[1 2],'post')
A_new =
1 2 0 0
2 3 0 0
0 0 0 0
Pure Python way achieve this:
row = 3
column = 4
A = [[1, 2],[2, 3]]
A_new = map(lambda x: x + ([0] * (column - len(x))), A + ([[0] * column] * (row - len(A))))
then A_new is [[1, 2, 0, 0], [2, 3, 0, 0], [0, 0, 0, 0]].
Good to know:
[x] * n will repeat x n-times
Lists can be concatenated using the + operator
Explanation:
map(function, list) will iterate each item in list pass it to function and replace that item with the return value
A + ([[0] * column] * (row - len(A))): A is being extended with the remaining "zeroed" lists
repeat the item in [0] by the column count
repeat that array by the remaining row count
([0] * (column - len(x))): for each row item (x) add an list with the remaining count of columns using
Q: Is there a vectorised way to ...
A: Yes, there is
A = np.ones( (2,2) ) # numpy create/assign 1-s
B = np.zeros( (4,5) ) # numpy create/assign 0-s "padding" mat
B[:A.shape[0],:A.shape[1]] += A[:,:] # numpy vectorised .ADD at a cost of ~270 us
B[:A.shape[0],:A.shape[1]] = A[:,:] # numpy vectorised .STO at a cost of ~180 us
B[:A.shape[0],:A.shape[1]] = A # numpy high-level .STO at a cost of ~450 us
B
Out[4]:
array([[ 1., 1., 0., 0., 0.],
[ 1., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
Q: Is it resources efficient to "extend" the A´s data-structure in a smart way "behind the curtain"?
A: No, fortunately not much. Try bigger, big or huge sizes to feel the resources-allocation/processing costs...
Numpy has genuine data-structure "behind-the-curtain" that allows lot of smart tricks alike strided (re-)mapping, view-based operations, fast vectorised/broadcast operations, however, changing the memory-layout "accross the strided smart-mapping" is rather expensive.
For this reason numpy has added since 1.7.0 an in-built layout/mapper-modifier .lib.pad() that is well-aware & optimised so as to handle the "behind-the-curtain" structures both smart & fast.
B = np.lib.pad( A,
( ( 0, 3 ), ( 0, 2) ),
'constant',
constant_values = ( 0, 0 )
) # .pad() at a cost of ~ 270 us
Given a 1D array of indices:
a = array([1, 0, 3])
I want to one-hot encode this as a 2D array:
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Create a zeroed array b with enough columns, i.e. a.max() + 1.
Then, for each row i, set the a[i]th column to 1.
>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1
>>> b
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
In case you are using keras, there is a built in utility for that:
from keras.utils.np_utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=3)
And it does pretty much the same as #YXD's answer (see source-code).
Here is what I find useful:
def one_hot(a, num_classes):
return np.squeeze(np.eye(num_classes)[a.reshape(-1)])
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
Exactly what you wanted to have I believe.
PS: the source is Sequence models - deeplearning.ai
You can also use eye function of numpy:
numpy.eye(number of classes)[vector containing the labels]
You can use sklearn.preprocessing.LabelBinarizer:
Example:
import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))
output:
[[0 1 0 0]
[1 0 0 0]
[0 0 0 1]]
Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.
For 1-hot-encoding
one_hot_encode=pandas.get_dummies(array)
For Example
ENJOY CODING
You can use the following code for converting into a one-hot vector:
let x is the normal class vector having a single column with classes 0 to some number:
import numpy as np
np.eye(x.max()+1)[x]
if 0 is not a class; then remove +1.
Here is a function that converts a 1-D vector to a 2-D one-hot array.
#!/usr/bin/env python
import numpy as np
def convertToOneHot(vector, num_classes=None):
"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""
assert isinstance(vector, np.ndarray)
assert len(vector) > 0
if num_classes is None:
num_classes = np.max(vector)+1
else:
assert num_classes > 0
assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector] = 1
return result.astype(int)
Below is some example usage:
>>> a = np.array([1, 0, 3])
>>> convertToOneHot(a)
array([[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])
>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])
I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.
Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If using tensorflow, there is one_hot():
import tensorflow as tf
import numpy as np
a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.]], dtype=float32)>
def one_hot(n, class_num, col_wise=True):
a = np.eye(class_num)[n.reshape(-1)]
return a.T if col_wise else a
# Column for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10))
# Row for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10, col_wise=False))
I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy
Here's a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot
Such type of encoding are usually part of numpy array. If you are using a numpy array like this :
a = np.array([1,0,3])
then there is very simple way to convert that to 1-hot encoding
out = (np.arange(4) == a[:,None]).astype(np.float32)
That's it.
p will be a 2d ndarray.
We want to know which value is the highest in a row, to put there 1 and everywhere else 0.
clean and easy solution:
max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)
I find the easiest solution combines np.take and np.eye
def one_hot(x, depth: int):
return np.take(np.eye(depth), x, axis=0)
works for x of any shape.
Here is an example function that I wrote to do this based upon the answers above and my own use case:
def label_vector_to_one_hot_vector(vector, one_hot_size=10):
"""
Use to convert a column vector to a 'one-hot' matrix
Example:
vector: [[2], [0], [1]]
one_hot_size: 3
returns:
[[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]]
Parameters:
vector (np.array): of size (n, 1) to be converted
one_hot_size (int) optional: size of 'one-hot' row vector
Returns:
np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
"""
squeezed_vector = np.squeeze(vector, axis=-1)
one_hot = np.zeros((squeezed_vector.size, one_hot_size))
one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1
return one_hot
label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)
I am adding for completion a simple function, using only numpy operators:
def probs_to_onehot(output_probabilities):
argmax_indices_array = np.argmax(output_probabilities, axis=1)
onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
return onehot_output_array
It takes as input a probability matrix: e.g.:
[[0.03038822 0.65810204 0.16549407 0.3797123 ]
...
[0.02771272 0.2760752 0.3280924 0.33458805]]
And it will return
[[0 1 0 0] ... [0 0 0 1]]
Use the following code. It works best.
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don't need to go into the link.
Using a Neuraxle pipeline step:
Set up your example
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Do the actual conversion
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
Assert it works
assert b_pred == b
Link to documentation: neuraxle.steps.numpy.OneHotEncoder
Is it possible to use numpy.nanargmin, so that it returns numpy.nan, on columns where there are only nans in them. Right now, it raises a ValueError, when that happens. And i cant use numpy.argmin, since that will fail when there are only a few nans in the column.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanargmin.html says that the ValueError is raised for all-nan slices. In that case, i want it to return numpy.nan (just to further mask the "non-data" with nans)
this next bit does this, but is super-slow and not really pythonic:
for i in range(R.shape[0]):
bestindex = numpy.nanargmin(R[i,:])
if(numpy.isnan(bestindex)):
bestepsilons[i]=numpy.nan
else:
bestepsilons[i]=epsilon[bestindex]
This next bit works too, but only if no all-nan columns are involved:
ar = numpy.nanargmin(R, axis=1)
bestepsilons = epsilon[ar]
So ideally i would want this last bit to work with all-nan columns as well
>>> def _nanargmin(arr, axis):
... try:
... return np.nanargmin(arr, axis)
... except ValueError:
... return np.nan
Demo:
>>> a = np.array([[np.nan]*10, np.ones(10)])
>>> _nanargmin(a, axis=1)
nan
>>> _nanargmin(a, axis=0)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Anyway, it's unlikely to be what you want. Not sure what exactly you are after. If all you want is to filter away the nans, then use boolean indexing:
>>> a[~np.isnan(a)]
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
>>> np.argmin(_)
0
EDIT2: Looks like you're after the masked arrays:
>>> a = np.vstack(([np.nan]*10, np.arange(10), np.arange(11, 1, -1)))
>>> a[2, 4] = np.nan
>>> m = np.ma.masked_array(a, np.isnan(a))
>>> np.argmin(m, axis=0)
array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2])
>>> np.argmin(m, axis=1)
array([0, 0, 9])
Found a solution:
# makes everything nan to start with
bestepsilons1 = numpy.zeros(R.shape[0])+numpy.nan
# finds the indices where the entire column would be nan, so the nanargmin would raise an error
d0 = numpy.nanmin(R, axis=1)
# on the indices where we do not have a nan-column, get the right index with nanargmin, and than put the right value in those points
bestepsilons1[~numpy.isnan(d0)] = epsilon[numpy.nanargmin(R[~numpy.isnan(d0),:], axis=1)]
This basically is a workaround, by only taking the nanargmin on the places where it will not give an error, since at those places we want the resulting index to be a nan anyways