Convert array of indices to one-hot encoded array in NumPy

Convert array of indices to one-hot encoded array in NumPy - python

Given a 1D array of indices:
a = array([1, 0, 3])
I want to one-hot encode this as a 2D array:
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Create a zeroed array b with enough columns, i.e. a.max() + 1.
Then, for each row i, set the a[i]th column to 1.
>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1
>>> b
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])

In case you are using keras, there is a built in utility for that:
from keras.utils.np_utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=3)
And it does pretty much the same as #YXD's answer (see source-code).

Here is what I find useful:
def one_hot(a, num_classes):
return np.squeeze(np.eye(num_classes)[a.reshape(-1)])
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
Exactly what you wanted to have I believe.
PS: the source is Sequence models - deeplearning.ai

You can also use eye function of numpy:
numpy.eye(number of classes)[vector containing the labels]

You can use sklearn.preprocessing.LabelBinarizer:
Example:
import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))
output:
[[0 1 0 0]
[1 0 0 0]
[0 0 0 1]]
Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.

For 1-hot-encoding
one_hot_encode=pandas.get_dummies(array)
For Example
ENJOY CODING

You can use the following code for converting into a one-hot vector:
let x is the normal class vector having a single column with classes 0 to some number:
import numpy as np
np.eye(x.max()+1)[x]
if 0 is not a class; then remove +1.

Here is a function that converts a 1-D vector to a 2-D one-hot array.
#!/usr/bin/env python
import numpy as np
def convertToOneHot(vector, num_classes=None):
"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""
assert isinstance(vector, np.ndarray)
assert len(vector) > 0
if num_classes is None:
num_classes = np.max(vector)+1
else:
assert num_classes > 0
assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector] = 1
return result.astype(int)
Below is some example usage:
>>> a = np.array([1, 0, 3])
>>> convertToOneHot(a)
array([[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])
>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.

Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If using tensorflow, there is one_hot():
import tensorflow as tf
import numpy as np
a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.]], dtype=float32)>

def one_hot(n, class_num, col_wise=True):
a = np.eye(class_num)[n.reshape(-1)]
return a.T if col_wise else a
# Column for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10))
# Row for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10, col_wise=False))

I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy

Here's a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot

Such type of encoding are usually part of numpy array. If you are using a numpy array like this :
a = np.array([1,0,3])
then there is very simple way to convert that to 1-hot encoding
out = (np.arange(4) == a[:,None]).astype(np.float32)
That's it.

p will be a 2d ndarray.
We want to know which value is the highest in a row, to put there 1 and everywhere else 0.
clean and easy solution:
max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)

I find the easiest solution combines np.take and np.eye
def one_hot(x, depth: int):
return np.take(np.eye(depth), x, axis=0)
works for x of any shape.

Here is an example function that I wrote to do this based upon the answers above and my own use case:
def label_vector_to_one_hot_vector(vector, one_hot_size=10):
"""
Use to convert a column vector to a 'one-hot' matrix
Example:
vector: [[2], [0], [1]]
one_hot_size: 3
returns:
[[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]]
Parameters:
vector (np.array): of size (n, 1) to be converted
one_hot_size (int) optional: size of 'one-hot' row vector
Returns:
np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
"""
squeezed_vector = np.squeeze(vector, axis=-1)
one_hot = np.zeros((squeezed_vector.size, one_hot_size))
one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1
return one_hot
label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

I am adding for completion a simple function, using only numpy operators:
def probs_to_onehot(output_probabilities):
argmax_indices_array = np.argmax(output_probabilities, axis=1)
onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
return onehot_output_array
It takes as input a probability matrix: e.g.:
[[0.03038822 0.65810204 0.16549407 0.3797123 ]
...
[0.02771272 0.2760752 0.3280924 0.33458805]]
And it will return
[[0 1 0 0] ... [0 0 0 1]]

Use the following code. It works best.
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don't need to go into the link.

Using a Neuraxle pipeline step:
Set up your example
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Do the actual conversion
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
Assert it works
assert b_pred == b
Link to documentation: neuraxle.steps.numpy.OneHotEncoder

Related

Working with 2 arrays with different lengths Numpy Python

Is there a way I could modify the function down below so that it could compute arrays with different length sizes. the length of Numbers array is 7 and the length of the Formating is 5. The code down below compares if any number in Formating is between two values and if it the case then it sums the values that are in between. So for the first calculation since no element in Numbers is between 0, 2 the result will be 0. Link to code was derived from: issue.
Code:
Numbers = np.array([3, 4, 5, 7, 8, 10,20])
Formating = np.array([0, 2 , 5, 12, 15])
x = np.sort(Numbers);
l = np.searchsorted(x, Formating, side='left')
mask=(Formating[:-1,None]<=Numbers)&(Numbers<Formating[1:,None])
N=Numbers[:,None].repeat(5,1).T
result= np.ma.masked_array(N,~mask)
result = result.filled(0)
result = np.sum(result, axis=1)
Expected output:
[ 0 7 30 0]

Here's an approach with bincounts. Note that you have your x and l messed-up, and I recalled that you could/should use digitize:
# Formating goes here
x = np.sort(Formating);
# digitize
l = np.digitize(Numbers, x)
# output:
np.bincount(l, weights=Numbers)
Out:
array([ 0., 0., 7., 30., 0., 20.])

How does one convert a data set with two labels +1 and -1 to a hot one vector representation in a vectorized way in Python?

I have a data set in numpy with a x vector and a y vector. The y vectors is only two values +1 or -1 (or 0 or 1) because its a binary valued function. I know I can just loop over the data set and if I see a +1 to map it to 1 and if I see and -1 map it to 0 one by one. However, I was hoping that given the whole vector y = [N x 1] to map it in one step to a vector y = [N x 2] since can be quite large I wanted to do it as quickly as possible (I also didn't want to save the copy of the data set twice).
Is there a vectorized way to do this transformation quickly in python?
For the reference here is the looping code:
def transform_data_to_one_hot(X,Y):
N,D = Y.size
Y_new = np.zeros(N,D)
for i in range(N):
if y == -1:
Y_new[i] = np.array([1,0])
else:
Y_new[i] = np.array([0,1])
return Y_new
Lets do the parity function using Radamacher variables (i.e. +1,-1 instead of 0 and 1). In this case the parity function is just the product function:
>>> X = np.array([[-1,-1],[-1,1],[1,-1],[1,1]])
>>> X
array([[-1, -1],
[-1, 1],
[ 1, -1],
[ 1, 1]])
>>> Y = np.reshape(np.prod(X,axis=1),[4,1])
>>> Y
array([[ 1],
[-1],
[-1],
[ 1]])
the Y vector when is one hot should be:
>>> Y
array([[ 0,1],
[1,0],
[1,0],
[ 0,1]])

Here's one initialization based -
def initialization_based(y):
out = np.zeros((len(y),2),dtype=int)
out[np.arange(out.shape[0]), (y==1).astype(int)] = 1
return out
Sample run -
In [244]: y
Out[244]: array([ 1, -1, 1, 1, -1, 1, -1, 1])
In [245]: initialization_based(y)
Out[245]:
array([[0, 1],
[1, 0],
[0, 1],
[0, 1],
[1, 0],
[0, 1],
[1, 0],
[0, 1]])
Other ways to use initialization method -
def initialization_based_v2(y):
out = np.zeros((len(y),2),dtype=int)
out[np.arange(out.shape[0]), (y+1)//2] = 1
return out
def initialization_based_v3(y):
yc = y.copy()
yc[yc==-1] = 0
out = np.zeros((len(y),2),dtype=int)
out[np.arange(out.shape[0]), yc] = 1
return out
The two new additions only differ in the way we are setting up the column indices. For version 2, we have those computed with simply : (y+1)//2, while for the version 3 as : yc = y.copy(); yc[yc==-1] = 0.
Another one that gets pretty close to #Eric's one, but uses boolean array -
def initialization_based_v4(y):
out = np.empty((len(y),2),dtype=int)
mask = y == 1
out[:,0] = mask
out[:,1] = ~mask
return out
Runtime test -
In [320]: y = 2*np.random.randint(0,2,(1000000))-1
In [321]: %timeit sign_to_one_hot(y, dtype=int)
...: %timeit initialization_based(y)
...: %timeit initialization_based_v2(y)
...: %timeit initialization_based_v3(y)
...: %timeit initialization_based_v4(y)
...:
100 loops, best of 3: 3.16 ms per loop
100 loops, best of 3: 8.39 ms per loop
10 loops, best of 3: 27.2 ms per loop
100 loops, best of 3: 13.8 ms per loop
100 loops, best of 3: 3.11 ms per loop
In [322]: from sklearn.preprocessing import OneHotEncoder
In [323]: enc = OneHotEncoder(sparse=False)
In [324]: %timeit enc.fit_transform(np.where(y>=0, y, 0))
10 loops, best of 3: 77.3 ms per loop

A few simple observations to making this efficient:
Preallocate the result, rather than using concatenate
empty is faster than zeros if you're just going to overwrite those zeros
Use the out argument, to avoid temporaries
def sign_to_one_hot(x, dtype=np.float64):
out = np.empty(x.shape + (2,), dtype=dtype)
plus_one = out[...,0]
minus_one = out[...,1]
np.equal(x, 1, out=plus_one)
np.subtract(1, plus_one, out=minus_one)
return out
Choose your dtype carefully - casting because you chose the wrong one will incur a copy

You can also use sklearn.preprocessing.OneHotEncoder method.
NOTE: it doesn't accept negative numbers, so we have to replace them.
Demo:
from sklearn.preprocessing import OneHotEncoder
# per default it generates sparsed matrix - it might be very useful for huge data sets
enc = OneHotEncoder(sparse=False)
rslt = enc.fit_transform(np.where(Y>=0, Y, 0))
Result:
In [140]: rslt
Out[140]:
array([[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.]])
Source array:
In [141]: Y
Out[141]:
array([[ 1],
[-1],
[-1],
[ 1]])
Pandas solution:
In [148]: pd.get_dummies(Y.ravel())
Out[148]:
-1 1
0 0 1
1 1 0
2 1 0
3 0 1

Apply function to result of numpy.where

Assume I have an array:
a = [1,2,3,4,0,-1,-2,3,4]
Using np.where(a < 0) it returns a list of indices which elements a are < 0
How to apply a function for those elements of a?

If you convert your list into a numpy array, it gets easier: you can index arrays with boolean arrays:
In [2]: a = np.asarray([1,2,3,4,0,-1,-2,3,4])
In [3]: a[a < 0]
Out[3]: array([-1, -2])
In [4]: np.sin(a[a < 0])
Out[4]: array([-0.84147098, -0.90929743])
In [5]: a[a < 0]**2
Out[5]: array([1, 4])
The key here is that a < 0 is an array itself:
In [6]: a < 0
Out[6]: array([False, False, False, False, False, True, True, False, False], dtype=bool)

In general I would recommend following #ev-br approach using boolean masks. But it's also possible with np.where if you use all three arguments. The second argument specifies the value chosen at the indices where the condition is True and the third where it's False:
>>> import numpy as np
>>> a = np.array([1,2,3,4,0,-1,-2,3,4])
>>> np.where(a < 0, 1000, a) # replace values below 0 with 1000
array([ 1, 2, 3, 4, 0, 1000, 1000, 3, 4])
If you want to apply a numpy-ufunc (for example np.sin) just replace the 1000:
>>> np.where(a < 0, np.sin(a), a)
array([ 1., 2., 3., 4., 0., -0.84147098, -0.90929743, 3., 4.])
Alternativly (this requires that the array already has the correct dtype to store the result of the function) you could use the indices returned by np.where to apply the result on:
>>> a = np.array([1,2,3,4,0,-1,-2,3,4], dtype=float) # must be floating point now
>>> idx = np.where(a < 0)
>>> a[idx] = np.sin(a[idx])
>>> a
array([ 1., 2., 3., 4., 0., -0.84147098, -0.90929743, 3., 4.])

Something like this ought to work:
square = lambda x: x**2
applied_func_array = [square(x) for x in a if x < 0]
Or with numpy.vectorize:
vec_square = np.vectorize(square)
vec_square(less_than_zero)
Which yields:
Out[220]:
array([[1],
[4]])

I don't work with numpy a lot but did you mean something along these lines?
my_arr = np.array(a)
def my_func(my_array):
for elem in np.where(my_array < 0):
my_array[elem] = my_array[elem] + 1 * 3
return my_array
np.apply_along_axis(my_func, 0, my_arr)
rray([1, 2, 3, 4, 0, 2, 1, 3, 4])

Python how to find unique entries and get the minimum values from a matching array

I have a numpy array, indices:
array([[ 0, 0, 0],
[ 0, 0, 0],
[ 2, 0, 2],
[ 0, 0, 0],
[ 2, 0, 2],
[95, 71, 95]])
I have another array of the same length called distances:
array([ 0.98713981, 1.04705992, 1.42340327, 74.0139111 ,
74.4285216 , 74.84623217])
All of the rows in indices have a match in the distances array. The problem is, there are duplicates in the indices array, and they have different values in the corresponding distances array. I would like to get the minimum distance for all triplets of indices, and discard the others. Therefore, with the inputs above, I want the output:
indicesOUT =
array([[ 0, 0, 0],
[ 2, 0, 2],
[95, 71, 95]])
distancesOUT=
array([ 0.98713981, 1.42340327, 74.84623217])
My current strategy is as follows:
import numpy as np
indicesOUT = []
distancesOUT = []
for i in range(6):
for j in range(6):
for k in range(6):
if len([s for s in indicesOUT if [i,j,k] == s]) == 0:
current = np.array([i, j, k])
ind = np.where((indices == current).all(-1) == True)[0]
currentDistances = distances[ind]
dist = np.amin(distances)
indicesOUT.append([i, j, k])
distancesOUT.append(dist)
The problem is, the actual arrays have about 4 million elements each, so this approach is way too slow. What is the most efficient way of doing this?

This is essentially a grouping operation, and NumPy is not well-optimized for it. Fortunately, the Pandas package has some very fast tools that can be adapted to this exact problem.
With your data above, we can do this:
import pandas as pd
def drop_duplicates(indices, distances):
data = pd.Series(distances)
grouped = data.groupby(list(indices.T)).min().reset_index()
return grouped.values[:, :3], grouped.values[:, 3]
And the output for your data is
array([[ 0., 0., 0.],
[ 2., 0., 2.],
[ 95., 71., 95.]]),
array([ 0.98713981, 1.42340327, 74.84623217])
My benchmark shows that for 4,000,000 elements, this should run in about a second:
indices = np.random.randint(0, 100, size=(4000000, 3))
distances = np.random.random(4000000)
%timeit drop_duplicates(indices, distances)
# 1 loops, best of 3: 1.15 s per loop
As written above, the input order of the indices will not necessarily be preserved; keeping the original order would require a bit more thought.

python - increase array size and initialize new elements to zero

I have an array of a size 2 x 2 and I want to change the size to 3 x 4.
A = [[1 2 ],[2 3]]
A_new = [[1 2 0 0],[2 3 0 0],[0 0 0 0]]
I tried 3 shape but it didn't and append can only append row, not column. I don't want to iterate through each row to add the column.
Is there any vectorized way to do this like that of in MATLAB: A(:,3:4) = 0; and A(3,:) = 0; this converted the A from 2 x 2 to 3 x 4. I was thinking is there a similar way in python?

In Python, if the input is a numpy array, you can use np.lib.pad to pad zeros around it -
import numpy as np
A = np.array([[1, 2 ],[2, 3]]) # Input
A_new = np.lib.pad(A, ((0,1),(0,2)), 'constant', constant_values=(0)) # Output
Sample run -
In [7]: A # Input: A numpy array
Out[7]:
array([[1, 2],
[2, 3]])
In [8]: np.lib.pad(A, ((0,1),(0,2)), 'constant', constant_values=(0))
Out[8]:
array([[1, 2, 0, 0],
[2, 3, 0, 0],
[0, 0, 0, 0]]) # Zero padded numpy array
If you don't want to do the math of how many zeros to pad, you can let the code do it for you given the output array size -
In [29]: A
Out[29]:
array([[1, 2],
[2, 3]])
In [30]: new_shape = (3,4)
In [31]: shape_diff = np.array(new_shape) - np.array(A.shape)
In [32]: np.lib.pad(A, ((0,shape_diff[0]),(0,shape_diff[1])),
'constant', constant_values=(0))
Out[32]:
array([[1, 2, 0, 0],
[2, 3, 0, 0],
[0, 0, 0, 0]])
Or, you can start off with a zero initialized output array and then put back those input elements from A -
In [38]: A
Out[38]:
array([[1, 2],
[2, 3]])
In [39]: A_new = np.zeros(new_shape,dtype = A.dtype)
In [40]: A_new[0:A.shape[0],0:A.shape[1]] = A
In [41]: A_new
Out[41]:
array([[1, 2, 0, 0],
[2, 3, 0, 0],
[0, 0, 0, 0]])
In MATLAB, you can use padarray -
A_new = padarray(A,[1 2],'post')
Sample run -
>> A
A =
1 2
2 3
>> A_new = padarray(A,[1 2],'post')
A_new =
1 2 0 0
2 3 0 0
0 0 0 0

Pure Python way achieve this:
row = 3
column = 4
A = [[1, 2],[2, 3]]
A_new = map(lambda x: x + ([0] * (column - len(x))), A + ([[0] * column] * (row - len(A))))
then A_new is [[1, 2, 0, 0], [2, 3, 0, 0], [0, 0, 0, 0]].
Good to know:
[x] * n will repeat x n-times
Lists can be concatenated using the + operator
Explanation:
map(function, list) will iterate each item in list pass it to function and replace that item with the return value
A + ([[0] * column] * (row - len(A))): A is being extended with the remaining "zeroed" lists
repeat the item in [0] by the column count
repeat that array by the remaining row count
([0] * (column - len(x))): for each row item (x) add an list with the remaining count of columns using

Q: Is there a vectorised way to ...
A: Yes, there is
A = np.ones( (2,2) ) # numpy create/assign 1-s
B = np.zeros( (4,5) ) # numpy create/assign 0-s "padding" mat
B[:A.shape[0],:A.shape[1]] += A[:,:] # numpy vectorised .ADD at a cost of ~270 us
B[:A.shape[0],:A.shape[1]] = A[:,:] # numpy vectorised .STO at a cost of ~180 us
B[:A.shape[0],:A.shape[1]] = A # numpy high-level .STO at a cost of ~450 us
B
Out[4]:
array([[ 1., 1., 0., 0., 0.],
[ 1., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
Q: Is it resources efficient to "extend" the A´s data-structure in a smart way "behind the curtain"?
A: No, fortunately not much. Try bigger, big or huge sizes to feel the resources-allocation/processing costs...
Numpy has genuine data-structure "behind-the-curtain" that allows lot of smart tricks alike strided (re-)mapping, view-based operations, fast vectorised/broadcast operations, however, changing the memory-layout "accross the strided smart-mapping" is rather expensive.
For this reason numpy has added since 1.7.0 an in-built layout/mapper-modifier .lib.pad() that is well-aware & optimised so as to handle the "behind-the-curtain" structures both smart & fast.
B = np.lib.pad( A,
( ( 0, 3 ), ( 0, 2) ),
'constant',
constant_values = ( 0, 0 )
) # .pad() at a cost of ~ 270 us

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert array of indices to one-hot encoded array in NumPy - python

Given a 1D array of indices: a = array([1, 0, 3]) I want to one-hot encode this as a 2D array: b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Create a zeroed array b with enough columns, i.e. a.max() + 1. Then, for each row i, set the a[i]th column to 1. >>> a = np.array([1, 0, 3]) >>> b = np.zeros((a.size, a.max() + 1)) >>> b[np.arange(a.size), a] = 1 >>> b array([[ 0., 1., 0., 0.], [ 1., 0., 0., 0.], [ 0., 0., 0., 1.]])

>>> values = [1, 0, 3] >>> n_values = np.max(values) + 1 >>> np.eye(n_values)[values] array([[ 0., 1., 0., 0.], [ 1., 0., 0., 0.], [ 0., 0., 0., 1.]])

In case you are using keras, there is a built in utility for that: from keras.utils.np_utils import to_categorical categorical_labels = to_categorical(int_labels, num_classes=3) And it does pretty much the same as #YXD's answer (see source-code).

You can also use eye function of numpy: numpy.eye(number of classes)[vector containing the labels]

For 1-hot-encoding one_hot_encode=pandas.get_dummies(array) For Example ENJOY CODING

You can use the following code for converting into a one-hot vector: let x is the normal class vector having a single column with classes 0 to some number: import numpy as np np.eye(x.max()+1)[x] if 0 is not a class; then remove +1.

If using tensorflow, there is one_hot(): import tensorflow as tf import numpy as np a = np.array([1, 0, 3]) depth = 4 b = tf.one_hot(a, depth) # <tf.Tensor: shape=(3, 3), dtype=float32, numpy= # array([[0., 1., 0.], # [1., 0., 0.], # [0., 0., 0.]], dtype=float32)>

Such type of encoding are usually part of numpy array. If you are using a numpy array like this : a = np.array([1,0,3]) then there is very simple way to convert that to 1-hot encoding out = (np.arange(4) == a[:,None]).astype(np.float32) That's it.

p will be a 2d ndarray. We want to know which value is the highest in a row, to put there 1 and everywhere else 0. clean and easy solution: max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1) one_hot = np.zeros(p.shape) np.put_along_axis(one_hot, max_elements_i, 1, axis=1)

I find the easiest solution combines np.take and np.eye def one_hot(x, depth: int): return np.take(np.eye(depth), x, axis=0) works for x of any shape.

Related

Working with 2 arrays with different lengths Numpy Python

How does one convert a data set with two labels +1 and -1 to a hot one vector representation in a vectorized way in Python?

Apply function to result of numpy.where

Python how to find unique entries and get the minimum values from a matching array

python - increase array size and initialize new elements to zero

Categories

Resources