Faster way to create cost matrix

Faster way to create cost matrix - python

I am using the Hungarian Algorithm in scipy which takes as an input the cost matrix of two sets of points. This just means each element in array x is passed into function f with each element in array y. I currently implemented this with a nested for loop in python. Here is a basic example of what I do:
def f(a, b):
return a * b
x = np.array([1, 2, 3])
y = np.array([1, 2, 3])
cost_mat = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[0]):
for j in range(y.shape[0]):
cost_mat[i, j] = f(x[i], y[j])
print(cost_mat)
>> out:
[[1., 2., 3.]
[2., 4., 6.]
[3., 6., 9.]]
Is there a faster way to do this? For example, vectorizing it somehow?

Something like this work :
x = np.array([1, 2, 3], ndmin=2)
y = np.array([1, 2, 3], ndmin=2)
cost_mat = x * y.T
cost_matrix is
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
Let's time both solutions with bigger arrays :
x = np.random.rand(10000,1)
y = np.random.rand(10000,1)
def f(a, b):
return a * b
# Start timing here
cost_mat1 = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[0]):
for j in range(y.shape[0]):
cost_mat1[i, j] = f(x[i], y[j])
# Wall time: 2min 13s
Using transpose is way faster :
# Start timing here
cost_mat2 = x * y.T
# Wall time: 395 ms
And then check that
np.array_equal(cost_mat1, cost_mat2)
Returns true

Related

How to vectorize performing pairwise sums given two numpy arrays?

I have two numpy arrays which look like this:
x = [v1, v2, v3, ..., vm]
y = [w1, w2, w3, ..., wn]
where vi, wj are numpy arrays of length 3.
I want to perform a pairwise summation of v's and w's and get a final array
z = [v1+w1, v1+w2,...,v1+wn,v2+w1, ..., vi+wj, ..., vm+wn]
A simple way of obtaining z is as follows:
z = np.zeros ((m*n, 3))
for i in range(m):
for j in range(n):
z[n*i+j] = x[i] + y[j]
This computation is not feasible is m, n are very large.
I know scipy.spatial has methods to enumerate pairwise distances using distance_matrix in a vectorized fashion.
I want to ask if there is a vectorized version of performing such pairwise additions for numpy arrays?

You can take advantage of broadcasting, creating a 2D array, then you can easily get z[i,j] = x[i] + y[j]
x = np.reshape(x, (-1, 1)) # shape (N, 1)
y = np.reshape(y, (-1, 1)) # shape (N, 1)
z = x + y.T # shape (N, N)
If you want to have z as a 1D array you can do z.reshape(-1).

If x is mx3 matrix, y is a nx3
x.shape # (m,3)
y.shape # (n,3)
x1 = x.reshape(m,1,3)
y1 = y.reshape(1,n,3)
z = x1 + y1 # shape (m,n,3)
z1 = z.reshape(-1,3) # (m*n, 3)
equivalently
z = x[:,None]+y
test:
In [263]: x=np.arange(12).reshape(4,3); y=np.arange(6).reshape(2,3)
In [264]: z = x[:,None]+y
In [265]: z.shape
Out[265]: (4, 2, 3)
In [266]: z
Out[266]:
array([[[ 0, 2, 4],
[ 3, 5, 7]],
[[ 3, 5, 7],
[ 6, 8, 10]],
[[ 6, 8, 10],
[ 9, 11, 13]],
[[ 9, 11, 13],
[12, 14, 16]]])

combination of numpy rows [duplicate]

I would like to implement itertools.combinations for numpy. Based on this discussion, I have a function that works for 1D input:
def combs(a, r):
"""
Return successive r-length combinations of elements in the array a.
Should produce the same output as array(list(combinations(a, r))), but
faster.
"""
a = asarray(a)
dt = dtype([('', a.dtype)]*r)
b = fromiter(combinations(a, r), dt)
return b.view(a.dtype).reshape(-1, r)
and the output makes sense:
In [1]: list(combinations([1,2,3], 2))
Out[1]: [(1, 2), (1, 3), (2, 3)]
In [2]: array(list(combinations([1,2,3], 2)))
Out[2]:
array([[1, 2],
[1, 3],
[2, 3]])
In [3]: combs([1,2,3], 2)
Out[3]:
array([[1, 2],
[1, 3],
[2, 3]])
However, it would be best if I could expand it to N-D inputs, where additional dimensions simply allow you to speedily do multiple calls at once. So, conceptually, if combs([1, 2, 3], 2) produces [1, 2], [1, 3], [2, 3], and combs([4, 5, 6], 2) produces [4, 5], [4, 6], [5, 6], then combs((1,2,3) and (4,5,6), 2) should produce [1, 2], [1, 3], [2, 3] and [4, 5], [4, 6], [5, 6] where "and" just represents parallel rows or columns (whichever makes sense). (and likewise for additional dimensions)
I'm not sure:
How to make the dimensions work in a logical way that's consistent with the way other functions work (like how some numpy functions have an axis= parameter, and a default of axis 0. So probably axis 0 should be the one I am combining along, and all other axes just represent parallel calculations?)
How to get the above code to work with ND (right now I get ValueError: setting an array element with a sequence.)
Is there a better way to do dt = dtype([('', a.dtype)]*r)?

You can use itertools.combinations() to create the index array, and then use NumPy's fancy indexing:
import numpy as np
from itertools import combinations, chain
from scipy.special import comb
def comb_index(n, k):
count = comb(n, k, exact=True)
index = np.fromiter(chain.from_iterable(combinations(range(n), k)),
int, count=count*k)
return index.reshape(-1, k)
data = np.array([[1,2,3,4,5],[10,11,12,13,14]])
idx = comb_index(5, 3)
print(data[:, idx])
output:
[[[ 1 2 3]
[ 1 2 4]
[ 1 2 5]
[ 1 3 4]
[ 1 3 5]
[ 1 4 5]
[ 2 3 4]
[ 2 3 5]
[ 2 4 5]
[ 3 4 5]]
[[10 11 12]
[10 11 13]
[10 11 14]
[10 12 13]
[10 12 14]
[10 13 14]
[11 12 13]
[11 12 14]
[11 13 14]
[12 13 14]]]

Case k = 2: np.triu_indices
I've tested case k = 2 using lots of variations of abovementioned functions using perfplot. The winner is, no doubt, np.triu_indices and I see now that using np.dtype([('', np.intp)] * 2) data structure can be a huge boost even for exotic data types such as igraph.EdgeList.
from itertools import combinations, chain
from scipy.special import comb
import igraph as ig #graph library build on C
import networkx as nx #graph library, pure Python
def _combs(n):
return np.array(list(combinations(range(n),2)))
def _combs_fromiter(n): ##Jaime
indices = np.arange(n)
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(combinations(indices, 2), dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _combs_fromiterplus(n):
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(combinations(range(n), 2), dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _numpy(n): ##endolith
return np.transpose(np.triu_indices(n,1))
def _igraph(n):
return np.array(ig.Graph(n).complementer(False).get_edgelist())
def _igraph_fromiter(n):
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(ig.Graph(n).complementer(False).get_edgelist(), dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _nx(n):
G = nx.Graph()
G.add_nodes_from(range(n))
return np.array(list(nx.complement(G).edges))
def _nx_fromiter(n):
G = nx.Graph()
G.add_nodes_from(range(n))
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(nx.complement(G).edges, dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _comb_index(n): ##HYRY
count = comb(n, 2, exact=True)
index = np.fromiter(chain.from_iterable(combinations(range(n), 2)),
int, count=count*2)
return index.reshape(-1, 2)
fig = plt.figure(figsize=(15, 10))
plt.grid(True, which="both")
out = perfplot.bench(
setup = lambda x: x,
kernels = [_numpy, _combs, _combs_fromiter, _combs_fromiterplus,
_comb_index, _igraph, _igraph_fromiter, _nx, _nx_fromiter],
n_range = [2 ** k for k in range(12)],
xlabel = 'combinations(n, 2)',
title = 'testing combinations',
show_progress = False,
equality_check = False)
out.show()
Wondering why np.triu_indices can't be extended to more dimensions?
Case 2 ≤ k ≤ 4: triu_indices(implemented here) = up to 2x speedup
np.triu_indices could actually be a winner for case k = 3 and even k = 4 if we implement a generalised method instead. A current version of this method is equivalent of:
def triu_indices(n, k):
x = np.less.outer(np.arange(n), np.arange(-k+1, n-k+1))
return np.nonzero(x)
It constructs matrix representation of a relation x < y for two sequences 0,1,...,n-1 and finds locations of cells where they are not zero. For 3D case we need to add extra dimension and intersect relations x < y and y < z. For next dimensions procedure is the same but this gets a huge memory overload since n^k binary cells are needed and only C(n, k) of them attains True values. Memory usage and performance grows by O(n!) so this algorithm outperformans itertools.combinations only for small values of k. This is best to use actually for case k=2 and k=3
def C(n, k): #huge memory overload...
if k==0:
return np.array([])
if k==1:
return np.arange(1,n+1)
elif k==2:
return np.less.outer(np.arange(n), np.arange(n))
else:
x = C(n, k-1)
X = np.repeat(x[None, :, :], len(x), axis=0)
Y = np.repeat(x[:, :, None], len(x), axis=2)
return X&Y
def C_indices(n, k):
return np.transpose(np.nonzero(C(n,k)))
Let's checkout with perfplot:
import matplotlib.pyplot as plt
import numpy as np
import perfplot
from itertools import chain, combinations
from scipy.special import comb
def C(n, k): # huge memory overload...
if k == 0:
return np.array([])
if k == 1:
return np.arange(1, n + 1)
elif k == 2:
return np.less.outer(np.arange(n), np.arange(n))
else:
x = C(n, k - 1)
X = np.repeat(x[None, :, :], len(x), axis=0)
Y = np.repeat(x[:, :, None], len(x), axis=2)
return X & Y
def C_indices(data):
n, k = data
return np.transpose(np.nonzero(C(n, k)))
def comb_index(data):
n, k = data
count = comb(n, k, exact=True)
index = np.fromiter(chain.from_iterable(combinations(range(n), k)),
int, count=count * k)
return index.reshape(-1, k)
def build_args(k):
return {'setup': lambda x: (x, k),
'kernels': [comb_index, C_indices],
'n_range': [2 ** x for x in range(2, {2: 10, 3:10, 4:7, 5:6}[k])],
'xlabel': f'N',
'title': f'test of case C(N,{k})',
'show_progress': True,
'equality_check': lambda x, y: np.array_equal(x, y)}
outs = [perfplot.bench(**build_args(n)) for n in (2, 3, 4, 5)]
fig = plt.figure(figsize=(20, 20))
for i in range(len(outs)):
ax = fig.add_subplot(2, 2, i + 1)
ax.grid(True, which="both")
outs[i].plot()
plt.show()
So the best performance boost is achieved for k=2 (equivalent to np.triu_indices) and for k=3` it's faster almost twice.
Case k > 3: numpy_combinations(implemented here) = up to 2.5x speedup
Following this question (thanks #Divakar) I managed to find a way to calculate values of specific column based on previous column and Pascal's triangle. It's not optimized yet as much as it could but results are really promising. Here we go:
from scipy.linalg import pascal
def stretch(a, k):
l = a.sum()+len(a)*(-k)
out = np.full(l, -1, dtype=int)
out[0] = a[0]-1
idx = (a-k).cumsum()[:-1]
out[idx] = a[1:]-1-k
return out.cumsum()
def numpy_combinations(n, k):
#n, k = data #benchmark version
n, k = data
x = np.array([n])
P = pascal(n).astype(int)
C = []
for b in range(k-1,-1,-1):
x = stretch(x, b)
r = P[b][x - b]
C.append(np.repeat(x, r))
return n - 1 - np.array(C).T
And the benchmark results are:
# script is the same as in previous example except this part
def build_args(k):
return {'setup': lambda x: (k, x),
'kernels': [comb_index, numpy_combinations],
'n_range': [x for x in range(1, k)],
'xlabel': f'N',
'title': f'test of case C({k}, k)',
'show_progress': True,
'equality_check': False}
outs = [perfplot.bench(**build_args(n)) for n in (12, 15, 17, 23, 25, 28)]
fig = plt.figure(figsize=(20, 20))
for i in range(len(outs)):
ax = fig.add_subplot(2, 3, i + 1)
ax.grid(True, which="both")
outs[i].plot()
plt.show()
Despite it still can't fight with itertools.combinations for n < 15 but it is a new winner in other cases. Last but not least, numpy demonstrates its power when amount of combinations gets reaaallly big. It was able to survive while processing C(28, 14) combinations which is around 40'000'000 items of size 14

When r = k = 2, you can also use numpy.triu_indices(n, 1) which indexes upper triangle of a matrix.
idx = comb_index(5, 2)
from HYRY's answer is equivalent to
idx = np.transpose(np.triu_indices(5, 1))
but built-in, and a few times faster for N above ~20:
timeit comb_index(1000, 2)
32.3 ms ± 443 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
timeit np.transpose(np.triu_indices(1000, 1))
10.2 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Not sure how it will work out performance-wise, but you can do the combinations on an index array, then extract the actual array slices with np.take:
def combs_nd(a, r, axis=0):
a = np.asarray(a)
if axis < 0:
axis += a.ndim
indices = np.arange(a.shape[axis])
dt = np.dtype([('', np.intp)]*r)
indices = np.fromiter(combinations(indices, r), dt)
indices = indices.view(np.intp).reshape(-1, r)
return np.take(a, indices, axis=axis)
>>> combs_nd([1,2,3], 2)
array([[1, 2],
[1, 3],
[2, 3]])
>>> combs_nd([[1,2,3],[4,5,6]], 2, axis=1)
array([[[1, 2],
[1, 3],
[2, 3]],
[[4, 5],
[4, 6],
[5, 6]]])

How can I manipulate a numpy array without nested loops?

If I have a MxN numpy array denoted arr, I wish to index over all elements and adjust the values like so
for m in range(arr.shape[0]):
for n in range(arr.shape[1]):
arr[m, n] += x**2 * np.cos(m) * np.sin(n)
Where x is a random float.
Is there a way to broadcast this over the entire array without needing to loop? Thus, speeding up the run time.

You are just adding zeros, because sin(2*pi*k) = 0 for integer k.
However, if you want to vectorize this, the function np.meshgrid could help you.
Check the following example, where I removed the 2 pi in the trigonometric functions to add something unequal zero.
x = 2
arr = np.arange(12, dtype=float).reshape(4, 3)
n, m = np.meshgrid(np.arange(arr.shape[1]), np.arange(arr.shape[0]), sparse=True)
arr += x**2 * np.cos(m) * np.sin(n)
arr
Edit: use the sparse argument to reduce memory consumption.

You can use nested generators of two-dimensional arrays:
import numpy as np
from random import random
x = random()
n, m = 10,20
arr = [[x**2 * np.cos(2*np.pi*j) * np.sin(2*np.pi*i) for j in range(m)] for i in range(n)]

In [156]: arr = np.ones((2, 3))
Replace the range with arange:
In [157]: m, n = np.arange(arr.shape[0]), np.arange(arr.shape[1])
And change the first array to (2,1) shape. A (2,1) array broadcasts with a (3,) to produce a (2,3) result.
In [158]: A = 0.23**2 * np.cos(m[:, None]) * np.sin(n)
In [159]: A
Out[159]:
array([[0. , 0.04451382, 0.04810183],
[0. , 0.02405092, 0.02598953]])
In [160]: arr + A
Out[160]:
array([[1. , 1.04451382, 1.04810183],
[1. , 1.02405092, 1.02598953]])
The meshgrid suggested in the accepted answer does the same thing:
In [161]: np.meshgrid(m, n, sparse=True, indexing="ij")
Out[161]:
[array([[0],
[1]]),
array([[0, 1, 2]])]
This broadcasting may be clearer with:
In [162]: m, n
Out[162]: (array([0, 1]), array([0, 1, 2]))
In [163]: m[:, None] * 10 + n
Out[163]:
array([[ 0, 1, 2],
[10, 11, 12]])

How to cut dataset into X and Y parameters

I have a time-series dataset that consists of evenly spaced timesteps and another parameter (say volume). I want to cut/split the dataset into X and Y parameters to train my ML model. I am looking for a logic/algorithm for Python that will be useful in tacking the simplified version below.
I have an array of even timesteps (1 timestep = 1day) ranging from 1 to 100:
array = [1,2,3,...,100]
I also have come up with the following parameters: N and K. N to be used in X parameter and K to be used in Y parameter.
If N = 5, then on first iteration X = [1,2,3,4,5], on second iteration X = [2,3,4,5,6] and on third iteration X = [3,4,5,6,7] and so forth. So, the length of an X is equal to the number of N. If N = 10, then on first iteration X = [1,2,3,...,10], on second iteration X = [2,3,4...,11] and so forth.
K parameter represents the length of a geometric sequence. For example: k =5 means k = (1,2,4,8,16), k = 3 means k = (1,2,4) and k = 7 means k = (1,2,4,8,16,32,64). Y parameter uses last element of an X array at each iteration and adds to it the values from the geometric sequence. So, the length of a Y is equal to the number of K. If len(K) = 5 -> len(Y)=5, if len(K) = 3 -> len(Y)=3 and so forth.
Example 1: N= 5, K=5:
First step:
X = [1,2,3,4,5] and Y = [6,7,9,13, 21]
because K = (1,2,4,8, 16) and Y = [5+1, 5+2, 5+4, 5+8, 5+16] with 5 being the last element of an array X
Second Step:
X = [2,3,4,5, 6] and Y = [7,8,10,14, 22]
because K = (1,2,4,8, 16) and Y = [6+1, 6+2, 6+4, 6+8, 6+16] with 6 being the last element of an array X
Third Step:
X = [3,4,5, 6, 7] and Y = [8,9,11,15, 23]
because K = (1,2,4,8, 16) and Y = [7+1, 7+2, 7+4, 7+8, 7+16] with 7 being the last element of an array X
**Other steps**
Last Step:
X = [?,?,?,?,?]; Y = [?,?,?,?,100]
k = (1,2,4,8,16) because 100 is the last element of an array
Example 2: N = 6, K = 3:
First Step:
X = [1,2,3,4,5, 6] and Y = [7,8,10] Because K = (1,2,4) and Y = [6+1, 6+2, 6+4]
Second Step:
X = [2,3,4,5,6, 7] and Y = [8,9,11] Because K = (1,2,4) and Y = [7+1, 7+2, 7+4]
Third Step:
X = [3,4,5,6,7,8] and Y = [9,10,12] Because K = (1,2,4) and Y = [8+1, 8+2, 8+4]
**Other steps**
Last Step:
X = [92,93,94,95,96]; Y = [97,98,100], k = (1,2,4) because 100 is the last element of an array
Edit
I expect the function to look like:
def dataset_split(array, N, K):
It should return multiple X and Y arrays (basically chunks) based on the input array between 1 and 100. Basically it should go over steps and save the results for X and Y in a form of matrix or arrays. Based on my Example 1 above, my X array after first three steps will be
X = [[1,2,3,4,5], [2,3,4,5,6], [3,4,5, 6, 7]]
and my Y array after first three steps will be
Y = [[6,7,9,13, 21], [7,8,10,14, 22], [8,9,11,15, 23]]
The procedure should continue until the last element of an array is reached which is 100 in this case

A very simple way to get all the values of X is to create a sliding window view into array. You can do this directly with np.lib.stride_tricks.sliding_window_view:
n = ...
k = ...
x = np.lib.stride_tricks.sliding_window_view(array, n)
The geometric sequence K can be trivially generated with np.logspace:
K = np.logspace(0, k - 1, k, base=2)
OR
K = 2.0**np.arange(k)
Either way, you can pre-generate all of y as
y = x + K
Now you have two arrays with all of the data you need:
>>> array = np.arange(1, 101)
>>> n = k = 5
>>> x = np.lib.stride_tricks.sliding_window_view(array, n)
>>> x
array([[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6],
[ 3, 4, 5, 6, 7],
...
[ 94, 95, 96, 97, 98],
[ 95, 96, 97, 98, 99],
[ 96, 97, 98, 99, 100]])
>>> K = np.logspace(0, k - 1, k, base=2)
>>> K
array([ 1., 2., 4., 8., 16.])
>>> y = x + K
>>> y
array([[ 2., 4., 7., 12., 21.],
[ 3., 5., 8., 13., 22.],
[ 4., 6., 9., 14., 23.],
...
[ 95., 97., 100., 105., 114.],
[ 96., 98., 101., 106., 115.],
[ 97., 99., 102., 107., 116.]])
The nice thing about this approach is that you don't need to copy the original data of array to make x, and everything is fully vectorized. Whatever operation you are planning on doing can likely be performed in bulk using numpy functions.

This satisfies your examples:
def split_dataset(array, N, K):
k = 2**np.arange(K)
# column stacking i-places shifted array for N columns
X = np.c_[[np.roll(array,-i) for i in range(N)]].T[:-N+1]
# masking rows that will go over the last value in array
mask = X[:,-1] + k[-1] <= array[-1]
X = X[mask]
# adding k to the last column of X
Y = X[:,-1].reshape(-1,1) + k
return X, Y
X, Y = split_dataset(array, 5, 5)

Convert array of indices to one-hot encoded array in NumPy

Given a 1D array of indices:
a = array([1, 0, 3])
I want to one-hot encode this as a 2D array:
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Create a zeroed array b with enough columns, i.e. a.max() + 1.
Then, for each row i, set the a[i]th column to 1.
>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1
>>> b
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])

In case you are using keras, there is a built in utility for that:
from keras.utils.np_utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=3)
And it does pretty much the same as #YXD's answer (see source-code).

Here is what I find useful:
def one_hot(a, num_classes):
return np.squeeze(np.eye(num_classes)[a.reshape(-1)])
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
Exactly what you wanted to have I believe.
PS: the source is Sequence models - deeplearning.ai

You can also use eye function of numpy:
numpy.eye(number of classes)[vector containing the labels]

You can use sklearn.preprocessing.LabelBinarizer:
Example:
import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))
output:
[[0 1 0 0]
[1 0 0 0]
[0 0 0 1]]
Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.

For 1-hot-encoding
one_hot_encode=pandas.get_dummies(array)
For Example
ENJOY CODING

You can use the following code for converting into a one-hot vector:
let x is the normal class vector having a single column with classes 0 to some number:
import numpy as np
np.eye(x.max()+1)[x]
if 0 is not a class; then remove +1.

Here is a function that converts a 1-D vector to a 2-D one-hot array.
#!/usr/bin/env python
import numpy as np
def convertToOneHot(vector, num_classes=None):
"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""
assert isinstance(vector, np.ndarray)
assert len(vector) > 0
if num_classes is None:
num_classes = np.max(vector)+1
else:
assert num_classes > 0
assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector] = 1
return result.astype(int)
Below is some example usage:
>>> a = np.array([1, 0, 3])
>>> convertToOneHot(a)
array([[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])
>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.

Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If using tensorflow, there is one_hot():
import tensorflow as tf
import numpy as np
a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.]], dtype=float32)>

def one_hot(n, class_num, col_wise=True):
a = np.eye(class_num)[n.reshape(-1)]
return a.T if col_wise else a
# Column for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10))
# Row for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10, col_wise=False))

I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy

Here's a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot

Such type of encoding are usually part of numpy array. If you are using a numpy array like this :
a = np.array([1,0,3])
then there is very simple way to convert that to 1-hot encoding
out = (np.arange(4) == a[:,None]).astype(np.float32)
That's it.

p will be a 2d ndarray.
We want to know which value is the highest in a row, to put there 1 and everywhere else 0.
clean and easy solution:
max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)

I find the easiest solution combines np.take and np.eye
def one_hot(x, depth: int):
return np.take(np.eye(depth), x, axis=0)
works for x of any shape.

Here is an example function that I wrote to do this based upon the answers above and my own use case:
def label_vector_to_one_hot_vector(vector, one_hot_size=10):
"""
Use to convert a column vector to a 'one-hot' matrix
Example:
vector: [[2], [0], [1]]
one_hot_size: 3
returns:
[[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]]
Parameters:
vector (np.array): of size (n, 1) to be converted
one_hot_size (int) optional: size of 'one-hot' row vector
Returns:
np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
"""
squeezed_vector = np.squeeze(vector, axis=-1)
one_hot = np.zeros((squeezed_vector.size, one_hot_size))
one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1
return one_hot
label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

I am adding for completion a simple function, using only numpy operators:
def probs_to_onehot(output_probabilities):
argmax_indices_array = np.argmax(output_probabilities, axis=1)
onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
return onehot_output_array
It takes as input a probability matrix: e.g.:
[[0.03038822 0.65810204 0.16549407 0.3797123 ]
...
[0.02771272 0.2760752 0.3280924 0.33458805]]
And it will return
[[0 1 0 0] ... [0 0 0 1]]

Use the following code. It works best.
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don't need to go into the link.

Using a Neuraxle pipeline step:
Set up your example
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Do the actual conversion
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
Assert it works
assert b_pred == b
Link to documentation: neuraxle.steps.numpy.OneHotEncoder

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster way to create cost matrix - python

Related

How to vectorize performing pairwise sums given two numpy arrays?

combination of numpy rows [duplicate]

How can I manipulate a numpy array without nested loops?

How to cut dataset into X and Y parameters

Convert array of indices to one-hot encoded array in NumPy

Categories

Resources