Generate one-hot vector from a nx1 array of binary labels - python

I have an array like this:
X = [0,0,1,1,0,0,1,1,1,0,0,0]
I want to create a one-hot encoder vector of nx2
one_hotX = [[1,0],[1,0],[0,1],[0,1],[1,0]...]
Is there an easy way to do this? OneHotEncoder and LabelEncoder don't seem to work.

How about constructing the vector by yourself:
[[0, 1] if i else [1, 0] for i in X]
#[[1, 0],
# [1, 0],
# [0, 1],
# [0, 1],
# [1, 0],
# [1, 0],
# [0, 1],
# [0, 1],
# [0, 1],
# [1, 0],
# [1, 0],
# [1, 0]]
If you are working with numpy, you could do something such as this as well(vectorized approach):
import numpy as np
code = np.array([[1,0],[0,1]])
arrX = np.array(X)
code[arrX]
#array([[1, 0],
# [1, 0],
# [0, 1],
# [0, 1],
# [1, 0],
# [1, 0],
# [0, 1],
# [0, 1],
# [0, 1],
# [1, 0],
# [1, 0],
# [1, 0]])

Related

Map a 2d 2-channel numpy array to a 2d 1-channel numpy array

Suppose I have a 2d 2-channel (3d) numpy array:
[[[-1, -1], [0, -1], [1, -1]],
[[-1, 0], [0, 0], [1, 0]],
[[-1, 1], [0, 1], [1, 1]]]
I want to map this to a 2d 1-channel (3d) numpy array:
[[[0], [1], [2]],
[[3], [4], [5]],
[[6], [7], [8]]]
So for example I had the following array
[[[-1, -1], [0, 0], [1, 1]],
[[ 0, 0], [1, 0], [1, 1]]]
After applying the mapping I should get.
[[[0], [4], [8]],
[[4], [5], [8]]]
Since [-1, -1] == [0], [0, 0] == [4], and so on in the mapping.
I am writing a python program to preprocess images in CIELAB space. The L* has been stripped off leaving me with 'ab'. I want to convert individual pixels of ab to classes.
Let's generate your lookup arrays to get a hint. First the template:
ROWS = 3
COLS = 3
template = np.arange(ROWS * COLS).reshape(ROWS, COLS, 1)
This is equivalent to
template = np.array([[[0], [1], [2]],
[[3], [4], [5]],
[[6], [7], [8]]])
Then the input grid:
ROW_OFFSET = -1
COL_OFFSET = -1
grid = np.stack(np.mgrid[ROW_OFFSET:ROWS + ROW_OFFSET,
COL_OFFSET:COLS + COL_OFFSET], 2)
This is equivalent to
grid = np.array([[[-1, -1], [-1, 0], [-1, 1]],
[[ 0, -1], [ 0, 0], [ 0, 1]],
[[ 1, -1], [ 1, 0], [ 1, 1]]])
Given how we made grid, it should be clear that the "channels" are the row and column index, up to the offset. So given an index array, you can map it into template using fancy indexing:
index = np.array([[[-1,-1], [0, 0], [1, 1]],
[[ 0, 0], [1, 0], [1, 1]]])
result = template[index[:, :, 0] - ROW_OFFSET, index[:, :, 1] - COL_OFFSET, :]
If your template always fits the pattern shown above, you don't need indexing at all. You can just generate the result directly from COLS and the grid offsets:
result = (index[:, :, 0] - ROW_OFFSET) * COLS + index[:, :, 1] - COL_OFFSET
import numpy as np
a = np.array([[[-1,-1], [0,-1], [1,-1]],
[[-1, 0], [0, 0], [1, 0]],
[[-1, 1], [0, 1], [1, 1]]])
b = np.array([[[-1,-1], [0, 0], [1, 1]],
[[ 0, 0], [1, 0], [1, 1]]])
width = len(a[0])
start_row = a[0][0][0]
start_col = a[0][0][1]
result = []
for rows in b:
line = []
for d in rows:
n = (d[1] - start_col) * width + d[0] - start_row
line.append([n])
result.append(line)
result = np.asarray(result)
Is this what you mean?

Numpy Use 2D array as heightmap-like index for 3D array

I want to use a 2D array as an index for a 3D array as a heightmap to index axis 0 of the 3D array. Is there an efficient "numpy-way" of doing this? In my example I want to set everything at equal or greater height of the heightmap in each corresponding pillar two zero. Example:
3D Array:
[[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]]]
2D Array (heightmap):
[[0, 1, 2],
[2, 3, 4],
[2, 0, 0]]
Desired output:
[[[0, 1, 1],
[1, 1, 1],
[1, 0, 0]],
[[0, 0, 1],
[1, 1, 1],
[1, 0, 0]],
[[0, 0, 0],
[0, 1, 1],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]]
So far I have implemented this with a for python loop as in
for y in range(arr2d.shape[0]):
for x in range(arr2d.shape[1]):
height = arr2d[y, x]
arr3d[height:, y, x] = 0
but this seems very ineffecient and I feel like there might be a way better way to do this.
Drawing inspiration from an fast way of padding arrays:
In [104]: (np.arange(4)[:,None,None]<arr2d).astype(int)
Out[104]:
array([[[0, 1, 1],
[1, 1, 1],
[1, 0, 0]],
[[0, 0, 1],
[1, 1, 1],
[1, 0, 0]],
[[0, 0, 0],
[0, 1, 1],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]])

Unexpected behaviour in list value change

I defined this function:
def newMap(dim, n):
tc = [0 for i in range(n)]
return [[tc for _ in range(dim)] for _ in range(dim)]
Which creates a list of lists of zeroes. For example
m = newMap(2,2)
print(m)
returns
[[[0, 0], [0, 0]], [[0, 0], [0, 0]]]
I want to change one os the zeroes to obtain [[[0, 0], [0, 0]], [[0, 0], [0, 0]]] and tried doing so by
m[0][0][0] = 1
which, unexpectedly returns [[[1, 0], [1, 0]], [[1, 0], [1, 0]]] instead of [[[1, 0], [0, 0]], [[0, 0], [0, 0]]].
However, if I defined a = [[[0, 0], [0, 0]], [[0, 0], [0, 0]]], and then do
a[0][0][0] = 1
print(a)
it returns [[[1, 0], [0, 0]], [[0, 0], [0, 0]]], which is what I want.
Why does this happen? Shouldn't the two definitions be equivalent? How can I prevent it from happening in the first case?
Use tc.copy() this should fix it, i tried it and it works:
def newMap(dim, n):
tc = [0 for i in range(n)]
return [[tc.copy() for _ in range(dim)] for _ in range(dim)]
a = newMap(2,2)
a
#[[[0, 0], [0, 0]], [[0, 0], [0, 0]]]
a[0][0][0] = 1
#[[[1, 0], [0, 0]], [[0, 0], [0, 0]]]

Find indices of unique values of a 3-dim numpy array

I have an array with coordinates of N points. Another array contains the masses of these N points.
>>> import numpy as np
>>> N=10
>>> xyz=np.random.randint(0,2,(N,3))
>>> mass=np.random.rand(len(xyz))
>>> xyz
array([[1, 0, 1],
[1, 1, 0],
[0, 1, 1],
[0, 0, 0],
[0, 1, 0],
[1, 1, 0],
[1, 0, 1],
[0, 0, 1],
[1, 0, 1],
[0, 0, 1]])
>>> mass
array([ 0.38668401, 0.44385111, 0.47756182, 0.74896529, 0.20424403,
0.21828435, 0.98937523, 0.08736635, 0.24790248, 0.67759276])
Now I want to obtain an array with unique values of xyz and a corresponding array of summed up masses. That means the following arrays:
>>> xyz_unique
array([[0, 1, 1],
[1, 1, 0],
[0, 0, 1],
[1, 0, 1],
[0, 0, 0],
[0, 1, 0]])
>>> mass_unique
array([ 0.47756182, 0.66213546, 0.76495911, 1.62396172, 0.74896529,
0.20424403])
My attempt was the following code with a double for-loop:
>>> xyz_unique=np.array(list(set(tuple(p) for p in xyz)))
>>> mass_unique=np.zeros(len(xyz_unique))
>>> for j in np.arange(len(xyz_unique)):
... indices=np.array([],dtype=np.int64)
... for i in np.arange(len(xyz)):
... if np.all(xyz[i]==xyz_unique[j]):
... indices=np.append(indices,i)
... mass_unique[j]=np.sum(mass[indices])
The problem is that this takes too long, I actually have N=100000.
Is there a faster way or how could I improve my code?
EDIT My coordinates are actually float numbers. To keep things simple, I made random integers to have duplicates at low N.
Case 1: Binary numbers in xyz
If the elements in the input array xyz were 0's and 1's, you can convert each row into a decimal number, then label each row based on their uniqueness with other decimal numbers. Then, based on those labels, you can use np.bincount to accumulate the summations, just like in MATLAB one could use accumarray. Here's the implementation to achieve all that -
import numpy as np
# Input arrays xyz and mass
xyz = np.array([
[1, 0, 1],
[1, 1, 0],
[0, 1, 1],
[0, 0, 0],
[0, 1, 0],
[1, 1, 0],
[1, 0, 1],
[0, 0, 1],
[1, 0, 1],
[0, 0, 1]])
mass = np.array([ 0.38668401, 0.44385111, 0.47756182, 0.74896529, 0.20424403,
0.21828435, 0.98937523, 0.08736635, 0.24790248, 0.67759276])
# Convert each row entry in xyz into equivalent decimal numbers
dec_num = np.dot(xyz,2**np.arange(xyz.shape[1])[:,None])
# Get indices of the first occurrences of the unique values and also label each row
_, unq_idx,row_labels = np.unique(dec_num, return_index=True, return_inverse=True)
# Find unique rows from xyz array
xyz_unique = xyz[unq_idx,:]
# Accumulate the summations from mass based on the row labels
mass_unique = np.bincount(row_labels, weights=mass)
Output -
In [148]: xyz_unique
Out[148]:
array([[0, 0, 0],
[0, 1, 0],
[1, 1, 0],
[0, 0, 1],
[1, 0, 1],
[0, 1, 1]])
In [149]: mass_unique
Out[149]:
array([ 0.74896529, 0.20424403, 0.66213546, 0.76495911, 1.62396172,
0.47756182])
Case 2: Generic
For a general case, you can use this -
import numpy as np
# Perform lex sort and get the sorted indices
sorted_idx = np.lexsort(xyz.T)
sorted_xyz = xyz[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_xyz,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1
# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels
# Get unique indices
unq_idx = sorted_idx[df2]
# Get unique xyz's and the mass counts using accumulation with bincount
xyz_unique = xyz[unq_idx,:]
mass_unique = np.bincount(labels, weights=mass)
Sample run -
In [238]: xyz
Out[238]:
array([[1, 2, 1],
[1, 2, 1],
[0, 1, 0],
[1, 0, 1],
[2, 1, 2],
[2, 1, 1],
[0, 1, 0],
[1, 0, 0],
[2, 1, 0],
[2, 0, 1]])
In [239]: mass
Out[239]:
array([ 0.5126308 , 0.69075674, 0.02749734, 0.384824 , 0.65151772,
0.77718427, 0.18839268, 0.78364902, 0.15962722, 0.09906355])
In [240]: xyz_unique
Out[240]:
array([[1, 0, 0],
[0, 1, 0],
[2, 1, 0],
[1, 0, 1],
[2, 0, 1],
[2, 1, 1],
[1, 2, 1],
[2, 1, 2]])
In [241]: mass_unique
Out[241]:
array([ 0.78364902, 0.21589002, 0.15962722, 0.384824 , 0.09906355,
0.77718427, 1.20338754, 0.65151772])

Sort numpy.array rows by indices

I have 2D numpy.array and a tuple of indices:
a = array([[0, 0], [0, 1], [1, 0], [1, 1]])
ix = (2, 0, 3, 1)
How can I sort array's rows by the indices? Expected result:
array([[1, 0], [0, 0], [1, 1], [0, 1]])
I tried using numpy.take, but it works as I expect only with 1D arrays.
You can in fact use ndarray.take() for this. The trick is to supply the second argument (axis):
>>> a.take(ix, 0)
array([[1, 0],
[0, 0],
[1, 1],
[0, 1]])
(Without axis, the array is flattened before elements are taken.)
Alternatively:
>>> a[ix, ...]
array([[1, 0],
[0, 0],
[1, 1],
[0, 1]])

Categories