Numpy matrix binarization using only one expression - python

I am looking for a way to binarize numpy N-d array based on the threshold using only one expression. So I have something like this:
np.random.seed(0)
np.set_printoptions(precision=3)
a = np.random.rand(4, 4)
threshold, upper, lower = 0.5, 1, 0
a is now:
array([[ 0.02 , 0.833, 0.778, 0.87 ],
[ 0.979, 0.799, 0.461, 0.781],
[ 0.118, 0.64 , 0.143, 0.945],
[ 0.522, 0.415, 0.265, 0.774]])
Now I can fire these 2 expressions:
a[a>threshold] = upper
a[a<=threshold] = lower
and achieve what I want:
array([[ 0., 1., 1., 1.],
[ 1., 1., 0., 1.],
[ 0., 1., 0., 1.],
[ 1., 0., 0., 1.]])
But is there a way to do this with just one expression?

We may consider np.where:
np.where(a>threshold, upper, lower)
Out[6]:
array([[0, 1, 1, 1],
[1, 1, 0, 1],
[0, 1, 0, 1],
[1, 0, 0, 1]])

Numpy treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math. So you can just do:
>>> a = (a > 0.5).astype(np.int_)
For example:
>>> np.random.seed(0)
>>> np.set_printoptions(precision=3)
>>> a = np.random.rand(4, 4)
>>> a
>>> array([[ 0.549, 0.715, 0.603, 0.545],
[ 0.424, 0.646, 0.438, 0.892],
[ 0.964, 0.383, 0.792, 0.529],
[ 0.568, 0.926, 0.071, 0.087]])
>>> a = (a > 0.5).astype(np.int_) # Where the numpy magic happens.
>>> array([[1, 1, 1, 1],
[0, 1, 0, 1],
[1, 0, 1, 1],
[1, 1, 0, 0]])
Whats going on here is that you are automatically iterating through every element of every row in the 4x4 matrix and applying a boolean comparison to each element.
If > 0.5 return True, else return False.
Then by calling the .astype method and passing np.int_ as the argument, you're telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.

A shorter method is to simply multiply the boolean matrix from the condition by 1 or 1.0, depending on the type you want.
>>> a = np.random.rand(4,4)
>>> a
array([[ 0.63227032, 0.18262573, 0.21241511, 0.95181594],
[ 0.79215808, 0.63868395, 0.41706148, 0.9153959 ],
[ 0.41812268, 0.70905987, 0.54946947, 0.51690887],
[ 0.83693151, 0.10929998, 0.19219377, 0.82919761]])
>>> (a>0.5)*1
array([[1, 0, 0, 1],
[1, 1, 0, 1],
[0, 1, 1, 1],
[1, 0, 0, 1]])
>>> (a>0.5)*1.0
array([[ 1., 0., 0., 1.],
[ 1., 1., 0., 1.],
[ 0., 1., 1., 1.],
[ 1., 0., 0., 1.]])

You can write expression directly, this will return a boolean array, and it can be used simply as an 1-byte unsigned integer ("uint8") array for further calculations:
print a > 0.5
output
[[False True True True]
[ True True False True]
[False True False True]
[ True False False True]]
In one line and with custom upper/lower values you can write so for example:
upper = 10
lower = 3
treshold = 0.5
print lower + (a>treshold) * (upper-lower)

Related

How do you efficiently sum the occurences of a value in one array at positions in another array

Im looking for an efficient 'for loop' avoiding solution that solves an array related problem I'm having. I want to use a huge 1Darray (A -> size = 250.000) of values between 0 and 40 for indexing in one dimension, and a array (B) with the same size with values between 0 and 9995 for indexing in a second dimension.
The result should be an array with size (41, 9996) with for each index the amount of times that any value from array 1 occurs at a value from array 2.
Example:
A = [0, 3, 2, 4, 3]
B = [1, 2, 2, 0, 2]
which should result in:
[[0, 1, 0,
[0, 0, 0,
[0, 0, 1,
[0, 0, 2,
[1, 0, 0]]
The dirty way is too slow as the amount of data is huge, what you would be able to do is:
out = np.zeros(41,9995)
for i in A:
for j in B:
out[i,j] += 1
which will take 238.000 * 238.000 loops...
I've tried this, which works partially:
out = np.zeros(41,9995)
out[A,B] += 1
Which generates a result with 1 everywhere, regardless of the amount of times the values occur.
Does anyone have a clue how to fix this? Thanks in advance!
You are looking for a sparse tensor:
import torch
A = [0, 3, 2, 4, 3]
B = [1, 2, 2, 0, 2]
idx = torch.LongTensor([A, B])
torch.sparse.FloatTensor(idx, torch.ones(idx.shape[1]), torch.Size([5,3])).to_dense()
Output:
tensor([[0., 1., 0.],
[0., 0., 0.],
[0., 0., 1.],
[0., 0., 2.],
[1., 0., 0.]])
You can also do the same with scipy sparse matrix:
import numpy as np
from scipy.sparse import coo_matrix
coo_matrix((np.ones(len(A)), (np.array(A), np.array(B))), shape=(5,3)).toarray()
output:
array([[0., 1., 0.],
[0., 0., 0.],
[0., 0., 1.],
[0., 0., 2.],
[1., 0., 0.]])
Sometimes it is better to leave the matrix in its sparse representation, rather than forcing it to be "dense" again.
Use numpy.add.at:
import numpy as np
A = [0, 3, 2, 4, 3]
B = [1, 2, 2, 0, 2]
arr = np.zeros((5, 3))
np.add.at(arr, (A, B), 1)
print(arr)
Output
[[0. 1. 0.]
[0. 0. 0.]
[0. 0. 1.]
[0. 0. 2.]
[1. 0. 0.]]
Given that the numbers are in a small range, bincount would be a good choice for bin-based summing -
def accumulate_coords(A,B):
nrows = A.max()+1
ncols = B.max()+1
return np.bincount(A*ncols+B,minlength=nrows*ncols).reshape(-1,ncols)
Sample run -
In [55]: A
Out[55]: array([0, 3, 2, 4, 3])
In [56]: B
Out[56]: array([1, 2, 2, 0, 2])
In [58]: accumulate_coords(A,B)
Out[58]:
array([[0, 1, 0],
[0, 0, 0],
[0, 0, 1],
[0, 0, 2],
[1, 0, 0]])

Is there a better way to produce a membership matrix (one-hot array) for an array of cluster assignments in Python? [duplicate]

This question already has answers here:
Convert array of indices to one-hot encoded array in NumPy
(22 answers)
Closed 5 years ago.
After running kmeans I can easily get an array with the assigned clusters for ever data point. Now I want to get a membership matrix (one-hot array) which has the different clusters as columns and indicates the cluster assignment by either 1 or 0 in the matrix for each data point.
My code is shown below and it works but I am wondering if there is a more elegant way to do the same.
km = KMeans(n_clusters=3).fit(data)
membership_matrix = np.stack([np.where(km.labels_ == 0, 1,0),
np.where(km.labels_ == 1, 1,0),
np.where(km.labels_ == 2, 1,0)]
axis = 1)
So you can create 'one-hot array' which is equivalent to your membership array from array of cluster according to this question. Here is how you do it using np.eye
import numpy as np
clusters = np.array([2,1,2,2,0,1])
n_clusters = max(clusters) + 1
membership_matrix = np.eye(n_clusters)[clusters]
Output is as follows
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
Here's a method that's agnostic to the number of clusters you have (with your method, you'll have to "stack" more things if you have more clusters).
This code sample assumes you have six data points and 3 clusters:
NUM_DATA_POINTS = 6
NUM_CLUSTERS = 3
clusters = np.array([2,1,2,2,0,1]) # hard-coded as an example, but this is your KMeans output
# create your empty membership matrix
membership = np.zeros((NUM_DATA_POINTS, NUM_CLUSTERS))
membership[np.arange(NUM_DATA_POINTS), clusters] = 1
The key feature being used here is 2D array indexing - in the last line of code above, we index into the rows of membership sequentially (np.arange creates an incrementing sequence from 0 to NUM_DATA_POINTS-1) and into the columns of membership using the cluster assignments. Here's the relevant numpy reference.
It would produce the following membership matrix:
>>> membership
array([[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
You are looking for LabelBinarizer. Give this code a try:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
membership_matrix = lb.fit_transform(km.labels_)
In contrast to other solutions proposed here, this approach:
Generates a compact membership matrix when the labels are not consecutive numbers.
Is able to deal with categorical labels.
Sample run:
In [9]: lb.fit_transform([0, 1, 2, 0, 2, 2])
Out[9]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [10]: lb.fit_transform([0, 1, 9, 0, 9, 9])
Out[10]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])
In [11]: lb.fit_transform(['first', 'second', 'third', 'first', 'third', 'third'])
Out[11]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]])

How to split an array based on minimum row value using vectorization

I am trying to figure out how to take the following for loop that splits an array based on the index of the lowest value in the row and use vectorization. I've looked at this link and have been trying to use the numpy.where function but currently unsuccessful.
For example if an array has n columns, then all the rows where col[0] has the lowest value are put in one array, all the rows where col[1] are put in another, etc.
Here's the code using a for loop.
import numpy
a = numpy.array([[ 0. 1. 3.]
[ 0. 1. 3.]
[ 0. 1. 3.]
[ 1. 0. 2.]
[ 1. 0. 2.]
[ 1. 0. 2.]
[ 3. 1. 0.]
[ 3. 1. 0.]
[ 3. 1. 0.]])
result_0 = []
result_1 = []
result_2 = []
for value in a:
if value[0] <= value[1] and value[0] <= value[2]:
result_0.append(value)
elif value[1] <= value[0] and value[1] <= value[2]:
result_1.append(value)
else:
result_2.append(value)
print(result_0)
>>[array([ 0. 1. 3.]), array([ 0. 1. 3.]), array([ 0. 1. 3.])]
print(result_1)
>>[array([ 1. 0. 2.]), array([ 1. 0. 2.]), array([ 1. 0. 2.])]
print(result_2)
>>[array([ 3. 1. 0.]), array([ 3. 1. 0.]), array([ 3. 1. 0.])]
First, use argsort to see where the lowest value in each row is:
>>> a.argsort(axis=1)
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[1, 0, 2],
[1, 0, 2],
[1, 0, 2],
[2, 1, 0],
[2, 1, 0],
[2, 1, 0]])
Note that wherever a row has 0, that is the smallest column in that row.
Now you can build the results:
>>> sortidx = a.argsort(axis=1)
>>> [a[sortidx[:,i] == 0] for i in range(a.shape[1])]
[array([[ 0., 1., 3.],
[ 0., 1., 3.],
[ 0., 1., 3.]]),
array([[ 1., 0., 2.],
[ 1., 0., 2.],
[ 1., 0., 2.]]),
array([[ 3., 1., 0.],
[ 3., 1., 0.],
[ 3., 1., 0.]])]
So it is done with only a single loop over the columns, which will give a huge speedup if the number of rows is much larger than the number of columns.
This is not the best solution since it relies on simple python loops and is not very efficient when you start dealing with large data sets but it should get you started.
The point is to create an array of "buckets" which store the data based on the depth of the lengthiest element. Then enumerate each element in values, selecting the smallest one and saving its offset which is subsequently appended to the correct results "bucket", for each a. Finally we print this out in the last loop.
Solution using loops:
import numpy
import pprint
# random data set
a = numpy.array([[0, 1, 3],
[0, 1, 3],
[0, 1, 3],
[1, 0, 2],
[1, 0, 2],
[1, 0, 2],
[3, 1, 0],
[3, 1, 0],
[3, 1, 0]])
# create a list of results as big as the depth of elements in an entry
results = list()
for l in range(max(len(i) for i in a)):
results.append(list())
# don't do the following because all the references to the lists will be the same and you get dups:
# results = [[]]*max(len(i) for i in a)
for value in a:
res_offset, _val = min(enumerate(value), key=lambda x: x[1]) # get the offset and min value
results[res_offset].append(value) # store the original Array obj in the correct "bucket"
# print for visualization
for c, r in enumerate(results):
print("result_%s: %s" % (c, r))
Outputs:
result_0: [array([0, 1, 3]), array([0, 1, 3]), array([0, 1, 3])]
result_1: [array([1, 0, 2]), array([1, 0, 2]), array([1, 0, 2])]
result_2: [array([3, 1, 0]), array([3, 1, 0]), array([3, 1, 0])]
I found a much easier way to do this. I hope that I am interpreting the OP correctly.
My sense is that the OP wants to create a slice of the larger array based upon some set of conditions.
Note that the code above to create the array does not seem to work--at least in python 3.5. I generated the array as follow.
a = np.array([0., 1., 3., 0., 1., 3., 0., 1., 3., 1., 0., 2., 1., 0., 2.,1., 0., 2.,3., 1., 0.,3., 1., 0.,3., 1., 0.]).reshape([9,3])
Next, I sliced the original array into smaller arrays. Numpy has builtins to help with this.
result_0 = a[np.logical_and(a[:,0] <= a[:,1],a[:,0] <= a[:,2])]
result_1 = a[np.logical_and(a[:,1] <= a[:,0],a[:,1] <= a[:,2])]
result_2 = a[np.logical_and(a[:,2] <= a[:,0],a[:,2] <= a[:,1])]
This will generate new numpy arrays that match the given conditions.
Note if the user wants to convert these individual rows into a list or arrays, he/she can just enter the following code to obtain the result.
result_0 = [np.array(x) for x in result_0.tolist()]
result_0 = [np.array(x) for x in result_1.tolist()]
result_0 = [np.array(x) for x in result_2.tolist()]
This should generate the outcome requested in the OP.

Understanding == applied to a NumPy array

I'm new to Python, and I am learning TensorFlow. In a tutorial using the notMNIST dataset, they give example code to transform the labels matrix to a one-of-n encoded array.
The goal is to take an array consisting of label integers 0...9, and return a matrix where each integer has been transformed into a one-of-n encoded array like this:
0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
...
The code they give to do this is:
# Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
However, I don't understand how this code does that at all. It looks like it just generates an array of integers in the range of 0 to 9, and then compares that with the labels matrix, and converts the result to a float. How does an == operator result in a one-of-n encoded matrix?
There are a few things going on here: numpy's vector ops, adding a singleton axis, and broadcasting.
First, you should be able to see how the == does the magic.
Let's say we start with a simple label array. == behaves in a vectorized fashion, which means that we can compare the entire array with a scalar and get an array consisting of the values of each elementwise comparison. For example:
>>> labels = np.array([1,2,0,0,2])
>>> labels == 0
array([False, False, True, True, False], dtype=bool)
>>> (labels == 0).astype(np.float32)
array([ 0., 0., 1., 1., 0.], dtype=float32)
First we get a boolean array, and then we coerce to floats: False==0 in Python, and True==1. So we wind up with an array which is 0 where labels isn't equal to 0 and 1 where it is.
But there's nothing special about comparing to 0, we could compare to 1 or 2 or 3 instead for similar results:
>>> (labels == 2).astype(np.float32)
array([ 0., 1., 0., 0., 1.], dtype=float32)
In fact, we could loop over every possible label and generate this array. We could use a listcomp:
>>> np.array([(labels == i).astype(np.float32) for i in np.arange(3)])
array([[ 0., 0., 1., 1., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 1.]], dtype=float32)
but this doesn't really take advantage of numpy. What we want to do is have each possible label compared with each element, IOW to compare
>>> np.arange(3)
array([0, 1, 2])
with
>>> labels
array([1, 2, 0, 0, 2])
And here's where the magic of numpy broadcasting comes in. Right now, labels is a 1-dimensional object of shape (5,). If we make it a 2-dimensional object of shape (5,1), then the operation will "broadcast" over the last axis and we'll get an output of shape (5,3) with the results of comparing each entry in the range with each element of labels.
First we can add an "extra" axis to labels using None (or np.newaxis), changing its shape:
>>> labels[:,None]
array([[1],
[2],
[0],
[0],
[2]])
>>> labels[:,None].shape
(5, 1)
And then we can make the comparison (this is the transpose of the arrangement we were looking at earlier, but that doesn't really matter).
>>> np.arange(3) == labels[:,None]
array([[False, True, False],
[False, False, True],
[ True, False, False],
[ True, False, False],
[False, False, True]], dtype=bool)
>>> (np.arange(3) == labels[:,None]).astype(np.float32)
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]], dtype=float32)
Broadcasting in numpy is very powerful, and well worth reading up on.
In short, == applied to a numpy array means applying element-wise == to the array. The result is an array of booleans. Here is an example:
>>> b = np.array([1,0,0,1,1,0])
>>> b == 1
array([ True, False, False, True, True, False], dtype=bool)
To count say how many 1s there are in b, you don't need to cast the array to float, i.e. the .astype(np.float32) can be saved, because in python boolean is a subclass of int and in Python 3 you have True == 1 False == 0. So here is how you count how many ones is in b:
>>> np.sum((b == 1))
3
Or:
>>> np.count_nonzero(b == 1)
3

Convolving a periodic image with python

I want to convolve an n-dimensional image which is conceptually periodic.
What I mean is the following: if I have a 2D image
>>> image2d = [[0,0,0,0],
... [0,0,0,1],
... [0,0,0,0]]
and I want to convolve it with this kernel:
>>> kernel = [[ 1,1,1],
... [ 1,1,1],
... [ 1,1,1]]
then I want the result to be:
>>> result = [[1,0,1,1],
... [1,0,1,1],
... [1,0,1,1]]
How to do this in python/numpy/scipy?
Note that I am not interested in creating the kernel, but mainly the periodicity of the convolution, i.e. the three leftmost ones in the resulting image (if that makes sense).
This is already built in, with scipy.signal.convolve2d's optional boundary='wrap' which gives periodic boundary conditions as padding for the convolution. The mode option here is 'same' to make the output size match the input size.
In [1]: image2d = [[0,0,0,0],
... [0,0,0,1],
... [0,0,0,0]]
In [2]: kernel = [[ 1,1,1],
... [ 1,1,1],
... [ 1,1,1]]
In [3]: from scipy.signal import convolve2d
In [4]: convolve2d(image2d, kernel, mode='same', boundary='wrap')
Out[4]:
array([[1, 0, 1, 1],
[1, 0, 1, 1],
[1, 0, 1, 1]])
The only disadvantage here is that you cannot use scipy.signal.fftconvolve which is often faster.
image2d = [[0,0,0,0,0],
[0,0,0,1,0],
[0,0,0,0,0],
[0,0,0,0,0]]
kernel = [[1,1,1],
[1,1,1],
[1,1,1]]
image2d = np.asarray(image2d)
kernel = np.asarray(kernel)
img_f = np.fft.fft2(image2d)
krn_f = np.fft.fft2(kernel, s=image2d.shape)
conv = np.fft.ifft2(img_f*krn_f).real
>>> conv.round()
array([[ 0., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 1.],
[ 1., 0., 0., 1., 1.],
[ 1., 0., 0., 1., 1.]])
Note that the kernel is placed with its top left corner at the position of the 1 in the image. You would need to roll the result to get what you are after:
k_rows, k_cols = kernel.shape
conv2 = np.roll(np.roll(conv, -(k_cols//2), axis=-1),
-(k_rows//2), axis=-2)
>>> conv2.round()
array([[ 0., 0., 1., 1., 1.],
[ 0., 0., 1., 1., 1.],
[ 0., 0., 1., 1., 1.],
[ 0., 0., 0., 0., 0.]])
This kind of 'periodic convolution' is better known as circular or cyclic convolution. See http://en.wikipedia.org/wiki/Circular_convolution.
In the case of an n-dimensional image, as is the case in this question, one can use the scipy.ndimage.convolve function. It has a parameter mode which can be set to wrap for a circular convolution.
result = scipy.ndimage.convolve(image,kernel,mode='wrap')
>>> import numpy as np
>>> image = np.array([[0, 0, 0, 0],
... [0, 0, 0, 1],
... [0, 0, 0, 0]])
>>> kernel = np.array([[1, 1, 1],
... [1, 1, 1],
... [1, 1, 1]])
>>> from scipy.ndimage import convolve
>>> convolve(image, kernel, mode='wrap')
array([[1, 0, 1, 1],
[1, 0, 1, 1],
[1, 0, 1, 1]])

Categories