Trying to understand what is happening in this Python Function - python

def closest_centroid(points, centroids):
"""returns an array containing the index to the nearest centroid for each point"""
distances = np.sqrt(((points - centroids[:, np.newaxis])**2).sum(axis=2))
return np.argmin(distances, axis=0)
Can someone explain the exact working of this function? I currently got points which looks like:
31998888119 0.94 34
23423423422 0.45 43
....
And so on. In this numpy array, points[1] would be the long ID while points[2] is 0.94 and points[3] would be 34 for their first entry.
Centroids is just a random selection from this particular array:
def initialize_centroids(points, k):
"""returns k centroids from the initial points"""
centroids = points.copy()
np.random.shuffle(centroids)
return centroids[:k]
Now I want to get the Euclidean distance from the values of points ignoring the first column of IDs and centroids (once again ignoring the first column). I don't exactly understand the syntax from the line distances = np.sqrt(((points - centroids[:, np.newaxis])**2).sum(axis=2)). Why exactly are we summing across the third column, while there being a decleration for a new axis: np.newaxis? Also along what axis am I supposed to make the np.argmin work?

It helps to think about the dimensions. Let's assume that k=4 and there are 10 points, so points.shape = (10,3).
Next, centroids = initialize_centroids(points, 4) returns an object with dimension (4,3).
Let's break up this line from the inside:
distances = np.sqrt(((points - centroids[:, np.newaxis])**2).sum(axis=2))
We want to subtract each centroid from each point. Since points and centroids are 2 dimensional, each points - centroid is 2 dimensional. If there were 1 centroid only, then we're ok. But we have 4 centroids! So we need to perform points - centroids, for each centroid. Therefore we need another dimension to store this. Hence the addition of a np.newaxis.
We square it because it's a distance, so we want to convert negatives to positive (and also because we are minimizing Euclidean distance).
We're not summing across the third column. In fact we are summing the difference between points and centroid, for each point, for each centroid.
np.argmin() finds the centroid with the minimum distance. So for each centroid, for each point, find the minimum index (hence argmin instead of min). That index is the centroid assigned to that point.
Here is an example:
points = np.array([
[ 1, 2, 4],
[ 1, 1, 3],
[ 1, 6, 2],
[ 6, 2, 3],
[ 7, 2, 3],
[ 1, 9, 6],
[ 6, 9, 1],
[ 3, 8, 6],
[ 10, 9, 6],
[ 0, 2, 0],
])
centroids = initialize_centroids(points, 4)
print(centroids)
array([[10, 9, 6],
[ 3, 8, 6],
[ 6, 2, 3],
[ 1, 1, 3]])
distances = (pts - centroids[:, np.newaxis])**2
print(distances)
array([[[ 81, 49, 4],
[ 81, 64, 9],
[ 81, 9, 16],
[ 16, 49, 9],
[ 9, 49, 9],
[ 81, 0, 0],
[ 16, 0, 25],
[ 49, 1, 0],
[ 0, 0, 0],
[100, 49, 36]],
[[ 4, 36, 4],
[ 4, 49, 9],
[ 4, 4, 16],
[ 9, 36, 9],
[ 16, 36, 9],
[ 4, 1, 0],
[ 9, 1, 25],
[ 0, 0, 0],
[ 49, 1, 0],
[ 9, 36, 36]],
[[ 25, 0, 1],
[ 25, 1, 0],
[ 25, 16, 1],
[ 0, 0, 0],
[ 1, 0, 0],
[ 25, 49, 9],
[ 0, 49, 4],
[ 9, 36, 9],
[ 16, 49, 9],
[ 36, 0, 9]],
[[ 0, 1, 1],
[ 0, 0, 0],
[ 0, 25, 1],
[ 25, 1, 0],
[ 36, 1, 0],
[ 0, 64, 9],
[ 25, 64, 4],
[ 4, 49, 9],
[ 81, 64, 9],
[ 1, 1, 9]]])
print(distances.sum(axis=2))
array([[134, 154, 106, 74, 67, 81, 41, 50, 0, 185],
[ 44, 62, 24, 54, 61, 5, 35, 0, 50, 81],
[ 26, 26, 42, 0, 1, 83, 53, 54, 74, 45],
[ 2, 0, 26, 26, 37, 73, 93, 62, 154, 11]])
# The minimum of the first 4 centroids is index 3. The minimum of the second 4 centroids is index 3 again.
print(np.argmin(distances.sum(axis=2), axis=0))
array([3, 3, 1, 2, 2, 1, 1, 1, 0, 3])

Related

Every Nth Element In Convolution

Is there a faster way to do np.convolve(A, B)[::N] using Numpy? It feels wasteful to compute all the convolutions and then throw N - 1 of N away... I could do a for loop or list comprehension, but I thought it would be faster to use only native Numpy methods.
EDIT
Or does Numpy do lazy evaluation? I just saw this from a JS library, would be awesome for Numpy as well:
// Get first 3 unique values
const arr = [1, 2, 2, 3, 3, 4, 5, 6];
const result = R.pipe(
arr,
R.map(x => {
console.log('iterate', x);
return x;
}),
R.uniq(),
R.take(3)
); // => [1, 2, 3]
/**
* Console output:
* iterate 1
* iterate 2
* iterate 2
* iterate 3
* /
A convolution is a product of your kernel and a window on your array, then the sum. You can achieve the same manually using a rolling window:
First let's see a dummy example
A = np.arange(30)
B = np.ones(6)
N = 3
out = np.convolve(A, B)[::N]
print(out)
output: [ 0. 6. 21. 39. 57. 75. 93. 111. 129. 147. 135. 57.]
Now we do the same with a rolling view, padding, and slicing:
from numpy.lib.stride_tricks import sliding_window_view as swv
out = (swv(np.pad(A, B.shape[0]-1), B.shape[0])[::N]*B).sum(axis=1)
print(out)
output: [ 0. 6. 21. 39. 57. 75. 93. 111. 129. 147. 135. 57.]
Intermediate sliding view:
swv(np.pad(A, B.shape[0]-1), B.shape[0])
array([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 1],
[ 0, 0, 0, 0, 1, 2],
[ 0, 0, 0, 1, 2, 3],
[ 0, 0, 1, 2, 3, 4],
[ 0, 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5, 6],
[ 2, 3, 4, 5, 6, 7],
...
[24, 25, 26, 27, 28, 29],
[25, 26, 27, 28, 29, 0],
[26, 27, 28, 29, 0, 0],
[27, 28, 29, 0, 0, 0],
[28, 29, 0, 0, 0, 0],
[29, 0, 0, 0, 0, 0]])
# with slicing
swv(np.pad(A, B.shape[0]-1), B.shape[0])[::N]
array([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 1, 2, 3],
[ 1, 2, 3, 4, 5, 6],
[ 4, 5, 6, 7, 8, 9],
[ 7, 8, 9, 10, 11, 12],
[10, 11, 12, 13, 14, 15],
[13, 14, 15, 16, 17, 18],
[16, 17, 18, 19, 20, 21],
[19, 20, 21, 22, 23, 24],
[22, 23, 24, 25, 26, 27],
[25, 26, 27, 28, 29, 0],
[28, 29, 0, 0, 0, 0]])

Multiplying numpy ndarray with 1d array

So I can see many questions on this forum asking how to multiply numpy ndarrays with a 1d ndarray over a given axis. Most of the answers suggest making use of np.newaxis to meet broadcasting requirements. Here I have a more specific issue where Id like to multiply over axis 2 eg:
>>> import numpy as np
>>> x = np.arange(27).reshape((3,3,3))
>>> y = np.arange(3)
>>> z = x*y[:,np.newaxis,np.newaxis]
>>> x
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
>>> y
array([0, 1, 2])
>>> z
array([[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[36, 38, 40],
[42, 44, 46],
[48, 50, 52]]])
This is the kind of multiplication I want.
However, in my case I've got dimensions along axis 0 and 1 that do not match dimensions along axis 2 eg, when I try and implement the above for my arrays I get this:
>>> x = np.arange(144).reshape(8,6,3)
>>> z = x*y[:,np.newaxis,np.newaxis]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (8,6,3) (3,1,1)
I understand why I get this broadcasting error, my issue is that if I adjust my broadcasting eg do a valid multiplication:
>>> z = x*y[np.newaxis,np.newaxis,:]
I am now not multiplying across the correct axis.
Any ideas how to address this issue?
Hi i have manage to get it in the format a long with the axis similar to how you did with your first example but not sure if this is correct?
import numpy as np
test = np.arange(144).reshape(8,6,3)
test2 = np.arange(3)
np.array([test[i] * test2[i] for i in range(len(test.shape))])
>>>array([[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 18, 19, 20],
[ 21, 22, 23],
[ 24, 25, 26],
[ 27, 28, 29],
[ 30, 31, 32],
[ 33, 34, 35]],
[[ 72, 74, 76],
[ 78, 80, 82],
[ 84, 86, 88],
[ 90, 92, 94],
[ 96, 98, 100],
[102, 104, 106]]])

How to reshape Numpy array with padded 0's

I have a Numpy array that looks like
array([1, 2, 3, 4, 5, 6, 7, 8])
and I want to reshape it to an array
array([[5, 0, 0, 6],
[0, 1, 2, 0],
[0, 3, 4, 0],
[7, 0, 0, 8]])
More specifically, I'm trying to reshape a 2D numpy array to get a 3D Numpy array to go from
array([[ 1, 2, 3, 4, 5, 6, 7, 8],
[ 9, 10, 11, 12, 13, 14, 15, 16],
[17, 18, 19, 20, 21, 22, 23, 24],
...
[ 9, 10, 11, 12, 13, 14, 15, 16],
[89, 90, 91, 92, 93, 94, 95, 96]])
to a numpy array that looks like
array([[[ 5, 0, 0, 6],
[ 0, 1, 2, 0],
[ 0, 3, 4, 0],
[ 7, 0, 0, 8]],
[[13, 0, 0, 14],
[ 0, 9, 10, 0],
[ 0, 11, 12, 0],
[15, 0, 0, 16]],
...
[[93, 0, 0, 94],
[ 0, 89, 90, 0],
[ 0, 91, 92, 0],
[95, 0, 0, 96]]])
Is there an efficient way to do this using numpy functionality, particularly vectorized?
We can make use of slicing -
def expand(a): # a is 2D array
out = np.zeros((len(a),4,4),dtype=a.dtype)
out[:,1:3,1:3] = a[:,:4].reshape(-1,2,2)
out[:,::3,::3] = a[:,4:].reshape(-1,2,2)
return out
The benefit is memory and hence perf. efficiency, as only the output would occupy memory space. The steps involved work with views thanks to the slicing on the input and output.
Sample run -
2D input :
In [223]: a
Out[223]:
array([[ 1, 2, 3, 4, 5, 6, 7, 8],
[ 9, 10, 11, 12, 13, 14, 15, 16]])
In [224]: expand(a)
Out[224]:
array([[[ 5, 0, 0, 6],
[ 0, 1, 2, 0],
[ 0, 3, 4, 0],
[ 7, 0, 0, 8]],
[[13, 0, 0, 14],
[ 0, 9, 10, 0],
[ 0, 11, 12, 0],
[15, 0, 0, 16]]])
1D input (feed in 2D extended input with None) :
In [225]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8])
In [226]: expand(a[None])
Out[226]:
array([[[5, 0, 0, 6],
[0, 1, 2, 0],
[0, 3, 4, 0],
[7, 0, 0, 8]]])

Generalisation of vector outer product: apply it to every column of a matrix

I have a matrix A = [x1, x2, ..., xm] where each xi is a column vector of size [n, 1]. So A has shape [n, m]. I am trying to find the covariance matrix of each column vector so that if the result is another matrix C, C has shape [n, n, m] and C[:,:,i] = np.outer(xi, xi).
Can someone tell my how to do the above in numpy or point me to a tensor operation that I should check out?
So your outer loop produces:
In [1147]: A = np.arange(12).reshape(3,4)
In [1148]: [np.outer(A[:,i],A[:,i]) for i in range(4)]
Out[1148]:
[array([[ 0, 0, 0],
[ 0, 16, 32],
[ 0, 32, 64]]), array([[ 1, 5, 9],
[ 5, 25, 45],
[ 9, 45, 81]]), array([[ 4, 12, 20],
[ 12, 36, 60],
[ 20, 60, 100]]), array([[ 9, 21, 33],
[ 21, 49, 77],
[ 33, 77, 121]])]
stacking that on the a new 1st dimension produces:
In [1149]: np.stack(_)
Out[1149]:
array([[[ 0, 0, 0],
[ 0, 16, 32],
[ 0, 32, 64]],
....
[ 21, 49, 77],
[ 33, 77, 121]]])
In [1150]: _.shape
Out[1150]: (4, 3, 3) # wrong order - can be transposed.
stack lets us specify a different axis:
In [1153]: np.stack([np.outer(A[:,i],A[:,i]) for i in range(4)],2)
Out[1153]:
array([[[ 0, 1, 4, 9],
[ 0, 5, 12, 21],
[ 0, 9, 20, 33]],
[[ 0, 5, 12, 21],
[ 16, 25, 36, 49],
[ 32, 45, 60, 77]],
[[ 0, 9, 20, 33],
[ 32, 45, 60, 77],
[ 64, 81, 100, 121]]])
np.einsum does this nicely as well:
In [1151]: np.einsum('mi,ni->mni',A,A)
Out[1151]:
array([[[ 0, 1, 4, 9],
[ 0, 5, 12, 21],
[ 0, 9, 20, 33]],
[[ 0, 5, 12, 21],
[ 16, 25, 36, 49],
[ 32, 45, 60, 77]],
[[ 0, 9, 20, 33],
[ 32, 45, 60, 77],
[ 64, 81, 100, 121]]])
In [1152]: _.shape
Out[1152]: (3, 3, 4)
broadcasted multiply is also nice
In [1156]: A[:,None,:]*A[None,:,:]
Out[1156]:
array([[[ 0, 1, 4, 9],
[ 0, 5, 12, 21],
...
[ 32, 45, 60, 77],
[ 64, 81, 100, 121]]])

balance numpy array with over-sampling

please help me finding a clean way to create a new array out of existing. it should be over-sampled, if the number of example of any class is smaller than the maximum number of examples in the class. samples should be taken from the original array (makes no difference, whether randomly or sequentially)
let's say, initial array is this:
[ 2, 29, 30, 1]
[ 5, 50, 46, 0]
[ 1, 7, 89, 1]
[ 0, 10, 92, 9]
[ 4, 11, 8, 1]
[ 3, 92, 1, 0]
the last column contains classes:
classes = [ 0, 1, 9]
the distribution of the classes is the following:
distrib = [2, 3, 1]
what i need is to create a new array with equal number of samples of all classes, taken randomly from the original array, e.g.
[ 5, 50, 46, 0]
[ 3, 92, 1, 0]
[ 5, 50, 46, 0] # one example added
[ 2, 29, 30, 1]
[ 1, 7, 89, 1]
[ 4, 11, 8, 1]
[ 0, 10, 92, 9]
[ 0, 10, 92, 9] # two examples
[ 0, 10, 92, 9] # added
The following code does what you are after:
a = np.array([[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0]])
unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype)
for j in xrange(len(unq)):
indices = np.random.choice(np.where(unq_idx==j)[0], cnt)
out[j*cnt:(j+1)*cnt] = a[indices]
>>> out
array([[ 5, 50, 46, 0],
[ 5, 50, 46, 0],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 4, 11, 8, 1],
[ 2, 29, 30, 1],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9]])
When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:
unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True,
return_counts=True)
Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:
unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype)
slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt)))
for j in xrange(len(unq)):
indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j])
out[slices[j]:slices[j+1]] = a[indices]
out = np.vstack((a, out))
>>> out
array([[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0],
[ 5, 50, 46, 0],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9]])
This gives a random distribution with equal probability for each class:
distrib = np.bincount(a[:,-1])
prob = 1/distrib[a[:, -1]].astype(float)
prob /= prob.sum()
In [38]: a[np.random.choice(np.arange(len(a)), size=np.count_nonzero(distrib)*distrib.max(), p=prob)]
Out[38]:
array([[ 5, 50, 46, 0],
[ 4, 11, 8, 1],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9],
[ 2, 29, 30, 1],
[ 0, 10, 92, 9],
[ 3, 92, 1, 0],
[ 1, 7, 89, 1],
[ 1, 7, 89, 1]])
Each class has equal probability, not guaranteed equal incidence.
You can use the imbalanced-learn package:
import numpy as np
from imblearn.over_sampling import RandomOverSampler
data = np.array([
[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0]
])
ros = RandomOverSampler()
# fit_resample expects two arguments: a matrix of sample data and a vector of
# sample labels. In this case, the sample data is in the first three columns of
# our array and the labels are in the last column
X_resampled, y_resampled = ros.fit_resample(data[:, :-1], data[:, -1])
# fit_resample returns a matrix of resampled data and a vector with the
# corresponding labels. Combine them into a single matrix
resampled = np.column_stack((X_resampled, y_resampled))
print(resampled)
Output:
[[ 2 29 30 1]
[ 5 50 46 0]
[ 1 7 89 1]
[ 0 10 92 9]
[ 4 11 8 1]
[ 3 92 1 0]
[ 3 92 1 0]
[ 0 10 92 9]
[ 0 10 92 9]]
The RandomOverSampler offers different sampling strategies, but the default is to resample all classes except the majority class.

Categories