using Numpy for Kmean Clustering

using Numpy for Kmean Clustering - python

I'm new in machine learning and want to build a Kmean algorithm with k = 2 and I'm struggling by calculate the new centroids. here is my code for kmeans:
def euclidean_distance(x: np.ndarray, y: np.ndarray):
# x shape: (N1, D)
# y shape: (N2, D)
# output shape: (N1, N2)
dist = []
for i in x:
for j in y:
new_list = np.sqrt(sum((i - j) ** 2))
dist.append(new_list)
distance = np.reshape(dist, (len(x), len(y)))
return distance
def kmeans(x, centroids, iterations=30):
assignment = None
for i in iterations:
dist = euclidean_distance(x, centroids)
assignment = np.argmin(dist, axis=1)
for c in range(len(y)):
centroids[c] = np.mean(x[assignment == c], 0) #error here
return centroids, assignment
I have input x = [[1., 0.], [0., 1.], [0.5, 0.5]] and y = [[1., 0.], [0., 1.]] and
distance is an array and look like that:
[[0. 1.41421356]
[1.41421356 0. ]
[0.70710678 0.70710678]]
and when I run kmeans(x,y) then it returns error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call
last) /tmp/ipykernel_40086/2170434798.py in
5
6 for c in range(len(y)):
----> 7 centroids[c] = (x[classes == c], 0)
8 print(centroids)
TypeError: only integer scalar arrays can be converted to a scalar
index
Does anyone know how to fix it or improve my code? Thank you in advance!

Changing inputs to NumPy arrays should get rid of errors:
x = np.array([[1., 0.], [0., 1.], [0.5, 0.5]])
y = np.array([[1., 0.], [0., 1.]])
Also seems like you must change for i in iterations to for i in range(iterations) in kmeans function.

Related

Convert array to a single float in Python

I am trying to write a function which would estimate data noise (σ2) based on three NP arrays - One augmented X-matrix and the two vectors - the y-target and the MAP weights:
This function should return the empirical data noise estimate, σ2.
I have the following function:
def estimDS (X, output_y, W):
n = X.shape[0] # observations rows
d = X.shape[1] # number of features in columns
matmul = np.matmul(aug_x, ml_weights)
mult_left = (1/(n-d))
mult_right = (output_y-matmul)**2
estimDS = mult_left * mult_right
return estimDS
And this is an example on which I run function:
output_y = np.array([208500, 181500, 223500,
140000, 250000, 143000,
307000, 200000, 129900,
118000])
aug_x = np. array([[ 1., 1710., 2003.],
[ 1., 1262., 1976.],
[ 1., 1786., 2001.],
[ 1., 1717., 1915.],
[ 1., 2198., 2000.],
[ 1., 1362., 1993.],
[ 1., 1694., 2004.],
[ 1., 2090., 1973.],
[ 1., 1774., 1931.],
[ 1., 1077., 1939.]])
W = [-2.29223802e+06 5.92536529e+01 1.20780450e+03]
sig2 = estimDS(aug_x, output_y, W)
print(sig2)
Function returns an array, but I need to get this result as a float 3700666577282.7227
[5.61083809e+07 2.17473754e+07 6.81288433e+06 4.40198178e+07
1.86225354e+06 3.95549405e+08 8.78575426e+08 3.04530677e+07
3.32164594e+07 2.87861673e+06]

You forgot to sum over i=1 to n. Therefore mult_right should be defined as:
mult_right=np.sum((output_y-matmul)**2, axis=0)

How to create a numpy array with a an extra dimension depending on an where clause?

Problem
I need to create an array that takes the argmax and based on that maximum value position fill the array with [1,0] while the other fields that are not the maximum will be filled with [0,1].
Example:
Given the vector a:
a.shape = (3,2)
a = np.array([[1,0],[1,2],[1,3]])
Return the vector b:
b.shape = (3,2,2)
b = np.array([[[1,0],[0,1]],[[0,1],[1,0]],[[0,1],[1,0]]])

c = np.argmax(a, axis=1)
b = np.empty(tuple(list(a.shape) + [2]))
b[range(len(c)), c, :] = [1, 0]
b[range(len(c)), ~c, :] = [0, 1]
b
>>>array([[[1., 0.],
[0., 1.]],
[[0., 1.],
[1., 0.]],
[[0., 1.],
[1., 0.]]])
Note this only works in this example since the argmax will ever be only 0 or 1. If the second dimension in a is greater than 2 I don't think that this solution will work

I was able to create a function that returns the desirable result but will only work for two classes. It could be adapted for multiple classes:
a = np.array([[1,0],[1,2],[1,3]])
def create_dist_prob_target(arr):
p_ = np.squeeze(arr,axis=1)
a = np.expand_dims(np.where((p_ == np.amax(p_,axis = 1)[:,None]),1,0),axis=-1)
b = np.expand_dims(np.where((p_ == np.amax(p_,axis = 1)[:,None]),0,1),axis=-1)
return np.concatenate((a,b),axis=2)
b = create_dist_prob_target(a)
print(b)

Need help understanding the gradient function in pytorch

The following code
w = np.array([[2., 2.],[2., 2.]])
x = np.array([[3., 3.],[3., 3.]])
b = np.array([[4., 4.],[4., 4.]])
w = torch.tensor(w, requires_grad=True)
x = torch.tensor(x, requires_grad=True)
b = torch.tensor(b, requires_grad=True)
y = w*x + b
print(y)
# tensor([[10., 10.],
# [10., 10.]], dtype=torch.float64, grad_fn=<AddBackward0>)
y.backward(torch.FloatTensor([[1, 1],[ 1, 1]]))
print(w.grad)
# tensor([[3., 3.],
# [3., 3.]], dtype=torch.float64)
print(x.grad)
# tensor([[2., 2.],
# [2., 2.]], dtype=torch.float64)
print(b.grad)
# tensor([[1., 1.],
# [1., 1.]], dtype=torch.float64)
As the tensor argument inside gradient function is an all ones tensor in the shape of the input tensor, my understanding says that
w.grad means derivative of y w.r.t w, and produces b,
x.grad means derivative of y w.r.t x, and produces b and
b.grad means derivative of y w.r.t b, and produces all ones.
Out of these, only point 3 answer is matching my expected result. Can someone help me in understanding the first two answers. I think I understand the accumulation part, but don't think that is happening here.

To find the correct derivatives in this example, we need to take the sum and product rule into consideration.
Sum rule:
Product rule:
That means the derivatives of your equation are calculated as follows.
With respect to x:
With respect to w:
With respect to b:
The gradients reflect exactly that:
torch.equal(w.grad, x) # => True
torch.equal(x.grad, w) # => True
torch.equal(b.grad, torch.tensor([[1, 1], [1, 1]], dtype=torch.float64)) # => True

Create identity matrices with arbitrary shape with numpy

Is there a faster / inbuilt way to generate identity matrices with arbitrary shape in the first dimensions and an identity in the last m dimensions?
import numpy as np
base_shape = (10, 11, 12)
n_dim = 4
# m = 2
frames2d = np.zeros(base_shape + (n_dim, n_dim))
for i in range(n_dim):
frames2d[..., i, i] = 1
# m = 3
frames3d = np.zeros(base_shape + (n_dim, n_dim, n_dim))
for i in range(n_dim):
frames3d[..., i, i, i] = 1

Approach #1
We can leverage np.einsum for a diagonal view inspired by this post and hence assign 1s there for our desired output. So, for say the m=3 case, after initializing with zeros, we can simply do -
diag_view = np.einsum('...iii->...i',frames3d)
diag_view[:] = 1
Generalizing to include those input params, it would be -
def ndeye_einsum(base_shape, n_dim, m):
out = np.zeros(list(base_shape) + [n_dim]*m)
diag_view = np.einsum('...'+'i'*m+'->...i',out)
diag_view[:] = 1
return out
So, to reproduce those same arrays, it would be -
frames2d = ndeye_einsum(base_shape, n_dim, m=2)
frames3d = ndeye_einsum(base_shape, n_dim, m=3)
Approach #2
Again, from the same linked post, we can also reshape to 2D and assign into step-sized sliced array along the cols, like so -
def ndeye_reshape(base_shape, n_dim, m):
N = (n_dim**np.arange(m)).sum()
out = np.zeros(list(base_shape) + [n_dim]*m)
out.reshape(-1,n_dim**m)[:,::N] = 1
return out
This again works on a view and hence should be equally efficient as approach #1.
Approach #3
Another way would be to use integer-based indexing. So, for example for assigning into frames3d in one-go, it would be -
I = np.arange(n_dim)
frames3d[..., I, I, I] = 1
Generalizing that becomes -
def ndeye_ellipsis_indexer(base_shape, n_dim, m):
I = np.arange(n_dim)
indexer = tuple([Ellipsis]+[I]*m)
out = np.zeros(list(base_shape) + [n_dim]*m)
out[indexer] = 1
return out
Extending to higher-dims with view
The dims along base_shape are basically replications of elements from the last m dims. As such, we can get those higher dims as a higher-dim array view with np.broadcast_to. We will create basically a m-dim identity array and then broadcast-view into higher dims. This would be applicable across all three approaches posted earlier. To demonstrate, how to use it on the einsum based solution, we would have -
# Create m-dim "trailing-base" array, basically a m-dim identity array
def ndeye_einsum_trailingbase(n_dim, m):
out = np.zeros([n_dim]*m)
diag_view = np.einsum('i'*m+'->...i',out)
diag_view[:] = 1
return out
def ndeye_einsum_view(base_shape, n_dim, m):
trail_base = ndeye_einsum_trailingbase(n_dim, m)
return np.broadcast_to(trail_base, list(base_shape) + [n_dim]*m)
Thus, again we would have, e.g. -
frames3d = ndeye_einsum_view(base_shape, n_dim, m=3)
This would be a view into a m-dim array and hence efficient both on memory and performance.

One approach to have an identity matrix along the last two dimensions of the array, is to use np.broadcast_to and specifying the resulting shape the ndarray should have (this does not generalize to higher dimensions):
base_shape = (10, 11, 12)
n_dim = 4
frame2d = np.broadcast_to(np.eye(n_dim), a.shape+(n_dim,)*2)
print(frame2d.shape)
# (10, 11, 12, 4, 4)
print(frame2d)
array([[[[[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]],
[[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]],
...

What is the derivative of Shannon's Entropy?

I have the following simple python function that calculates the entropy of a single input X according to Shannon's Theory of Information:
import numpy as np
def entropy(X:'numpy array'):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies/X.shape[0]
return -np.sum(probabilities*np.log2(probabilities))
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"entropy(a): {entropy(a)}")
print(f"entropy(b): {entropy(b)}")
print(f"entropy(c): {entropy(c)}")
With the output being the following:
entropy(a): 1.4591479170272446
entropy(b): 1.0
entropy(c): -0.0
However, I also need to calculate the derivative over dx:
d entropy / dx
This is not an easy task since the main formula
-np.sum(probabilities*np.log2(probabilities))
takes in probabilities, not x values, therefore it is not clear how to differentiate over dx.
Does anyone have an idea on how to do this?

One way to solve this is to use finite differences to compute the derivative numerically.
In this context, we can define a small constant to help us compute the numerical derivative. This function takes a one-argument function and computes its derivative for input x:
ε = 1e-12
def derivative(f, x):
return (f(x + ε) - f(x)) / ε
To make our work easier, let us define a function that computes the innermost operation of the entropy:
def inner(x):
return x * np.log2(x)
Recall that the derivative of the sum is the sum of derivatives. Therefore, the real derivative computation takes place in the inner function we just defined.
So, the numerical derivative of the entropy is:
def numerical_dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([derivative(inner, p) for p in probabilities])
Can we do better? Of course we can! The key insight here is the product rule: (f g)' = fg' + gf', where f=x and g=np.log2(x). (Also notice that d[log_a(x)]/dx = 1/(x ln(a)).)
So, the analytical entropy can be computed as:
import math
def dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([(1/math.log(2, math.e) + np.log2(p)) for p in probabilities])
Using the sample vectors for testing, we have:
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"numerical d[entropy(a)]: {numerical_dentropy(a)}")
print(f"numerical d[entropy(b)]: {numerical_dentropy(b)}")
print(f"numerical d[entropy(c)]: {numerical_dentropy(c)}")
print(f"analytical d[entropy(a)]: {dentropy(a)}")
print(f"analytical d[entropy(b)]: {dentropy(b)}")
print(f"analytical d[entropy(c)]: {dentropy(c)}")
Which, when executed, gives us:
numerical d[entropy(a)]: 0.8417710972707937
numerical d[entropy(b)]: -0.8854028621385623
numerical d[entropy(c)]: -1.4428232973189605
analytical d[entropy(a)]: 0.8418398787754222
analytical d[entropy(b)]: -0.8853900817779268
analytical d[entropy(c)]: -1.4426950408889634
As a bonus, we can test whether this is correct with an automatic differentiation library:
import torch
a, b, c = torch.from_numpy(a), torch.from_numpy(b), torch.from_numpy(c)
def torch_entropy(X):
_, frequencies = torch.unique(X, return_counts=True)
frequencies = frequencies.type(torch.float32)
probabilities = frequencies / X.shape[0]
probabilities.requires_grad_(True)
return -(probabilities * torch.log2(probabilities)).sum(), probabilities
for v in a, b, c:
h, p = torch_entropy(v)
print(f'torch entropy: {h}')
h.backward()
print(f'torch derivative: {p.grad.sum()}')
Which gives us:
torch entropy: 1.4591479301452637
torch derivative: 0.8418397903442383
torch entropy: 1.0
torch derivative: -0.885390043258667
torch entropy: -0.0
torch derivative: -1.4426950216293335

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using Numpy for Kmean Clustering - python

Changing inputs to NumPy arrays should get rid of errors: x = np.array([[1., 0.], [0., 1.], [0.5, 0.5]]) y = np.array([[1., 0.], [0., 1.]]) Also seems like you must change for i in iterations to for i in range(iterations) in kmeans function.

Related

Convert array to a single float in Python

How to create a numpy array with a an extra dimension depending on an where clause?

Need help understanding the gradient function in pytorch

Create identity matrices with arbitrary shape with numpy

What is the derivative of Shannon's Entropy?

Categories

Resources