K Means in Python from Scratch

K Means in Python from Scratch - python

I have a python code for a k-means algorithm.
I am having a hard time understanding what it does.
Lines like C = X[numpy.random.choice(X.shape[0], k, replace=False), :] are very confusing to me.
Could someone explain what this code is actually doing?
Thank you
def k_means(data, k, num_of_features):
# Make a matrix out of the data
X = data.as_matrix()
# Get k random points from the data
C = X[numpy.random.choice(X.shape[0], k, replace=False), :]
# Remove the last col
C = [C[j][:-1] for j in range(len(C))]
# Turn it into a numpy array
C = numpy.asarray(C)
# To store the value of centroids when it updates
C_old = numpy.zeros(C.shape)
# Make an array that will assign clusters to each point
clusters = numpy.zeros(len(X))
# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)
# Loop will run till the error becomes zero of 5 tries
tries = 0
while error != 0 and tries < 1:
# Assigning each value to its closest cluster
for i in range(len(X)):
# Get closest cluster in terms of distance
clusters[i] = dist1(X[i][:-1], C)
# Storing the old centroid values
C_old = deepcopy(C)
# Finding the new centroids by taking the average value
for i in range(k):
# Get all of the points that match the cluster you are on
points = [X[j][:-1] for j in range(len(X)) if clusters[j] == i]
# If there were no points assigned to cluster, put at origin
if not points:
C[i][:] = numpy.zeros(C[i].shape)
else:
# Get the average of all the points and put that centroid there
C[i] = numpy.mean(points, axis=0)
# Erro is the distance between where the centroids use to be and where they are now
error = dist(C, C_old, None)
# Increase tries
tries += 1
return sil_coefficient(X,clusters,k)

(Expanded answer, will format later)
X is the data, as a matrix.
Using the [] notation, we are taking slices, or selecting single element, from the matrix. You may want to review numpy array indexing. https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
numpy.random.choice selects k elements at random from the size of the first dimension of the data matrix without replacement.
Notice, that in indexing, using the [] syntax, we see we have two entries. The numpy.random.choice, and ":".
":" indicates that we are taking everything along that axis.
Thus, X[numpy.random.choice(X.shape[0], k, replace=False), :] means we select an element along the first axis and take every element along the second which shares that first index. Effectively, we are selecting a random row of a matrix.
(The comments expalain this code quite well, I would suggest you read into numpy indexing an list comprehensions for further elucidation).
C[C[j][:-1] for j in range(len(c))]
The part after "C[" uses a list comprehension in order to select parts of the matrix C.
C[j] represents the rows of the matrix C.
We use the [:-1] to take up to, but not including the final element of the row. We do this for each row in the matrix C. This removes the last column of the matrix.
C = numpy.asarray(C). This converts the matrix to a numpy array so we can do special numpy things with it.
C_old = numpy.zeros(C.shape). This creates a zero matrix, to later be populated, which is the same size as C. We are initializing this array to be populated later.
clusters = numpy.zeros(len(x)). This creates a zero vector whose dimension is the same as the number of rows in the matrix X. This vector will be populated later. We are initializing this array to be populated later.
error = dist(C, C_old, None). Take the distance between the two matrices. I believe this function to be defined elsewhere in your script.
tries = 0. Set the tires counter to 0.
while...do this block while this condition is true.
for i in [0...(number of rows in X - 1)]:
clusters[i] = dist1(X[i][:-1], C); Put which cluster the ith row of X is closest to in the ith position of clusters.
C_old = deepcopy(C) - Create a copy of C which is new. Don't just move pointers.
for each (0..number of means - 1):
points = [X[j][:-1] for j in range(len(X)) if clusters[j] == i]. This is a list comprehension. Create a list of the rows of X, with all but the last entry, but only include the row if it belongs to the jth cluster.
if not points. If nothing belongs to a cluster.
C[i][:] = numpy.zeros(C[i].shape). Create a vector of zeros, to be populated later, and use this vector as the ith row of the clusters matrix, C.
else:
C[i] = np.mean(points, axis=0). Assign the ith row of the clusters matrix, C, to be the average point in the cluster. We sum across the rows (axis=0). This is us updating our clusters.

Related

How to only iterate over one argument of an matrix array if both have the same variable in python?

I am trying to eliminate some non zero entries in a matrix where the 2 adjacent diagonals to the main diagonal are nonzero.
h = np.zeros((n**2,n**2))
for i in np.arange(0, n**2):
for j in np.arange(0,n**2):
if(i==j):
for i in np.arange(0,n**2,n):
h[i,j-1] = 0
print(h)
I want it to only eliminate the lower triangle non-zero entries, but it's erasing some entries in the upper triangle. I know this is because on the last if statement with the for loop, it is iterating for both arguments of the array, when I only want it to iterate for the first argument i, but since I set i=j, it runs for both.
The matrix I want to obtain is the following:
Desired matrix
PS: sorry for the extremely bad question format, this is my first question.

hamiltonian = np.zeros((n**2,n**2)) # store the Hamiltonian
for i in np.arange(0, n**2):
for j in np.arange(0,n**2):
if abs(i-j) == 1:
hamiltonian[i,j] = 1

Is this what you are looking for?:
hamiltonian[0,1] = 1
hamiltonian[n**2-1,n**2-2] = 1
for i in np.arange(1, n**2-1):
hamiltonian[i,i+1] = 1
hamiltonian[i,i-1] = 1

Write a function to calculate a unit vector

The homework problem is written as follows:
Write a function called unitVec that determines a unit vector in the direction of the line that connects two points (A and B) in space. The function should take as input two vectors (lists), each with the coordinates of a point in space. The output should be a vector (list) with the components of the unit vector in the direction from A to B. If points A and B have two coordinates each (i.e., they lie in the x y plane), the output vector should have two elements. If points A and B have three coordinates each (i.e., they lie in general space), the output vector should have three elements.
I have basically the entire code written but cannot for the life of me figure out how to square each element in the list called connects[].
To calculate a unit vector the program will subtract the elements in vector B with the corresponding elements in vector A and create a new list (connects[]) with these values. Then each of these elements needs to be squared and they all need to be added together. Then the square root will be taken of this number and each element in connects[] will be divided by this number and stored in a new list which will be the unit vector.
I'm trying to add the squares of elements in connects[] by using the line
add = add + (connects[i]**2)
but I know this only returns the list twice. The rest of my code is fine I just need help squaring these elements.
from math import *
vecA = []
vecB = []
unitV = []
connects = []
vec = []
elements = int(input("How many elements will your vectors contain?"))
for i in range(0,elements):
A = float(input("Enter element for vector A:"))
vecA.append(A)
B = float(input("Enter element for vector B:"))
vecB.append(B)
def unitVec(vecA,vecB):
for i in range(0,elements):
unit = 0
add = 0
connect = vecB[i] - vecA[i]
connects.append(connect)
add = add + (connects[i]**2)
uVec = sqrt(add)
result = connects[i]/uVec
unitV.append(result)
return unitV
print("The unit vector connecting your two vectors is:",unitVec(vecA,vecB))

You need to change your function to the following:
def unitVec(vecA,vecB):
add = 0
for i in range(0, elements):
unit = 0
connect = vecB[i] - vecA[i]
connects.append(connect)
add = add + (connect**2)
uVec = sqrt(add)
unitV = [val/uVec for val in connects]
return unitV
You cannot do everything in a single for loop, since you need to add all the differences before being able to get the square root. Then you can divide the differences by this uVec.

python's list is for general use and its arithmetric operation is different from vector operation. for example, [1,2,3]*2 is replication operation instead of vector scalar multiplication such that result is [1,2,3,1,2,3] instead of [2,4,6].
I would use numpy array which is designed for numerical array and provide vector operations.
import numpy as np
a = [1,2,3]
# convert python list into numpy array
b = np.array(a)
# vector magnitude
magnitude = np.sqrt((b**2).sum()) # sqrt( sum(b_i^2))
# or
magnitude = (b**2).sum()**0.5 # sqrt( sum(b_i^2))
# unit vector calculation
unit_b = b/magnitude

How to generate a matrix with random entries and with constraints on row and columns?

How to generate a matrix that its entries are random real numbers between zero and one inclusive with the additional constraint : The sum of each row must be less than or equal to one and the sum of each column must be less than or equal to one.
Examples:
matrix = [0.3, 0.4, 0.2;
0.7, 0.0, 0.3;
0.0, 0.5, 0.1]

If you want a matrix that is uniformly distributed and fulfills those constraints, you probably need a rejection method. In Matlab it would be:
n = 3;
done = false;
while ~done
matrix = rand(n);
done = all(sum(matrix,1)<=1) & all(sum(matrix,2)<=1);
end
Note that this will be slow for large n.

If you're looking for a Python way, this is simply a transcription of Luis Mendo's rejection method. For simplicity, I'll be using NumPy:
import numpy as np
n = 3
done = False
while not done:
matrix = np.random.rand(n,n)
done = np.all(np.logical_and(matrix.sum(axis=0) <= 1, matrix.sum(axis=1) <= 1))
If you don't have NumPy, then you can generate your 2D matrix as a list of lists instead:
import random
n = 3
done = False
while not done:
# Create matrix as a list of lists
matrix = [[random.random() for _ in range(n)] for _ in range(n)]
# Compute the row sums and check for each to be <= 1
row_sums = [sum(matrix[i]) <= 1 for i in range(n)]
# Compute the column sums and check for each to be <= 1
col_sums = [sum([matrix[j][i] for j in range(n)]) <= 1 for i in range(n)]
# Only quit of all row and column sums are less than 1
done = all(row_sums) and all(col_sums)

The rejection method will surely give you a uniform solution, but it might take a long time to generate a good matrix, especially if your matrix is large. So another, but more tedious approach is to generate each element such that the sum can only be 1 in each direction. For this you always generate a new element between 0 and the remainder until 1:
n = 3
matrix = zeros(n+1); %dummy line in first row/column
for k1=2:n+1
for k2=2:n+1
matrix(k1,k2)=rand()*(1-max(sum(matrix(k1,1:k2-1)),sum(matrix(1:k1-1,k2))));
end
end
matrix = matrix(2:end,2:end)
It's a bit tricky because for each element you check the row-sum and column-sum until that point, and use the larger of the two for generating a new element (in order to stay below a sum of 1 in both directions). For practical reasons I padded the matrix with a zero line and column at the beginning to avoid indexing problems with k1-1 and k2-1.
Note that as #LuisMendo pointed out, this will have a different distribution as the rejection method. But if your constraints do not consider the distribution, this could do as well (and this will give you a matrix from a single run).

Generate "random" matrix of certain rank over a fixed set of elements

I'd like to generate matrices of size mxn and rank r, with elements coming from a specified finite set, e.g. {0,1} or {1,2,3,4,5}. I want them to be "random" in some very loose sense of that word, i.e. I want to get a variety of possible outputs from the algorithm with distribution vaguely similar to the distribution of all matrices over that set of elements with the specified rank.
In fact, I don't actually care that it has rank r, just that it's close to a matrix of rank r (measured by the Frobenius norm).
When the set at hand is the reals, I've been doing the following, which is perfectly adequate for my needs: generate matrices U of size mxr and V of nxr, with elements independently sampled from e.g. Normal(0, 2). Then U V' is an mxn matrix of rank r (well, <= r, but I think it's r with high probability).
If I just do that and then round to binary / 1-5, though, the rank increases.
It's also possible to get a lower-rank approximation to a matrix by doing an SVD and taking the first r singular values. Those values, though, won't lie in the desired set, and rounding them will again increase the rank.
This question is related, but accepted answer isn't "random," and the other answer suggests SVD, which doesn't work here as noted.
One possibility I've thought of is to make r linearly independent row or column vectors from the set and then get the rest of the matrix by linear combinations of those. I'm not really clear, though, either on how to get "random" linearly independent vectors, or how to combine them in a quasirandom way after that.
(Not that it's super-relevant, but I'm doing this in numpy.)
Update: I've tried the approach suggested by EMS in the comments, with this simple implementation:
real = np.dot(np.random.normal(0, 1, (10, 3)), np.random.normal(0, 1, (3, 10)))
bin = (real > .5).astype(int)
rank = np.linalg.matrix_rank(bin)
niter = 0
while rank > des_rank:
cand_changes = np.zeros((21, 5))
for n in range(20):
i, j = random.randrange(5), random.randrange(5)
v = 1 - bin[i,j]
x = bin.copy()
x[i, j] = v
x_rank = np.linalg.matrix_rank(x)
cand_changes[n,:] = (i, j, v, x_rank, max((rank + 1e-4) - x_rank, 0))
cand_changes[-1,:] = (0, 0, bin[0,0], rank, 1e-4)
cdf = np.cumsum(cand_changes[:,-1])
cdf /= cdf[-1]
i, j, v, rank, score = cand_changes[np.searchsorted(cdf, random.random()), :]
bin[i, j] = v
niter += 1
if niter % 1000 == 0:
print(niter, rank)
It works quickly for small matrices but falls apart for e.g. 10x10 -- it seems to get stuck at rank 6 or 7, at least for hundreds of thousands of iterations.
It seems like this might work better with a better (ie less-flat) objective function, but I don't know what that would be.
I've also tried a simple rejection method for building up the matrix:
def fill_matrix(m, n, r, vals):
assert m >= r and n >= r
trans = False
if m > n: # more columns than rows I think is better
m, n = n, m
trans = True
get_vec = lambda: np.array([random.choice(vals) for i in range(n)])
vecs = []
n_rejects = 0
# fill in r linearly independent rows
while len(vecs) < r:
v = get_vec()
if np.linalg.matrix_rank(np.vstack(vecs + [v])) > len(vecs):
vecs.append(v)
else:
n_rejects += 1
print("have {} independent ({} rejects)".format(r, n_rejects))
# fill in the rest of the dependent rows
while len(vecs) < m:
v = get_vec()
if np.linalg.matrix_rank(np.vstack(vecs + [v])) > len(vecs):
n_rejects += 1
if n_rejects % 1000 == 0:
print(n_rejects)
else:
vecs.append(v)
print("done ({} total rejects)".format(n_rejects))
m = np.vstack(vecs)
return m.T if trans else m
This works okay for e.g. 10x10 binary matrices with any rank, but not for 0-4 matrices or much larger binaries with lower rank. (For example, getting a 20x20 binary matrix of rank 15 took me 42,000 rejections; with 20x20 of rank 10, it took 1.2 million.)
This is clearly because the space spanned by the first r rows is too small a portion of the space I'm sampling from, e.g. {0,1}^10, in these cases.
We want the intersection of the span of the first r rows with the set of valid values.
So we could try sampling from the span and looking for valid values, but since the span involves real-valued coefficients that's never going to find us valid vectors (even if we normalize so that e.g. the first component is in the valid set).
Maybe this can be formulated as an integer programming problem, or something?

My friend, Daniel Johnson who commented above, came up with an idea but I see he never posted it. It's not very fleshed-out, but you might be able to adapt it.
If A is m-by-r and B is r-by-n and both have rank r then AB has rank r. Now, we just have to pick A and B such that AB has values only in the given set. The simplest case is S = {0,1,2,...,j}.
One choice would be to make A binary with appropriate row/col sums
that guaranteed the correct rank and B with column sums adding to no
more than j (so that each term in the product is in S) and row sums
picked to cause rank r (or at least encourage it as rejection can be
used).
I just think that we can come up with two independent sampling
schemes on A and B that are less complicated and quicker than trying
to attack the whole matrix at once. Unfortunately, all my matrix
sampling code is on the other computer. I know it generalized easily
to allowing entries in a bigger set than {0,1} (i.e. S), but I can't
remember how the computation scaled with m*n.

I am not sure how useful this solution will be, but you can construct a matrix that will allow you to search for the solution on another matrix with only 0 and 1 as entries. If you search randomly on the binary matrix, it is equivalent to randomly modifying the elements of the final matrix, but it is possible to come up with some rules to do better than a random search.
If you want to generate an m-by-n matrix over the element set E with elements ei, 0<=i<k, you start off with the m-by-k*m matrix, A:
Clearly, this matrix has rank m. Now, you can construct another matrix, B, that has 1s at certain locations to pick the elements from the set E. The structure of this matrix is:
Each Bi is a k-by-n matrix. So, the size of AB is m-by-n and rank(AB) is min(m, rank(B)). If we want the output matrix to have only elements from our set, E, then each column of Bi has to have exactly one element set to 1, and the rest set to 0.
If you want to search for a certain rank on B randomly, you need to start off with a valid B with max rank, and rotate a random column j of a random Bi by a random amount. This is equivalent to changing column i row j of A*B to a random element from our set, so it is not a very useful method.
However, you can do certain tricks with the matrices. For example, if k is 2, and there are no overlaps on first rows of B0 and B1, you can generate a linearly dependent row by adding the first rows of these two sub-matrices. The second row will also be linearly dependent on rows of these two matrices. I am not sure if this will easily generalize to k larger than 2, but I am sure there will be other tricks you can employ.
For example, one simple method to generate at most rank k (when m is k+1) is to get a random valid B0, keep rotating all rows of this matrix up to get B1 to Bm-2, set first row of Bm-1 to all 1, and the remaining rows to all 0. The rank cannot be less than k (assuming n > k), because B_0 columns have exactly 1 nonzero element. The remaining rows of the matrices are all linear combinations (in fact exact copies for almost all submatrices) of these rows. The first row of the last submatrix is the sum of all rows of the first submatrix, and the remaining rows of it are all zeros. For larger values of m, you can use permutations of rows of B0 instead of simple rotation.
Once you generate one matrix that satisfies the rank constraint, you may get away with randomly shuffling the rows and columns of it to generate others.

How about like this?
rank = 30
n1 = 100; n2 = 100
from sklearn.decomposition import NMF
model = NMF(n_components=rank, init='random', random_state=0)
U = model.fit_transform(np.random.randint(1, 5, size=(n1, n2)))
V = model.components_
M = np.around(U) # np.around(V)

Array element evaluation from reverse

I'm still very new to python and programing and I'm trying to figure out if I'm going about this problem in the correct fashion. I tend to have a matlab approach to things but here I'm just struggling...
Context:
I have two numpy arrays plotted in this image on flickr since I can't post photos here :(. They are of equal length properties (both 777x1600) and I'm trying to use the red array to help return the index(value on the x-axis of plot) and element value(y-axis) of the point in the blue plot indicated by the arrow for each row of the blue array.
The procedure I've been tasked with was to:
a) determine max value of red array (represented with red dot in figure and already achieved)
and b) Start at the end of the blue array with the final element and count backwards, comparing element to preceding element. Goal being to determine where the preceding value decreases. (for example, when element -1 is greater than element -2, indicative of the last peak in the image). Additionally, to prevent selecting "noise" at the tail end of the section with elevated values, I also need to constrain the selected value to be larger than the maximum of the red array.
Here's what I've got so far, but I'm stuck on line two where I have to evaluate the selected row of the array from the (-1) position in the row to the beginning, or (0) position:
for i,n in enumerate(blue): #select each row of blue in turn to analyze
for j,m in enumerate(n): #select each element of blue ??how do I start from the end of array and work backwards??
if m > m-1 and m > max_val_red[i]:
indx_m[i] = j
val_m[i] = m

To answer you question directly, you can use n[::-1] to reverse the arrray n.
So the code is :
for j, m in enumerate(n[::-1]):
j = len(n)-j-1
# here is your code
But to increase calculation speed, you should avoid python loop:
import numpy as np
n = np.array([1,2,3,4,2,5,7,8,3,2,3,3,0,1,1,2])
idx = np.nonzero(np.diff(n) < 0)[0]
peaks = n[idx]
mask = peaks > 3 # peak muse larger than 3
print "index=", idx[mask]
print "value=", peaks[mask]
the output is:
index= [3 7]
value= [4 8]

I assume you mean:
if m > n[j-1] and m > max_val_red[i]:
indx_m[i] = j
val_m[i] = m
because m > m - 1 is always True
To reverse an array on an axis you can index the array using ::-1 on that axis, for example to reverse blue on axis 1 you can use:
blue_reverse = blue[:, ::-1]
Try and see you can write your function as a set of array operations instead of loops (that tends to be much faster). This is similar to the other answer, but it should allow you avoid both loops you're currently using:
threshold = red.max(1)
threshold = threshold[:, np.newaxis] #this makes threshold's shape (n, 1)
blue = blue[:, ::-1]
index_from_end = np.argmax((blue[:, :-1] > blue[:, 1:]) & (blue[:, :-1] > threshold), 1)
value = blue[range(len(blue)), index_from_end]
index = blue.shape[1] - 1 - index_from_end

Sorry, I didn't read all of it but you can possibly look into the built in function reversed.
so instead of enumerate( n ). you can do reversed( enumerate( n ) ). But then your index would be wrong the correct index would be eval to len( n ) - j

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.