clustering algorithm in python [duplicate] - python

This question already has answers here:
Numpy: Get random set of rows from 2D array
(10 answers)
Closed 3 years ago.
im trying to code this algorithm but im struggling with it and step 6 is confusing me my code so far is at the bottom
Set a positive value for K.
Select K different rows from the data matrix at random.
For each of the selected rows a. Copy its values to a new list, let us call it c. Each element of c is a number. (at the end of step 3, you should have the lists 𝑐1 , 𝑐2 , … , 𝑐𝐾. Each of these should have the same number of columns as the data matrix)
For each row i in the data matrix a. Calculate the Manhattan distance between data row 𝐷′ 𝑖 and each of the lists 𝑐1 , 𝑐2 , … , 𝑐𝐾. b. Assign the row 𝐷′ 𝑖 to the cluster of the nearest c. For instance, if the nearest c is 𝑐3 then assign row i to the cluster 3 (ie. you should have a list whose ith entry is equal to 3, let’s call this list S).
If the previous step does not change S, stop.
For each k = 1, 2, …, K a. Update π‘π‘˜. Each element j of π‘π‘˜should be equal to the median of the column 𝐷′𝑗 but only taking into consideration those rows that have been assigned to cluster k.
Go to Step 4.
Notice that in the above K is not the same thing as k
#This is what i have so far:
def clustering(matrix,k):
for i in k:
I'm stuck with how it would choose the rows randomly and also I don't understand what step 5 and 6 mean if someone could explain

You need np.random.choice.
Use this:
import numpy as np
# some data with 10 rows and 5 columns
X=np.random.rand(10,5)
def clustering(X,k):
# create the random indices (selector)
random_selector = np.random.choice(range(X.shape[0]), size=k, replace=False) # replace=False to get unique samples
# select randomly k=10 lines
sampled_X = X[random_selector] # X[random_selector].shape = (10,5)
.
.
.
return #SOMETHING
Now you can continue working on your ?homework?

Related

Pandas random n samples of consecutive rows / pairs

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array

Masking few non-zero elements of certain rows of a matrix

I have a 3*3 matrix with 1s and 0s, A = [[1,0,1],[0,1,1],[1,0,0]] and an array indicating the limit on the row sum, B = [1,2,1]. I want to find rows of A whose sum exceeds the corresponding value in B, and set the non-zero elements of A to zero to ensure that the sum matches with B. Finding the rows of A that exceed the sum is easy, however masking the elements to adjust the sum is what I need help with.How can this be achieved (want to scale it to larger matrices and tensors)?
I would do smth like this:
import numpy as np
A = np.array([[1,0,1],[0,1,1],[1,0,0]])
B = np.array([1,2,1])
# a cumulative sum of each row will tell you how many
# ones were in that row up to each point.
A_cs = np.cumsum(A, axis = 1)
# theresholding according to the sum vector will let
# you know where you should start omitting values since
# at that point the sum of the row exceeds its limit.
A_th = A_cs > B[:, None]
# then you can use the boolean array to create a new
# array where the appropriate values in the original
# array are set to zero to reduce the row sum.
A_nw = A * (1 - A_th)
output:
A_nw =
[[1 0 0]
[0 1 1]
[1 0 0]]
Unrelated note:
The following note is here to help OP better their dev-related search skills.
I can answer some questions instantaneously, but this was not one of them. I'm telling you this because I reached the answer through a simple google search for "python find the i th non zero element in each row", which led me to this post which in turn led me very quickly to an answer. You don't have to try to be a better, more independet code writer. But if you want to, know that you can.

Python: Selecting every Nth row of a matrix [duplicate]

This question already has answers here:
Pythonic way to return list of every nth item in a larger list
(9 answers)
Closed 4 years ago.
does anyone know how to select multiple rows of a matrix to form a new one - e.g. I would like to select EVERY 3rd row of a matrix and build a new matrix with these rows.
Many thanks for your help,
Nicolas
An example using numpys ndarray to create a matrix using 10 rows and 3 columns as an example
import numpy as np
matrix = np.ndarray(shape=(10,3))
rows = np.shape(matrix)[0] #number of rows
columns = np.shape(matrix)[1] #number of columns
l = range(rows)[0::3] #indexes of each third element including the first element
new_matrix = np.ndarray(shape=(len(l),columns)) #Your new matrix
for i in range(len(l)):
new_matrix[i] = matrix[l[i]] #adding each third row from matrix to new_matrix

Using scipy pdist on pandas data frames (and) lists

I've run into an odd problem yet again.
Suppose I have the following dummy data frame (by way of demonstrating my problem):
import numpy as np
import pandas as pd
import string
# Test data frame
N = 3
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(5, 3*N),
columns=['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])
df
This produces:
A_x A_y A_z B_x B_y B_z C_x C_y C_z
0 -1.339040 0.185817 0.083120 0.498545 -0.569518 0.580264 0.453234 1.336992 -0.346724
1 -0.938575 0.367866 1.084475 1.497117 0.349927 -0.726140 -0.870142 -0.371153 -0.881763
2 -0.346819 -1.689058 -0.475032 -0.625383 -0.890025 0.929955 0.683413 0.819212 0.102625
3 0.359540 -0.125700 -0.900680 -0.403000 2.655242 -0.607996 1.117012 -0.905600 0.671239
4 1.624630 -1.036742 0.538341 -0.682000 0.542178 -0.001380 -1.126426 0.756532 -0.701805
Now I would like to use scipy.spatial.distance.pdist on this pandas data frame. This turns out to be a rather non-trivial process. What pdist does is to compute the distance between m points using Euclidean distance (2-norm) as the distance metric between the points. The points are arranged as m n-dimensional row vectors in the matrix X (source).
So, there are a couple of things that one has to do to create a function that operates on a pandas data frame, such that the pdist function can be used. You will note that pdist is convenient when the number of points gets very large. I've tried making my own, which works for a one-row data-frame, but I cannot get it to work, ideally, on the whole data frame at once.
Here's my attempt:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd
import string
def Euclidean_distance(df):
EcDist = pd.DataFrame(index=df.index) # results container
arr = df.values # Store data frame values into a numpy array
tag_list = [num for elem in arr for num in elem] # flatten numpy array into single list
tag_list_3D = zip(*[iter(tag_list)]*3) # separate list into length = 3 sub-lists, that pdist() can work with
EcDist = pdist(tag_list_3D) # the distance between m points using Euclidean distance (2-norm)
return EcDist
First I begin my creating a results container in pandas form, to store the result in. Secondly I save the pandas data frame as a numpy array, in order to get it into list form in the next step. It has to be list form because the pdist function does only operate on lists. When saving the data frame into an array, it stores it as a list within a list. This has to be flattened which is saved in the 'tag_list' variable. Thirdly, the tag_list is furthered reduced into sub-lists of length three, such that the x, y and z coordinates can be obtained for each point, which can the be used to find the Euclidean distance between all of these points (in this example there are three points: A,B and C each being three dimensional).
As said, the function works if the data frame is a single row, but when using the function in the given example it calculates the Euclidean distance for 5x3 points, which yields a total of 105 distances. What I want it to do is to calculate the distances per row (so pdist should only work on a 1x3 vector at a time). Such that my final results, for this example, would look something like this:
dist_1 dist_2 dist_3
0 0.807271 0.142495 1.759969
1 0.180112 0.641855 0.257957
2 0.196950 1.334812 0.638719
3 0.145780 0.384268 0.577387
4 0.044030 0.735428 0.549897
(these are just dummy numbers to show the desired shape)
Hence how do I get my function to apply to the data frame in a row-wise fashion?
Or better yet, how can I get it to perform the function on the entire data frame at once, and then store the result in a new data frame?
Any help would be very appreciated. Thanks.
If I understand correctly, you have "groups" of points. In your example each group has three points, which you call A, B and C. A is represented by three columns A_x, A_y, A_z, and likewise for B and C.
What I suggest is that you restructure your "wide-form" data into a "long" form in which each row contains only one point. Each row then will have only three columns for the coordinates, and then you will add an additional column to represent which group a point is in. Here's an example:
>>> d = pandas.DataFrame(np.random.randn(12, 3), columns=["X", "Y", "Z"])
>>> d["Group"] = np.repeat([1, 2, 3, 4], 3)
>>> d
X Y Z Group
0 -0.280505 0.888417 -0.936790 1
1 0.823741 -0.428267 1.483763 1
2 -0.465326 0.005103 -1.107431 1
3 -1.009077 -1.618600 -0.443975 2
4 0.535634 0.562617 1.165269 2
5 1.544621 -0.858873 -0.349492 2
6 0.839795 0.720828 -0.973234 3
7 -2.273654 0.125304 0.469443 3
8 -0.179703 0.962098 -0.179542 3
9 -0.390777 -0.715896 -0.897837 4
10 -0.030338 0.746647 0.250173 4
11 -1.886581 0.643817 -2.658379 4
The three points with Group==1 correspond to A, B and C in your first row; the three points with Group==2 correspond to A, B, and C in your second row; etc.
With this structure, computing the pairwise distances by group using pdist becomes straightforward:
>>> d.groupby('Group')[["X", "Y", "Z"]].apply(lambda g: pandas.Series(distance.pdist(g), index=["D1", "D2", "D3"]))
D1 D2 D3
Group
1 2.968517 0.918435 2.926395
2 3.119856 2.665986 2.309370
3 3.482747 1.314357 2.346495
4 1.893904 2.680627 3.451939
It is possible to do a similar thing with your existing setup, but it will be more awkward. The problem with the way you set it up is that you have encoded critical information in a difficult-to-extract way. The information about which columns are X coordinates and which are Y or Z coordinates, as well as the information about which columns refer to point A versus B or C, in your setup, is encoded in the textual names of the columns. You as a human can see which columns are X values just by looking at them, but specifying that programmatically requires parsing the string names of the columns.
You can see this in how you made the column names with your '{}_{}'.format(letter, coord) business. This means that in order to get to use pdist on your data, you will have to do the reverse operation of parsing the column names as strings in order to decide which columns to compare. Needless to say, this will be awkward. On the other hand, if you put the data into "long" form, there is no such difficulty: the X coordinates of all points line up in one column, and likewise for Y and Z, and the information about which points are to be compared is also contained in one column (the "Group" column).
When you want to do large-scale operations on subsets of data, it's usually better to split out things into separate rows. This allows you to leverage the power of groupby, and is also usually what is expected by scipy tools.

How to sum over columns with some weight in a csr matrix in python

If I have a large csr_matrix A, I want to sum over its columns, simply
A.sum(axis=0)
does this for me, right? Are the corresponding axis values: 1->rows, 0->columns?
I stuck when I want to sum over columns with some weights which are specified in a list, e.g. [1 2 3 4 5 4 3 ... 4 2 5] with the same length as the number of rows in the csr_matrix A. To be more clear, I want the inner product of each column vector with this weight vector. How can I achieve this with Python?
This is a part of my code:
uniFeature = csr_matrix(uniFeature)
[I,J] = uniFeature.shape
sumfreq = uniFeature.sum(axis=0)
sumratings = []
for j in range(J):
column = uniFeature.getcol(j)
column = column.toarray()
sumtemp = np.dot(ratings,column)
sumratings.append(sumtemp)
sumfreq = sumfreq.toarray()
average = np.true_divide(sumratings,sumfreq)
(Numpy is imported as np) There is a weight vector "ratings", the program is supposed to output the average rating for each column of the matrix "uniFeature".
I experimented to dot column=uniFeature.getcol(j) directly with ratings(which is a list), but there is an error that says format does not agree. It's ok after column.toarray() then dot with ratings. But isn't making each column back to dense form losing the point of having the sparse matrix and would be very slow? I ran the above code and it's too slow to show the results. I guess there should be a way that dots the vector "ratings" with each column of the sparse matrix efficiently.
Thanks in advance!

Categories