Python - Kmeans - Add the centroids as a new column - python

Assume I have the following dataframe. How can I create a new column "new_col" containing the centroids? I can only create the column with the labs, not with the centroids.
Here is my code.
from sklearn import preprocessing
from sklearn.cluster import KMeans
numbers = pd.DataFrame(list(range(1,1000)), columns = ['num'])
kmean_model = KMeans(n_clusters=5)
kmean_model.fit(numbers[['num']])
kmean_model.cluster_centers_
array([[699. ],
[297. ],
[497.5],
[899.5],
[ 99. ]])
numbers['new_col'] = kmean_model.predict(numbers[['num']])

It is simple. Just use .labels_ as follows.
numbers['new_col'] = kmean_model.labels_
Edit. Sorry my mistake.
Make dictionary whose key is label and value is centers, and replace the new_col using the dictionary. See the following.
label_center_dict = {k:v for k, v in zip(kmean_model.labels_, kmean_model.cluster_centers_)}
numbers['new_col'] = kmean_model.labels_
numbers['new_col'].replace(label_center_dict, inplace = True)

Related

How to get nearest match in csv file python

If want to get the nearest match in my big .csv file in python. My (shortened) .csv file is:
0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0
0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0
1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0
0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0
0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0
0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0
1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0
1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0
0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0
0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0
0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0
0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0
0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0
0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0
0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0
0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0
0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0
1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0
1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0
0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0
0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0
1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0
0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0
1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0
0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0
0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0
0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0
1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0
0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0
0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0
1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0
I have made a programm, but it isn't finished and I don't know how I can complete it. Do I have to use an another program?:
with open("<dir>", "r") as file:
file = file.readlines()
len_ = len(file)
string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data.
list_ = []
for i in range(1, len_):
item = str(file[i])
item2 = item[2:]
list_.append(item2)
for item in list_:
algorithm: Look from left to right on the row and find the row with the most sequential matches to the search data.
It seems you are handling a machine learning problem, with a dataset and a point to find the nearest neighbor. I assume you want the point of the dataset that has the shortest euclidean distance (in 19-dimension) to the given point.
I would use pandas and scikit-learn packages with the NearestNeighbors algorithm.
Upload the packages
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
upload the file.csv as Pandas DataFrame (with generic column names)
df = pd.read_csv('file.csv', index_col=False, names=np.arange(20))
Since you want the first column of values as results, I move it to a Pandas Series called "first_column" and drop it from the "df" dataframe
first_column = df[0]
df.drop(columns=[0], inplace=True)
What you called "string" I call it "y" and set it as numpy array:
y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]])
now let's fit the NearestNeighbors model
nnb = NearestNeighbors(n_neighbors=1).fit(df)
and now computes which point in the data set is the closest to the given point y:
distances, indices = nnb.kneighbors(y, n_neighbors=1)
print(indices)
[[13]]
So, the nearest point has index 13 in the dataframe. Let's print the 13th position of the first_column
print(first_column.loc[13])
0

For loop only returning last item

# Create random df
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
I am stumped on this problem and wondering if anyone can see where I'm going wrong. I am trying to calculate the euclidean distance between each row and every other row. Then, I sort those distances and return the index positions of the "most similar" rows by minimum distance in the list smallest_dist.
The issue is that this only returns the most similar index positions of the last row: [6.0, 3.0, 4.0]
What I want for output is something like this:
Original ID
Matches
1
4,5,6
2
8,2,5
I've tried this but it gives the same result:
list_of_mins = []
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
for i in range(len(test)):
list_of_mins.append(smallest_dist_ixs)
Does anyone know what's causing this problem? thank you!
I don't have the distance library available so I change that to a simple sum, but it should work after replacing it back to distance
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, size=(100, 23)))
test = df[:50]
dict_results = {'ids': [],
'ids_min': []}
n_min = 2
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: np.sum(row), axis=1)
# Create a new dataframe with distances.
# print(euclidean_distances)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances,
"idx": euclidean_distances.index})
selected_min = distance_frame.sort_values("dist").head(n_min)
dict_results['ids'].append(i)
dict_results['ids_min'].append(', '.join(selected_min['idx'].astype('str')))
print(pd.DataFrame(dict_results))
I added a few changes to your code:
Added a n_min parameter to define how many elements you want in the second columns (number of index to closest rows)
Created a dict where the results are going to be save to create the data frame you want.
In the loop added the append to add the results of each iteration to the dict where the results are being saved
After the loop if you call the dict inside pd.DataFrame it will be parse the same way you were doing it with the distance_frame
What happens if you try to resturn the results either in the data frame or (for convenience of testing) a dictionary? For example:
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
closest_nodes = {}
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
closest_nodes[i] = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
The thing I didn't see in your code was some sort of storage action to put the one result per test case into a permanent structure.

Cannot group datapoints by cluster

I have a datalists where each datapoint has 5 features and a cluster assigned to each point.
You can see the beginning of it here, last column is the cluster number:
[[4.01682810e-01 2.14628527e-02 2.99529665e-02 2.79935965e-01 9.21441137e-01 9.00000000e+00]
[9.32087200e-03 3.38196129e-01 8.49571569e-01 3.69402590e-01 1.92096835e-01 1.20000000e+01]
[7.51465196e-01 4.45955645e-01 3.37174838e-01 3.65047097e-01 5.81725084e-01 1.00000000e+00]
I want to create a list of lists of datapoints of the same cluster, so I wrote the following function and tried to execute it:
def returnArrayOfClusters(data, clusterNumbers):
# create an empty column
column = []
# create an empty list we want to output
listOfClusters = []
# fill it with a column for each cluster
for i in clusterNumbers:
listOfClusters.append(column)
print(listOfClusters)
## fill the columns with points according to their cluster
for datapoint in data:
print(datapoint)
cluster = int(datapoint[5])
listOfClusters[cluster].append(datapoint)
return listOfClusters
listOfClusters = returnArrayOfClusters(data_labeled_unfinished, range(0,14))
What I get is an unordered list of datapoints of this format (the end of the list), and as you can see all the points in the column are of different clusters (they have different last value):
array([ 0.81802695, 0.45533606, 0.33799001, 0.26154893, 0.64155249,
13. ]), array([0.12995366, 0.45586338, 0.85833814, 0.32153188, 0.28736836,
1. ]), array([0.06230581, 0.47400143, 0.78671841, 0.3162376 , 0.04819034,
9. ]), array([0.15291747, 0.54247295, 0.54407916, 0.87888682, 0.46639597,
8. ]), array([ 0.21578994, 0.178303 , 0.80642112, 0.39853499, 0.27832876,
10. ]), array([0.27426491, 0.32986967, 0.49411613, 0.50818875, 0.2336591 ,
5. ])]
Maybe it is a very stupid mistake, but I just cannot spot the error.
What I expect to see, however, is to be all the points in the list to be of the same cluster (i.e. in the output have the same value of the 6th item)
Hopefully I got you correct, you can split your data using a list comprehension, for example:
from sklearn.cluster import KMeans
import numpy as np
X = np.random.normal(0,1,(100,5))
kmeans = KMeans(n_clusters=8, random_state=0).fit(X)
data = np.concatenate((X,kmeans.labels_.reshape(-1,1)),axis=1)
[data[data[:,5]==i,] for i in np.unique(data[:,5])]
in your case:
[data_labeled_unfinished[data_labeled_unfinished[:,5]==i,] for i in np.unique(data_labeled_unfinished[:,5])]

Python for loop, iterating over values from numpy arrange method

I need to write code that tests a numpy array of cutoff values for a classification problem. The values to test are stored in the cutoff_list variable. I then want to place the list of resulting confusion matrices in a dictionary. However, the code below gives me only the first dictionary entry (confusion matrix for the first test value):
cutoff_list = [np.arange(0,1,0.01)] # list of test values
dictionary = {}
for i, v in enumerate(cutoff_list):
actual = (df.observed)
predicted = np.where(df.indicator > i, 1, 0)
df_confusion = confusion_matrix(actual, predicted) / len(df.indicator)
dictionary[i] = df_confusion
print(dictionary)
Libraries that I am using:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
Is this a problem with the loop or the dictionary update step? I'm new to Python, have more experience with R and still struggling here. Any help appreciated.
Your for loop runs only once, because of the way you set it up.
You are enclosing the return of the np.arrange(0.1.0.01) in a list which breaks the way you want your for loop to run. You are getting just one value because the for loop runs just once as the outer list has just one item.
>>> cutoff_list = [np.arange(0,1,0.01)]
>>> cutoff_list
[array([0. , 0.01, ... 0.98, 0.99])]
>>> type(cutoff_list)
<class 'list'>
You want to get the actual numpy array:
>>> cutoff_list = np.arange(0,1,0.01)
>>> cutoff_list
array([0. , 0.01, ... 0.98, 0.99])
>>> type(cutoff_list)
<class 'numpy.ndarray'>
Change the line cutoff_list = [np.arange(0,1,0.01)] to cutoff_list = np.arange(0,1,0.01) and see whether that resolves your problem.
I would also imagine that you want to use v instead of i in this line:
predicted = np.where(df.indicator > i, 1, 0)
as the i will hold just the enumerating value that you use as key for your dict, whereas the v will hold the values from cutoff_list.

for loop in scipy.stats.linregress

I am using the scipy stats module to calculate the linear regression. ie
slope, intercept, r_value, p_value, std_err
= stats.linregress(data['cov_0.0075']['num'],data['cov_0.0075']['com'])
where data is a dictionary containing several 'cov_x' keys corresponding to a dataframe with columns 'num' and 'com'
I want to be able to loop through this dictionary and do linear regression on each 'cov_x'. I am not sure how to do this. I tried:
for i in data:
slope_+str(i), intercept+str(i), r_value+str(i),p_value+str(i),std_err+str(i)= stats.linregress(data[i]['num'],data[i]['com'])
Essentially I want len(x) slope_x values.
You could use a list comprehension to collect all the stats.linregress return values:
result = [stats.linregress(df['num'],df['com']) for key, df in data.items()]
result is a list of 5-tuples. To collect all the first, second, third, etc... elements from each 5-tuple into separate lists, use zip(*[...]):
slopes, intercepts, r_values, p_values, stderrs = zip(*result)
You should be able to do what you're trying to, but there are a couple of things you should watch out for.
First, you can't add a string to a variable name and store it that way. No plus signs on the left of the equals sign. Ever.
You should be able to accomplish what you're trying to do, however. Just make sure that you use the dict data type if you want string indexing.
import scipy.stats as stats
import pandas as pd
import numpy as np
data = {}
l = ['cov_0.0075','cov_0.005']
for i in l:
x = np.random.random(100)
y = np.random.random(100)+15
d = {'num':x,'com':y}
df = pd.DataFrame(data=d)
data[i] = df
slope = {}
intercept = {}
r_value = {}
p_value = {}
std_error = {}
for i in data:
slope[str(i)], \
intercept[str(i)], \
r_value[str(i)],\
p_value[str(i)], std_error[str(i)]= stats.linregress(data[i]['num'],data[i]['com'])
print(slope,intercept,r_value,p_value,std_error)
should work just fine. Otherwise, you can store individual values and put them in a list later.

Categories