Python - Kmeans - Add the centroids as a new column - python
Assume I have the following dataframe. How can I create a new column "new_col" containing the centroids? I can only create the column with the labs, not with the centroids.
Here is my code.
from sklearn import preprocessing
from sklearn.cluster import KMeans
numbers = pd.DataFrame(list(range(1,1000)), columns = ['num'])
kmean_model = KMeans(n_clusters=5)
kmean_model.fit(numbers[['num']])
kmean_model.cluster_centers_
array([[699. ],
[297. ],
[497.5],
[899.5],
[ 99. ]])
numbers['new_col'] = kmean_model.predict(numbers[['num']])
It is simple. Just use .labels_ as follows.
numbers['new_col'] = kmean_model.labels_
Edit. Sorry my mistake.
Make dictionary whose key is label and value is centers, and replace the new_col using the dictionary. See the following.
label_center_dict = {k:v for k, v in zip(kmean_model.labels_, kmean_model.cluster_centers_)}
numbers['new_col'] = kmean_model.labels_
numbers['new_col'].replace(label_center_dict, inplace = True)
Related
How to get nearest match in csv file python
If want to get the nearest match in my big .csv file in python. My (shortened) .csv file is: 0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0 0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0 1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0 0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0 0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0 0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0 1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0 1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0 0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0 0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0 0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0 0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0 0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0 0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0 0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0 0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0 0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0 1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0 1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0 0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0 0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0 1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0 0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0 1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0 0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0 0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0 0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0 1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0 0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0 0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0 1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0 I have made a programm, but it isn't finished and I don't know how I can complete it. Do I have to use an another program?: with open("<dir>", "r") as file: file = file.readlines() len_ = len(file) string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data. list_ = [] for i in range(1, len_): item = str(file[i]) item2 = item[2:] list_.append(item2) for item in list_: algorithm: Look from left to right on the row and find the row with the most sequential matches to the search data.
It seems you are handling a machine learning problem, with a dataset and a point to find the nearest neighbor. I assume you want the point of the dataset that has the shortest euclidean distance (in 19-dimension) to the given point. I would use pandas and scikit-learn packages with the NearestNeighbors algorithm. Upload the packages from sklearn.neighbors import NearestNeighbors import numpy as np import pandas as pd upload the file.csv as Pandas DataFrame (with generic column names) df = pd.read_csv('file.csv', index_col=False, names=np.arange(20)) Since you want the first column of values as results, I move it to a Pandas Series called "first_column" and drop it from the "df" dataframe first_column = df[0] df.drop(columns=[0], inplace=True) What you called "string" I call it "y" and set it as numpy array: y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]]) now let's fit the NearestNeighbors model nnb = NearestNeighbors(n_neighbors=1).fit(df) and now computes which point in the data set is the closest to the given point y: distances, indices = nnb.kneighbors(y, n_neighbors=1) print(indices) [[13]] So, the nearest point has index 13 in the dataframe. Let's print the 13th position of the first_column print(first_column.loc[13]) 0
For loop only returning last item
# Create random df df = pd.DataFrame(np.random.randint(1,10, size=(100,23))) test = df[:50] for i in range(len(test)): query_node = test.iloc[i] # Find the distance between this node and everyone else euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1) # Create a new dataframe with distances. distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index}) distance_frame.sort_values("dist", inplace=True) smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()] I am stumped on this problem and wondering if anyone can see where I'm going wrong. I am trying to calculate the euclidean distance between each row and every other row. Then, I sort those distances and return the index positions of the "most similar" rows by minimum distance in the list smallest_dist. The issue is that this only returns the most similar index positions of the last row: [6.0, 3.0, 4.0] What I want for output is something like this: Original ID Matches 1 4,5,6 2 8,2,5 I've tried this but it gives the same result: list_of_mins = [] for i in range(len(test)): query_node = test.iloc[i] # Find the distance between this node and everyone else euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1) # Create a new dataframe with distances. distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index}) distance_frame.sort_values("dist", inplace=True) smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()] for i in range(len(test)): list_of_mins.append(smallest_dist_ixs) Does anyone know what's causing this problem? thank you!
I don't have the distance library available so I change that to a simple sum, but it should work after replacing it back to distance import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(1, 10, size=(100, 23))) test = df[:50] dict_results = {'ids': [], 'ids_min': []} n_min = 2 for i in range(len(test)): query_node = test.iloc[i] # Find the distance between this node and everyone else euclidean_distances = test.apply(lambda row: np.sum(row), axis=1) # Create a new dataframe with distances. # print(euclidean_distances) distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index}) selected_min = distance_frame.sort_values("dist").head(n_min) dict_results['ids'].append(i) dict_results['ids_min'].append(', '.join(selected_min['idx'].astype('str'))) print(pd.DataFrame(dict_results)) I added a few changes to your code: Added a n_min parameter to define how many elements you want in the second columns (number of index to closest rows) Created a dict where the results are going to be save to create the data frame you want. In the loop added the append to add the results of each iteration to the dict where the results are being saved After the loop if you call the dict inside pd.DataFrame it will be parse the same way you were doing it with the distance_frame
What happens if you try to resturn the results either in the data frame or (for convenience of testing) a dictionary? For example: df = pd.DataFrame(np.random.randint(1,10, size=(100,23))) test = df[:50] closest_nodes = {} for i in range(len(test)): query_node = test.iloc[i] # Find the distance between this node and everyone else euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1) # Create a new dataframe with distances. distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index}) distance_frame.sort_values("dist", inplace=True) closest_nodes[i] = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()] The thing I didn't see in your code was some sort of storage action to put the one result per test case into a permanent structure.
Cannot group datapoints by cluster
I have a datalists where each datapoint has 5 features and a cluster assigned to each point. You can see the beginning of it here, last column is the cluster number: [[4.01682810e-01 2.14628527e-02 2.99529665e-02 2.79935965e-01 9.21441137e-01 9.00000000e+00] [9.32087200e-03 3.38196129e-01 8.49571569e-01 3.69402590e-01 1.92096835e-01 1.20000000e+01] [7.51465196e-01 4.45955645e-01 3.37174838e-01 3.65047097e-01 5.81725084e-01 1.00000000e+00] I want to create a list of lists of datapoints of the same cluster, so I wrote the following function and tried to execute it: def returnArrayOfClusters(data, clusterNumbers): # create an empty column column = [] # create an empty list we want to output listOfClusters = [] # fill it with a column for each cluster for i in clusterNumbers: listOfClusters.append(column) print(listOfClusters) ## fill the columns with points according to their cluster for datapoint in data: print(datapoint) cluster = int(datapoint[5]) listOfClusters[cluster].append(datapoint) return listOfClusters listOfClusters = returnArrayOfClusters(data_labeled_unfinished, range(0,14)) What I get is an unordered list of datapoints of this format (the end of the list), and as you can see all the points in the column are of different clusters (they have different last value): array([ 0.81802695, 0.45533606, 0.33799001, 0.26154893, 0.64155249, 13. ]), array([0.12995366, 0.45586338, 0.85833814, 0.32153188, 0.28736836, 1. ]), array([0.06230581, 0.47400143, 0.78671841, 0.3162376 , 0.04819034, 9. ]), array([0.15291747, 0.54247295, 0.54407916, 0.87888682, 0.46639597, 8. ]), array([ 0.21578994, 0.178303 , 0.80642112, 0.39853499, 0.27832876, 10. ]), array([0.27426491, 0.32986967, 0.49411613, 0.50818875, 0.2336591 , 5. ])] Maybe it is a very stupid mistake, but I just cannot spot the error. What I expect to see, however, is to be all the points in the list to be of the same cluster (i.e. in the output have the same value of the 6th item)
Hopefully I got you correct, you can split your data using a list comprehension, for example: from sklearn.cluster import KMeans import numpy as np X = np.random.normal(0,1,(100,5)) kmeans = KMeans(n_clusters=8, random_state=0).fit(X) data = np.concatenate((X,kmeans.labels_.reshape(-1,1)),axis=1) [data[data[:,5]==i,] for i in np.unique(data[:,5])] in your case: [data_labeled_unfinished[data_labeled_unfinished[:,5]==i,] for i in np.unique(data_labeled_unfinished[:,5])]
Python for loop, iterating over values from numpy arrange method
I need to write code that tests a numpy array of cutoff values for a classification problem. The values to test are stored in the cutoff_list variable. I then want to place the list of resulting confusion matrices in a dictionary. However, the code below gives me only the first dictionary entry (confusion matrix for the first test value): cutoff_list = [np.arange(0,1,0.01)] # list of test values dictionary = {} for i, v in enumerate(cutoff_list): actual = (df.observed) predicted = np.where(df.indicator > i, 1, 0) df_confusion = confusion_matrix(actual, predicted) / len(df.indicator) dictionary[i] = df_confusion print(dictionary) Libraries that I am using: from pandas import Series, DataFrame import pandas as pd import numpy as np from sklearn.metrics import confusion_matrix Is this a problem with the loop or the dictionary update step? I'm new to Python, have more experience with R and still struggling here. Any help appreciated.
Your for loop runs only once, because of the way you set it up. You are enclosing the return of the np.arrange(0.1.0.01) in a list which breaks the way you want your for loop to run. You are getting just one value because the for loop runs just once as the outer list has just one item. >>> cutoff_list = [np.arange(0,1,0.01)] >>> cutoff_list [array([0. , 0.01, ... 0.98, 0.99])] >>> type(cutoff_list) <class 'list'> You want to get the actual numpy array: >>> cutoff_list = np.arange(0,1,0.01) >>> cutoff_list array([0. , 0.01, ... 0.98, 0.99]) >>> type(cutoff_list) <class 'numpy.ndarray'> Change the line cutoff_list = [np.arange(0,1,0.01)] to cutoff_list = np.arange(0,1,0.01) and see whether that resolves your problem. I would also imagine that you want to use v instead of i in this line: predicted = np.where(df.indicator > i, 1, 0) as the i will hold just the enumerating value that you use as key for your dict, whereas the v will hold the values from cutoff_list.
for loop in scipy.stats.linregress
I am using the scipy stats module to calculate the linear regression. ie slope, intercept, r_value, p_value, std_err = stats.linregress(data['cov_0.0075']['num'],data['cov_0.0075']['com']) where data is a dictionary containing several 'cov_x' keys corresponding to a dataframe with columns 'num' and 'com' I want to be able to loop through this dictionary and do linear regression on each 'cov_x'. I am not sure how to do this. I tried: for i in data: slope_+str(i), intercept+str(i), r_value+str(i),p_value+str(i),std_err+str(i)= stats.linregress(data[i]['num'],data[i]['com']) Essentially I want len(x) slope_x values.
You could use a list comprehension to collect all the stats.linregress return values: result = [stats.linregress(df['num'],df['com']) for key, df in data.items()] result is a list of 5-tuples. To collect all the first, second, third, etc... elements from each 5-tuple into separate lists, use zip(*[...]): slopes, intercepts, r_values, p_values, stderrs = zip(*result)
You should be able to do what you're trying to, but there are a couple of things you should watch out for. First, you can't add a string to a variable name and store it that way. No plus signs on the left of the equals sign. Ever. You should be able to accomplish what you're trying to do, however. Just make sure that you use the dict data type if you want string indexing. import scipy.stats as stats import pandas as pd import numpy as np data = {} l = ['cov_0.0075','cov_0.005'] for i in l: x = np.random.random(100) y = np.random.random(100)+15 d = {'num':x,'com':y} df = pd.DataFrame(data=d) data[i] = df slope = {} intercept = {} r_value = {} p_value = {} std_error = {} for i in data: slope[str(i)], \ intercept[str(i)], \ r_value[str(i)],\ p_value[str(i)], std_error[str(i)]= stats.linregress(data[i]['num'],data[i]['com']) print(slope,intercept,r_value,p_value,std_error) should work just fine. Otherwise, you can store individual values and put them in a list later.