K-Means clustering with 6d vectors - python
I have a dataset of R-D curves such as the following.
(33.3987 34.7318 35.9673 36.8494 37.6992 38.422)
(3929.76 4946.93 6069.78 7243.61 8185.01 9387.84)
we have a 6D vector whose columns are corresponding to PSNR and bitrate. I try to cluster these vectors using K-Means clustering. But my question is how can I use these vectors as input to K-Means? do I need to enter 2D inputs for each column such as (33.3987,3929.76)?
or do I have to put them beside each other?
(33.3987 34.7318 35.9673 36.8494 37.6992 38.422 3929.76 4946.93 6069.78 7243.61 8185.01 9387.84)
I am confused about that because I am not sure about the input of K-Means as a vector.
I used this to combine two arrays as input to K-Means:
psnr_bitrate=np.load(r'F:/RD_data_from_twitch_system/RD_data_from_twitch_system/bitrate_1080.npy')
bitrate=np.load(r'F:/RD_data_from_twitch_system/RD_data_from_twitch_system/psnr_1080.npy')#***
kmeans_input=np.array([psnr_bitrate],[bitrate])
and it produces this error:
Traceback (most recent call last):
File "<ipython-input-33-28c2bfac9deb>", line 2, in <module>
scaled_features = pd.DataFrame((kmeans_input))
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 497, in __init__
mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 190, in init_ndarray
values = _prep_ndarray(values, copy=copy)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 324, in _prep_ndarray
raise ValueError(f"Must pass 2-d input. shape={values.shape}")
ValueError: Must pass 2-d input. shape=(2, 71, 6)
You should create a list of the vectors. IE a numpy array of shape=(n_vectors, 6).
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[33.3987, 34.7318, 35.9673, 36.8494, 37.6992, 38.422],
[3929.76, 4946.93, 6069.78, 7243.61, 8185.01, 9387.84]]
kmeans = KMeans(n_clusters=3).fit(X)
Obviously you will need to change n_clusters to get good results.
See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html for more info.
Related
Can't get correct input for DBSCAN clustersing
I have a node2vec embedding stored as a .csv file, values are a square symmetric matrix. I have two versions of this, one with node names in the first column and another with node names in the first row. I would like to cluster this data with DBSCAN, but I can't seem to figure out how to get the input right. I tried this: import numpy as np import pandas as pd from sklearn.cluster import DBSCAN from sklearn import metrics input_file = "node2vec-labels-on-columns.emb" # for tab delimited use: df = pd.read_csv(input_file, header = 0, delimiter = "\t") # put the original column names in a python list original_headers = list(df.columns.values) emb = df.as_matrix() db = DBSCAN(eps=0.3, min_samples=10).fit(emb) labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print("Estimated number of clusters: %d" % n_clusters_) print("Estimated number of noise points: %d" % n_noise_) This leads to an error: dbscan.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. emb = df.as_matrix() Traceback (most recent call last): File "dbscan.py", line 15, in <module> db = DBSCAN(eps=0.3, min_samples=10).fit(emb) File "C:\Python36\lib\site-packages\sklearn\cluster\_dbscan.py", line 312, in fit X = self._validate_data(X, accept_sparse='csr') File "C:\Python36\lib\site-packages\sklearn\base.py", line 420, in _validate_data X = check_array(X, **check_params) File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f return f(**kwargs) File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 646, in check_array allow_nan=force_all_finite == 'allow-nan') File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 100, in _assert_all_finite msg_dtype if msg_dtype is not None else X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). I've tried other input methods that lead to the same error. All the tutorials I can find use datasets imported form sklearn so those are of not help figuring out how to read from a file. Can anyone point me in the right direction?
The error does not come from the fact that you are reading the dataset from a file but on the content of the dataset. DBSCAN is meant to be used on numerical data. As stated in the error, it does not support NaNs. If you are willing to cluster strings or labels, you should find some other model.
Python k-means get error Found array with 0 feature(s)
I am trying to read a csv file and apply k-means algorithm to identify the groups of the elements. My code is this: import csv import numpy as np import scipy as sp from sklearn import cluster as sk print(sk.k_means(np.genfromtxt('keywords.csv', delimiter=' ')[:,:0],3)) I use genfromtxt because there are some missing values and with this statement I can bypass these. For the moment I would like to see the full return of the k_means function but I get /anaconda/lib/python3.6/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice. warnings.warn("Mean of empty slice.", RuntimeWarning) /anaconda/lib/python3.6/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "ejercicio2.py", line 6, in <module> print(sk.k_means(np.genfromtxt('keywords.csv', delimiter=' ')[:,:0],3)) File "/anaconda/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 345, in k_means x_squared_norms=x_squared_norms, random_state=random_state) File "/anaconda/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 388, in _kmeans_single_elkan X = check_array(X, order="C") File "/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 424, in check_array context)) ValueError: Found array with 0 feature(s) (shape=(3312, 0)) while a minimum of 1 is required.
You are passing all the rows but no columns by writing [:, :0] and hence the error. You might want to send all the rows and columns, and in that case just remove it from that line. In general the syntax is - data[x:y, a:b] which just means, rows from x to y(exclusive) and columns from a to b(exclusive).
KMeans in Python: ValueError: setting an array element with a sequence
I am trying to perform kmeans clustering in Python using numpy and sklearn. I have a txt file with 45 columns and 645 rows. The first row is Y and remaining 644 rows are X. My Python code is: import numpy as np import matplotlib.pyplot as plt import csv from sklearn.cluster import KMeans #The following code reads the first row and terminates the loop with open('trainDataXY.txt','r') as f: read = csv.reader(f) for first_row in read: y = list(first_row) break #The following code skips the first row and reads rest of the rows firstLine = True with open('trainDataXY.txt','r') as f1: readY = csv.reader(f1) for rows in readY: if firstLine: firstLine=False continue x = list(readY) X = np.array((x,y), dtype=object) kmean = KMeans(n_clusters=2) kmean.fit(X) I get an error at this line: kmean.fit(X) The error I get is: Traceback (most recent call last): File "D:\file_path\kmeans.py", line 25, in <module> kmean.fit(X) File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py", line 812, in fit X = self._check_fit_data(X) File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py", line 786, in _check_fit_data X = check_array(X, accept_sparse='csr', dtype=np.float64) File "C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 373, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: setting an array element with a sequence.` trainDataXY.txt 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5 47,64,50,39,66,51,46,37,43,37,37,35,36,34,37,38,37,39,104,102,103,103,102,108,109,107,106,115,116,116,120,122,121,121,116,116,131,131,130,132,126,127,131,128,127 47,65,58,30,39,48,47,35,42,37,38,37,37,36,38,38,38,40,104,103,103,103,101,108,110,108,106,116,115,116,121,121,119,121,116,116,133,131,129,132,127,128,132,126,127 49,69,55,28,56,64,50,30,41,37,39,37,38,36,39,39,39,40,105,103,104,104,103,110,110,108,107,116,115,117,120,120,117,121,115,116,134,131,129,134,128,125,134,126,127 51,78,52,46,56,74,50,28,38,38,39,38,38,37,40,39,39,41,96,101,99,104,97,101,111,101,104,115,116,116,119,110,112,119,116,116,135,130,129,135,120,108,133,120,125 55,79,53,65,52,102,55,28,36,39,40,38,39,37,40,39,40,42,79,86,84,105,84,57,110,85,76,117,118,115,110,66,86,117,117,118,123,130,130,129,106,93,130,113,114 48,80,59,81,50,120,63,26,31,39,40,39,40,38,42,37,41,42,53,73,77,90,47,34,76,52,63,106,102,97,80,33,68,105,105,113,115,130,124,111,83,91,128,105,110 45,95,56,86,38,137,60,27,27,39,40,38,40,37,41,52,38,41,24,44,44,79,40,32,48,26,28,63,52,59,42,30,62,79,67,77,116,121,122,114,96,90,126,93,103 45,93,47,86,35,144,60,26,27,39,40,45,39,38,43,87,46,58,33,21,26,62,42,49,49,37,24,33,41,56,29,28,68,79,58,74,115,111,115,119,117,104,132,92,97 48,85,50,83,37,142,62,25,29,57,47,77,43,64,61,115,70,101,41,28,28,48,39,46,42,38,37,47,43,74,32,28,64,86,80,81,127,113,99,130,140,112,139,92,97 48,94,78,77,30,138,57,28,29,91,66,94,61,94,103,129,89,140,38,34,32,38,33,43,38,36,39,50,39,75,31,33,65,89,82,84,127,112,100,133,141,107,136,95,97 45,108,158,77,30,140,67,29,26,104,97,113,92,106,141,137,116,151,33,32,32,43,44,40,37,34,37,54,86,77,55,48,77,112,83,109,120,111,105,124,133,98,129,89,99 48,139,173,64,40,159,61,55,27,115,117,128,106,124,150,139,125,160,27,26,29,54,51,47,36,36,32,80,125,105,97,96,86,130,102,118,117,104,105,118,117,92,130,94,97 131,157,143,66,87,130,57,118,26,124,137,129,133,138,156,133,132,173,29,25,28,81,48,38,48,32,24,134,165,144,149,142,110,145,147,161,114,112,103,118,115,94,126,87,102 160,162,146,78,116,127,52,133,71,116,141,125,125,141,169,115,110,161,69,53,46,97,79,47,76,59,32,148,147,134,165,152,111,155,139,145,116,113,101,118,105,86,123,92,99
Your data matrix should not be of type object. It should be a matrix of numbers of shape n_samples x n_features. This error usually crops up when people try to convert a list of samples into a data matrix, and each sample is an array or a list, and at least one of the samples does not have the same length as the others. This can be figured out by evaluating np.unique(list(map(len, X))). In your case it is different. Make sure you obtain a data matrix. The first thing to try is to replace the line X = np.array((x,y), dtype=object) with something that creates a data matrix. You should also opt for using numpy.recfromcsv to read your data. It will make everything easier to read.
Python: create multiple boxplots in one pannel
I have been using R for long time and I am recently learning Python. I would like to create multiple box plots in one panel in Python. My dataset is in a vector form and a label vector indicates which box plot each element of data corresponds. The example looks like this: N = 50 data = np.random.lognormal(size=N, mean=1.5, sigma=1.75) label = np.repeat([1,2,3,4,5],N/5) From various websites (e.g., matplotlib: Group boxplots), Creating multiple boxplots requires a matrix object input whose column contains samples for one boxplot. So I created a list object based on data and label: savelist = data[ label == 1] for i in [2,3,4,5]: savelist = [savelist, data[ label == i]] However, the code below gives me an error: boxplot(savelist) Traceback (most recent call last): File "<ipython-input-222-1a55d04981c4>", line 1, in <module> boxplot(savelist) File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2636, in boxplot meanprops=meanprops, manage_xticks=manage_xticks) File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3045, in boxplot labels=labels) File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/matplotlib/cbook.py", line 1962, in boxplot_stats stats['mean'] = np.mean(x) File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2727, in mean out=out, keepdims=keepdims) File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py", line 66, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) ValueError: operands could not be broadcast together with shapes (2,) (10,) Can anyone explain what is going on?
You're ending up with a nested list instead of a flat list. Try this instead: savelist = [data[label == 1]] for i in [2,3,4,5]: savelist.append(data[label == i]) And it should work.
ZeroDivisionError when using scipy.interpolate.griddata
I'm getting a ZeroDivisionError from the following code: #stacking the array into a complex array allows np.unique to choose #truely unique points. We also keep a handle on the unique indices #to allow us to index `self` in the same order. unique_points,index = np.unique(xdata[mask]+1j*ydata[mask], return_index=True) #Now we break it into the data structure we need. points = np.column_stack((unique_points.real,unique_points.imag)) xx1,xx2 = self.meta['rcm_xx1'],self.meta['rcm_xx2'] yy1 = self.meta['rcm_yy2'] gx = np.arange(xx1,xx2+dx,dx) gy = np.arange(-yy1,yy1+dy,dy) GX,GY = np.meshgrid(gx,gy) xi = np.column_stack((GX.ravel(),GY.ravel())) gdata = griddata(points,self[mask][index],xi,method='linear', fill_value=np.nan) Here, xdata,ydata and self are all 2D numpy.ndarrays (or subclasses thereof) with the same shape and dtype=np.float32. mask is a 2d ndarray with the same shape and dtype=bool. Here's a link for those wanting to peruse the scipy.interpolate.griddata documentation. Originally, xdata and ydata are derived from a non-uniform cylindrical grid that has a 4 point stencil -- I thought that the error might be coming from the fact that the same point was defined multiple times, so I made the set of input points unique as suggested in this question. Unfortunately, that hasn't seemed to help. The full traceback is: Traceback (most recent call last): File "/xxxxxxx/rcm.py", line 428, in <module> x[...,1].to_pz0() File "/xxxxxxx/rcm.py", line 285, in to_pz0 fill_value=fill_value) File "/usr/local/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.py", line 183, in griddata ip = LinearNDInterpolator(points, values, fill_value=fill_value) File "interpnd.pyx", line 192, in scipy.interpolate.interpnd.LinearNDInterpolator.__init__ (scipy/interpolate/interpnd.c:2935) File "qhull.pyx", line 996, in scipy.spatial.qhull.Delaunay.__init__ (scipy/spatial/qhull.c:6607) File "qhull.pyx", line 183, in scipy.spatial.qhull._construct_delaunay (scipy/spatial/qhull.c:1919) ZeroDivisionError: float division For what it's worth, the code "works" (No exception) if I use the "nearest" method.