KMeans in Python: ValueError: setting an array element with a sequence - python
I am trying to perform kmeans clustering in Python using numpy and sklearn.
I have a txt file with 45 columns and 645 rows. The first row is Y and remaining 644 rows are X.
My Python code is:
import numpy as np
import matplotlib.pyplot as plt
import csv
from sklearn.cluster import KMeans
#The following code reads the first row and terminates the loop
with open('trainDataXY.txt','r') as f:
read = csv.reader(f)
for first_row in read:
y = list(first_row)
break
#The following code skips the first row and reads rest of the rows
firstLine = True
with open('trainDataXY.txt','r') as f1:
readY = csv.reader(f1)
for rows in readY:
if firstLine:
firstLine=False
continue
x = list(readY)
X = np.array((x,y), dtype=object)
kmean = KMeans(n_clusters=2)
kmean.fit(X)
I get an error at this line: kmean.fit(X)
The error I get is:
Traceback (most recent call last):
File "D:\file_path\kmeans.py", line 25, in <module> kmean.fit(X)
File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py",
line 812, in fit X = self._check_fit_data(X)
File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py",
line 786, in _check_fit_data X = check_array(X, accept_sparse='csr',
dtype=np.float64)
File "C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py",
line 373, in check_array array = np.array(array, dtype=dtype,
order=order, copy=copy) ValueError: setting an array element with a
sequence.`
trainDataXY.txt
1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5
47,64,50,39,66,51,46,37,43,37,37,35,36,34,37,38,37,39,104,102,103,103,102,108,109,107,106,115,116,116,120,122,121,121,116,116,131,131,130,132,126,127,131,128,127
47,65,58,30,39,48,47,35,42,37,38,37,37,36,38,38,38,40,104,103,103,103,101,108,110,108,106,116,115,116,121,121,119,121,116,116,133,131,129,132,127,128,132,126,127
49,69,55,28,56,64,50,30,41,37,39,37,38,36,39,39,39,40,105,103,104,104,103,110,110,108,107,116,115,117,120,120,117,121,115,116,134,131,129,134,128,125,134,126,127
51,78,52,46,56,74,50,28,38,38,39,38,38,37,40,39,39,41,96,101,99,104,97,101,111,101,104,115,116,116,119,110,112,119,116,116,135,130,129,135,120,108,133,120,125
55,79,53,65,52,102,55,28,36,39,40,38,39,37,40,39,40,42,79,86,84,105,84,57,110,85,76,117,118,115,110,66,86,117,117,118,123,130,130,129,106,93,130,113,114
48,80,59,81,50,120,63,26,31,39,40,39,40,38,42,37,41,42,53,73,77,90,47,34,76,52,63,106,102,97,80,33,68,105,105,113,115,130,124,111,83,91,128,105,110
45,95,56,86,38,137,60,27,27,39,40,38,40,37,41,52,38,41,24,44,44,79,40,32,48,26,28,63,52,59,42,30,62,79,67,77,116,121,122,114,96,90,126,93,103
45,93,47,86,35,144,60,26,27,39,40,45,39,38,43,87,46,58,33,21,26,62,42,49,49,37,24,33,41,56,29,28,68,79,58,74,115,111,115,119,117,104,132,92,97
48,85,50,83,37,142,62,25,29,57,47,77,43,64,61,115,70,101,41,28,28,48,39,46,42,38,37,47,43,74,32,28,64,86,80,81,127,113,99,130,140,112,139,92,97
48,94,78,77,30,138,57,28,29,91,66,94,61,94,103,129,89,140,38,34,32,38,33,43,38,36,39,50,39,75,31,33,65,89,82,84,127,112,100,133,141,107,136,95,97
45,108,158,77,30,140,67,29,26,104,97,113,92,106,141,137,116,151,33,32,32,43,44,40,37,34,37,54,86,77,55,48,77,112,83,109,120,111,105,124,133,98,129,89,99
48,139,173,64,40,159,61,55,27,115,117,128,106,124,150,139,125,160,27,26,29,54,51,47,36,36,32,80,125,105,97,96,86,130,102,118,117,104,105,118,117,92,130,94,97
131,157,143,66,87,130,57,118,26,124,137,129,133,138,156,133,132,173,29,25,28,81,48,38,48,32,24,134,165,144,149,142,110,145,147,161,114,112,103,118,115,94,126,87,102
160,162,146,78,116,127,52,133,71,116,141,125,125,141,169,115,110,161,69,53,46,97,79,47,76,59,32,148,147,134,165,152,111,155,139,145,116,113,101,118,105,86,123,92,99
Your data matrix should not be of type object. It should be a matrix of numbers of shape n_samples x n_features.
This error usually crops up when people try to convert a list of samples into a data matrix, and each sample is an array or a list, and at least one of the samples does not have the same length as the others. This can be figured out by evaluating np.unique(list(map(len, X))).
In your case it is different. Make sure you obtain a data matrix. The first thing to try is to replace the line X = np.array((x,y), dtype=object) with something that creates a data matrix.
You should also opt for using numpy.recfromcsv to read your data. It will make everything easier to read.
Related
Can't get correct input for DBSCAN clustersing
I have a node2vec embedding stored as a .csv file, values are a square symmetric matrix. I have two versions of this, one with node names in the first column and another with node names in the first row. I would like to cluster this data with DBSCAN, but I can't seem to figure out how to get the input right. I tried this: import numpy as np import pandas as pd from sklearn.cluster import DBSCAN from sklearn import metrics input_file = "node2vec-labels-on-columns.emb" # for tab delimited use: df = pd.read_csv(input_file, header = 0, delimiter = "\t") # put the original column names in a python list original_headers = list(df.columns.values) emb = df.as_matrix() db = DBSCAN(eps=0.3, min_samples=10).fit(emb) labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print("Estimated number of clusters: %d" % n_clusters_) print("Estimated number of noise points: %d" % n_noise_) This leads to an error: dbscan.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. emb = df.as_matrix() Traceback (most recent call last): File "dbscan.py", line 15, in <module> db = DBSCAN(eps=0.3, min_samples=10).fit(emb) File "C:\Python36\lib\site-packages\sklearn\cluster\_dbscan.py", line 312, in fit X = self._validate_data(X, accept_sparse='csr') File "C:\Python36\lib\site-packages\sklearn\base.py", line 420, in _validate_data X = check_array(X, **check_params) File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f return f(**kwargs) File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 646, in check_array allow_nan=force_all_finite == 'allow-nan') File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 100, in _assert_all_finite msg_dtype if msg_dtype is not None else X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). I've tried other input methods that lead to the same error. All the tutorials I can find use datasets imported form sklearn so those are of not help figuring out how to read from a file. Can anyone point me in the right direction?
The error does not come from the fact that you are reading the dataset from a file but on the content of the dataset. DBSCAN is meant to be used on numerical data. As stated in the error, it does not support NaNs. If you are willing to cluster strings or labels, you should find some other model.
K-Means clustering with 6d vectors
I have a dataset of R-D curves such as the following. (33.3987 34.7318 35.9673 36.8494 37.6992 38.422) (3929.76 4946.93 6069.78 7243.61 8185.01 9387.84) we have a 6D vector whose columns are corresponding to PSNR and bitrate. I try to cluster these vectors using K-Means clustering. But my question is how can I use these vectors as input to K-Means? do I need to enter 2D inputs for each column such as (33.3987,3929.76)? or do I have to put them beside each other? (33.3987 34.7318 35.9673 36.8494 37.6992 38.422 3929.76 4946.93 6069.78 7243.61 8185.01 9387.84) I am confused about that because I am not sure about the input of K-Means as a vector. I used this to combine two arrays as input to K-Means: psnr_bitrate=np.load(r'F:/RD_data_from_twitch_system/RD_data_from_twitch_system/bitrate_1080.npy') bitrate=np.load(r'F:/RD_data_from_twitch_system/RD_data_from_twitch_system/psnr_1080.npy')#*** kmeans_input=np.array([psnr_bitrate],[bitrate]) and it produces this error: Traceback (most recent call last): File "<ipython-input-33-28c2bfac9deb>", line 2, in <module> scaled_features = pd.DataFrame((kmeans_input)) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 497, in __init__ mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 190, in init_ndarray values = _prep_ndarray(values, copy=copy) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 324, in _prep_ndarray raise ValueError(f"Must pass 2-d input. shape={values.shape}") ValueError: Must pass 2-d input. shape=(2, 71, 6)
You should create a list of the vectors. IE a numpy array of shape=(n_vectors, 6). from sklearn.cluster import KMeans import numpy as np X = np.array([[33.3987, 34.7318, 35.9673, 36.8494, 37.6992, 38.422], [3929.76, 4946.93, 6069.78, 7243.61, 8185.01, 9387.84]] kmeans = KMeans(n_clusters=3).fit(X) Obviously you will need to change n_clusters to get good results. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html for more info.
How to read and display MNIST dataset?
The code below opens the mnist dataset as a csv import numpy as np import csv import matplotlib.pyplot as plt with open('C:/Z_Uni/Individual_Project/Python_Projects/NeuralNet/MNIST_Dataset/mnist_train.csv/mnist_train.csv', 'r') as csv_file: for data in csv.reader(csv_file): # The first column is the label label = data[0] # The rest of columns are pixels pixels = data[1:] # Make those columns into a array of 8-bits pixels # This array will be of 1D with length 784 # The pixel intensity values are integers from 0 to 255 pixels = np.array(pixels, dtype='uint8') print(pixels.shape) # Reshape the array into 28 x 28 array (2-dimensional array) pixels = pixels.reshape((28, 28)) print(pixels.shape) # Plot plt.title('Label is {label}'.format(label=label)) plt.imshow(pixels, cmap='gray') plt.show() break # This stops the loop, I just want to see one I got the code above from someone and cannot get it to display the mnist digits. I get the error: Traceback (most recent call last): File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\Test_View_Mnist.py", line 16, in pixels = np.array(pixels, dtype='uint8') ValueError: invalid literal for int() with base 10: '1x1' When I remove dtype='unit8' I get the error: Traceback (most recent call last): File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\Test_View_Mnist.py", line 24, in plt.imshow(pixels, cmap='gray') File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\source\lib\site-packages\matplotlib_api\deprecation.py", line 456, in wrapper return func(*args, **kwargs) File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\source\lib\site-packages\matplotlib\pyplot.py", line 2640, in imshow _ret = gca().imshow( File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\source\lib\site-packages\matplotlib_api\deprecation.py", line 456, in wrapper return func(*args, **kwargs) File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\source\lib\site-packages\matplotlib_init.py", line 1412, in inner return func(ax, *map(sanitize_sequence, args), **kwargs) File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\source\lib\site-packages\matplotlib\axes_axes.py", line 5488, in imshow im.set_data(X) File "C:\Z_Uni\Individual_Project\Python_Projects\NeuralNet\source\lib\site-packages\matplotlib\image.py", line 706, in set_data raise TypeError("Image data of dtype {} cannot be converted to " TypeError: Image data of dtype <U5 cannot be converted to float Process finished with exit code 1 Could someone explain why this error is happening and how to fix it? Thanks.
There are two problems here. (1) You need to skip the first row because they are labels. (1x1), (1x2) and etc. (2) You need int64 data type. The code below will solve both. next(csvreader) skips the first row. import numpy as np import csv import matplotlib.pyplot as plt with open('./mnist_test.csv', 'r') as csv_file: csvreader = csv.reader(csv_file) next(csvreader) for data in csvreader: # The first column is the label label = data[0] # The rest of columns are pixels pixels = data[1:] # Make those columns into a array of 8-bits pixels # This array will be of 1D with length 784 # The pixel intensity values are integers from 0 to 255 pixels = np.array(pixels, dtype = 'int64') print(pixels.shape) # Reshape the array into 28 x 28 array (2-dimensional array) pixels = pixels.reshape((28, 28)) print(pixels.shape) # Plot plt.title('Label is {label}'.format(label=label)) plt.imshow(pixels, cmap='gray') plt.show()
Python k-means get error Found array with 0 feature(s)
I am trying to read a csv file and apply k-means algorithm to identify the groups of the elements. My code is this: import csv import numpy as np import scipy as sp from sklearn import cluster as sk print(sk.k_means(np.genfromtxt('keywords.csv', delimiter=' ')[:,:0],3)) I use genfromtxt because there are some missing values and with this statement I can bypass these. For the moment I would like to see the full return of the k_means function but I get /anaconda/lib/python3.6/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice. warnings.warn("Mean of empty slice.", RuntimeWarning) /anaconda/lib/python3.6/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "ejercicio2.py", line 6, in <module> print(sk.k_means(np.genfromtxt('keywords.csv', delimiter=' ')[:,:0],3)) File "/anaconda/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 345, in k_means x_squared_norms=x_squared_norms, random_state=random_state) File "/anaconda/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 388, in _kmeans_single_elkan X = check_array(X, order="C") File "/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 424, in check_array context)) ValueError: Found array with 0 feature(s) (shape=(3312, 0)) while a minimum of 1 is required.
You are passing all the rows but no columns by writing [:, :0] and hence the error. You might want to send all the rows and columns, and in that case just remove it from that line. In general the syntax is - data[x:y, a:b] which just means, rows from x to y(exclusive) and columns from a to b(exclusive).
ValueError when trying to save ndarray (Numpy)
I am trying to translate a project I have in MATLAB to Python+Numpy because MATLAB keeps running out of memory. The file I have is rather long, so I have tried to make a minimal example that shows the same error. Basically I'm making a 2d histogram of a dataset, and want to save it after some processing. The problem is that the numpy.save function throws a "ValueError: setting an array element with a sequence" when I try to save the output of the histogram function. I can't find the problem when I look at the docs of Numpy. My version of Python is 2.6.6, Numpy version 1.4.1 on a Debian distro. import numpy as np import random n_samples = 5 rows = 5 out_file = file('dens.bin','wb') x_bins = np.arange(-2.005,2.005,0.01) y_bins = np.arange(-0.5,n_samples+0.5) listy = [random.gauss(0,1) for r in range(n_samples*rows)] dens = np.histogram2d( listy, \ range(n_samples)*rows, \ [y_bins, x_bins]) print 'Write data' np.savez(out_file, dens) out_file.close() Full output: $ python error.py Write data Traceback (most recent call last): File "error.py", line 19, in <module> np.savez(out_file, dens) File "/usr/lib/pymodules/python2.6/numpy/lib/io.py", line 439, in savez format.write_array(fid, np.asanyarray(val)) File "/usr/lib/pymodules/python2.6/numpy/core/numeric.py", line 312, in asanyarray return array(a, dtype, copy=False, order=order, subok=True) ValueError: setting an array element with a sequence.
Note that np.histogram2d actually returns a tuple of three arrays: (hist, x_bins, y_bins). If you want to save all three of these, you have to unpack them as #Francesco said. dens = np.histogram2d(listy, range(n_samples)*rows, [y_bins, x_bins]) np.savez('dens.bin', *dens) Alternatively, if you only need the histogram itself, you could save just that. np.savez('dens.bin', dens[0]) If you want to keep track of which of these is which, use the **kwds instead of the *args denskw = dict(zip(['hist','y_bins','x_bins'], dens)) np.savez('dens.bin', **denskw) Then, you can load it like dens = np.load('dens.bin') hist = dens['hist']# etc