Can't get correct input for DBSCAN clustersing - python
I have a node2vec embedding stored as a .csv file, values are a square symmetric matrix. I have two versions of this, one with node names in the first column and another with node names in the first row. I would like to cluster this data with DBSCAN, but I can't seem to figure out how to get the input right. I tried this:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
input_file = "node2vec-labels-on-columns.emb"
# for tab delimited use:
df = pd.read_csv(input_file, header = 0, delimiter = "\t")
# put the original column names in a python list
original_headers = list(df.columns.values)
emb = df.as_matrix()
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
This leads to an error:
dbscan.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
emb = df.as_matrix()
Traceback (most recent call last):
File "dbscan.py", line 15, in <module>
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
File "C:\Python36\lib\site-packages\sklearn\cluster\_dbscan.py", line 312, in fit
X = self._validate_data(X, accept_sparse='csr')
File "C:\Python36\lib\site-packages\sklearn\base.py", line 420, in _validate_data
X = check_array(X, **check_params)
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
return f(**kwargs)
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 646, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 100, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I've tried other input methods that lead to the same error. All the tutorials I can find use datasets imported form sklearn so those are of not help figuring out how to read from a file. Can anyone point me in the right direction?
The error does not come from the fact that you are reading the dataset from a file but on the content of the dataset.
DBSCAN is meant to be used on numerical data. As stated in the error, it does not support NaNs.
If you are willing to cluster strings or labels, you should find some other model.
Related
K-Means clustering with 6d vectors
I have a dataset of R-D curves such as the following. (33.3987 34.7318 35.9673 36.8494 37.6992 38.422) (3929.76 4946.93 6069.78 7243.61 8185.01 9387.84) we have a 6D vector whose columns are corresponding to PSNR and bitrate. I try to cluster these vectors using K-Means clustering. But my question is how can I use these vectors as input to K-Means? do I need to enter 2D inputs for each column such as (33.3987,3929.76)? or do I have to put them beside each other? (33.3987 34.7318 35.9673 36.8494 37.6992 38.422 3929.76 4946.93 6069.78 7243.61 8185.01 9387.84) I am confused about that because I am not sure about the input of K-Means as a vector. I used this to combine two arrays as input to K-Means: psnr_bitrate=np.load(r'F:/RD_data_from_twitch_system/RD_data_from_twitch_system/bitrate_1080.npy') bitrate=np.load(r'F:/RD_data_from_twitch_system/RD_data_from_twitch_system/psnr_1080.npy')#*** kmeans_input=np.array([psnr_bitrate],[bitrate]) and it produces this error: Traceback (most recent call last): File "<ipython-input-33-28c2bfac9deb>", line 2, in <module> scaled_features = pd.DataFrame((kmeans_input)) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 497, in __init__ mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 190, in init_ndarray values = _prep_ndarray(values, copy=copy) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 324, in _prep_ndarray raise ValueError(f"Must pass 2-d input. shape={values.shape}") ValueError: Must pass 2-d input. shape=(2, 71, 6)
You should create a list of the vectors. IE a numpy array of shape=(n_vectors, 6). from sklearn.cluster import KMeans import numpy as np X = np.array([[33.3987, 34.7318, 35.9673, 36.8494, 37.6992, 38.422], [3929.76, 4946.93, 6069.78, 7243.61, 8185.01, 9387.84]] kmeans = KMeans(n_clusters=3).fit(X) Obviously you will need to change n_clusters to get good results. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html for more info.
ValueError while fitting a model even after imputation
I am using the Melbourne Housing Dataset from Kaggle to fit a regression model on it, with Price being the target value. You can find the dataset here import numpy as np import pandas as pd from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence from sklearn.preprocessing import Imputer cols_to_use = ['Distance', 'Landsize', 'BuildingArea'] data = pd.read_csv('data/melb_house_pricing.csv') # drop rows where target is NaN data = data.loc[~(data['Price'].isna())] y = data.Price X = data[cols_to_use] my_imputer = Imputer() imputed_X = my_imputer.fit_transform(X) print(f"Contains NaNs in training data: {np.isnan(imputed_X).sum()}") print(f"Contains NaNs in target data: {np.isnan(y).sum()}") print(f"Contains Infinity: {np.isinf(imputed_X).sum()}") print(f"Contains Infinity: {np.isinf(y).sum()}") my_model = GradientBoostingRegressor() my_model.fit(imputed_X, y) # Here we make the plot my_plots = plot_partial_dependence(my_model, features=[0, 2], # column numbers of plots we want to show X=X, # raw predictors data. feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs grid_resolution=10) # number of values to plot on x axis Even after using the Imputer from sklearn, I get the following error - Contains NaNs in training data: 0 Contains NaNs in target data: 0 Contains Infinity: 0 Contains Infinity: 0 /Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/deprecation.py:85: DeprecationWarning: Function plot_partial_dependence is deprecated; The function ensemble.plot_partial_dependence has been deprecated in favour of sklearn.inspection.plot_partial_dependence in 0.21 and will be removed in 0.23. warnings.warn(msg, category=DeprecationWarning) Traceback (most recent call last): File "partial_dependency_plots.py", line 29, in <module> grid_resolution=10) # number of values to plot on x axis File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/deprecation.py", line 86, in wrapped return fun(*args, **kwargs) File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/ensemble/partial_dependence.py", line 286, in plot_partial_dependence X = check_array(X, dtype=DTYPE, order='C') File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array allow_nan=force_all_finite == 'allow-nan') File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite raise ValueError(msg_err.format(type_err, X.dtype)) ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). As, you can see when I print the number of NaNs in imputed_X, I get 0. So, why do I still get ValueError. Any help?
Just change the code for plot_partial_dependence: my_plots = plot_partial_dependence(my_model, features=[0, 2], # column numbers of plots we want to show X=imputed_X, # raw predictors data. feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs grid_resolution=10) # num It will work.
Could not convert string to float while data preprocessing
I need help with this. I'm a beginner and I am really confused with this. This is my code for the beginning of my preprocessing. import numpy as np import matplotlib.pyplot as plt import pandas as pd # Import training set dataset_train = pd.read_csv('Google_Stock_Price_Train.csv') training_set = dataset_train.iloc[:, 1:6].values from sklearn.preprocessing import MinMaxScaler sc = MinMaxScaler(feature_range = (0, 1)) training_set_scaled = sc.fit_transform(training_set) With this dataset(not full, I only put 10 of them as there are actually 10000) Date, Open, High, Low, Close, Volume 1/3/2012,325.25,332.83,324.97,663.59,"7,380,500" 1/4/2012,331.27,333.87,329.08,666.45,"5,749,400" 1/5/2012,329.83,330.75,326.89,657.21,"6,590,300" 1/6/2012,328.34,328.77,323.68,648.24,"5,405,900" 1/9/2012,322.04,322.29,309.46,620.76,"11,688,800" 1/10/2012,313.7,315.72,307.3,621.43,"8,824,000" 1/11/2012,310.59,313.52,309.4,624.25,"4,817,800" 1/12/2012,314.43,315.26,312.08,627.92,"3,764,400" 1/13/2012,311.96,312.3,309.37,623.28,"4,631,800" I get this error Traceback (most recent call last): File "<ipython-input-10-94c47491afd8>", line 3, in <module> training_set_scaled = sc.fit_transform(training_set) File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\base.py", line 517, in fit_transform return self.fit(X, **fit_params).transform(X) File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit return self.partial_fit(X, y) File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit estimator=self, dtype=FLOAT_DTYPES) File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: '1,770,000' Sample code to help fix would be helpful
You need to get rid of the commas in your numbers: float("7,380,500") fails. I don't know how/if you can change the data, but if you can, str.replace(',', '') deletes all the commas from your number-strings. As your file is a csv, you need to make sure it only applies to the number-columns, not to all commas in your file.
You can use the 'thousands' param in the 'read_csv'. This will format the data and remove the commas from between the numbers in 'Volume' column, and convert that to int (default) which can then be easily converted into float. dataset_train = pd.read_csv('Google_Stock_Price_Train.csv', thousands=',') dataset_train['Volume'].dtype # Output: int64
How can I manipulate my data to allow a random forest to run on it?
I want to train a random forest on a bunch of matrices (first link below for an example). I want to classify them as either "g" or "b" (good or bad, a or b, 1 or 0, it doesn't matter). I've called the script randfore.py. I am currently using 10 examples, but I will be using a much bigger data set once I actually get this up and running. Here is the code: # -*- coding: utf-8 -*- import numpy as np import pandas as pd import os import sklearn from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier working_dir = os.getcwd() # Grabs the working directory directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located sources = list() # Just sets up a list here which is going to become the input for the random forest for i in range(10): cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from sources.append(cutoutfile) # add it to our sources list targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad) sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary? # Training sets X_train = sources[:8] # Inputs y_train = targets[:8] # Targets # Random Forest rf = RandomForestClassifier(n_estimators=10) rf_fit = rf.fit(X_train, y_train) Here is the current error output: Traceback (most recent call last): File "randfore.py", line 31, in <module> rf_fit = rf.fit(X_train, y_train) File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit X = check_array(X, accept_sparse="csc", dtype=DTYPE) File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: setting an array element with a sequence. I tried making the dtype = object, but it hasn't helped. I'm just not sure what sort of manipulation I need to perform to have this work. I think the problem is because the files I appending to sources aren't just numbers but a mix of numbers, commas, and various square brackets (it's basically a big matrix). Is there a natural way to import this? The square brackets in particular are probably an issue. Before I converted sources to a DataFrame I was getting the following error: ValueError: cannot copy sequence with size 99 to array axis with dimension 1 This is due to the dimensions of my input (100 lines long) and my target which has 10 rows and 1 column. Here is the contents of the first file that's read into cutouts (they're all the exact same style) to be used as the input: https://pastebin.com/tkysqmVu And here is the contents of faketargets.dat, the targets: https://pastebin.com/632RBqWc Any ideas? Help greatly appreciated. I am sure there is a lot of fundamental confusion going on here.
Try writing: X_train = sources.values[:8] # Inputs y_train = targets.values[:8] # Targets I hope this will solve your problem!
KMeans in Python: ValueError: setting an array element with a sequence
I am trying to perform kmeans clustering in Python using numpy and sklearn. I have a txt file with 45 columns and 645 rows. The first row is Y and remaining 644 rows are X. My Python code is: import numpy as np import matplotlib.pyplot as plt import csv from sklearn.cluster import KMeans #The following code reads the first row and terminates the loop with open('trainDataXY.txt','r') as f: read = csv.reader(f) for first_row in read: y = list(first_row) break #The following code skips the first row and reads rest of the rows firstLine = True with open('trainDataXY.txt','r') as f1: readY = csv.reader(f1) for rows in readY: if firstLine: firstLine=False continue x = list(readY) X = np.array((x,y), dtype=object) kmean = KMeans(n_clusters=2) kmean.fit(X) I get an error at this line: kmean.fit(X) The error I get is: Traceback (most recent call last): File "D:\file_path\kmeans.py", line 25, in <module> kmean.fit(X) File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py", line 812, in fit X = self._check_fit_data(X) File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py", line 786, in _check_fit_data X = check_array(X, accept_sparse='csr', dtype=np.float64) File "C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 373, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: setting an array element with a sequence.` trainDataXY.txt 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5 47,64,50,39,66,51,46,37,43,37,37,35,36,34,37,38,37,39,104,102,103,103,102,108,109,107,106,115,116,116,120,122,121,121,116,116,131,131,130,132,126,127,131,128,127 47,65,58,30,39,48,47,35,42,37,38,37,37,36,38,38,38,40,104,103,103,103,101,108,110,108,106,116,115,116,121,121,119,121,116,116,133,131,129,132,127,128,132,126,127 49,69,55,28,56,64,50,30,41,37,39,37,38,36,39,39,39,40,105,103,104,104,103,110,110,108,107,116,115,117,120,120,117,121,115,116,134,131,129,134,128,125,134,126,127 51,78,52,46,56,74,50,28,38,38,39,38,38,37,40,39,39,41,96,101,99,104,97,101,111,101,104,115,116,116,119,110,112,119,116,116,135,130,129,135,120,108,133,120,125 55,79,53,65,52,102,55,28,36,39,40,38,39,37,40,39,40,42,79,86,84,105,84,57,110,85,76,117,118,115,110,66,86,117,117,118,123,130,130,129,106,93,130,113,114 48,80,59,81,50,120,63,26,31,39,40,39,40,38,42,37,41,42,53,73,77,90,47,34,76,52,63,106,102,97,80,33,68,105,105,113,115,130,124,111,83,91,128,105,110 45,95,56,86,38,137,60,27,27,39,40,38,40,37,41,52,38,41,24,44,44,79,40,32,48,26,28,63,52,59,42,30,62,79,67,77,116,121,122,114,96,90,126,93,103 45,93,47,86,35,144,60,26,27,39,40,45,39,38,43,87,46,58,33,21,26,62,42,49,49,37,24,33,41,56,29,28,68,79,58,74,115,111,115,119,117,104,132,92,97 48,85,50,83,37,142,62,25,29,57,47,77,43,64,61,115,70,101,41,28,28,48,39,46,42,38,37,47,43,74,32,28,64,86,80,81,127,113,99,130,140,112,139,92,97 48,94,78,77,30,138,57,28,29,91,66,94,61,94,103,129,89,140,38,34,32,38,33,43,38,36,39,50,39,75,31,33,65,89,82,84,127,112,100,133,141,107,136,95,97 45,108,158,77,30,140,67,29,26,104,97,113,92,106,141,137,116,151,33,32,32,43,44,40,37,34,37,54,86,77,55,48,77,112,83,109,120,111,105,124,133,98,129,89,99 48,139,173,64,40,159,61,55,27,115,117,128,106,124,150,139,125,160,27,26,29,54,51,47,36,36,32,80,125,105,97,96,86,130,102,118,117,104,105,118,117,92,130,94,97 131,157,143,66,87,130,57,118,26,124,137,129,133,138,156,133,132,173,29,25,28,81,48,38,48,32,24,134,165,144,149,142,110,145,147,161,114,112,103,118,115,94,126,87,102 160,162,146,78,116,127,52,133,71,116,141,125,125,141,169,115,110,161,69,53,46,97,79,47,76,59,32,148,147,134,165,152,111,155,139,145,116,113,101,118,105,86,123,92,99
Your data matrix should not be of type object. It should be a matrix of numbers of shape n_samples x n_features. This error usually crops up when people try to convert a list of samples into a data matrix, and each sample is an array or a list, and at least one of the samples does not have the same length as the others. This can be figured out by evaluating np.unique(list(map(len, X))). In your case it is different. Make sure you obtain a data matrix. The first thing to try is to replace the line X = np.array((x,y), dtype=object) with something that creates a data matrix. You should also opt for using numpy.recfromcsv to read your data. It will make everything easier to read.