ValueError while fitting a model even after imputation - python

I am using the Melbourne Housing Dataset from Kaggle to fit a regression model on it, with Price being the target value. You can find the dataset here
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.preprocessing import Imputer
cols_to_use = ['Distance', 'Landsize', 'BuildingArea']
data = pd.read_csv('data/melb_house_pricing.csv')
# drop rows where target is NaN
data = data.loc[~(data['Price'].isna())]
y = data.Price
X = data[cols_to_use]
my_imputer = Imputer()
imputed_X = my_imputer.fit_transform(X)
print(f"Contains NaNs in training data: {np.isnan(imputed_X).sum()}")
print(f"Contains NaNs in target data: {np.isnan(y).sum()}")
print(f"Contains Infinity: {np.isinf(imputed_X).sum()}")
print(f"Contains Infinity: {np.isinf(y).sum()}")
my_model = GradientBoostingRegressor()
my_model.fit(imputed_X, y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model,
features=[0, 2], # column numbers of plots we want to show
X=X, # raw predictors data.
feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
grid_resolution=10) # number of values to plot on x axis
Even after using the Imputer from sklearn, I get the following error -
Contains NaNs in training data: 0
Contains NaNs in target data: 0
Contains Infinity: 0
Contains Infinity: 0
/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/deprecation.py:85: DeprecationWarning: Function plot_partial_dependence is deprecated; The function ensemble.plot_partial_dependence has been deprecated in favour of sklearn.inspection.plot_partial_dependence in 0.21 and will be removed in 0.23.
warnings.warn(msg, category=DeprecationWarning)
Traceback (most recent call last):
File "partial_dependency_plots.py", line 29, in <module>
grid_resolution=10) # number of values to plot on x axis
File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/deprecation.py", line 86, in wrapped
return fun(*args, **kwargs)
File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/ensemble/partial_dependence.py", line 286, in plot_partial_dependence
X = check_array(X, dtype=DTYPE, order='C')
File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/Users/adimyth/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
As, you can see when I print the number of NaNs in imputed_X, I get 0. So, why do I still get ValueError. Any help?

Just change the code for plot_partial_dependence:
my_plots = plot_partial_dependence(my_model,
features=[0, 2], # column numbers of plots we want to show
X=imputed_X, # raw predictors data.
feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
grid_resolution=10) # num
It will work.

Related

Can't get correct input for DBSCAN clustersing

I have a node2vec embedding stored as a .csv file, values are a square symmetric matrix. I have two versions of this, one with node names in the first column and another with node names in the first row. I would like to cluster this data with DBSCAN, but I can't seem to figure out how to get the input right. I tried this:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
input_file = "node2vec-labels-on-columns.emb"
# for tab delimited use:
df = pd.read_csv(input_file, header = 0, delimiter = "\t")
# put the original column names in a python list
original_headers = list(df.columns.values)
emb = df.as_matrix()
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
This leads to an error:
dbscan.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
emb = df.as_matrix()
Traceback (most recent call last):
File "dbscan.py", line 15, in <module>
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
File "C:\Python36\lib\site-packages\sklearn\cluster\_dbscan.py", line 312, in fit
X = self._validate_data(X, accept_sparse='csr')
File "C:\Python36\lib\site-packages\sklearn\base.py", line 420, in _validate_data
X = check_array(X, **check_params)
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
return f(**kwargs)
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 646, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 100, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I've tried other input methods that lead to the same error. All the tutorials I can find use datasets imported form sklearn so those are of not help figuring out how to read from a file. Can anyone point me in the right direction?
The error does not come from the fact that you are reading the dataset from a file but on the content of the dataset.
DBSCAN is meant to be used on numerical data. As stated in the error, it does not support NaNs.
If you are willing to cluster strings or labels, you should find some other model.

Using knn to predict values from another DataFrame (Python 3.6) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I created a DataFrame with geological data from a well log, then I created a new column to label each row with a name according to its differents properties. That means: each row now has a rock name.
My question: I already trained my first DataFrame with all the data that I have and now I want to predict the labels (rock names) of a new DataFrame that has the same columns (properties) of the first one. But I do not know how to do it. Here is my code till now:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
data = pd.read_excel('wellA.xlsx') #size (20956,26)
well1 = pd.concat([data['GR'], data['NPHI'], data['RHOB'], data['SW'],
data['VSH'], data['rock_name']], axis=1, keys =
['GR','NPHI','RHOB','SW','VSH','rock_name'])
well1 = well1.drop(well1.index[0:15167])
well1.dropna(axis=0, inplace=True)
knn = KNeighborsClassifier(n_neighbors = 9)
d = {'Claystone': 1, 'Calcareous Claystone': 2, 'Sandy Claystone': 3,
'Limestone': 4, 'Muddy Limestone': 5, 'Muddy Sandstone': 6, 'Sandstone': 7}
well1['Label'] = well1['rock_name'].map(d) #size (5412,7)
X = well1[well1.columns[:5]] #size (5412, 5)
y = well1.rock_name #size (5412,)
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 0)
#sizes: X_train(4059,5), X_test(1353,5) , y_train(4059,), y_test(1353,)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
data2 = pd.read_excel('wellB.xlsx') #size (29070, 12)
well2 = pd.concat([data2['GR'], data2['NPHI'], data2['RHOB'], data2['SW'],
data2['VSH']], axis=1, keys = ['GR','NPHI','RHOB','SW','VSH'])
well2.dropna(axis=0, inplace=True) #size (2124, 5)
# values of the properties
gammaray = well2['GR'].values
neutron = well2['NPHI'].values
density = well2['RHOB'].values
swat = well2['SW'].values
vshale = well2['VSH'].values
rock_name_pred = knn.predict([[gammaray, neutron, density, swat, vshale]])
and then I have the following error:
Traceback (most recent call last):
File "C:\Users\laguiar\AppData\Local\Continuum\anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\laguiar\AppData\Local\Continuum\anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/laguiar/Desktop/Projeto Norne/exemploKNN.py", line 41, in
<module> rock_name_pred = knn.predict([[gammaray, neutron, density, swat,
vshale]])
File "C:\Users\laguiar\AppData\Local\Continuum\anaconda3\lib\site-
packages\sklearn\neighbors\classification.py", line 143, in predict
X = check_array(X, accept_sparse='csr')
File "C:\Users\laguiar\AppData\Local\Continuum\anaconda3\lib\site-
packages\sklearn\utils\validation.py", line 451, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
The error says that KNN expects arrays with a dimension lower or equal to 2. However in your script, your properties, like gammaray are numpy arrays.
When you write [[gammaray, neutron, density, swat, vshale]], in your knn.predict call, the double brackets add 2 dimensions so you end up with a 3-D array.
Try calling the predict method like this:
rock_name_pred = knn.predict([gammaray, neutron, density, swat, vshale])
Or you could call the predict method directly on your dataframe, just like the fit method:
rock_name_pred = knn.predict(well2)

Could not convert string to float while data preprocessing

I need help with this. I'm a beginner and I am really confused with this. This is my code for the beginning of my preprocessing.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import training set
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:6].values
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)
With this dataset(not full, I only put 10 of them as there are actually 10000)
Date, Open, High, Low, Close, Volume
1/3/2012,325.25,332.83,324.97,663.59,"7,380,500"
1/4/2012,331.27,333.87,329.08,666.45,"5,749,400"
1/5/2012,329.83,330.75,326.89,657.21,"6,590,300"
1/6/2012,328.34,328.77,323.68,648.24,"5,405,900"
1/9/2012,322.04,322.29,309.46,620.76,"11,688,800"
1/10/2012,313.7,315.72,307.3,621.43,"8,824,000"
1/11/2012,310.59,313.52,309.4,624.25,"4,817,800"
1/12/2012,314.43,315.26,312.08,627.92,"3,764,400"
1/13/2012,311.96,312.3,309.37,623.28,"4,631,800"
I get this error
Traceback (most recent call last):
File "<ipython-input-10-94c47491afd8>", line 3, in <module>
training_set_scaled = sc.fit_transform(training_set)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit
return self.partial_fit(X, y)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '1,770,000'
Sample code to help fix would be helpful
You need to get rid of the commas in your numbers: float("7,380,500") fails.
I don't know how/if you can change the data, but if you can, str.replace(',', '') deletes all the commas from your number-strings. As your file is a csv, you need to make sure it only applies to the number-columns, not to all commas in your file.
You can use the 'thousands' param in the 'read_csv'. This will format the data and remove the commas from between the numbers in 'Volume' column, and convert that to int (default) which can then be easily converted into float.
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv', thousands=',')
dataset_train['Volume'].dtype
# Output: int64

KMeans in Python: ValueError: setting an array element with a sequence

I am trying to perform kmeans clustering in Python using numpy and sklearn.
I have a txt file with 45 columns and 645 rows. The first row is Y and remaining 644 rows are X.
My Python code is:
import numpy as np
import matplotlib.pyplot as plt
import csv
from sklearn.cluster import KMeans
#The following code reads the first row and terminates the loop
with open('trainDataXY.txt','r') as f:
read = csv.reader(f)
for first_row in read:
y = list(first_row)
break
#The following code skips the first row and reads rest of the rows
firstLine = True
with open('trainDataXY.txt','r') as f1:
readY = csv.reader(f1)
for rows in readY:
if firstLine:
firstLine=False
continue
x = list(readY)
X = np.array((x,y), dtype=object)
kmean = KMeans(n_clusters=2)
kmean.fit(X)
I get an error at this line: kmean.fit(X)
The error I get is:
Traceback (most recent call last):
File "D:\file_path\kmeans.py", line 25, in <module> kmean.fit(X)
File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py",
line 812, in fit X = self._check_fit_data(X)
File "C:\Anaconda2\lib\site-packages\sklearn\cluster\k_means_.py",
line 786, in _check_fit_data X = check_array(X, accept_sparse='csr',
dtype=np.float64)
File "C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py",
line 373, in check_array array = np.array(array, dtype=dtype,
order=order, copy=copy) ValueError: setting an array element with a
sequence.`
trainDataXY.txt
1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5
47,64,50,39,66,51,46,37,43,37,37,35,36,34,37,38,37,39,104,102,103,103,102,108,109,107,106,115,116,116,120,122,121,121,116,116,131,131,130,132,126,127,131,128,127
47,65,58,30,39,48,47,35,42,37,38,37,37,36,38,38,38,40,104,103,103,103,101,108,110,108,106,116,115,116,121,121,119,121,116,116,133,131,129,132,127,128,132,126,127
49,69,55,28,56,64,50,30,41,37,39,37,38,36,39,39,39,40,105,103,104,104,103,110,110,108,107,116,115,117,120,120,117,121,115,116,134,131,129,134,128,125,134,126,127
51,78,52,46,56,74,50,28,38,38,39,38,38,37,40,39,39,41,96,101,99,104,97,101,111,101,104,115,116,116,119,110,112,119,116,116,135,130,129,135,120,108,133,120,125
55,79,53,65,52,102,55,28,36,39,40,38,39,37,40,39,40,42,79,86,84,105,84,57,110,85,76,117,118,115,110,66,86,117,117,118,123,130,130,129,106,93,130,113,114
48,80,59,81,50,120,63,26,31,39,40,39,40,38,42,37,41,42,53,73,77,90,47,34,76,52,63,106,102,97,80,33,68,105,105,113,115,130,124,111,83,91,128,105,110
45,95,56,86,38,137,60,27,27,39,40,38,40,37,41,52,38,41,24,44,44,79,40,32,48,26,28,63,52,59,42,30,62,79,67,77,116,121,122,114,96,90,126,93,103
45,93,47,86,35,144,60,26,27,39,40,45,39,38,43,87,46,58,33,21,26,62,42,49,49,37,24,33,41,56,29,28,68,79,58,74,115,111,115,119,117,104,132,92,97
48,85,50,83,37,142,62,25,29,57,47,77,43,64,61,115,70,101,41,28,28,48,39,46,42,38,37,47,43,74,32,28,64,86,80,81,127,113,99,130,140,112,139,92,97
48,94,78,77,30,138,57,28,29,91,66,94,61,94,103,129,89,140,38,34,32,38,33,43,38,36,39,50,39,75,31,33,65,89,82,84,127,112,100,133,141,107,136,95,97
45,108,158,77,30,140,67,29,26,104,97,113,92,106,141,137,116,151,33,32,32,43,44,40,37,34,37,54,86,77,55,48,77,112,83,109,120,111,105,124,133,98,129,89,99
48,139,173,64,40,159,61,55,27,115,117,128,106,124,150,139,125,160,27,26,29,54,51,47,36,36,32,80,125,105,97,96,86,130,102,118,117,104,105,118,117,92,130,94,97
131,157,143,66,87,130,57,118,26,124,137,129,133,138,156,133,132,173,29,25,28,81,48,38,48,32,24,134,165,144,149,142,110,145,147,161,114,112,103,118,115,94,126,87,102
160,162,146,78,116,127,52,133,71,116,141,125,125,141,169,115,110,161,69,53,46,97,79,47,76,59,32,148,147,134,165,152,111,155,139,145,116,113,101,118,105,86,123,92,99
Your data matrix should not be of type object. It should be a matrix of numbers of shape n_samples x n_features.
This error usually crops up when people try to convert a list of samples into a data matrix, and each sample is an array or a list, and at least one of the samples does not have the same length as the others. This can be figured out by evaluating np.unique(list(map(len, X))).
In your case it is different. Make sure you obtain a data matrix. The first thing to try is to replace the line X = np.array((x,y), dtype=object) with something that creates a data matrix.
You should also opt for using numpy.recfromcsv to read your data. It will make everything easier to read.

Having problems with dimensions in machine learning ( Python Scikit )

I am a bit new to applying machine learning, so I was trying to teach myself how to do linear regression with any kind of data on mldata.org and in the Python scikit package. I tested out the linear regression example code (http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html) and the code worked well with the diabetes dataset. However, I tried to use the code with other datasets, such as one about earthquakes on mldata (http://mldata.org/repository/data/viewslug/global-earthquakes/). However, I was not able to do so due to the dimension problems on there.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 55
warnings.warn("Mean of empty slice.", RuntimeWarning)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 65
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
Traceback (most recent call last):
File "/home/anthony/Documents/Programming/Python/Machine Learning/Scikit/earthquake_linear_regression.py", line 38, in <module>
regr.fit(earthquake_X_train, earthquake_y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 371, in fit
linalg.lstsq(X, y)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 518, in lstsq
raise ValueError('incompatible dimensions')
ValueError: incompatible dimensions
How do I set up the dimensions of the data?
Size of the data:
earthquake_X.shape
(59209, 1, 4)
earthquake_X_train.shape
(59189, 1)
earthquake_y_test.shape
(3, 59209)
earthquake.target.shape
(3, 59209)
The code:
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
#Experimenting with earthquake data
from sklearn.datasets.mldata import fetch_mldata
import tempfile
test_data_home = tempfile.mkdtemp()
# Load the diabetes dataset
earthquake = fetch_mldata('Global Earthquakes', data_home = test_data_home)
# Use only one feature
earthquake_X = earthquake.data[:, np.newaxis]
earthquake_X_temp = earthquake_X[:, :, 2]
# Split the data into training/testing sets
earthquake_X_train = earthquake_X_temp[:-20]
earthquake_X_test = earthquake_X_temp[-20:]
# Split the targets into training/testing sets
earthquake_y_train = earthquake.target[:-20]
earthquake_y_test = earthquake.target[-20:]
print "Splitting of data for preformance check completed"
# Create linear regression object
regr = linear_model.LinearRegression()
print "Created linear regression object"
# Train the model using the training sets
regr.fit(earthquake_X_train, earthquake_y_train)
print "Dataset trained"
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(earthquake_X_test) - earthquake_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(earthquake_X_test, earthquake_y_test))
# Plot outputs
plt.scatter(earthquake_X_test, earthquake_y_test, color='black')
plt.plot(earthquake_X_test, regr.predict(earthquake_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Your array of targets (earthquake_y_train) is of wrong shape. Moreover actually it's empty.
When you do
earthquake_y_train = earthquake.target[:-20]
you select all rows but last 20 among first axis. And, according to the data you posted, earthquake.target has shape (3, 59209), so there are no rows to select!
But even if there were any, it'd be still an error. Why? Because first dimensions of X and y must be the same. According to the sklearn's documentation, LinearRegression's fit expects X to be of shape [n_samples, n_features] and y — [n_samples, n_targets].
In order to fix it change definitions of ys to the following:
earthquake_y_train = earthquake.target[:, :-20].T
earthquake_y_test = earthquake.target[:, -20:].T
P.S. Even if you fix all these problem there's still a problem in your script: plt.scatter can't work with "multidimensional" ys.

Categories