ValueError: could not convert string to float: 'ID1' - python

import pandas as pd
from sklearn.linear_model import LinearRegression
data = {
'ID': ['ID1', 'ID2', 'ID3', 'ID4', 'ID5'],
'RMSE': [10.05616902165789, 9.496130901397015, 9.857060740380899,9.528204292426823,9.491117416326155]
}
df = pd.DataFrame(data)
X = df[['ID']]
y = df['RMSE']
reg = LinearRegression().fit(X, y)
preds = reg.predict(X)
mean_pred = preds.mean()
print('Mean of predicted RMSE values:', mean_pred)
how to resolve this error.

You are getting the error because your column, ID only contains str objects, which makes it impossible to convert to float. The X column must be numerical in order to work.

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

Converting strings to numpy arrays to run a for loop on dataframe linear regression

import datetime
import numpy as np
from sklearn.linear_model import LinearRegression
import os
os.chdir(r'path')
df = pd.read_excel("test.xlsx")
print(df)
def process_date(date_str):
# the date format is month-day-year
d = datetime.datetime.strptime(date_str, '%m-%d-%Y')
return d.timestamp()
df['Date'] = df['Date'].apply(process_date)
df.head()
for (columnName, columnData) in df.iteritems():
cn = columnName
cd = columnData
print('Column Name : ', columnName)
print('Column Contents : ', columnData.values)
X = np.array(cn).reshape(-1,1)
y = cd.to_numpy()
model = LinearRegression()
model.fit(X, y)
ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
Here is my dataset

could not convert string to float: 'Runny_nose'

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
Disease_data = pd.read_csv("Disease_dataset.csv")
X = Disease_data.drop(columns='Diseases')
y = Disease_data['Diseases']
model = DecisionTreeClassifier()
model.fit(X, y)
I get this error:
ValueError: could not convert string to float: 'Runny_nose'
I tried
Disease_data = Disease_data['Diseases'].astype(float)
and
music_data = pd.to_numeric(music_data, errors='coerce')
instead I get empty columns
Some of your lines might don't have valid float data.
Visit this thread for more info.

cannot concatenate object of type "<class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

My input data is under the form:
gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold
T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
....
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace,
T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
I am reading my data and I am trying to concatenate two data sets that are subsets of the original data set, here is the code I am using:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
# Feature Scaling
from sklearn.preprocessing import StandardScaler
SeparateProjectLearning=False
CompleteCallersCallees=False
PartialTrainingSetCompleteCallersCallees=True
def main():
X_train={}
X_test={}
y_train={}
y_test={}
dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False)
#convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
dataset['Program'] = dataset['Program'].astype('category').cat.codes
dataset['classGold'] = dataset['classGold'].astype('category').cat.codes
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes
dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes
dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes
dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes
dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes
dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes
dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes
dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes
dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes
dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes
dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes
dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes
pd.set_option('display.max_columns', None)
row_count, column_count = dataset.shape
Xcol = dataset.iloc[:, 1:column_count]
CompleteSet=dataset.loc[dataset['CompleteCallersCallees'] == 1]
CompleteSet_X = CompleteSet.iloc[:, 1:column_count].values
CompleteSet_Y = CompleteSet.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(CompleteSet_X, CompleteSet_Y, test_size = 0.2, random_state = 0)
TestSet=dataset.loc[dataset['CompleteCallersCallees'] == 0]
X_test1=TestSet.iloc[:, 1:column_count].values
X_test=pd.concat(X_test1,X_test)
I want to build my own test set and training set by using concatenation and I am trying to concatenate X_test1 and X_test in the code above. However, the problem is that I am getting an error for the last line of code X_test=pd.concat(X_test1,X_test) and the error says TypeError: cannot concatenate object of type "<class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid. How can I fix this?
By adding .values to the end of your filters in the following lines:
CompleteSet_X = CompleteSet.iloc[:, 1:column_count].values
CompleteSet_Y = CompleteSet.iloc[:, 0].values
X_test1=TestSet.iloc[:, 1:column_count].values
You are extracting the underlying Numpy ndarray from the Pandas Series/DataFrame the prior code extracts, just remove .values at the end and you can use concat directly with the Series or DataFrame.

How can I plug a prediction of X into BernoulliNB.predict_proba?

I have a Python script that, given an observed class (X) and some columns of binary (Y), predicts a class (Pred_X). It then predicts the probability of each class (Prob(1), etc). How could I get the probability of only the observed class (Prob(X)) please?
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Data
df_1 = pd.DataFrame({'X' : [1,2,1,1,1,2,1,2,2,1],
'Y1': [1,0,0,1,0,0,1,1,0,1],
'Y2': [0,0,1,0,0,1,0,0,1,0],
'Y3': [1,0,0,0,0,0,1,0,0,0]})
# Split the data
df_I = df_1 .loc[ : , ['Y1', 'Y2', 'Y3']]
S_O = df_1['X']
# Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_I, S_O)
# Predict X
A_P = BNB.predict(df_I)
df_P = pd.DataFrame(A_P)
df_P.columns = ['Pred_X']
# Predict Probability
A_R = BNB.predict_proba(df_I)
df_R = pd.DataFrame(A_R)
df_R.columns = ['Prob_1', 'Prob_2']
# Join
df_1 = df_1.join(df_P)
df_1 = df_1.join(df_R)
thanks #jezrael:
# Rename the columns after the classes of X
classes = df_1['X'].unique()
df_R.columns = [classes]
# Look up the predicted probability of X
df_1['Prob_X'] = df_R.lookup(df_R.index, df_1.X)

Categories