Value error when training model with randomforest classifier - python

from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
import time
from sklearn import metrics
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
enc = preprocessing.OneHotEncoder()
onehotencoder = OneHotEncoder(categories='auto')
enc.fit(X)
onehotlabels = enc.transform(X).toarray()
onehotlabels.shape
clf=RandomForestClassifier(n_estimators=10)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
predict = clf.predict(X_test)
print("Evaluation on Test Set",predict)
I am doing this to train my model with randomforest classifier. I am getting the following error:
ValueError: could not convert string to float: 'gorilla'

I can't tell for sure by looking at your code, because data structures of X, X_train or X_test is not clear.
However, I suspect that the onehotlabels variable is not used.
If one hot encoding worked properly, 'gorilla' string would not have been included.
So, I suggest that you check whether the following code had been executed.
X_train, X_test = train_test_split(onehotlabels)

Related

how to fix the error ValueError: could not convert string to float in a NLP project in python?

I am writing a python code using jupyter notebook that train and test a dataset in order to return a correct sentiment.
The problem that when i try to predict the sentiment of the phrase the system crash and display the below error :
ValueError: could not convert string to float: 'this book was so
interstening it made me not happy'
Note i have an imbalanced dataset so i use SMOTE in order to over_sampling the dataset
code:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE# for inbalance dataset
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
df = pd.read_csv("data/Apple-Twitter-Sentiment-DFE.csv",encoding="ISO-8859-1")
df
# data is cleaned using preprocessing functions
# Solving inbalanced dataset using SMOTE
vectorizer = TfidfVectorizer()
vect_df =vectorizer.fit_transform(df["clean_text"])
oversample = SMOTE(random_state = 42)
x_smote,y_smote = oversample.fit_resample(vect_df, df["sentiment"])
print("shape x before SMOTE: {}".format(vect_df.shape))
print("shape x after SMOTE: {}".format(x_smote.shape))
print("balance of targets feild %")
y_smote.value_counts(normalize = True)*100
# split the dataset into train and test
x_train,x_test,y_train,y_test = train_test_split(x_smote,y_smote,test_size = 0.2,random_state =42)
logreg = Pipeline([
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))
# Make prediction
exl = "this book was so interstening it made me not happy"
logreg.predict(exl)
You should define your variable exl as the following:
exl = vectorizer.transform(["this book was so interstening it made me not happy"])
and then do the prediction.
First, put the testing data in a list and then use vectorizer to use features extracted from your training data to do the prediction.

How can i create an instance of multi-layer perceptron network to use in bagging classifier?

i am trying to create an instance of multi-layer perceptron network to use in bagging classifier. But i don't understand how to fix them.
Here is my code:
My task is:
1-To apply bagging classifier (with or without replacement) with eight base classifiers created at the previous step.
It would be really great if you show me how can i implement this to my algorithm. I did my search but i couldn't find a way to do that
To train your BaggingClassifier:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Load the digits data:
X,y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
# Feature scaling
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Finally for the MLP- Multilayer Perceptron
mlp = MLPClassifier(hidden_layer_sizes=(16, 8, 4, 2), max_iter=1001)
clf = BaggingClassifier(mlp, n_estimators=8)
clf.fit(X_train,y_train)
To analyze your output you may try:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
print(cm)
To see num of correctly predicted instances per class:
print(cm[np.eye(len(clf.classes_)).astype("bool")])
To see percentage of correctly predicted instances per class:
cm[np.eye(len(clf.classes_)).astype("bool")]/cm.sum(1)
To see total accuracy of your algo:
(y_pred==y_test).mean()
EDIT
To access predictions on a per base estimator basis, i.e. your mlps, you can do:
estimators = clf.estimators_
# print(len(estimators), type(estimators[0]))
preds = []
for base_estimator in estimators:
preds.append(base_estimator.predict(X_test))

How to feed data into random forest classifier and see prediction

I have build a random forest classifier using scikit learn and python, and I am having trouble actually feeding data in to see the prediction. I want to see the format of the output, and to convert this to a json file. I have attached the code for the random forest and what the data looks like. I believe I need to use 'y_pred', but I am not sure what format the input data needs to be.
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
You can simply concatenate the predicted values with the matrix of features.
Also note that the pipeline is exactly for this purpose, when you first want to transform the data and then apply some classifier.
This should work for you:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
classifier = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=20, random_state=0))
classifier = classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
pred = pd.concat([X_test, pd.Series(y_pred, name="pages")], axis=1)

How to fix NameError: name 'X_train' is not defined?

I am running the [code] of multi-label classification1.how to fix the NameError that the "X_train" is not defined.the python code is given below.
import scipy
from scipy.io import arff
data, meta = scipy.io.arff.loadarff('./yeast/yeast-train.arff')
from sklearn.datasets import make_multilabel_classification
# this will generate a random multi-label dataset
X, y = make_multilabel_classification(sparse = True, n_labels = 20,
return_indicator = 'sparse', allow_unlabeled = False)
# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(GaussianNB())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)
You forgot to split the dataset into train and test sets.
Import the library
from sklearn.model_selection import train_test_split
Add this line before classifier.fit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train does not exist, you have to split between train and test :
from sklearn.preprocessing import StandardScaler
s =StandardScaler()
X_train = s.fit_transform(X_train)
X_test = s.fit_transform(X_test)

How does cross_val_score and gridsearchCV works?

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.
Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.
To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:
Here it is:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt
I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:
train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)
for train_index, test_index in kf.split(X):
print('TRAIN:', train_index, 'TEST:', test_index)
X_train_kf = X[train_index]
X_test_kf = X[test_index]
train_data = list(range(0,15))
test_data = list(range(15,21))
X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]
Here is what I do:
instantiate a simple linear model and use it with the manual set of data
lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train)
->r2=0.4686662249071524
lmscore_test=lm.score(X_test,y_test)
->r2 0.6264021467338086
now I try do do the exact same things using a pipeline:
pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train)
->r2=0.4686662249071524
pscore_test=p.score(X_test,y_test)
->r2 0.6264021467338086
LinearRegression and pipeline matches perfectly
Now I try to do the same by using cross_val_score using the predefined split kf
cv_scores = cross_val_score(lm, X, y, cv=kf)
->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)
Now let's try gridsearchCV
scoring = {'r_squared':'r2'}
grid_parameters = [{}]
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_
from cv_results_ I get once again
->mean_test_r_squared->r2->-1.234474757883921292e+01
So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.
Will you please help me out solving this puzzle?
cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.
Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.
Instead of this:
lm=lr.fit(X,y)
Try this:
lm=lr.fit(X_train, y_train)
Same for pipeline:
Instead of p=pipe.fit(X,y), do this:
p=pipe.fit(X_train, y_train)
You can look at my answers for more description:-
https://stackoverflow.com/a/42364900/3374996
https://stackoverflow.com/a/42230764/3374996

Categories