how to give input to loaded .pkl model in python - python

I have a Random Forest model, and model saved in .pkl file.
I have loaded the .pkl model but now I have to input the test data and predict the accuracy.
how to input file to .pkl model?
import pickle
def read_from_pickle(RF):
with open(RF, 'rb') as file:
try:
while True:
yield pickle.load(file)
except EOFError:
pass
this is the code i have used to load the model
Next..how to input?

this solution is with random Forrest regressor my model was dynamic price prediction
import pandas as pd
import numpy as np
from sklearn import pipeline, preprocessing,metrics,model_selection,ensemble,linear_model
from sklearn_pandas import DataFrameMapper
from sklearn.metrics import mean_squared_error
// firstly we loaded this library and then we loaded the dataset and all the cleaning stuff we did after that
data.to_csv("Pune_hpp.csv",index=False)
mapper = DataFrameMapper([
(['area_type','size','new_total_sqft','bath','balcony',], preprocessing.StandardScaler()),
# (['area_type','size'],preprocessing.OneHotEncoder())
],df_out=True)
// hear we created two pipeline for it bcz we have compared two algorithm with mse and rsme method and loaded the this below algo
pipeline_obj_LR=pipeline.Pipeline([
('mapper',mapper),
("model",linear_model.LinearRegression())
])
pipeline_obj=pipeline.Pipeline([
('mapper',mapper),
("model",ensemble.RandomForestRegressor())
])
X=['area_type','size','new_total_sqft','bath','balcony'] // X with INPUT
Y=['price'] // Y as OUTPUT
// hear the comparison process start
pipeline_obj_LR.fit(data[X],data[Y]) // this logistic regression
pipeline_obj.fit(data[X],data[Y]) // random forest
pipeline_obj.predict(data[X]) // some predict we have done
predict=pipeline_obj_LR.predict(data[X])
//BELLOW is the actual way to compare and which algo is best fited
predict=pipeline_obj_LR.predict(data[X])
Root Mean Squared Error on train and test data
print('MSE using linear_regression: ', mean_squared_error(data[Y], predict))
print('RMSE using linear_regression: ', mean_squared_error(data[Y], predict)**(0.5))
// above is for the lr
predict=pipeline_obj.predict(data[X])
Root Mean Squared Error on train and test data
print('MSE using randomforestregression: ', mean_squared_error(data[Y], predict))
print('RMSE using randomforestregression: ', mean_squared_error(data[Y], predict)**(0.5))
// above it is for RFR and in my I have done with the random forest reason to do with the joblib was I had the huge dataset and it easy to implement and it's line of code also very less and you have seen I have not use the pipeline_obj_LR this how we have inputed the value in pkl file
import joblib
joblib.dump(pipeline_obj,'dynamic_price_pred.pkl')
modelReload=joblib.load('dynamic_price_pred.pkl')

Related

Accuracy is high everytime but the resulting prediction is wrong

I am trying to predict the crop name by entering the temperature, soil humidity, pH and average rainfall.
And the accuracy percentage is always high i.e it ranges from 88% to 94% everytime. But the final result after prediction is always wrong.
This is the code:
#importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#Reading the csv file
data=pd.read_csv('cpdata.csv')
#Creating dummy variable for target i.e label
label= pd.get_dummies(data.label).iloc[: , 1:]
data= pd.concat([data,label],axis=1)
data.drop('label', axis=1,inplace=True)
print('The data present in one row of the dataset is')
print(data.head(1))
train=data.iloc[:, 0:4].values
test=data.iloc[: ,4:].values
#Dividing the data into training and test set
X_train,X_test,y_train,y_test=train_test_split(train,test,test_size=0.3)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Importing Decision Tree classifier
from sklearn.tree import DecisionTreeRegressor
clf=DecisionTreeRegressor()
#Fitting the classifier into training set
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
from sklearn.metrics import accuracy_score
# Finding the accuracy of the model
a=accuracy_score(y_test,pred)
print("The accuracy of this model is: ", a*100)
ah=89.41
atemp=26.98
shum=28
pH=6.26
rain=58.54
l=[]
l.append(atemp)
l.append(ah)
l.append(pH)
l.append(rain)
predictcrop=[l]
# Putting the names of crop in a single list
crops=['rice','wheat','mungbean','Tea','millet','maize','lentil','jute','cofee','cotton','ground nut','peas','rubber','sugarcane','tobacco','kidney beans','moth beans','coconut','blackgram','adzuki beans','pigeon peas','chick peas','banana','grapes','apple','mango','muskmelon','orange','papaya','pomegranate','watermelon']
cr='rice'
#Predicting the crop
predictions = clf.predict(predictcrop)
count=0
for i in range(0,31):
if(predictions[0][i]==1):
c=crops[i]
count=count+1
break;
i=i+1
if(count==0):
print('The predicted crop is %s'%cr)
else:
print('The predicted crop is %s'%c)
The output that I am getting is-
The accuracy of this model is: 90.43010752688173
The predicted crop is apple
Even though I enter the exact values for any other crop, I get apple or mango every time.
Kindly help.
Apply the scaler also to your new data for prediction. I cannot test it without your data but it should look somehow like:
datascaled = sc.transform(predictcrop)
predictions = clf.predict(datascaled)
In order to apply the scaler also to new data later, you need to save it:
from sklearn.externals.joblib import dump, load
dump(sc, 'scaler.bin', compress=True)
and later:
sc=load('scaler.bin')

sklearn - How to reload model with a pipeline and predict?

I've saved a trained model and the testing dataset and wish to reload it just to verify I'm getting the same results for future use of the model (I don't have new data to test on at the moment). The csv I've saved does not contain the labels, it's the same test data as in the original train/test operation which worked fine.
I created the model like so:
# copy split data for this model
dtc_test_X = test_X
dtc_test_y = test_y
dtc_train_X = train_X
dtc_train_y = train_y
# initialize the model
dtc = DecisionTreeClassifier(random_state = 1)
# fit the trianing data
dtc_yhat = dtc.fit(dtc_train_X, dtc_train_y).predict(dtc_test_X)
# scikit-learn's accuracy scoring
acc = accuracy_score(dtc_test_y, dtc_yhat)
# scikit-learn's Jaccard Index
jacc = jaccard_similarity_score(dtc_test_y, dtc_yhat)
# scikit-learn's classification report
class_report = classification_report(dtc_test_y, dtc_yhat)
I've saved the model and data below:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
# setup the pipe line
pipe = make_pipeline(DecisionTreeClassifier)
# save the model
joblib.dump(pipe, 'model.pkl')
dtc_test_X.to_csv('set_to_predict.csv')
When I reload the model and attempt a prediction as follows:
#Loading the saved model with joblib
pipe = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)
pred_cols
# apply the whole pipeline to data
pred = pd.Series(pipe.predict(pr[pred_cols]))
On the last line though (the prediction) it raised an exception:
TypeError: predict() missing 1 required positional argument: 'X'
Searching for an answer, I can only find examples of a similar exception but with Y instead of X and the answers don't seem to apply. Why am I getting this error?
Try substituting pipe.predict(pr[pred_cols]) by pipe.predict(X=pr[pred_cols]) to see if it works or if it drops you other error

Logistic regression sklearn - train and apply model

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

Load and predict new data sklearn

I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it.
Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here.
Here is my code:
#Loading the saved model with joblib
model = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# Standardize new data
scaler = StandardScaler()
X_pred = scaler.fit(pr[pred_cols]).transform(pr[pred_cols])
pred = pd.Series(model.predict(X_pred))
print pred
No, it's incorrect. All the data preparation steps should be fit using train data. Otherwise, you risk applying the wrong transformations, because means and variances that StandardScaler estimates do probably differ between train and test data.
The easiest way to train, save, load and apply all the steps simultaneously is to use Pipelines:
At training:
# prepare the pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
pipe = make_pipeline(StandardScaler(), LogisticRegression)
pipe.fit(X_train, y_train)
joblib.dump(pipe, 'model.pkl')
At prediction:
#Loading the saved model with joblib
pipe = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# apply the whole pipeline to data
pred = pd.Series(pipe.predict(pr[pred_cols]))
print pred

Make predictions from a saved trained classifier in Scikit Learn

I wrote a classifier for Tweets in Python which then I saved it in .pkl format on disk, so I can run it again and again without the need to train it each time. This is the code:
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
from sklearn.externals import joblib
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
y = train['sentiment']
X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)
a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train.toarray(), b_train)
prediction = classifier.predict(a_test.toarray())
#Export the trained model to load it in another project
joblib.dump(classifier, 'my_model.pkl', compress=9)
Let's say that I have another Python file and I want to classify a Tweet. How can I proceed to do the classification?
from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')
mytweet = 'Uh wow:#medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'
Up to the hasher.transform I can replicate the same procedure to add it to the prediction model, but then I have the problem that I cannot calculate the best 20k features. To use the SelectKBest, you need to add both features and label. Since, I want to predict the label, I cannot use the SelectKBest. So, how can I pass this issue to continue on the prediction?
I support the comment of #EdChum that
you build a model by training it on data which presumably is representative enough for it to cope with unseen data
Practically this means that you need to apply both FeatureHasher and SelectKBest to your new data with predict only. (It is wrong to train FeatureHasher anew on the new data, because in general it will produce different features).
To do this either
pickle FeatureHasher and SelectKBest separately
or (better)
make a Pipeline of FeatureHasher, SelectKBest, and RandomForestClassifier and pickle the whole pipeline. Then you can load this pipeline and use predict on a new data.

Categories