sklearn - How to reload model with a pipeline and predict? - python

I've saved a trained model and the testing dataset and wish to reload it just to verify I'm getting the same results for future use of the model (I don't have new data to test on at the moment). The csv I've saved does not contain the labels, it's the same test data as in the original train/test operation which worked fine.
I created the model like so:
# copy split data for this model
dtc_test_X = test_X
dtc_test_y = test_y
dtc_train_X = train_X
dtc_train_y = train_y
# initialize the model
dtc = DecisionTreeClassifier(random_state = 1)
# fit the trianing data
dtc_yhat = dtc.fit(dtc_train_X, dtc_train_y).predict(dtc_test_X)
# scikit-learn's accuracy scoring
acc = accuracy_score(dtc_test_y, dtc_yhat)
# scikit-learn's Jaccard Index
jacc = jaccard_similarity_score(dtc_test_y, dtc_yhat)
# scikit-learn's classification report
class_report = classification_report(dtc_test_y, dtc_yhat)
I've saved the model and data below:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
# setup the pipe line
pipe = make_pipeline(DecisionTreeClassifier)
# save the model
joblib.dump(pipe, 'model.pkl')
dtc_test_X.to_csv('set_to_predict.csv')
When I reload the model and attempt a prediction as follows:
#Loading the saved model with joblib
pipe = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)
pred_cols
# apply the whole pipeline to data
pred = pd.Series(pipe.predict(pr[pred_cols]))
On the last line though (the prediction) it raised an exception:
TypeError: predict() missing 1 required positional argument: 'X'
Searching for an answer, I can only find examples of a similar exception but with Y instead of X and the answers don't seem to apply. Why am I getting this error?

Try substituting pipe.predict(pr[pred_cols]) by pipe.predict(X=pr[pred_cols]) to see if it works or if it drops you other error

Related

output decision tree in the pipeline manner

Hi as I am new to machine learning methods using the sklearn library, I try to incorporate the decision tree into pipeline and then make both the prediction and output of the model, but as I run the following code, I got the warning:
'Pipeline' object has no attribute 'tree_'
So I wonder if the pipeline does not support with tree output, and how am I able to fix this problem? I have also tried using the decision_tree class directly, but I got another warning that:
setting an array element with a sequence.
I know that this appears as I have vectors with different dimension, but still no clue how to deal with the situation.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn import tree
# a function that reads the corpus, tokenizes it and returns the documents
# and their labels
def read_corpus(corpus_file, use_sentiment):
documents = []
labels = []
with open(corpus_file, encoding='utf-8') as f:
for line in f:
tokens = line.strip().split()
documents.append(tokens[3:])
if use_sentiment:
# 2-class problem: positive vs negative
labels.append( tokens[1] )
else:
# 6-class problem: books, camera, dvd, health, music, software
labels.append( tokens[0] )
return documents, labels
# a dummy function that just returns its input
def identity(x):
return x
# read the data and split i into train and test
X, Y = read_corpus('/Users/dengchenglong/Downloads/trainset', use_sentiment=False)
split_point = int(0.75*len(X))
Xtrain = X[:split_point]
Ytrain = Y[:split_point]
Xtest = X[split_point:]
Ytest = Y[split_point:]
# let's use the TF-IDF vectorizer
tfidf = False
# we use a dummy function as tokenizer and preprocessor,
# since the texts are already preprocessed and tokenized.
if tfidf:
vec = TfidfVectorizer(preprocessor = identity,
tokenizer = identity)
else:
vec = CountVectorizer(preprocessor = identity,
tokenizer = identity)
# combine the vectorizer with a Naive Bayes classifier
classifier = Pipeline( [('vec', vec),
('cls', tree.DecisionTreeClassifier())])
# train the classifier on the train dataset
decision_tree = classifier.fit(Xtrain, Ytrain)
# predict the labels of the test data
Yguess = classifier.predict(Xtest)
tree.plot_tree(classifier.fit(Xtest, Ytest))
# report performance of the classifier
print(accuracy_score(Ytest, Yguess))
print(classification_report(Ytest, Yguess))
What if you try this:
from sklearn.pipeline import make_pipeline
# combine the vectorizer with a Naive Bayes classifier
clf = DecisionTreeClassifier()
classifier = make_pipeline(vec,clf)
As it seems, before using pipeline you must initiate the model you are trying to apply. Let me know if this works and if not, the errors it's returning.
From: Scikit-learn documentation
Example out of: Make pipeline example with trees

KeyError:"['class']" not found in axis

I found a tutorial about decision tree algorithm using pyxll add-in for excel, and tried to execute. I get an error: KeyError:"['class']" not found in axis.
from pyxll import xl_func
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import os
#xl_func("float, int, int: object")
def ml_get_zoo_tree_2(train_size=0.75, max_depth=5, random_state=245245):
# Load the zoo data
dataset = pd.read_csv(os.path.join(os.path.dirname(__file__), "zoo.csv"))
# Drop the animal names since this is not a good feature to split the data on
dataset = dataset.drop("animal_name", axis=1)
# Split the data into a training and a testing set
features = dataset.drop("class", axis=1)
targets = dataset["class"]
train_features, test_features, train_targets, test_targets = \
train_test_split(features, targets, train_size=train_size, random_state=random_state)
# Train the model
tree = DecisionTreeClassifier(criterion="entropy", max_depth=max_depth)
tree = tree.fit(train_features, train_targets)
# Add the feature names to the tree for use in predict function
tree._feature_names = features.columns
return tree
If i removed line 17 and 18 for class code, then i get error NameError: name 'features' is not defined, then when i removed feature i get error as target has to be defined.
You need the correct dataset to go with that tutorial. You can download it (and the code) from here https://github.com/pyxll/pyxll-examples/tree/master/machine-learning.

Multiclass Classification and probability prediction

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()
# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]
test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]
naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)
print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)
I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.
I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.
GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:
pred_prob = naive_b.predict_proba(test_features)
instead of
test_data["p_malw"] = naive_b.predict_proba(test_features)
and verify its shape using pred_prob.shape. The second dimension should be 5.
If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.
from sklearn.metrics import confusion_matrix
naive_B.fit(train_features, train_label)
pred_label = naive_B.predict(test_features)
confusion_m = confusion_matrix(test_label, pred_label)
confusion_m
Here is some useful reading.
sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba
sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Load and predict new data sklearn

I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it.
Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here.
Here is my code:
#Loading the saved model with joblib
model = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# Standardize new data
scaler = StandardScaler()
X_pred = scaler.fit(pr[pred_cols]).transform(pr[pred_cols])
pred = pd.Series(model.predict(X_pred))
print pred
No, it's incorrect. All the data preparation steps should be fit using train data. Otherwise, you risk applying the wrong transformations, because means and variances that StandardScaler estimates do probably differ between train and test data.
The easiest way to train, save, load and apply all the steps simultaneously is to use Pipelines:
At training:
# prepare the pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
pipe = make_pipeline(StandardScaler(), LogisticRegression)
pipe.fit(X_train, y_train)
joblib.dump(pipe, 'model.pkl')
At prediction:
#Loading the saved model with joblib
pipe = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# apply the whole pipeline to data
pred = pd.Series(pipe.predict(pr[pred_cols]))
print pred

Make predictions from a saved trained classifier in Scikit Learn

I wrote a classifier for Tweets in Python which then I saved it in .pkl format on disk, so I can run it again and again without the need to train it each time. This is the code:
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
from sklearn.externals import joblib
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
y = train['sentiment']
X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)
a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train.toarray(), b_train)
prediction = classifier.predict(a_test.toarray())
#Export the trained model to load it in another project
joblib.dump(classifier, 'my_model.pkl', compress=9)
Let's say that I have another Python file and I want to classify a Tweet. How can I proceed to do the classification?
from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')
mytweet = 'Uh wow:#medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'
Up to the hasher.transform I can replicate the same procedure to add it to the prediction model, but then I have the problem that I cannot calculate the best 20k features. To use the SelectKBest, you need to add both features and label. Since, I want to predict the label, I cannot use the SelectKBest. So, how can I pass this issue to continue on the prediction?
I support the comment of #EdChum that
you build a model by training it on data which presumably is representative enough for it to cope with unseen data
Practically this means that you need to apply both FeatureHasher and SelectKBest to your new data with predict only. (It is wrong to train FeatureHasher anew on the new data, because in general it will produce different features).
To do this either
pickle FeatureHasher and SelectKBest separately
or (better)
make a Pipeline of FeatureHasher, SelectKBest, and RandomForestClassifier and pickle the whole pipeline. Then you can load this pipeline and use predict on a new data.

Categories