StratifiedKFold: shuffle and random_state

StratifiedKFold: shuffle and random_state - python

I've done the following for cross-validation purpose:
from sklearn.cross_validation import StratifiedKFold
n_folds = 5
SKFolds = list(StratifiedKFold(ytrain, n_folds, shuffle=True))
I'm just thinking about one detail: I'd like to have the same final results if somebody (my teacher for instance!) run again the code. However, I forgot to specify the random_state parameter! And I unfortunately can't start again from the begining because my models need a very long time to be fitted and it's quite finished.
My question is the following: is it possible to find what was the random_state which leads to my SKFolds? (my notebook is still opened so maybe the information can be found somewhere?). Or can I do something like save my SKFolds in a csv file and then load it when I will restart my notebook to be sure that I will have the same split on my folds?
Thanks for your help!

You can save SKFolds object with pickle and then you will just have to load it and use it as is.
import cPickle as pickle
# To save the object
pickle.dump( SKFolds , open( "skfolds.p", "wb" ) )
# To load the object
SKFolds = pickle.load( open( "skfolds.p", "rb" ) )

Related

Problem loading ML model saved using joblib/pickle

I saved a jupyter notebook .pynb file to .pickle format using joblib.
My ML model is built using pandas, numpy and the statsmodels python library.
I saved the fitted model to a variable called fitted_model and here is how I used joblib:
from sklearn.externals import joblib
# Save RL_Model to file in the current working directory
joblib_file = "joblib_RL_Model.pkl"
joblib.dump(fitted_model, joblib_file)
I get this as output:
['joblib_RL_Model.pkl']
But when I try to load from file, in a new jupyter notebook, using:
# Load from file
joblib_file = "joblib_RL_Model.pkl"
joblib_LR_model = joblib.load(joblib_file)
joblib_LR_model
I only get this back:
<statsmodels.tsa.holtwinters.HoltWintersResultsWrapper at 0xa1a8a0ba8>
and no model. I was expecting to see the model load there and see the graph outputs as per original notebook.

Use with open, it is better because, it automatically open and close file. Also with proper mode.
with open('joblib_RL_Model.pkl', 'wb') as f:
pickle.dump(fitted_model, f)
with open('joblib_RL_Model.pkl', 'rb') as f:
joblib_LR_model = pickle.load(f)
And my implementation in Colab is here. Check it.

you can use more quantifiable package which is pickle default package for python to save models
you can use the the following function for the saving of ML Models
import pickle
def save_model(model):
pickle.dump(model, open("model.pkl", "wb"))
template for function would be
import pickle
def save_model(model):
pickle.dump(model, open(PATH_AND_FILE_NAME_TO_BE_SAVED, "wb"))
to load the model when saved it from pickle library you can follow the following function
def load_model(path):
return pickle.load(open(path, 'rb'))
Where path is the path and name to file where model is saved to.
Note:
This would only work for basic ML Models and PyTorch Models, it would not work for Tensorflow based models where you need to use
model.save(PATH_TO_MODEL_AND_NAME)
where model is of type tensorflow.keras.models

why does pickle file of fbprophet model need so much memory on hard drive?

I created a simple fbprophet model with the airpassengers data:
I created a simple fbprophet model with the airpassengers data:
import pandas as pd
import pickle
from fbprophet import Prophet
import sys
df = pd.read_csv("airline-passengers.csv")
# preprocess columns as fbprophet expects it
df.rename(columns={"Month": "ds", "Passengers": "y"}, inplace=True)
df["ds"] = pd.to_datetime(df["ds"])
m = Prophet()
m.fit(df)
However, when I save the object m:
with open("p_model", "wb") as f:
pickle.dump(m, f)
it needs >1 MB of memory on my hard drive. The object m itself seems to be rather small, as sys.getsizeof(m) returns 56.
Why is the pickle file so large? Is there a suitable alternative for saving the the object for later reuse? Thanks in advance.

I think that it pickles training data also, so try not to save model.history and it should be fine.
Here is nice discussion: https://github.com/facebook/prophet/issues/1159

Thanks to the link of #Kohelet, I found the solultion, it was the stan_backend attribute:
m.stan_backend = None
This reduced the filesize on hard drive to around 18 KB.
Im am still wondering why this is not visible when invoking sys.sizeof(m)

How to use pickle to save sklearn model

I want to dump and load my Sklearn trained model using Pickle. How to do that?

Save:
import pickle
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
Load:
with open("model.pkl", "rb") as f:
model = pickle.load(f)
In the specific case of scikit-learn, it may be better to use joblib’s
replacement of pickle (dump & load), which is more efficient on
objects that carry large numpy arrays internally as is often the case
for fitted scikit-learn estimators:
Save:
import joblib
joblib.dump(model, "model.joblib")
Load:
model = joblib.load("model.joblib")

Using pickle is same across all machine learning models irrespective of type i.e. clustering, regression etc.
To save your model in dump is used where 'wb' means write binary.
pickle.dump(model, open(filename, 'wb')) #Saving the model
To load the saved model wherever need load is used where 'rb' means read binary.
model = pickle.load(open(filename, 'rb')) #To load saved model from local directory
Here model is kmeans and filename is any local file, so use accordingly.

One can also use joblib
from joblib import dump, load
dump(model, model_save_path)

Predicting from SciKitLearn RandomForestClassification with Categorical Data

I created a RandomForestClassification model using SkLearn using 10 different text features and a training set of 10000. Then, I pickled the model (76mb) in hopes of using it for prediction.
However, in order to produce the Random Forest, I used the LabelEncoder and OneHotEncoder for best results on the categorical/string data.
Now, I'd like to pull up the pickled model and get a classification prediction on 1 instance. However, I'm not sure how to encode the text on the 1 instance without loading the entire training & test dataset CSV
again and going through the entire encoding process.
It seems quite laborious to load the csv files every time. I'd like this to run 1000x per hour so it doesn't seem right to me.
Is there a way to quickly encode 1 row of data given the pickle or other variable/setting? Does encoding always require ALL the data?
If loading all the training data is required to encode a single row, would be advantageous to encode the text data myself in a database where each feature assigned to a table, auto-incremented with a numeric id and a UNIQUE key on the text/categorical field, then pass this id to the RandomForestClassification? Obviously I would need to refit and pickle this new model, but then I would know exactly the (encoded) numeric representation of a new row and simply request a prediction on those values.
It's highly likely that I'm missing a feature or misunderstanding SkLearn or Python, I only started both a 3 days ago. Please excuse my naivety.

Using Pickle you should save your Label and One Hot Encoder. You can then read this each time and easily transform new instances. For example,
import cPickle as pickle
from sklearn.externals import joblib
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)
# Save your encoding
joblib.dump(le, '/path/to/save/model')
# OR
pickle.dump(le, open( '/path/to/model', "wb" ) )
# Load those encodings
le = joblib.load('/path/to/save/model')
# OR
le = pickle.load( open( '/path/to/model', "rb" ) )
# Then use as normal
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])

Can i call vectorizer.fit_transform multiple times to update vectorizer

I'm training a Multinomial Naive Bayes classifier on a large dataset separated over multiple files. I would like to update the CountVectorizer with all my data, but only read one file into memory at the time.
My current code:
raw_documents = []
for f in files:
text = np.loadtxt(open("csv/{f}".format(f=f), "r", delimiter="\t", dtype="str", comments=None)
raw_documents.extend(list(text[:,1]))
vectorizer = CountVectorizer(stop_words=None)
train_features = vectorizer.fit_transform(raw_documents)
Is it possible to partially call fit_transform, such that i can do
vectorizer = CountVectorizer(stop_words=None)
for f in files:
text = np.loadtxt(open("csv/{f}".format(f=f), "r", delimiter="\t", dtype="str", comments=None)
train_features = vectorizer.fit_transform(text[:,1])
Relevant documentation can be found here, but I don't manage to fully understand it.
Thanks in advance!

The problem is that the CountVecorizer needs to know in advance all what all the words in your courpus are, so that it can have a way of mapping words to integers. (It would be nice if you could do a "partial fit" where if it encounters new words it adds them onto the end, but as far as I know this is not currently supported)
An alternative would be to use HashingVectorizer; this doesn't need to be fit, as it just runs each word through a fixed hashing function to get its integer encoding.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.