pickle/joblib AttributeError: module '__main__' has no attribute 'thing' in pytest - python

I have built a custom sklearn pipeline, as follows:
pipeline = make_pipeline(
SelectColumnsTransfomer(features_to_use),
ToDummiesTransformer('feature_0', prefix='feat_0', drop_first=True, dtype=bool), # Dummify customer_type
ToDummiesTransformer('feature_1', prefix='feat_1'), # Dummify the feature
ToDummiesTransformer('feature_2', prefix='feat_2'), # Dummify
ToDummiesTransformer('feature_3', prefix='feat_3'), # Dummify
)
pipeline.fit(df)
The classes SelectColumnsTransfomer and ToDummiesTransformer are custom sklearn steps implementing BaseEstimator and TransformerMixin.
To serialise this object I use
from sklearn.externals import joblib
joblib.dump(pipeline, 'data_pipeline.joblib')
but when I do deserialise with
pipeline = joblib.load('data_pipeline.joblib')
I get AttributeError: module '__main__' has no attribute 'SelectColumnsTransfomer'.
I have read other similar questions and followed the instruction in this blogpost here, but couldn't solve the issue.
I am copying pasting the classes, and importing them in the code. If i create a simplified version of this exercise, the whole thing works, the problem occurs because i am running some tests with pytest, and when i run pytest it seems it doesn't see my custom classes, in fact there is this other part of the error
self = <sklearn.externals.joblib.numpy_pickle.NumpyUnpickler object at 0x7f821508a588>, module = '__main__', name = 'SelectColumnsTransfomer' which is hinting me that the NumpyUnpickler doesn't see the SelectColumnsTransfomer even if in the test it is imported.
My test code
import pytest
from app.pipeline import * # the pipeline objects
# SelectColumnsTransfomer and ToDummiesTransformer
# are here!
#pytest.fixture(scope="module")
def clf():
pipeline = joblib.load("persistence/data_pipeline.joblib")
return clf
def test_fake(clf):
assert True

I had the same error message when I was trying to save a Pytorch class like this:
import torch.nn as nn
class custom(nn.Module):
def __init__(self):
super(custom, self).__init__()
print("Class loaded")
model = custom()
And then using Joblib to dump this model like so:
from joblib import dump
dump(model, 'some_filepath.jobjib')
The issue was I was running the code above in a Kaggle kernel. And then downloading the dumped file and trying to load it with this script locally:
from joblib import load
model = load(model, 'some_filepath.jobjib')
The way I fixed the issue was to run all of these code snippets locally on my computer instead of creating the class and dumping it on Kaggle, but loading it on my local machine. Wanted to add this here because the comments on the answer by #DarioB confused me in their reference to a 'function' which didn't apply in my simpler case.

I had a similar issue with sklearn and complex pipelines.
I used cloudpickle 2.0.0 /py3.10 (instead of pickle or joblib) to dump the model and then load it with joblib without error.
Hope it could help.
Note: the model was dump from a jupyter notebook and load inside a python script.

Related

HDBSCAN : clustering , persistance and approximate_predict()

I want to cache my model results in order to make predictions without redoing the clustering.
I read that I can do that with memory parameter in HDBSCAN.
I did that instead because I wanted to save the file in the same directory as my script instead of '/tmp/joblib' that's here ((HDBSCAN cluster caching and persistance)) :
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, prediction_data=True).fit(data)
# save the model to disk
filename = 'finalized_model.joblib'
joblib.dump(clusterer, filename)
I then tried to load the model in a different file:
from joblib import load
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = model.approximate_predict(model, test_points)
But I got this error: AttributeError: 'HDBSCAN' object has no attribute 'approximate_predict'
Last time I got this error, it was because prediction_data was not set to True, but what's the problem now?
approximate_predict() is under hdbscan package itself, instead of a HDBSCAN object.
Here's what you need to do:
from joblib import load
import hdbscan
# load the model
model = load('finalized_model.joblib')
# make predictions
test_labels, strengths = hdbscan.approximate_predict(model, test_points)
API Reference:
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict

Unable to restore a layer of class TextVectorization - Text Classification

System information
Google Colab
When I run the example provided by official tensorflow basic text classification, everything runs fine until the model save, but when I load the model it gives me this error.
RuntimeError: Unable to restore a layer of class TextVectorization. Layers of class TextVectorization require that the class be provided to the model loading code, either by registering the class using #keras.utils.register_keras_serializable on the class def and including that file in your program, or by passing the class in a keras.utils.CustomObjectScope that wraps this load call.
Expected Behavior: Model should be loaded successfully and process the raw input
https://colab.research.google.com/gist/amahendrakar/8b65a688dc87ce9ca07ffb0ce50b84c7/44199.ipynb#scrollTo=fEjmSrKIqiiM
Example Link: https://tensorflow.google.cn/tutorials/keras/text_classification
I also ran into this error message (RuntimeError: Unable to restore a layer of class TextVectorization. [...]) when I implemented (and customized) the code from the "Basic Text Classification" tutorial.
Instead of running the code in a notebook, I have two scripts, one for building, training and saving the model and the other one for loading it and making predictions. (Thus, the error does not seem to be limited to Google Colab).
This is what I had to do (see https://github.com/tensorflow/tensorflow/issues/45231):
First, I added this line in the first script before the function definition and built, trained and saved the model again:
#tf.keras.utils.register_keras_serializable()
def custom_standardization(input_data):
[...]
# Save model as SavedModel
export_model.save(model_path, save_format='tf')
Secondly, I also had to add the same line and the whole function definition in the second script to make sure that it works if I restart(!) ipython (where I currently run the scripts) and only run the second script:
#tf.keras.utils.register_keras_serializable()
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation),
'')
[...]
# Load model
reloaded_model = tf.keras.models.load_model(model_path)
# Make predictions
predictions = reloaded_model.predict(examples)
Note: If I run the second script without restarting ipython after running the first script, I get this error:
ValueError: Custom>custom_standardization has already been registered [...]
Alternatively, you can just use the default standardization method in the vectorizer layer when building the model:
vectorize_layer = TextVectorization(
standardize="lower_and_strip_punctuation",
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
I got something working as Hassan describes it, I think. Not sure it's the right way, but it seems to work for me...
I define, train, and archive the model in one notebook
I un-archive it, load it, and use it for predictions from another notebook.
See here: https://github.com/OlivierLD/oliv-ai/tree/master/JupyterNotebooks/tf-tutorials/sentiment-analysis

'NearMiss' object has no attribute '_validate_data'

Detailed Image
This is the code below which shows the error.
from imblearn.under_sampling import NearMiss
nm = NearMiss()
X_res,y_res=nm.fit_sample(X,Y)
You are probably trying to under sample your imbalanced dataset. For this purpose, you can use RandomUnderSampler instead of NearMiss.
Try the following code:
from imblearn.under_sampling import RandomUnderSampler
under_sampler = RandomUnderSampler()
X_res, y_res = under_sampler.fit_resample(X, y)
Now, your dataset is balanced. You can verify it using y_res.value_counts().
Cheers!
Instead of "imblearn" package my conda installed a package named "imbalanced-learn" that's why it does not take the data. But it is strange that the jupyter notebook doesn't tell me that "imblearn" isn't installed.

Keras model import name is not defined

I'm not sure why the model isn't defined
Taken from here
https://github.com/DariusAf/MesoNet/blob/master/example.py
Code:
from classifiers import *
from pipeline import *
from keras.preprocessing.image import ImageDataGenerator
classifier = Meso4()
classifier.load('Meso4_DF')
gives error:
classifier = Meso4()
NameError: name 'Meso4' is not defined
The reason for this is that Meso4 is defined in classifiers.py, as you can see here.
Strictly speaking, your problem would be solved by also downloading the classifiers.py file and putting it in the same directory as your example.py file.
However, you should, in general, refrain from copy-pasting code from GitHub unless you know what you are doing, and if you need to wonder if you do, you don't.
Therefore, I recommend actually cloning the repo and working from the local copy.

Name 'RandomUnderSampler' is not defined

I'm trying to use RandomUnderSampler. I have correctly installed the imblearn module. But still getting the error: "Name 'RandomUnderSampler" is not defined`. Any specific reason for this? Can someone please help
from imblearn.under_sampling import RandomUnderSampler
#Random under-sampling and over-sampling with imbalanced-learn
def random_under_sampling(X,Y):
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, Y)
print('Removed indexes:', id_rus)
plot_2d_space(X_rus, y_rus, 'Random under-sampling')
The actual method name
This is where I called my method
Since it seems that you are using IPython it is important that you execute first the line importing imblearn library (e.g. Ctrl-Enter):
from imblearn.under_sampling import RandomUnderSampler
After that the module should get imported and the name of the function is going to be defined.
If this does not work, could you reload the notebook and execute all the statements up until the random_under_sampling function to ensure nothing was missed?

Categories