How to works pickle dataframe inside - python

I wonder how the module "pickle" save and load objects. I saved a file with a dataframe object on the disk,
import pandas as pd
import pickle
df = pd.read_excel(r".\test.xlsx")
with open("o.pkl", "wb") as file:
pickle.dump(df, file)
then I uninstalled pandas and tried to load the object dataframe from file, but i get error "Exception has occurred: ModuleNotFoundError
No module named 'pandas'":
import pickle
with open("o.pkl", "rb") as file:
e = pickle.load(file)
my question is, does the pickle module somehow use pandas when loading an df? If so how is it done?

Pickle by default will go and import the class.
In this case, if you do not have pandas installed when you run the second snippet, it won't work by default (see below for more info on that default behaviour).
Quick primer on pickling
Essentially, everything in Python is an instance of a class, in some shape or form.
When you make a DataFrame, such as when you use pandas.read_excel, you create an instance of a DataFrame class. To create that class you need:
the class definition (containing information about methods and attributes)
something that creates the instance from some input data
You can create instances of a class normally by directly instantiating the class, or by using another method/function. Example:
# This makes a string, '12345' by directly invoking the str constructor
s = str(12345)
# This makes a list by using the split method of the string
l = s.split('3')
Pickle works just the same. When you unpickle, you need the class definition as well as the function which transforms some input data (your .pkl file) into the instance.
The class definition will be available in the pickled data, but none of the other supporting imports + code outside of the class will be.
This means that even if you override the default behaviour, while you might be able to make a DataFrame, your DataFrame won't work because you're missing pandas. When you try to invoke a method on the DataFrame, Python will try to access code that doesn't live in the original class definition. This code lives in other modules in the pandas module, and so this will never be captured in the pickle -- your code will then become quite unhappy at this point.
Can I override the default behaviour for unpickling?
Yes, you can do this -- you can override the import behaviour by using a custom unpickler. That's described here in the Python doc: restricting globals (Python official doc).

I've run into a similar thing before where it needed a specific pandas version, but I didn't investigate. Running across your post here, I read some of the documentation and came across this line:
When a class instance is unpickled, its __init__() method is usually not invoked. The default behaviour first creates an uninitialized instance and then restores the saved attributes.
https://docs.python.org/3.8/library/pickle.html#pickle-inst
So to unpickle an arbitrary class instance, it has to be able to access the initialization method of that class. If the class isn't present, it can't do that.
That same page also says:
Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled.
If I make a pandas DataFrame, I can access df.__class__ which will return pandas.core.frame.DataFrame
Putting this all together on that page, here's what I think happens:
Pickling df saves the instance data, which includes the __class__ attribute
Unpickling goes and looks for this class to access its __setstate__ method
If the module containing this class definition can't be found: error!
Short answer: it saves that information.

Related

How to properly pickle sklearn pipeline when using custom transformer

I am trying to pickle a sklearn machine-learning model, and load it in another project. The model is wrapped in pipeline that does feature encoding, scaling etc. The problem starts when i want to use self-written transformers in the pipeline for more advanced tasks.
Let's say I have 2 projects:
train_project: it has the custom transformers in src.feature_extraction.transformers.py
use_project: it has other things in src, or has no src catalog at all
If in "train_project" I save the pipeline with joblib.dump(), and then in "use_project" i load it with joblib.load() it will not find something such as "src.feature_extraction.transformers" and throw exception:
ModuleNotFoundError: No module named 'src.feature_extraction'
I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (e.g. gradient boosting) is happening inside.
I thought of creating /dependencies/xxx_model/ catalog in root of both projects, and store all needed classes and functions in there (copy code from "train_project" to "use_project"), so structure of projects is equal and transformers can be loaded. I find this solution extremely inelegant, because it would force the structure of any project where the model would be used.
I thought of just recreating the pipeline and all transformers inside "use_project" and somehow loading fitted values of transformers from "train_project".
The best possible solution would be if dumped file contained all needed info and needed no dependencies, and I am honestly shocked that sklearn.Pipelines seem to not have that possibility - what's the point of fitting a pipeline if i can not load fitted object later? Yes it would work if i used only sklearn classes, and not create custom ones, but non-custom ones do not have all needed functionality.
Example code:
train_project
src.feature_extraction.transformers.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
train_project
main.py
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
from src.feature_extraction.transformers import FilterOutBigValuesTransformer
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'path.x')
test_project
main.py
from sklearn.externals import joblib
pipeline = joblib.load('path.x')
The expected result is pipeline loaded correctly with transform method possible to use.
Actual result is exception when loading the file.
I found a pretty straightforward solution. Assuming you are using Jupyter notebooks for training:
Create a .py file where the custom transformer is defined and import it to the Jupyter notebook.
This is the file custom_transformer.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
Train your model importing this class from the .py file and save it using joblib.
import joblib
from custom_transformer import FilterOutBigValuesTransformer
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'pipeline.pkl')
When loading the .pkl file in a different python script, you will have to import the .py file in order to make it work:
import joblib
from utils import custom_transformer # decided to save it in a utils directory
pipeline = joblib.load('pipeline.pkl')
Apparently this problem raises when you split definitions and saving code part in two different files. So I have found this workaround that has worked for me.
It consists in these steps:
Guess we have your 2 projects/repositories : train_project and use_project
train_project:
On your train_project create a jupyter notebook or .py
On that file lets define every Custom transformer in a class, and import all other tools needed from sklearn to design the pipelines. Then lets write the saving code to pickle just inside the same file.(Don't create an external .py file src.feature_extraction.transformers to define your customtransformers).
Then fit and dumb your pipeline by running that file.
On use_project:
Create a customthings.py file with all the functions and transformers defined inside.
Create another file_where_load.py where you wish load the pickle. Inside, make sure you have imported all the definitions from customthings.py . Ensure that functions and classes have the same name than the ones you've used on train_project.
I hope it works for everyone with same problem
I have created a workaround solution. I do not consider it a complete answer to my question, but non the less it let me move on from my problem.
Conditions for the workaround to work:
I. Pipeline needs to have only 2 kinds of transformers:
sklearn transformers
custom transformers, but with only attributes of types:
number
string
list
dict
or any combination of those e.g. list of dicts with strings and numbers. Generally important thing is that attributes are json serializable.
II. names of pipeline steps need to be unique (even if there is pipeline nesting)
In short model would be stored as a catalog with joblib dumped files, a json file for custom transformers, and a json file with other info about model.
I have created a function that goes through steps of a pipeline and checks __module__ attribute of transformer.
If it finds sklearn in it it then it runs joblib.dump function under a name specified in steps (first element of step tuple), to some selected model catalog.
Otherwise (no sklearn in __module__) it adds __dict__ of transformer to result_dict under a key equal to name specified in steps. At the end I json.dump the result_dict to model catalog under name result_dict.json.
If there is a need to go into some transformer, because e.g. there is a Pipeline inside a pipeline, you can probably run this function recursively by adding some rules to the beginning of the function, but it becomes important to have always unique steps/transformers names even between main pipeline and subpipelines.
If there are other information needed for creation of model pipeline then save them in model_info.json.
Then if you want to load the model for usage:
You need to create (without fitting) the same pipeline in target project. If pipeline creation is somewhat dynamic, and you need information from source project, then load it from model_info.json.
You can copy function used for serialization and:
replace all joblib.dump with joblib.load statements, assign __dict__ from loaded object to __dict__ of object already in pipeline
replace all places where you added __dict__ to result_dict with assignment of appropriate value from result_dict to object __dict__ (remember to load result_dict from file beforehand)
After running this modified function, previously unfitted pipeline should have all transformer attributes that were effect of fitting loaded, and pipeline as a whole ready to predict.
The main things I do not like about this solution is that it needs pipeline code inside target project, and needs all attrs of custom transformers to be json serializable, but I leave it here for other people that stumble on a similar problem, maybe somebody comes up with something better.
I was similarly surprised when I came across the same problem some time ago. Yet there are multiple ways to address this.
Best practice solution:
As others have mentioned, the best practice solution is to move all dependencies of your pipeline into a separate Python package and define that package as a dependency of your model environment.
The environment then has to be recreated whenever the model is deployed. In simple cases this can be done manually e.g. via virtualenv or Poetry. But model stores and versioning frameworks (MLflow being one example) typically provide a way to define the required Python environment (e.g. via conda.yaml). They often can automatically recreate the environment at deployment time.
Solution by putting code into main:
In fact, class and function declearations can be serialized, but only declarations in __main__ actually get serialized. __main__ is the entry point of the script, the file that is run. So if all the custom code and all of its dependencies are in that file, then custom objects can later be loaded in Python environments that do not include the code. This kind of solves the problem, but who wants to have all that code in __main__? (Note that this property also applies to cloudpickle)
Solution by "mainifying":
There is one other way which is to "mainify" the classes or function objects before saving. I came across that same problem some time ago and have written a function that does that. It essentially redefines an existing object's code in __main__. Its application is simple: Pass object to function, then serialize the object, voilà, it can be loaded anywhere. Like so:
# ------ In file1.py: ------
class Foo():
pass
# ------ In file2.py: ------
from file1 import Foo
foo = Foo()
foo = mainify(foo)
import dill
with open('path/file.dill', 'wb') as f
dill.dump(foo, f)
I post the function code below. Note that I have tested this with dill, but I think it should work with pickle as well.
Also note that the original idea is not mine, but came from a blog post that I cannot find right now. I will add the reference/acknowledgement when I find it.
Edit: Blog post by Oege Dijk by which my code was inspired.
def mainify(obj, warn_if_exist=True):
''' If obj is not defined in __main__ then redefine it in main. Allows dill
to serialize custom classes and functions such that they can later be loaded
without them being declared in the load environment.
Parameters
---------
obj : Object to mainify (function or class instance)
warn_if_exist : Bool, default True. Throw exception if function (or class) of
same name as the mainified function (or same name as mainified
object's __class__) was already defined in __main__. If False
don't throw exception and instead use what was defined in
__main__. See Limitations.
Limitations
-----------
Assumes `obj` is either a function or an instance of a class.
'''
if obj.__module__ != '__main__':
import __main__
is_func = True if isinstance(obj, types.FunctionType) else False
# Check if obj with same name is already defined in __main__ (for funcs)
# or if class with same name as obj's class is already defined in __main__.
# If so, simply return the func with same name from __main__ (for funcs)
# or assign the class of same name to obj and return the modified obj
if is_func:
on = obj.__name__
if on in __main__.__dict__.keys():
if warn_if_exist:
raise RuntimeError(f'Function with __name__ `{on}` already defined in __main__')
return __main__.__dict__[on]
else:
ocn = obj.__class__.__name__
if ocn in __main__.__dict__.keys():
if warn_if_exist:
raise RuntimeError(f'Class with obj.__class__.__name__ `{ocn}` already defined in __main__')
obj.__class__ = __main__.__dict__[ocn]
return obj
# Get source code and compile
source = inspect.getsource(obj if is_func else obj.__class__)
compiled = compile(source, '<string>', 'exec')
# "declare" in __main__, keeping track which key of __main__ dict is the new one
pre = list(__main__.__dict__.keys())
exec(compiled, __main__.__dict__)
post = list(__main__.__dict__.keys())
new_in_main = list(set(post) - set(pre))[0]
# for function return mainified version, else assign new class to obj and return object
if is_func:
obj = __main__.__dict__[new_in_main]
else:
obj.__class__ = __main__.__dict__[new_in_main]
return obj
Have you tried using cloud pickle?
https://github.com/cloudpipe/cloudpickle
Based on my research it seems that the best solution is to create a Python package that includes your trained pipeline and all files.
Then you can pip install it in the project where you want to use it and import the pipeline with from <package name> import <pipeline name>.
Credit to Ture Friese for mentioning cloudpickle >=2.0.0, but here's an example for your use case.
import cloudpickle
cloudpickle.register_pickle_by_value(FilterOutBigValuesTransformer)
with open('./pipeline.cloudpkl', mode='wb') as file:
pipeline.dump(
obj=Pipe
, file=file
)
register_pickle_by_value() is the key as it will ensure your custom module (src.feature_extraction.transformers) is also included when serializing your primary object (pipeline). However, this is not built for recursive module dependence, e.g. if FilterOutBigValuesTransformer also contains another import statement
Calling the location of the transform.py file with sys.path.append may resolve the issue.
import sys
sys.path.append("src/feature_extraction/transformers")

Two Objects Created from the Same Class, isinstance = False

I'm trying to create some unit tests for some code here at work.
The code takes in an object and based on information within that object imports a specific module then creates an instance of it.
The test I am trying to write creates the object and then I check that it is an instance of the class I expect it to import. The issue is the isinstance check is failing.
Here is what my test looks like.
import importlib
from path.to.imported_api import SomeApi
api = importlib.import_module("path.to.imported_api").create_instance() # create_instance() is a function that returns SomeApi().
assert isinstance(api, SomeApi) # This returns false, but I am not sure why.
The reason for the difference is, that whereas both objects refer to the same module, they get different identifiers as you load a new module and bypass sys.modules. See also the explanation here: https://bugs.python.org/issue40427
A hack might be to compare the name:
assert isinstance(api.__class__.__name__, SomeApi.__name__)
There are a few things that could cause that:
So first, it could be that the api is just returning something that looks like SomeApi(). Also it coud is be that SomeApi is overwriting isinstance behaviour.

Python : How to read pickle dump?

I have a pickle dump which I got from a friend and he asked me to read it like :
f = open('file.pickle')
import pickle
l = pickle.loads(f.read())
But I get an ImportError saying no module named sql.models
Can someone help me understand what is happening ?
You are missing the code required to reconstruct the pickled objects.
Pickles store the location where the class can be imported from, together with the instance attributes. The original module is still required to recreate the module. From the documentation:
Note that functions (built-in and user-defined) are pickled by “fully qualified” name reference, not by value. This means that only the function name is pickled, along with the name of the module the function is defined in. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised. [4]
Similarly, classes are pickled by named reference, so the same restrictions in the unpickling environment apply. Note that none of the class’s code or data is pickled, so in the following example the class attribute attr is not restored in the unpickling environment:
class Foo:
attr = 'a class attr'
picklestring = pickle.dumps(Foo)
These restrictions are why picklable functions and classes must be defined in the top level of a module.
In other words, the original data used to create the pickle includes at least one instance of a custom class that originates in a module named sql.models.
Do be careful reading arbitrary pickles, even from friends. A pickle is just a stack language that recreates arbitrary Python structures. You can construct a pickle that spawns a secret back-door server on your computer, with enough determination and skill. The pickle documention warns you explicitly:
Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
This has been a problem in the past, even for experienced developers.

lazy load dictionary

I have a dictionary called fsdata at module level (like a global variable).
The content gets read from the file system. It should load its data once on the first access. Up to now it loads the data during importing the module. This should be optimized.
If no code accesses fsdata, the content should not be read from the file system (save CPU/IO).
Loading should happen, if you check for the boolean value, too:
if mymodule.fsdata:
... do_something()
Update: Some code already uses mymodule.fsdata. I don't want to change the other places. It should be variable, not a function. And "mymodule" needs to be a module, since it gets already used in a lot of code.
I think you should use Future/Promise like this https://gist.github.com/2935416
Main point - you create not an object, but a 'promise' about object, that behave like an object.
You can replace your module with an object that has descriptor semantics:
class FooModule(object):
#property
def bar(self):
print "get"
import sys
sys.modules[__name__] = FooModule()
Take a look at http://pypi.python.org/pypi/apipkg for a packaged approach.
You could just create a simple function that memoizes the data:
fsdata = []
def get_fsdata:
if not fsdata:
fsdata.append(load_fsdata_from_file())
return fsdata[0]
(I'm using a list as that's an easy way to make a variable global without mucking around with the global keyword).
Now instead of referring to module.fsdata you can just call module.get_fsdata().

Gracefully-degrading pickling in Python

(You may read this question for some background)
I would like to have a gracefully-degrading way to pickle objects in Python.
When pickling an object, let's call it the main object, sometimes the Pickler raises an exception because it can't pickle a certain sub-object of the main object. For example, an error I've been getting a lot is "can’t pickle module objects." That is because I am referencing a module from the main object.
I know I can write up a little something to replace that module with a facade that would contain the module's attributes, but that would have its own issues(1).
So what I would like is a pickling function that automatically replaces modules (and any other hard-to-pickle objects) with facades that contain their attributes. That may not produce a perfect pickling, but in many cases it would be sufficient.
Is there anything like this? Does anyone have an idea how to approach this?
(1) One issue would be that the module may be referencing other modules from within it.
You can decide and implement how any previously-unpicklable type gets pickled and unpickled: see standard library module copy_reg (renamed to copyreg in Python 3.*).
Essentially, you need to provide a function which, given an instance of the type, reduces it to a tuple -- with the same protocol as the reduce special method (except that the reduce special method takes no arguments, since when provided it's called directly on the object, while the function you provide will take the object as the only argument).
Typically, the tuple you return has 2 items: a callable, and a tuple of arguments to pass to it. The callable must be registered as a "safe constructor" or equivalently have an attribute __safe_for_unpickling__ with a true value. Those items will be pickled, and at unpickling time the callable will be called with the given arguments and must return the unpicked object.
For example, suppose that you want to just pickle modules by name, so that unpickling them just means re-importing them (i.e. suppose for simplicity that you don't care about dynamically modified modules, nested packages, etc, just plain top-level modules). Then:
>>> import sys, pickle, copy_reg
>>> def savemodule(module):
... return __import__, (module.__name__,)
...
>>> copy_reg.pickle(type(sys), savemodule)
>>> s = pickle.dumps(sys)
>>> s
"c__builtin__\n__import__\np0\n(S'sys'\np1\ntp2\nRp3\n."
>>> z = pickle.loads(s)
>>> z
<module 'sys' (built-in)>
I'm using the old-fashioned ASCII form of pickle so that s, the string containing the pickle, is easy to examine: it instructs unpickling to call the built-in import function, with the string sys as its sole argument. And z shows that this does indeed give us back the built-in sys module as the result of the unpickling, as desired.
Now, you'll have to make things a bit more complex than just __import__ (you'll have to deal with saving and restoring dynamic changes, navigate a nested namespace, etc), and thus you'll have to also call copy_reg.constructor (passing as argument your own function that performs this work) before you copy_reg the module-saving function that returns your other function (and, if in a separate run, also before you unpickle those pickles you made using said function). But I hope this simple cases helps to show that there's really nothing much to it that's at all "intrinsically" complicated!-)
How about the following, which is a wrapper you can use to wrap some modules (maybe any module) in something that's pickle-able. You could then subclass the Pickler object to check if the target object is a module, and if so, wrap it. Does this accomplish what you desire?
class PickleableModuleWrapper(object):
def __init__(self, module):
# make a copy of the module's namespace in this instance
self.__dict__ = dict(module.__dict__)
# remove anything that's going to give us trouble during pickling
self.remove_unpickleable_attributes()
def remove_unpickleable_attributes(self):
for name, value in self.__dict__.items():
try:
pickle.dumps(value)
except Exception:
del self.__dict__[name]
import pickle
p = pickle.dumps(PickleableModuleWrapper(pickle))
wrapped_mod = pickle.loads(p)
Hmmm, something like this?
import sys
attribList = dir(someobject)
for attrib in attribList:
if(type(attrib) == type(sys)): #is a module
#put in a facade, either recursively list the module and do the same thing, or just put in something like str('modulename_module')
else:
#proceed with normal pickle
Obviously, this would go into an extension of the pickle class with a reimplemented dump method...

Categories