In Python, what is a method_descriptor (in plain English)?
I had this error, and I can't really find any information on it:
*** TypeError: can't pickle method_descriptor objects
Switch to dill.
I am not interested in debugging this error...
You should be. If you're uninterested in debugging errors, you're in the wrong field. For the sake of polite argumentation, let's charitably assume you authored that comment under the duress of an unreasonable deadline. (It happens.)
The standard pickle module is incapable of serializing so-called "exotic types," including but presumably not limited to: functions with yields, nested functions, lambdas, cells, methods, unbound methods, modules, ranges, slices, code objects, methodwrapper objects, dictproxy objects, getsetdescriptor objects, memberdescriptor objects, wrapperdescriptor objects, notimplemented objects, ellipsis objects, quit objects, and (...wait for it!) method_descriptor objects.
All is not lost, however. The third-party dill package is capable of serializing all of these types and substantially more. Since dill is a drop-in replacement for pickle, globally replacing all calls across your codebase to the pickle.dump() function with the equivalent dill.dump() function should suffice to pickle the problematic method descriptors in question.
I just want to know what a method_descriptor is, in plain English.
No, you don't. There is no plain-English explanation of method descriptors, because the descriptor protocol underlying method descriptors is deliciously dark voodoo.
It's voodoo, because it has to be; it's the fundamental basis for Python's core implementation of functions, properties, static methods, and class methods. It's dark, because only a dwindling cabal of secretive Pythonistas are actually capable of correctly implementing a descriptor in the wild. It's delicious, because the power that data descriptors in particular provide is nonpareil in the Python ecosystem.
Fortunately, you don't need to know what method descriptors are to pickle them. You only need to switch to dill.
method_descriptor is a normal class with
__get__, __set__ and __del__ methods.
You can check the link for more info at
Static vs instance methods of str in Python
Related
The pickle module documentation says right at the beginning:
Warning:
The pickle module is not intended to be secure against erroneous or
maliciously constructed data. Never unpickle data received from an
untrusted or unauthenticated source.
However, further down under restricting globals it seems to describe a way to make unpickling data safe using a whitelist of allowed objects.
Does this mean that I can safely unpickle untrusted data if I use a RestrictedUnpickler that allows only some "elementary" types, or are there additional security issues that are not addressed by this method? If there are, is there another way to make unpickling safe (obviously at the cost of not being able to unpickle every stream)?
With "elementary types" I mean precisely the following:
bool
str, bytes, bytearray
int, float, complex
tuple, list, dict, set and frozenset
In this answer we're going to explore what exactly the pickle protocol allows an attacker to do. This means we're only going to rely on documented features of the protocol, not implementation details (with a few exceptions). In other words, we'll assume that the source code of the pickle module is correct and bug-free and allows us to do exactly what the documentation says and nothing more.
What does the pickle protocol allow an attacker to do?
Pickle allows classes to customize how their instances are pickled. During the unpickling process, we can:
Call (almost) any class's __setstate__ method (as long as we manage to unpickle an instance of that class).
Invoke arbitrary callables with arbitrary arguments, thanks to the __reduce__ method (as long as we can gain access to the callable somehow).
Invoke (almost) any unpickled object's append, extend and __setitem__ methods, once again thanks to __reduce__.
Access any attribute that Unpickler.find_class allows us to.
Freely create instances of the following types: str, bytes, list, tuple, dict, int, float, bool. This is not documented, but these types are built into the protocol itself and don't go through Unpickler.find_class.
The most useful (from an attacker's perspective) feature here is the ability to invoke callables. If they can access exec or eval, they can make us execute arbitrary code. If they can access os.system or subprocess.Popen they can run arbitrary shell commands. Of course, we can deny them access to these with Unpickler.find_class. But how exactly should we implement our find_class method? Which functions and classes are safe, and which are dangerous?
An attacker's toolbox
Here I'll try to explain some methods an attacker can use to do evil things. Giving an attacker access to any of these functions/classes means you're in danger.
Arbitrary code execution during unpickling:
exec and eval (duh)
os.system, os.popen, subprocess.Popen and all other subprocess functions
types.FunctionType, which allows to create a function from a code object (can be created with compile or types.CodeType)
typing.get_type_hints. Yes, you read that right. How, you ask? Well, typing.get_type_hints evaluates forward references. So all you need is an object with __annotations__ like {'x': 'os.system("rm -rf /")'} and get_type_hints will run the code for you.
functools.singledispatch. I see you shaking your head in disbelief, but it's true. Single-dispatch functions have a register method, which internally calls typing.get_type_hints.
... and probably a few more
Accessing things without going through Unpickler.find_class:
Just because our find_class method prevents an attacker from accessing something directly doesn't mean there's no indirect way of accessing that thing.
Attribute access: Everything is an object in python, and objects have lots of attributes. For example, an object's class can accessed as obj.__class__, a class's parents can be accessed as cls.__bases__, etc.
getattr
operator.attrgetter
object.__getattribute__
Tools.scripts.find_recursionlimit.RecursiveBlowup5.__getattr__
... and many more
Indexing: Lots of things are stored in lists, tuples and dicts - being able to index data structures opens many doors for an attacker.
operator.itemgetter
list.__getitem__, dict.__getitem__, etc
... and almost certainly some more
See Ned Batchelder's Eval is really dangerous to find out how an attacker can use these to gain access to pretty much everything.
Code execution after unpickling:
An attacker doesn't necessarily have to do something dangerous during the unpickling process - they can also try to return a dangerous object and let you call a dangerous function on accident. Maybe you call typing.get_type_hints on the unpickled object, or maybe you expect to unpickle a CuteBunny but instead unpickle a FerociousDragon and get your hand bitten off when you try to .pet() it. Always make sure the unpickled object is of the type you expect, its attributes are of the types you expect, and it doesn't have any attributes you don't expect it to have.
At this point, it should be obvious that there aren't many modules/classes/functions you can trust. When you implement your find_class method, never ever write a blacklist - always write a whitelist, and only include things you're sure can't be abused.
So what's the answer to the question?
If you really only allow access to bool, str, bytes, bytearray, int, float, complex, tuple, list, dict, set and frozenset then you're most likely safe. But let's be honest - you should probably use JSON instead.
In general, I think most classes are safe - with exceptions like subprocess.Popen, of course. The worst thing an attacker can do is call the class - which generally shouldn't do anything more dangerous than return an instance of that class.
What you really need to be careful about is allowing access to functions (and other non-class callables), and how you handle the unpickled object.
I'd go so far as saying that there is no safe way to use pickle to handle untrusted data.
Even with restricted globals, the dynamic nature of Python is such that a determined hacker still has a chance of finding a way back to the __builtins__ mapping and from there to the Crown Jewels.
See Ned Batchelder's blog posts on circumventing restrictions on eval() that apply in equal measure to pickle.
Remember that pickle is still a stack language and you cannot foresee all possible objects produced from allowing arbitrary calls even to a limited set of globals. The pickle documentation also doesn't mention the EXT* opcodes that allow calling copyreg-installed extensions; you'll have to account for anything installed in that registry too here. All it takes is one vector allowing a object call to be turned into a getattr equivalent for your defences to crumble.
At the very least use a cryptographic signature to your data so you can validate the integrity. You'll limit the risks, but if an attacker ever managed to steal your signing secrets (keys) then they could again slip you a hacked pickle.
I would instead use an an existing innocuous format like JSON and add type annotations; e.g. store data in dictionaries with a type key and convert when loading the data.
This idea has been discussed also on the mailing list python-ideas when addressing the problem of adding a safe pickle alternative in the standard library. For example here:
To make it safer I would have a restricted unpickler as the default (for load/loads) and force people to override it if they want to loosen restrictions. To be really explicit, I would make load/loads only work with built-in types.
And also here:
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated.
Is the following enough for you: http://docs.python.org/3.4/library/pickle.html#restricting-globals ?
Indeed, it is. Thanks for pointing it out! I've never gotten past the module interface part of the docs. Maybe the warning at the top of the page could also mention that there are ways to mitigate the safety concerns, and point to #restricting-globals?
Yes, that would be a good idea :-)
So I don't know why the documentation has not been changed but according to me, using a RestrictedUnpickler to restrict the types that can be unpickled is a safe solution. Of course there could exist bugs in the library that compromise the system, but there could be a bug also in OpenSSL that show random memory data to everyone who asks.
The pickle module documentation says right at the beginning:
Warning:
The pickle module is not intended to be secure against erroneous or
maliciously constructed data. Never unpickle data received from an
untrusted or unauthenticated source.
However, further down under restricting globals it seems to describe a way to make unpickling data safe using a whitelist of allowed objects.
Does this mean that I can safely unpickle untrusted data if I use a RestrictedUnpickler that allows only some "elementary" types, or are there additional security issues that are not addressed by this method? If there are, is there another way to make unpickling safe (obviously at the cost of not being able to unpickle every stream)?
With "elementary types" I mean precisely the following:
bool
str, bytes, bytearray
int, float, complex
tuple, list, dict, set and frozenset
In this answer we're going to explore what exactly the pickle protocol allows an attacker to do. This means we're only going to rely on documented features of the protocol, not implementation details (with a few exceptions). In other words, we'll assume that the source code of the pickle module is correct and bug-free and allows us to do exactly what the documentation says and nothing more.
What does the pickle protocol allow an attacker to do?
Pickle allows classes to customize how their instances are pickled. During the unpickling process, we can:
Call (almost) any class's __setstate__ method (as long as we manage to unpickle an instance of that class).
Invoke arbitrary callables with arbitrary arguments, thanks to the __reduce__ method (as long as we can gain access to the callable somehow).
Invoke (almost) any unpickled object's append, extend and __setitem__ methods, once again thanks to __reduce__.
Access any attribute that Unpickler.find_class allows us to.
Freely create instances of the following types: str, bytes, list, tuple, dict, int, float, bool. This is not documented, but these types are built into the protocol itself and don't go through Unpickler.find_class.
The most useful (from an attacker's perspective) feature here is the ability to invoke callables. If they can access exec or eval, they can make us execute arbitrary code. If they can access os.system or subprocess.Popen they can run arbitrary shell commands. Of course, we can deny them access to these with Unpickler.find_class. But how exactly should we implement our find_class method? Which functions and classes are safe, and which are dangerous?
An attacker's toolbox
Here I'll try to explain some methods an attacker can use to do evil things. Giving an attacker access to any of these functions/classes means you're in danger.
Arbitrary code execution during unpickling:
exec and eval (duh)
os.system, os.popen, subprocess.Popen and all other subprocess functions
types.FunctionType, which allows to create a function from a code object (can be created with compile or types.CodeType)
typing.get_type_hints. Yes, you read that right. How, you ask? Well, typing.get_type_hints evaluates forward references. So all you need is an object with __annotations__ like {'x': 'os.system("rm -rf /")'} and get_type_hints will run the code for you.
functools.singledispatch. I see you shaking your head in disbelief, but it's true. Single-dispatch functions have a register method, which internally calls typing.get_type_hints.
... and probably a few more
Accessing things without going through Unpickler.find_class:
Just because our find_class method prevents an attacker from accessing something directly doesn't mean there's no indirect way of accessing that thing.
Attribute access: Everything is an object in python, and objects have lots of attributes. For example, an object's class can accessed as obj.__class__, a class's parents can be accessed as cls.__bases__, etc.
getattr
operator.attrgetter
object.__getattribute__
Tools.scripts.find_recursionlimit.RecursiveBlowup5.__getattr__
... and many more
Indexing: Lots of things are stored in lists, tuples and dicts - being able to index data structures opens many doors for an attacker.
operator.itemgetter
list.__getitem__, dict.__getitem__, etc
... and almost certainly some more
See Ned Batchelder's Eval is really dangerous to find out how an attacker can use these to gain access to pretty much everything.
Code execution after unpickling:
An attacker doesn't necessarily have to do something dangerous during the unpickling process - they can also try to return a dangerous object and let you call a dangerous function on accident. Maybe you call typing.get_type_hints on the unpickled object, or maybe you expect to unpickle a CuteBunny but instead unpickle a FerociousDragon and get your hand bitten off when you try to .pet() it. Always make sure the unpickled object is of the type you expect, its attributes are of the types you expect, and it doesn't have any attributes you don't expect it to have.
At this point, it should be obvious that there aren't many modules/classes/functions you can trust. When you implement your find_class method, never ever write a blacklist - always write a whitelist, and only include things you're sure can't be abused.
So what's the answer to the question?
If you really only allow access to bool, str, bytes, bytearray, int, float, complex, tuple, list, dict, set and frozenset then you're most likely safe. But let's be honest - you should probably use JSON instead.
In general, I think most classes are safe - with exceptions like subprocess.Popen, of course. The worst thing an attacker can do is call the class - which generally shouldn't do anything more dangerous than return an instance of that class.
What you really need to be careful about is allowing access to functions (and other non-class callables), and how you handle the unpickled object.
I'd go so far as saying that there is no safe way to use pickle to handle untrusted data.
Even with restricted globals, the dynamic nature of Python is such that a determined hacker still has a chance of finding a way back to the __builtins__ mapping and from there to the Crown Jewels.
See Ned Batchelder's blog posts on circumventing restrictions on eval() that apply in equal measure to pickle.
Remember that pickle is still a stack language and you cannot foresee all possible objects produced from allowing arbitrary calls even to a limited set of globals. The pickle documentation also doesn't mention the EXT* opcodes that allow calling copyreg-installed extensions; you'll have to account for anything installed in that registry too here. All it takes is one vector allowing a object call to be turned into a getattr equivalent for your defences to crumble.
At the very least use a cryptographic signature to your data so you can validate the integrity. You'll limit the risks, but if an attacker ever managed to steal your signing secrets (keys) then they could again slip you a hacked pickle.
I would instead use an an existing innocuous format like JSON and add type annotations; e.g. store data in dictionaries with a type key and convert when loading the data.
This idea has been discussed also on the mailing list python-ideas when addressing the problem of adding a safe pickle alternative in the standard library. For example here:
To make it safer I would have a restricted unpickler as the default (for load/loads) and force people to override it if they want to loosen restrictions. To be really explicit, I would make load/loads only work with built-in types.
And also here:
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated.
Is the following enough for you: http://docs.python.org/3.4/library/pickle.html#restricting-globals ?
Indeed, it is. Thanks for pointing it out! I've never gotten past the module interface part of the docs. Maybe the warning at the top of the page could also mention that there are ways to mitigate the safety concerns, and point to #restricting-globals?
Yes, that would be a good idea :-)
So I don't know why the documentation has not been changed but according to me, using a RestrictedUnpickler to restrict the types that can be unpickled is a safe solution. Of course there could exist bugs in the library that compromise the system, but there could be a bug also in OpenSSL that show random memory data to everyone who asks.
As stated in the pickle documentation, classes are normally pickled in such a way that they require the exact same class to be present in a module on the receiving end. However, I do note that there's also some __getstate__() and __setstate__() methods for classes, which affect how their instances are pickled...
How feasible would it be to create a metaclass that would allow pickling and unpickling of the classes created from that metaclass (in other words, the instances of that metaclass) even without the classes being present on the receiving end? (Though I think the metaclass would probably have to be present.)
Would utilizing a __reduce__() method in the class or metaclass also be something to look into?
The classes have to be somehow present on the receiving end, because methods are not stored with the objects. So, I think that using a specific metaclass unfortunately can't help, hereā¦
I am using the distribution classes in scipy.stats.distributions and need to serialize instances for storage and transfer. These are quite complex objects, and they don't pickle. I am trying to develop a mixin class that makes objects pickle-able, so that I can work with remixed subclasses that otherwise behave just like the objects from scipy.stats. The more I investigate the problem, the more confused I become, and I wonder if I am missing an obvious way to do this.
I have read a related question on how to pickle instance methods, but this is only part of the overall solution that I need and may not even be necessary. I have experimented with writing pickle support functions that closely follow the __init__ method and serialize the object as arguments to __init__, but this seems brittle, especially when subclasses can define arbitrary subclass-specific behavior in __init__.
Does someone have an elegant solution to share?
Update: I found a Python bug report with an example of registering pickle support functions with the copy_reg module to pickle instance methods. For my case, the instance method attributes were the only blockers. However, I would still like to know if there is a way to use a mixin class to solve this problem, because copy_reg has global effects which may not be desireable in all situations.
To avoid repeatedly accessing a SOAP server during development, I'm trying to cache the results so I can run the rest of my code without querying the server each time.
With the code below I get a PicklingError: Can't pickle <class suds.sudsobject.AdvertiserSearchResponse at 0x03424060>: it's not found as suds.sudsobject.AdvertiserSearchResponse when I try to pickle a suds result. I guess this is because the classes are dynamically created.
import pickle
from suds.client import Client
client = Client(...)
result = client.service.search(...)
file = open('test_pickle.dat', 'wb')
pickle.dump(result, file, -1)
file.close()
If I drop the -1 protocol version from pickle.dump(result, file, -1), I get a different error:
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
Is pickling the right thing to do? Can I make it work? Is there a better way?
As the error message you're currently getting is trying to tell you, you're trying to pickle instances that are not picklable (in the ancient legacy pickle protocol you're now using) because their class defines __slots__ but not a __getstate__ method.
However, even altering their class would not help because then you'd run into the other problem -- which you already correctly identified as being likely due to dynamically generated classes. All pickle protocols serialize classes (and functions) "by name", essentially constraining them to be at top-level names in their modules. And, serializing an instance absolutely does require serializing the class (how else could you possibly reconstruct the instance later if the class was not around?!).
So you'll need to save and reload your data in some other way, breaking your current direct dependence on concrete classes in suds.sudsobject in favor of depending on an interface (either formalized or just defined by duck typing) that can be implemented both by such concrete classes when you are in fact accessing the SOAP server, or simpler "homemade" ones when you're loading the data from a file. (The data representing instance state can no doubt be represented as a dict, so you can force it through pickle if you really want, e.g. via the copy_reg module which allows you to customize serialize/deserialize protocols for objects that you're forced to treat non-invasively [[so you can't go around adding __getstate__ or the like to their classes]] -- the problem will come only if there's a rich mesh of mutual references among such objects).
You are pickling the class object itself, and not instance objects of the class. This won't work if the class object is recreated. However, pickling instances of the class will work as long as the class object exists.