Change in pickle file for same object - python

I'm pickling an object, then do things to it, then pickle it again, and check it the object is still the same. Maybe not the best way to ensure equality, but a very strict and simple one.
However, I get unexpected difference in the pickle file on some machines running python2.7.
I used pickletools to analyze the resulting files.
The difference is additional PUT codes, though these registers are never accessed. When are PUT statements generated, and why are they generated differently when calling pickle twice on the same object?

Related

How do I properly save state in case of an exception in python?

I want to
load data from files,
work on that data,
and eventually save that data back to files.
However, since step 2 may take several hours I want to make sure that progress is saved in case of an unexpected exception.
The data is loaded into an object to make it easy to work with it.
First thing that came to my mind was to turn that objects class into a context manager and use the with-statement. However I'd have to write practically my entire program within that with-statement and that doesn't feel right.
So I had a look around and found this question which essentially asks for the same thing. Among the answers this one suggesting weakref.finalize seemed the most promising to me. However, there is a note at the bottom of the documentation that says:
Note: It is important to ensure that func, args and kwargs do not own any references to obj, either directly or indirectly, since otherwise obj will never be garbage collected. In particular, func should not be a bound method of obj.
Since I'd want to save fields of that object, I'd reference them, running right into this problem.
Does this mean that the objects __exit__-function will never be called or that it will not be called until the program crashes/exits?
What is the pythonic way of doing this?
It's a little hacky, and you still wrap your code, but I normally just wrap main() in a try except block.
Then you can handle except with a pdb.set_trace(), which means whatever exception you get, your program will drop into an interactive terminal instead of breaking.
After that, you can manually inspect the error and dump any processed data into a pickle or whatever you want to do. Once you fix the bug, setup your code to read the pickle and pick up from where it left off.

Mocking disk-out-of-space in python unittests

I'm trying to write a unittest to test the behaviour of a function when the disk is full. I need file access functions to behave normally while most of the test runs, so that the file I'm creating is actually created, then at one point I need the disk to be 'full'. I can't find a way to do this using mock_open(), since the file object created by this doesn't seem to be persist between function calls. I've tried to use pyfakefs and setting the disk size using self.fs.set_disk_usage(MAX_FS_SIZE) but when I try to run this in my tests, it allows used_size to go negative, meaning there is always free space (though oddly, their example code works correctly).
Is there a way to either simulate a disk-out-space error at a particular point in my code? Mocking the write function to have a side-effect would be my immediate thought, but I can't access the file object that I'm writing to in my test code, as it's buried deep inside function calls.
Edit: looks like I've found a bug in pyfakefs
Edit2: bug in pyfakefs has been fixed; now works as expected. Still interested to know if there's a way to get f.write() to throw an OSError with a simple mock.

How to write test cases for pickle backwards compatibility

You're writing a library, and you know your users pickle your object. Sometimes you add new fields, and this creates a BC problem because old pickles of the objects don't have the needed fields. You'd like to add some tests for this case.
The obvious way to do this is to save some actual pickles from your old version, shove them in your test suite somehow, and make sure you can keep unpickling them. But having hard-coded binary test data is pretty uncool. Is there a way to write tests without having to do some manual binary? I've tried playing around with "faking" the __module__ and __qualname__ fields on a locally declared class (which is intended to simulate the "old" version) but I get errors like "_pickle.PicklingError: Can't pickle : it's not the same object as torch.nn.modules.conv.Conv2d" Is there a good way to do this?

Do Pickle and Dill have similar levels of risk of containing malicious script?

Dill is obviously a very useful module, and it seems as long as you manage the files carefully it is relatively safe. But I was put off by the statement:
Thus dill is not intended to be secure against erroneously or maliciously constructed data. It is left to the user to decide whether the data they unpickle is from a trustworthy source.
I read in in https://pypi.python.org/pypi/dill. It's left to the user to decide how to manage their files.
If I understand correctly, once it has been pickled by dill, you can not easily find out what the original script will do without some special skill.
MY QUESTION IS: although I don't see a warning, does a similar situation also exist for pickle?
Dill is built on top of pickle, and the warnings apply just as much to pickle as they do to dill.
Pickle uses a stack language to effectively execute arbitrary Python code. An attacker can sneak in instructions to open up a backport to your machine, for example. Don't ever use pickled data from untrusted sources.
The documentation includes an explicit warning:
Warning: The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
Yes
Because Pickle allows you to override the object serialization and deserialization, via
object.__getstate__()
Classes can further influence how their instances are pickled; if the
class defines the method __getstate__(), it is called and the returned
object is pickled as the contents for the instance, instead of the
contents of the instance’s dictionary. If the __getstate__() method is
absent, the instance’s __dict__ is pickled as usual.
object.__setstate__(state)
Upon unpickling, if the class defines __setstate__(), it is called
with the unpickled state. In that case, there is no requirement for
the state object to be a dictionary. Otherwise, the pickled state must
be a dictionary and its items are assigned to the new instance’s
dictionary.
Because these functions can execute arbitrary code at the user's permission level, it is relatively easy to write a malicious deserializer -- e.g. one that deletes all the files on your hard disk.
Although I don't see a warning, does a similar situation also exist for pickle?
Always, always assume that just because someone doesn't state it's dangerous it is not safe to use something.
That being said, Pickle docs do say the same:
Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
So yes, that security risk exists on pickle, too.
To explain the background: pickle and dill restore the state of python objects. In CPython, the default python implementation, this means restoring PyObjects structs, which contain a length field. Modification of that, as an example, leads to funky effects and might have arbitrary effects on your python process' memory.
By the way, even assuming that data is not malicious doesn't mean you can un-pickle or un-dill just about anything that comes e.g. from a different python version. So, to me, that question is a bit of theoretical one: If you need portable objects, you will have to implement a rock-solid serialization/deserialization mechanism that transports the data you need transported, and nothing more or less.

Should I use marshal if I am getting comma errors in JSON (Python)?

I am actually using Python itself to dump a huge data structure (multiple lists and dictionaries) and am sending it over a socket to a client.
I keep getting a ValueError: 'Expecting ',' delimiter: line 1 column 16177 (char 16176) at various different locations every time I run the program (it could be column 25000, or column 13000, it keeps changing).
Should I use marshal instead of json (or even pickle)? What is the most reliable format for large file sizes?
I'd suggest to use pickle (or cPickle if you're on Python 2.X) as it can serialize almost anything, including user-defined classes. And, as the docs say
The marshal serialization format is not guaranteed to be portable across Python versions. Because its primary job in life is to support .pyc files, the Python implementers reserve the right to change the serialization format in non-backwards compatible ways should the need arise. The pickle serialization format is guaranteed to be backwards compatible across Python releases.
(Emphasis mine).
Another advantage of pickle:
The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again. marshal doesn’t do this.
This has implications both for recursive objects and object sharing.
Recursive objects are objects that contain references to themselves.
These are not handled by marshal, and in fact, attempting to marshal
recursive objects will crash your Python interpreter. Object sharing
happens when there are multiple references to the same object in
different places in the object hierarchy being serialized. pickle
stores such objects only once, and ensures that all other references
point to the master copy. Shared objects remain shared, which can be
very important for mutable objects.
You can also use dill if pickle fails to serialize some data.

Categories