function to save and load python memory? - python

I wrote a program in python that takes several hours to calculate. I now want the program to save all the memory from time to time (mainly numpy-arrays). In that way I can restart calculations starting from the point where the last save happened. I am not looking for something like 'numpy.save(file,arr)' but a way to save all the memory in one time...
Kind Regards,
Mattias

I agree with #phyrox, that dill can be used to persist your live objects to disk so you can restart later. dill can serialize numpy arrays with dump(), and the entire interpreter session with dump_session().
However, it sounds like you are really asking about some form of caching… so I'd have to say that the comment from #Alfe is probably a bit closer to what you want. If you want seamless caching and archiving of arrays to memory… then you want joblib or klepto.
klepto is built on top of dill, and can cache function inputs and outputs to memory (so that calculations don't need to be run twice), and it can seamlessly persist objects in the cache to disk or to a database.
The versions on github are the ones you want. https://github.com/uqfoundation/klepto or https://github.com/joblib/joblib. Klepto is newer, but has a much broader set of caching and archiving solutions than joblib. Joblib has been in production use longer, so it's better tested -- especially for parallel computing.
Here's an example of typical klepto workflow: https://github.com/uqfoundation/klepto/blob/master/tests/test_workflow.py
Here's another that has some numpy in it:
https://github.com/uqfoundation/klepto/blob/master/tests/test_cache.py

Dill can be your solution: https://pypi.python.org/pypi/dill
Dill provides the user the same interface as the 'pickle' module, and
also includes some additional features. In addition to pickling python
objects, dill provides the ability to save the state of an interpreter
session in a single command. Hence, it would be feasable to save a
interpreter session, close the interpreter, ship the pickled file to
another computer, open a new interpreter, unpickle the session and
thus continue from the 'saved' state of the original interpreter
session.
An example:
import dill as pickle;
from numpy import array;
a = array([1,2]);
pickle.dump_session('sesion.pkl')
a = 0;
pickle.load_session('sesion.pkl')
print a;
Since dill conforms to the 'pickle' interface, the examples and
documentation at http://docs.python.org/library/pickle.html also apply
to dill if one will import dill as pickle
Nota that there are several types of data that you can not save. Check them first.

Related

Read python pickle with scala

I inherited a database with values stored as Python pickled objects. Is there a way to unpickle these values in Scala (without calling Python internally) ?
In general, you'd need to call python internally, because pickle allows classes to run arbitrary code on unpickling. (Do a search for "python pickle security" and you'll find a lot of interesting discussions about why this means you shouldn't unpickle from untrusted sources.)
I suspect it could be done for more common cases, though, if there's nothing particularly unusual in your pickled data. This simliar question has an answer suggesting a Java library called Pyrolite.

Do Pickle and Dill have similar levels of risk of containing malicious script?

Dill is obviously a very useful module, and it seems as long as you manage the files carefully it is relatively safe. But I was put off by the statement:
Thus dill is not intended to be secure against erroneously or maliciously constructed data. It is left to the user to decide whether the data they unpickle is from a trustworthy source.
I read in in https://pypi.python.org/pypi/dill. It's left to the user to decide how to manage their files.
If I understand correctly, once it has been pickled by dill, you can not easily find out what the original script will do without some special skill.
MY QUESTION IS: although I don't see a warning, does a similar situation also exist for pickle?
Dill is built on top of pickle, and the warnings apply just as much to pickle as they do to dill.
Pickle uses a stack language to effectively execute arbitrary Python code. An attacker can sneak in instructions to open up a backport to your machine, for example. Don't ever use pickled data from untrusted sources.
The documentation includes an explicit warning:
Warning: The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
Yes
Because Pickle allows you to override the object serialization and deserialization, via
object.__getstate__()
Classes can further influence how their instances are pickled; if the
class defines the method __getstate__(), it is called and the returned
object is pickled as the contents for the instance, instead of the
contents of the instance’s dictionary. If the __getstate__() method is
absent, the instance’s __dict__ is pickled as usual.
object.__setstate__(state)
Upon unpickling, if the class defines __setstate__(), it is called
with the unpickled state. In that case, there is no requirement for
the state object to be a dictionary. Otherwise, the pickled state must
be a dictionary and its items are assigned to the new instance’s
dictionary.
Because these functions can execute arbitrary code at the user's permission level, it is relatively easy to write a malicious deserializer -- e.g. one that deletes all the files on your hard disk.
Although I don't see a warning, does a similar situation also exist for pickle?
Always, always assume that just because someone doesn't state it's dangerous it is not safe to use something.
That being said, Pickle docs do say the same:
Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
So yes, that security risk exists on pickle, too.
To explain the background: pickle and dill restore the state of python objects. In CPython, the default python implementation, this means restoring PyObjects structs, which contain a length field. Modification of that, as an example, leads to funky effects and might have arbitrary effects on your python process' memory.
By the way, even assuming that data is not malicious doesn't mean you can un-pickle or un-dill just about anything that comes e.g. from a different python version. So, to me, that question is a bit of theoretical one: If you need portable objects, you will have to implement a rock-solid serialization/deserialization mechanism that transports the data you need transported, and nothing more or less.

Sharing large object between different processes in Python 3.4

I am trying to share a large object (~2 GB) between different processes in Python, in order to cut down on memory usage. I have learned about the Manager class and proxies in the multiprocessing library (https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing-managers). However, according to the documentation and to other Stackoverflow users, this can be very slow when it is used on large objects like this one. Is this correct, and if so, is there another faster Python library or function that I can use instead? Thanks.
EDIT: The object I created is a DAG (directed acyclic graph) whose constructor consists of standard python values, though.
If your data is limited to standard values and arrays (no other Python objects) you can use Shared Memory (Value() and Array(), see https://docs.python.org/3.4/library/multiprocessing.html#shared-ctypes-objects). It is very fast.
One solution to the problem is to make the graph a processus which exposes methods that are executed from other processus using proxies. This means that you have to build a similar class to manager.dict and manager.value.
This is done via a producer/consumer pattern. It's called Inter Process Call (IPC) or Remote Procedure Call (RPC). Solutions might involve zeroless or pyro.
Another solutions, that is simpler
Another solution is to use a database. For instance, bsddb or lmdb which support at least multiprocessus read access to the database. Using ajgu or the simpler design. Can save you from writing a lot of code.
Last solution, is to build a file that you mmap in memory and read from their. But this is really a solution with your graph is readonly because if you expect to modify the graph you will need to start to write an mmap'ed graph database. This has the advantage of being fully in memory.
My recommendation is to use lmdb to build a graph database taking exemple from the simpler version of ajgu with two script:
One to create the database
Another class that will use the graph from different process.

Pickle: Not safe or fast?

I'm working through some scipy lectures (http://scipy-lectures.github.io/intro/language/standard_library.html#pickle-easy-persistence) and I came across this statement about Pickle:
Useful to store arbitrary objects to a file. Not safe or fast!
What do they mean by this? Not safe (according to Pickle docs) as in don't UnPickle files from an unknown origin or not safe as in you don't always retrieve the original object?
What's the alternative for something safer and faster? I know about cPickle being faster, but I don't think it solves the above definition of safer.
Thanks.
Using pickle in production code is vulnerable by design. Arbitrary code can be executed while unpickling. You can safely unpickle only data from trusted sources. Never unpickle data received from an untrusted or unauthenticated source.
See here for real applications samples.
As for faster alternative, there is marshal, python internal serealization library. But unlike pickle (or cPickle, which is just a C implementation), it is less stable (see docs) and its output being architecture and os independend, depends on python version. That is object marshal'ed on Windows platform with python 2.7.5 is guaranteed to be un-marshalable on OS X or Ubuntu with python 2.7.5 installed, but not guaranteed to be un-marshalable with python 2.6 on Windows.
Another faster, safer by design, but less functional serialization alternative is JSON.
The original module Pickle is almost never used.
If you need to do it fast, use cPickle.
If you need a safe one, try sPickle.

Common use-cases for pickle in Python

I've looked at the pickle documentation, but I don't understand where pickle is useful.
What are some common use-cases for pickle?
Some uses that I have come across:
1) saving a program's state data to disk so that it can carry on where it left off when restarted (persistence)
2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)
3) storing python objects in a database
4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).
There are some issues with the last one - two identical objects can be pickled and result in different strings - or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.
To emphasise #lunaryorn's comment - you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/
Minimal roundtrip example..
>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'
Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you'd have to dig quite deep into the source) is ZODB:
http://svn.zope.org/
Otherwise, PyPI mentions several:
http://pypi.python.org/pypi?:action=search&term=pickle&submit=search
I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.
Pickle is like "Save As.." and "Open.." for your data structures and classes. Let's say I want to save my data structures so that it is persistent between program runs.
Saving:
with open("save.p", "wb") as f:
pickle.dump(myStuff, f)
Loading:
try:
with open("save.p", "rb") as f:
myStuff = pickle.load(f)
except:
myStuff = defaultdict(dict)
Now I don't have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.
I have used it in one of my projects. If the app was terminated during it's working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.
Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn't pickle, you can't send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.
For the beginner (as is the case with me) it's really hard to understand why use pickle in the first place when reading the official documentation. It's maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help:
https://stackoverflow.com/a/14482962/4383472, What is serialization?,
https://stackoverflow.com/a/3984483/4383472
To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.
I can tell you the uses I use it for and have seen it used for:
Game profile saves
Game data saves like lives and health
Previous records of say numbers inputed to a program
Those are the ones I use it for at least
I use pickling during web scraping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.
you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.

Categories