Common use-cases for pickle in Python - python

I've looked at the pickle documentation, but I don't understand where pickle is useful.
What are some common use-cases for pickle?

Some uses that I have come across:
1) saving a program's state data to disk so that it can carry on where it left off when restarted (persistence)
2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)
3) storing python objects in a database
4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).
There are some issues with the last one - two identical objects can be pickled and result in different strings - or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.
To emphasise #lunaryorn's comment - you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/

Minimal roundtrip example..
>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'
Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you'd have to dig quite deep into the source) is ZODB:
http://svn.zope.org/
Otherwise, PyPI mentions several:
http://pypi.python.org/pypi?:action=search&term=pickle&submit=search
I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.

Pickle is like "Save As.." and "Open.." for your data structures and classes. Let's say I want to save my data structures so that it is persistent between program runs.
Saving:
with open("save.p", "wb") as f:
pickle.dump(myStuff, f)
Loading:
try:
with open("save.p", "rb") as f:
myStuff = pickle.load(f)
except:
myStuff = defaultdict(dict)
Now I don't have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.

I have used it in one of my projects. If the app was terminated during it's working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.

Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn't pickle, you can't send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.

For the beginner (as is the case with me) it's really hard to understand why use pickle in the first place when reading the official documentation. It's maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help:
https://stackoverflow.com/a/14482962/4383472, What is serialization?,
https://stackoverflow.com/a/3984483/4383472

To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.

I can tell you the uses I use it for and have seen it used for:
Game profile saves
Game data saves like lives and health
Previous records of say numbers inputed to a program
Those are the ones I use it for at least

I use pickling during web scraping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.
you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.

Related

Read python pickle with scala

I inherited a database with values stored as Python pickled objects. Is there a way to unpickle these values in Scala (without calling Python internally) ?
In general, you'd need to call python internally, because pickle allows classes to run arbitrary code on unpickling. (Do a search for "python pickle security" and you'll find a lot of interesting discussions about why this means you shouldn't unpickle from untrusted sources.)
I suspect it could be done for more common cases, though, if there's nothing particularly unusual in your pickled data. This simliar question has an answer suggesting a Java library called Pyrolite.

function to save and load python memory?

I wrote a program in python that takes several hours to calculate. I now want the program to save all the memory from time to time (mainly numpy-arrays). In that way I can restart calculations starting from the point where the last save happened. I am not looking for something like 'numpy.save(file,arr)' but a way to save all the memory in one time...
Kind Regards,
Mattias
I agree with #phyrox, that dill can be used to persist your live objects to disk so you can restart later. dill can serialize numpy arrays with dump(), and the entire interpreter session with dump_session().
However, it sounds like you are really asking about some form of caching… so I'd have to say that the comment from #Alfe is probably a bit closer to what you want. If you want seamless caching and archiving of arrays to memory… then you want joblib or klepto.
klepto is built on top of dill, and can cache function inputs and outputs to memory (so that calculations don't need to be run twice), and it can seamlessly persist objects in the cache to disk or to a database.
The versions on github are the ones you want. https://github.com/uqfoundation/klepto or https://github.com/joblib/joblib. Klepto is newer, but has a much broader set of caching and archiving solutions than joblib. Joblib has been in production use longer, so it's better tested -- especially for parallel computing.
Here's an example of typical klepto workflow: https://github.com/uqfoundation/klepto/blob/master/tests/test_workflow.py
Here's another that has some numpy in it:
https://github.com/uqfoundation/klepto/blob/master/tests/test_cache.py
Dill can be your solution: https://pypi.python.org/pypi/dill
Dill provides the user the same interface as the 'pickle' module, and
also includes some additional features. In addition to pickling python
objects, dill provides the ability to save the state of an interpreter
session in a single command. Hence, it would be feasable to save a
interpreter session, close the interpreter, ship the pickled file to
another computer, open a new interpreter, unpickle the session and
thus continue from the 'saved' state of the original interpreter
session.
An example:
import dill as pickle;
from numpy import array;
a = array([1,2]);
pickle.dump_session('sesion.pkl')
a = 0;
pickle.load_session('sesion.pkl')
print a;
Since dill conforms to the 'pickle' interface, the examples and
documentation at http://docs.python.org/library/pickle.html also apply
to dill if one will import dill as pickle
Nota that there are several types of data that you can not save. Check them first.

lists or dicts over zeromq in python

What is the correct/best way to send objects like lists or dicts over zeromq in python?
What if we use a PUB/SUB pattern, where the first part of the string would be used as a filter?
I am aware that there are multipart messages, but they where originally meant for a different purpose. Further you can not subscribe all messages, which have a certain string as the first element.
Manual serialization
You turn the data into a string, concatenate or else, do your stuff. It's fast and doesn't take much space but requires work and maintenance, and it's not flexible.
If another language wants to read the data, you need to code it again. No DRY.
Ok for very small data, but really the amount of work is usually not worth it unless you are looking for speed and memory effiency and that you can measure that your implementation is significantly better.
Pickle
Slow, but you can serialize complex objects, and even callable. It's powerfull, and it's so easy it's a no brainer.
On the other side it's possible to end up with something you can't pickle and break your code. Plus you can't share the data with any lib written in an other language.
Eventually, the format is not human readable (hard do debug) and quite verbose.
Very nice to share objects and tasks, not so nice for messages.
json
Reasonably fast, easy to implement with simple to averagely complex data structures. It's flexible, human readible and data can be shared accross languages easily.
For complex data, you'll have to write a bit of code.
Unless you have a very specific need, this is probably the best balance between features and complexity. Espacially since the last implementation in the Python lib is in C and speed is ok.
xml
Verbose, hard to create and a pain to maintain unless you got some heavy lib that that does all the job for you. Slow.
Unless it's a requirement, I would avoid it.
In the end
Now as usual, speed and space efficiency is relative, and you must first answer the questions:
what efficiency do I need ?
what am I ready to pay (money, time, energy) for that ?
what solution fits in my current system ?
It's all what matters.
That wonderful moment of philosophy passed, use JSON.
JSON:
# Client
socket.send(json.dumps(message))
# Server
message = json.loads(socket.recv())
More info:
JSON encoder and decoder
hwserver.py
hwclient.py
In zeroMQ, a message is simple a binary blob. You can put anything in it that you want. When you have an object that has multiple parts, you need to first serialize it into something that can be deserialized on the other end. The simplest way to do this is to use obj.repr() which produces a string that you can execute at the other end to recreate the object. But that is not the best way.
First of all, you should try to use a language independent format because sooner or later you will need to interact with applications written in other languages. A JSON object is a good choice for this because it is a single string that can be decoded by many languages. However, a JSON object might not be the most efficient representation if you are sending lots of messages across the network. Instead you might want to consider a format like MSGPACK or Protobufs.
If you need a topic identiffier for PUB_SUB, then simply tack it onto the beginning. Either use a fixed length topic, or place a delimiter between the topic and the real message.
Encode as JSON before sending, and decode as JSON after receiving.
Also check out MessagePack
http://msgpack.org/
"It's like JSON. but fast and small"
In case you are interested in seeing examples, I released a small package called pyRpc that shows you how to do a simple python RPC setup where you expose services between different apps. It uses the python zeromq built-in method for sending and receiving python objects (which I believe is simply cPickle)
http://pypi.python.org/pypi/pyRpc/0.1
https://github.com/justinfx/pyRpc
While my examples use the pyobj version of the send and receive calls, you can see there are other versions available that you can use, like send_json, send_unicode... Unless you need some specific type of serialization, you can easily just use the convenience send/receive functions that handle the serialization/deserialization on both ends for you.
http://zeromq.github.com/pyzmq/api/generated/zmq.core.socket.html
json is probably the fastest, and if you need even faster than what is included in zeromq, you could manually use cjson. If your focus is speed then this is a good option. But if you know you will be communicating only with other python services, then the benefit of cPickle is a native python serialization format that gives you a lot of control. You can easily define your classes to serialize the way you want, and end up with native python objects in the end, as opposed to basic values. Im sure you could also write your own object hook for json if you wanted.
There are a few questions in that question but in terms of best / correct way to send objects / dics obviously it depends. For a lot of situations JSON is simple and familiar to most. To get it to work I had to use send_string and recv_string e.g.
# client.py
socket.send_string(json.dumps({'data': ['a', 'b', 'c']}))
# server.py
result = json.loads(socket.recv_string())
Discussion in docs https://pyzmq.readthedocs.io/en/latest/unicode.html

What is the least resource intense data structure to distribute with a Python Application

I am building an application to distribute to fellow academics. The application will take three parameters that the user submits and output a list of dates and codes related to those events. I have been building this using a dictionary and intended to build the application so that the dictionary loaded from a pickle file when the application called for it. The parameters supplied by the user will be used to lookup the needed output.
I selected this structure because I have gotten pretty comfortable with dictionaries and pickle files and I see this going out the door with the smallest learning curve on my part. There might be as many as two million keys in the dictionary. I have been satisfied with the performance on my machine with a reasonable subset. I have already thought through about how to break the dictionary apart if I have any performance concerns when the whole thing is put together. I am not really that worried about the amount of disk space on their machine as we are working with terabyte storage values.
Having said all of that I have been poking around in the docs and am wondering if I need to invest some time to learn and implement an alternative data storage file. The only reason I can think of is if there is an alternative that could increase the lookup speed by a factor of three to five or more.
The standard shelve module will give you a persistent dictionary that is stored in a dbm style database. Providing that your keys are strings and your values are picklable (since you're using pickle already, this must be true), this could be a better solution that simply storing the entire dictionary in a single pickle.
Example:
>>> import shelve
>>> d = shelve.open('mydb')
>>> d['key1'] = 12345
>>> d['key2'] = value2
>>> print d['key1']
12345
>>> d.close()
I'd also recommend Durus, but that requires some extra learning on your part. It'll let you create a PersistentDictionary. From memory, keys can be any pickleable object.
To get fast lookups, use the standard Python dbm module (see http://docs.python.org/library/dbm.html) to build your database file, and do lookups in it. The dbm file format may not be cross-platform, so you may want to to distrubute your data in Pickle or repr or JSON or YAML or XML format, and build the dbm database the user runs your program.
How much memory can your application reasonably use? Is this going to be running on each user's desktop, or will there just be one deployment somewhere?
A python dictionary in memory can certainly cope with two million keys. You say that you've got a subset of the data; do you have the whole lot? Maybe you should throw the full dataset at it and see whether it copes.
I just tested creating a two million record dictionary; the total memory usage for the process came in at about 200MB. If speed is your primary concern and you've got the RAM to spare, you're probably not going to do better than an in-memory python dictionary.
See this solution at SourceForge, esp. the "endnotes" documentation:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
Here are three things you can try:
Compress the pickled dictionary with zlib. pickle.dumps(dict).encode("zlib")
Make your own serializing format (shouldn't be too hard).
Load the data in a sqlite database.

Python Disk-Based Dictionary

I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.
I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.
Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?
The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module
from shove import Shove
mem_store = Shove()
file_store = Shove('file://mystore')
file_store['key'] = value
Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.
The shelve module may do it; at any rate, it should be simple to test. Instead of:
self.lengths = {}
do:
import shelve
self.lengths = shelve.open('lengths.shelf')
The only catch is that keys to shelves must be strings, so you'll have to replace
self.lengths[indx]
with
self.lengths[str(indx)]
(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)
There's no built-in caching in memory, but your operating system may do that for you anyway.
[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]
Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.
A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.
With a little bit of thought it seems like you could get the shelve module to do what you want.
I've read you think shelve is too slow and you tried to hack your own dict using sqlite.
Another did this too :
http://sebsauvage.net/python/snyppets/index.html#dbdict
It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?
You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.
For simple use cases sqlitedict
can help. However when you have much more complex databases you might one to try one of the more upvoted answers.
It isn't exactly a dictionary, but the vaex module provides incredibly fast dataframe loading and lookup that is lazy-loading so it keeps everything on disk until it is needed and only loads the required slices into memory.
https://vaex.io/docs/tutorial.html#Getting-your-data-in

Categories