lists or dicts over zeromq in python - python

What is the correct/best way to send objects like lists or dicts over zeromq in python?
What if we use a PUB/SUB pattern, where the first part of the string would be used as a filter?
I am aware that there are multipart messages, but they where originally meant for a different purpose. Further you can not subscribe all messages, which have a certain string as the first element.

Manual serialization
You turn the data into a string, concatenate or else, do your stuff. It's fast and doesn't take much space but requires work and maintenance, and it's not flexible.
If another language wants to read the data, you need to code it again. No DRY.
Ok for very small data, but really the amount of work is usually not worth it unless you are looking for speed and memory effiency and that you can measure that your implementation is significantly better.
Pickle
Slow, but you can serialize complex objects, and even callable. It's powerfull, and it's so easy it's a no brainer.
On the other side it's possible to end up with something you can't pickle and break your code. Plus you can't share the data with any lib written in an other language.
Eventually, the format is not human readable (hard do debug) and quite verbose.
Very nice to share objects and tasks, not so nice for messages.
json
Reasonably fast, easy to implement with simple to averagely complex data structures. It's flexible, human readible and data can be shared accross languages easily.
For complex data, you'll have to write a bit of code.
Unless you have a very specific need, this is probably the best balance between features and complexity. Espacially since the last implementation in the Python lib is in C and speed is ok.
xml
Verbose, hard to create and a pain to maintain unless you got some heavy lib that that does all the job for you. Slow.
Unless it's a requirement, I would avoid it.
In the end
Now as usual, speed and space efficiency is relative, and you must first answer the questions:
what efficiency do I need ?
what am I ready to pay (money, time, energy) for that ?
what solution fits in my current system ?
It's all what matters.
That wonderful moment of philosophy passed, use JSON.

JSON:
# Client
socket.send(json.dumps(message))
# Server
message = json.loads(socket.recv())
More info:
JSON encoder and decoder
hwserver.py
hwclient.py

In zeroMQ, a message is simple a binary blob. You can put anything in it that you want. When you have an object that has multiple parts, you need to first serialize it into something that can be deserialized on the other end. The simplest way to do this is to use obj.repr() which produces a string that you can execute at the other end to recreate the object. But that is not the best way.
First of all, you should try to use a language independent format because sooner or later you will need to interact with applications written in other languages. A JSON object is a good choice for this because it is a single string that can be decoded by many languages. However, a JSON object might not be the most efficient representation if you are sending lots of messages across the network. Instead you might want to consider a format like MSGPACK or Protobufs.
If you need a topic identiffier for PUB_SUB, then simply tack it onto the beginning. Either use a fixed length topic, or place a delimiter between the topic and the real message.

Encode as JSON before sending, and decode as JSON after receiving.

Also check out MessagePack
http://msgpack.org/
"It's like JSON. but fast and small"

In case you are interested in seeing examples, I released a small package called pyRpc that shows you how to do a simple python RPC setup where you expose services between different apps. It uses the python zeromq built-in method for sending and receiving python objects (which I believe is simply cPickle)
http://pypi.python.org/pypi/pyRpc/0.1
https://github.com/justinfx/pyRpc
While my examples use the pyobj version of the send and receive calls, you can see there are other versions available that you can use, like send_json, send_unicode... Unless you need some specific type of serialization, you can easily just use the convenience send/receive functions that handle the serialization/deserialization on both ends for you.
http://zeromq.github.com/pyzmq/api/generated/zmq.core.socket.html
json is probably the fastest, and if you need even faster than what is included in zeromq, you could manually use cjson. If your focus is speed then this is a good option. But if you know you will be communicating only with other python services, then the benefit of cPickle is a native python serialization format that gives you a lot of control. You can easily define your classes to serialize the way you want, and end up with native python objects in the end, as opposed to basic values. Im sure you could also write your own object hook for json if you wanted.

There are a few questions in that question but in terms of best / correct way to send objects / dics obviously it depends. For a lot of situations JSON is simple and familiar to most. To get it to work I had to use send_string and recv_string e.g.
# client.py
socket.send_string(json.dumps({'data': ['a', 'b', 'c']}))
# server.py
result = json.loads(socket.recv_string())
Discussion in docs https://pyzmq.readthedocs.io/en/latest/unicode.html

Related

Packet Declaration Language

I'm writing a program which will communicate with another program and, obviously, should use same protocol for it.
What I need is something like protobuf, but not protobuf, because it won't let me describe packets format exactly as I want. For example, it inserts field numbers in its packets. Pickle won't do either because of the same reasons.
I wrote my own thing using struct but it is ugly and I not fully understand how is it working. I need something where I can describe different fields like short, integer, their endianness, complex fields, which consists of primitive fields or another complex fields, arrays of primitive fields, array of complex fields.
Could you recommend something like this? Or I doomed to stick to my own solution?
I've had to write Python code for dealing with binary formats before, and struct is no fun to work with. The construct module is much nicer. It allows you to both consume and generate complex binary formats using a simple declarative syntax.

Common use-cases for pickle in Python

I've looked at the pickle documentation, but I don't understand where pickle is useful.
What are some common use-cases for pickle?
Some uses that I have come across:
1) saving a program's state data to disk so that it can carry on where it left off when restarted (persistence)
2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)
3) storing python objects in a database
4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).
There are some issues with the last one - two identical objects can be pickled and result in different strings - or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.
To emphasise #lunaryorn's comment - you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/
Minimal roundtrip example..
>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'
Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you'd have to dig quite deep into the source) is ZODB:
http://svn.zope.org/
Otherwise, PyPI mentions several:
http://pypi.python.org/pypi?:action=search&term=pickle&submit=search
I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.
Pickle is like "Save As.." and "Open.." for your data structures and classes. Let's say I want to save my data structures so that it is persistent between program runs.
Saving:
with open("save.p", "wb") as f:
pickle.dump(myStuff, f)
Loading:
try:
with open("save.p", "rb") as f:
myStuff = pickle.load(f)
except:
myStuff = defaultdict(dict)
Now I don't have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.
I have used it in one of my projects. If the app was terminated during it's working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.
Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn't pickle, you can't send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.
For the beginner (as is the case with me) it's really hard to understand why use pickle in the first place when reading the official documentation. It's maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help:
https://stackoverflow.com/a/14482962/4383472, What is serialization?,
https://stackoverflow.com/a/3984483/4383472
To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.
I can tell you the uses I use it for and have seen it used for:
Game profile saves
Game data saves like lives and health
Previous records of say numbers inputed to a program
Those are the ones I use it for at least
I use pickling during web scraping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.
you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.

Are there reasons to use get/put methods instead of item access?

I find that I have recently been implementing Mapping interfaces on classes which on the surface fit the model (they are essentially just key-value stores with no more meta-data), but underneath they are sometimes quite complex.
Here are a couple examples of increasing severity:
An object which wraps another mapping converting all objects to strings when set.
An object which uses a local database as a back-end to store the key-value pairs.
An object which makes HTTP requests to remote servers to get/set data.
Lets suppose all of these examples seamlessly implement the Mapping interface, and the only indication that there is something fishy going on is that item access potentially could take a few seconds, and an item may not be retrievable in the same form as stored (if at all). I'm perfectly content with something like the first example, pretty okay with the second, but I'm getting kinda uncomfortable with the last.
The question is, is there a line at which the API for these models should not use item access, even though the underlying structure may feel like it fits on the surface?
It sounds like you are describing the standard anydbm module semantics. And just as anydbm can raise exception anydbm.error, so too could your subclass raise derivatives like MyDbmTimeoutError as needed. Whether you implement it as dictionary operations or function calls, the caller will still have to contend with exceptions anyway (e.g. KeyError, NameError).
I think that arbitrary "tied" hashes exist in Python 2 and 3.x is decent justification for saying it is a reasonable approach. Indeed, I've been looking for (and designing in my head) more complex bindings than simple key ⇒ value mappings without a heavy ORM SQL layer in between.
added: The more I think about it, the more Pythonic tied dictionaries seem to be. A key ⇒ value collection is a dictionary. Whether it lives in core or on disk or across the network is an implementation detail that is best abstracted away. The only substantive differences are increased latency and possible unavailability; however, on a virtual memory based OS, "core" can have higher latency than RAM and in a multiprocessing OS, "core" can become unavailable too. So these are differences in degree only, not kind.
From a strictly philosophical point of view, I don't think that there is a line you can cross with this. If some tool provides the needed functionality, but its API is different, adapt away. The only time you shouldn't do this is if the adapted to API simply is not expressive enough to manipulate the adapted component in a way that is needed.
I wouldn't hesitate to adapt a database into a dict, because that's a great way to manipulate collections, and it's already compatible with a heck of a lot of other code. If I find that my particular application must make calls to the database connections begin(), commit(), and rollback() methods to work right, then a dict won't do, since dict's don't have transaction semantics.

Why we should perfer to store the serialized data not the raw code to DB?

If we have some code(a data structure) which should be stored in DB, someone always suggests us to store the serialized data not the raw code string.
So I'm not so sure why we should prefer the serialized data.
Give a simple instance(in python):
we've got a field which will store a dict of python, like
{ "name" : "BMW", "category":"car", "cost" : "200000"}
so we can serialize it using pickle(a python module) and then store the pickled data to db field.
Or we can store the dict string directly to DB without serializing.
Since we need to convert the string to python data back, two approaches are both easy to do, by using pickle.loads and exec respectively.
So which should be preferred? And why? Is it because exec is much slower than pickle? or some other reasons?
Thanks.
Or we can store the dict string
directly to DB without serializing.
There is no such thing as "the dict string". There are many ways to serialize a dict into a string; you may be thinking of repr, possibly as eval as the way to get the dict back (you mention exec, but that's simply absurd: what statement would you execute...?! I think you probably mean eval). They're different serialization methods with their tradeoffs, and in many cases the tradeoffs tend to favor pickling (cPickle, for speed, with protocol -1 meaning "the best you can do", usually).
Performance is surely an issue, e.g., in terms of size of what you're storing...:
$ python -c 'import cPickle; d=dict.fromkeys(range(99), "banana"); print len(repr(d))'
1376
$ python -c 'import cPickle; d=dict.fromkeys(range(99), "banana"); print len(cPickle.dumps(d,-1))'
412
...why would you want to store 1.4 KB rather than 0.4 KB each time you serialize a dict like this one...?-)
Edit: since some suggest Json, it's worth pointing out that json takes 1574 bytes here -- even bulkier than bulky repr!
As for speed...
$ python -mtimeit -s'import cPickle; d=dict.fromkeys(range(99), "chocolate")' 'eval(repr(d))'
1000 loops, best of 3: 706 usec per loop
$ python -mtimeit -s'import cPickle; d=dict.fromkeys(range(99), "chocolate")' 'cPickle.loads(cPickle.dumps(d, -1))'
10000 loops, best of 3: 70.2 usec per loop
...why take 10 times longer? What's the upside that would justify paying such a hefty price?
Edit: json takes 2.7 milliseconds -- almost forty times slower than cPickle.
Then there's generality -- not every serializable object can properly round-trip with repr and eval, while pickling is much more general. E.g.:
$ python -c'def f(): pass
d={23:f}
print d == eval(repr(d))'
Traceback (most recent call last):
File "<string>", line 3, in <module>
File "<string>", line 1
{23: <function f at 0x241970>}
^
SyntaxError: invalid syntax
vs
$ python -c'import cPickle
def f(): pass
d={"x":f}
print d == cPickle.loads(cPickle.dumps(d, -1))'
True
Edit: json is even less general than repr in terms of round-trips.
So, comparing the two serialization approaches (pickling vs repr/eval), we see: pickling is way more general, it can be e.g. 10 times faster, and take up e.g. 3 times less space in your database.
What compensating advantages do you envisage for repr/eval...?
BTW, I see some answers mention security, but that's not a real point: pickling is insecure too (the security issue with eval`ing untrusted strings may be more obvious, but unpickling an untrusted string is also insecure, though in subtler and darker ways).
Edit: json is more secure. Whether that's worth the huge cost in size, speed and generality, is a tradeoff worth pondering. In most cases it won't be.
I've preferred using a standard serialization format like JSON for storing this kind of data in the database. It makes it possible for consumers of the data to be written in other languages than python, it's basically human readable, and it's more easily query-able with SQL than pickled objects.
Both storing as a string and using pickle are serialization strategies. Pickle is more flexible in what it can store and can be more compact. Both strategies, eval (which is what you would use over exec in this instance) and pickle.loads are insecure—both of these can run arbitrary Python code.
Better would be to use a serialization format like JSON (json module in 2.6, simplejson 3rd party module pre-2.6), which isn't tied specifically to being read by Python and won't execute arbitrary code if there ends up being data you do not expect in your database. Further, while the pickle formats are subject to changing (and your losing data!), a standard like JSON is not going to change on you in a backward-incompatible way.
If I had to choose between serializing the data into something like JSON or storing a pickled data structure, I'd choose the JSON option every time. Other than the security issues everyone else is mentioning, portability is the biggest reason to not store a native python object in the database. There may be a requirement in the future to port your system to some other language, and storing a pickled python object would make that rather difficult. Also, other applications may need to hit the data your storing, but I can't speak to specific instances since I don't know your situation.
Also, if your system needed to do any kind of filtering, storing data in a JSON string still wouldn't be your best option. If you can, and there are a set number of fields, I would be very tempted to split them into atomic elements. It would make searching and filtering a lot easier and efficient.
The question is what does serialization gain you?
I bet the people recommending that you store the serialized data think you will save time because you don't need to mess around with SQL queries to construct Python objects. But there are some significant trade offs in storing you data as serialized blobs, such as:
You lose referential integrity checking
The data format you choose may not work well for different access patterns. How will you get all cars that cost more than $20,000 efficiently, if all of the data is stored inside serialized objects?
What will you do if you object model changes significantly?
You lose interoperability with other languages if you use a native Python serialization format
You have to write code support code to loading data with a non-native Python serialization format
You can't use 3rd party tools for doing reporting on your data
The list goes on and on, make sure you are okay with these trade offs.
There's always a danger in exec that someone will somehow pass in a string with some nasty code. It might never be the case in your application but in general, it's a big problem, and using built-in serialization avoids it.
Another reason for using built-in serialization is that it makes it obvious what you're trying to do in your code. If you just fetch and exec, someone might not understand your actual intent.
Firstly fetch and exec will leave your application vulnerable to code injection.
If someone entered "System(rm -r /);" in your name field you would lose most of your files on a *nix system when you read in the data.
The second reason is portability and upgradability. "pickled" objects will work on any python platform with any python release -- Guido promised!
Thirdly "pikling" will automgically handle special characters and wierd code pages. So there will be no problems if your users enter line feeds or semi colons.
Serialization means less worrying. When you marshall your data using some known serialization — Pickle, JSON, Google's Protocol Buffers — you can trust that the data you retrieve later is the data you stored earlier.
Limit capability. If you're storing static data, why open up the possibility of letting the code be executed? It's unnecessary. Imagine the complication which will occur if, one year from now, another programer starts adding functions and module imports to this "static" data.
If there is any possibility of manipulating this data on DB or creating reports from it; I would seriously consider unpacking it onto a table. A simple table with name, key and value columns would give you all the power of your relational database. Depending on edits it might even perform better than fetch->modify->dump.

Python Disk-Based Dictionary

I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.
I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.
Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?
The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module
from shove import Shove
mem_store = Shove()
file_store = Shove('file://mystore')
file_store['key'] = value
Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.
The shelve module may do it; at any rate, it should be simple to test. Instead of:
self.lengths = {}
do:
import shelve
self.lengths = shelve.open('lengths.shelf')
The only catch is that keys to shelves must be strings, so you'll have to replace
self.lengths[indx]
with
self.lengths[str(indx)]
(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)
There's no built-in caching in memory, but your operating system may do that for you anyway.
[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]
Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.
A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.
With a little bit of thought it seems like you could get the shelve module to do what you want.
I've read you think shelve is too slow and you tried to hack your own dict using sqlite.
Another did this too :
http://sebsauvage.net/python/snyppets/index.html#dbdict
It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?
You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.
For simple use cases sqlitedict
can help. However when you have much more complex databases you might one to try one of the more upvoted answers.
It isn't exactly a dictionary, but the vaex module provides incredibly fast dataframe loading and lookup that is lazy-loading so it keeps everything on disk until it is needed and only loads the required slices into memory.
https://vaex.io/docs/tutorial.html#Getting-your-data-in

Categories