ZeroMQ selective pub/sub pattern? - python

I'm trying to design ZeroMQ architecture for N front-end servers and M back-end workers, where front-end servers would send task to back-end ones. Front-end servers do have information about back-end ones, but back-end ones do not know about front-end. I have two types of tasks, one type should use round robin and go to just one back-end server, while other type should be broadcasted to all back-end servers. I don't want to have a central broker, as it would be single point of failure.
For the first type of tasks request/response pattern seems to be the right one, while for the second it would be publisher/subscriber pattern. But how about pattern combining the two? Is there any patter that would allow me to select at send time if I want to sent message to all or just one random back-end servers?
The solution I've come up with is just use publisher/subscriber and prepend messages with back-end server ID and some magic value if it's addressed to all. However, this would create lot unnecessary traffic. Is there cleaner and more efficient way to do it?

I'd probably use pub sub message envelopes - if you're using pub/sub broadcast over UDP I don't believe it will generate unnecessary network traffic but it will incur extra processing however like most of these things it's a trade-off between design elegance and performance. ØMQ tends to take the route of performance first, but I'd be inclined to measure it and use quantified performance results to decide if this was acceptable.
For me the elegant solution would be to use two sets of sockets because this in itself is differentiating the workflow through the system - whereas using a single socket is mixing things up in a very non ØMQ way, these should be different to allow for future changes and dynamic/unstable systems.

What I see as the only possibility is to use the DEALER-ROUTER combination. DEALER at the frontend, ROUTER at the backend. Every frontend server shall contain a DEALER socket for every backend server (for broadcast) and one DEALER socket on top connected to all the backend servers at once for the round-robin thing. Now let me explain why.
You can't really use PUB-SUB in such a critical case, because that pattern can very easily drop messages silently, it does not queue. So in fact the message posted to PUB can arrive to any subset of SUB since it's (dis)connecting in the background. For this reason you need to simulate broadcast by looping over DEALER sockets assigned to all the background servers. It will queue messages if the backend part is not connected, but beware of the HWM. The only final solution is to use heartbeat to know when a backend is dead and destroy the socket assigned to it.
A ROUTER socket at the background is a logical solution since you can asynchronously accept any number of requests and since it's a ROUTER socket it is super easy to send the response back to the frontend that requested the task. By having a single ROUTER in the background servers you can make it in a way that they are not even aware of the fact that there is a broadcast happening, they see everything as a direct request to them. Broadcasting is purely a frontend thing. The only issue with this solution might be that if your backend server is not fast enough, all the frontend servers may fill it up so that it reaches the HWM and starts dropping the packages. You can prevent this by having more threads/processes processing the messages from the ROUTER socket. zmq_proxy() is a useful function for this stuff.
Hope this helps ;-)

Related

Concurrency test of UDP server

I need to write a script to stress test the UDP server.It needs to simulate about 5000 online users and about 400 concurrent users.I couldn't find a similar function on Google, so I wrote a UDP client myself.But I had a problem simulating multiple clients.The solution I came up with:
One socket per client
How to mark online users and concurrent users when using multithreading and multiple sockets to simulate clients?
I encapsulate the client into classes,in this class __ init__ The method of adding one to a variable is used to record the of online users.In this way, concurrent operations cannot be performed successfully
Is it feasible to create 5000 sockets with threads? Is this a best practice? Good performance?
Other approaches?
Is there another approach I haven't thought of? Am I on the wrong track?
Is there a mature testing framework that can be used for reference?
Finally, English is not my mother tongue. Please forgive me for my typos or grammar.Thank you for your reading and look forward to your reply.
There is Apache JMeter tool which is free, open source and modular
There is UDP Request sampler plugin which adds support of the UDP protocol to JMeter, see
The "5000 online users and 400 concurrent users" requirement may be interpreted in the following manner: real users don't hammer the system under test non-stop, they need some time to "think" between operations, i.e. read text, type response, fill forms, take a phone call, etc. So you need to introduce realistic think times using JMeter Timers so you could come up with the configuration when:
5000 users are "online" (connected to the server)
4600 are not doing anything, just "sleeping"
400 are actively sending requests
As long as your machine is capable of doing this without running out of CPU, RAM, Network, etc - it should be fine, personally I would use something like greenlet

Efficient way to send results every 1-30 seconds from one machine to another

Key points:
I need to send roughly ~100 float numbers every 1-30 seconds from one machine to another.
The first machine is catching those values through sensors connected to it.
The second machine is listening for them, passing them to an http server (nginx), a telegram bot and another program sending emails with alerts.
How would you do this and why?
Please be accurate. It's the first time I work with sockets and with python, but I'm confident I can do this. Just give me crucial details, lighten me up!
Some small portion (a few rows) of the core would be appreciated if you think it's a delicate part, but the main goal of my question is to see the big picture.
Main thing here is to decide on a connection design and to choose protocol. I.e. will you have a persistent connection to your server or connect each time when new data is ready to it.
Then will you use HTTP POST or Web Sockets or ordinary sockets. Will you rely exclusively on nginx or your data catcher will be another serving service.
This would be a most secure way, if other people will be connecting to nginx to view sites etc.
Write or use another server to run on another port. For example, another nginx process just for that. Then use SSL (i.e. HTTPS) with basic authentication to prevent anyone else from abusing the connection.
Then on client side, make a packet every x seconds of all data (pickle.dumps() or json or something), then connect to your port with your credentials and pass the packet.
Python script may wait for it there.
Or you write a socket server from scratch in Python (not extra hard) to wait for your packets.
The caveat here is that you have to implement your protocol and security. But you gain some other benefits. Much more easier to maintain persistent connection if you desire or need to. I don't think it is necessary though and it can become bulky to code break recovery.
No, just wait on some port for a connection. Client must clearly identify itself (else you instantly drop the connection), it must prove that it talks your protocol and then send the data.
Use SSL sockets to do it so that you don't have to implement encryption yourself to preserve authentication data. You may even rely only upon in advance built keys for security and then pass only data.
Do not worry about the speed. Sockets are handled by OS and if you are on Unix-like system you may connect as many times you want in as little time interval you need. Nothing short of DoS attack won't inpact it much.
If on Windows, better use some finished server because Windows sometimes do not release a socket on time so you will be forced to wait or do some hackery to avoid this unfortunate behaviour (non blocking sockets and reuse addr and then some flo control will be needed).
As far as your data is small you don't have to worry much about the server protocol. I would use HTTPS myself, but I would write myown light-weight server in Python or modify and run one of examples from internet. That's me though.
The simplest thing that could possibly work would be to take your N floats, convert them to a binary message using struct.pack(), and then send them via a UDP socket to the target machine (if it's on a single LAN you could even use UDP multicast, then multiple receivers could get the data if needed). You can safely send a maximum of 60 to 170 double-precision floats in a single UDP datagram (depending on your network).
This requires no application protocol, is easily debugged at the network level using Wireshark, is efficient, and makes it trivial to implement other publishers or subscribers in any language.

Python tcp socket client

I need to have a tcp socket client connected to a server to send data and receive.
But this socket must be always on and i cannot open another socket.
I have always some data to send over the time and then later process the answer to the data sent previously.
If i could open many sockets, i think it was more easy. But in my case i have to send everything on the same socket asynchronously.
So the question is, what do you recommend to use within the Python ecosystem? (twisted, tornado, etc)
Should i consider node.js or another option?
I highly recommend Twisted for this:
It comes with out-of-the-box support for many TCP protocols.
It is easy to maintain a single connection, there is a ReconnectingClientFactory that will deal with disconnections and use exponential backoff, and LoopingCall makes it easy to implement a heartbeat.
Stateful protocols are also easy to implement and intermingle with complex business logic.
It's fun.
I have a service that is exactly like the one you mention (single login, stays on all the time, processes data). It's been on for months working like a champ.
Twisted is possibly hard to get your head around, but the tutorials here are a great start. Knowing Twisted will get you far in the long run!
"i have to send everything on the same socket asynchronously"
Add your data to a queue, have a separate thread taking items out of the queue and sending via socket.send()

Sending image to server: http POST vs custom tcp protocol

I am working out how to build a python app to do image processing. A client (not a web browser) sends an image and some text data to the server and the server's response is based on the received image.
One method is to use a web server + WSGI module and have clients make a HTTP POST request (using multipart/form-data). The http server then 'works out' the uploaded image and other data that the program can use.
Another method is to create a protocol that only sends the needed data and is handled within the application. The application would be doing everything (listening on the port, etc).
Is one of these a stand-out 'best' way (if yes, which one?), or is it more up to preference (or is there another way which is better)?
I believe it's more up to your needs, the size of the images, and your general knowledge of network programming.
In terms of simplicity, posting an image to the webserver using WSGI would be fairly simple, and you wouldn't have to worry about handling connections, sockets, error handling due to busy network ports, etc.
Another argument in favor of this approach is that you can easily reuse this "feature" if you already have it working on a webserver, say, by including a browser client. It might not be one of your needs now, but the door is left open.
This would be my choice.
Also, in Python you have a huge plethora of web frameworks to choose from, from the likes of Django, which is probably a huge overkill for your needs, to something alot simpler, like http://flask.pocoo.org/ which might just suit your needs and is really simple to set up.
In my opinion HTTP is an ideal protocol for sending files or large data, and its very common use, easy to suit any situation. If you use a self-created protocol, you may find it hard to transform when you get other client needs, like a web API.
Maybe the discussions about HTTP's lack of instantaneity and agility make you hesitate about choosing HTTP, but that mostly something about instant messaging and server push, there are better protocols. But when it comes to stability and flexiblity, HTTP is always a good choice.

Adding authentication to beanstalkd from Python (or any UNIX) client

So what I like about beanstalkd: small, lightweight, has priorities for messages, has a great set of clients, easy to use.
What I dislike about beanstalkd: the lack of authentication menaing if you can connect to the port you can insert messages into it.
So my thoughts are to either firewall it to trusted systems (which is a pain to maintain and external to the application adding another layer of stuff to do) or to wrap it in TLS/SSL using something like stunnel (which will incur a good chunk of overhead with respect to establishing connections and whatnot). I did think of maybe signing jobs (MD5 or SHA of job string+time value+secret appended to the job), but if an attacker were to flood the server with bogus jobs I'd still be in trouble. Can anyone think of any other methods to secure the beanstalkd against insertion of bogus messages from an attacker? Especially those that don't incur a lot of overhead computationally or administratively.
I have to disagree about the practice of just having connections being held open indefinitely, since I use BeanstalkD from a web-scripting language (php) for various events. The overhead of opening a secure connection would be something I would have to think very carefully over.
Like Memcached, beanstalkd is designed for use in a trusted environment - behind the firewall. If you don't control the entire private network, then limiting access to a set of machines (by IP address) would be a typical way of controlling that. Putting in a security hash to then throw away invalid jobs is not difficult, and has little work or overhead to check, but wouldn't stop a flood of jobs being sent.
The questions to ask are 'How often are your machines likely to be added to (at random IP addresses outside of a given range), and how likely is a third party that is also on the local network would want to inject random jobs to your queues?'. The first part is about how much work is it to firewall the machines off, the latter is about do you need to anyway?
This question really belongs on the beanstalkd talk list.
I added SASL support to memcached recently for a similar reason. The overhead is almost irrelevant in practice since you only authenticate at connect time (and you hold connections open indefinitely).
If authentication is something you need, I'd recommend bringing it up there where people are likely to help you solve your problems.
I do two things that reduce the issue you are refererring to:
First I always run beanstalkd on 127.0.0.1
Second, I normally serialize the job structure, and load a "secret key" encrypted base64 digest as the job string. Only workers that can decrypt the job string correctly can parse jobs.
I know that this is certainly not a substitute for authentication. But I hope they do minimize to some extent some one hijacking enqueued jobs.

Categories