Handle multiple socket connections - python

I'm writing a client-server app with Python. The idea is to have a main server and thousands of clients that will connect with it. The server will send randomly small files to the clients to be processed and clients must do the work and update its status to the server every minute. My problem with this is that for the moment I only have an small and old home server so I think that it can't handle so many connections. Maybe you could help me with this:
How can increase the number of connections in my server?
How can I balance the load from client-side?
How could I improve the communication? I mean, I need to have a list of clients on the server with its status (maybe in a DB?) and this updates will be received time to time, so I don't need a permanent connection. Is a good idea to use UDP to send the updates? If not, do I have to create a new thread every time that I receive an update?
EDIT: I updated the question to explain a little better the problem but mainly to be clear enough for people with the same problems. There is actually a good solution in #TimMcNamara answer.

Setting yourself up for success: access patterns matter
What are some of design decisions that could affect how you implement a networking solution? You immediately begin to list down a few:
programmability
available memory
available processors
available bandwidth
This looks like a great list. We want something which is easy enough to program, and is fairly high spec. But, this list fails. What we've done here is only look at the server. That might be all we can control in a web application, but what about distributed systems that we have full control over, like sensor networks?
Let's say we have 10,000 devices that want to update you with their latest sensor readings, which they take each minute. Now, we could use a high-end server that holds concurrent connections with all of the devices.
However, even if you had an extremely high-end server, you could still be finding yourself with performance troubles. If the devices all use the same clock, and all attempt to send data at the top of the minute, then the server would be doing lots of CPU work for 1-2 seconds of each minute and nothing for the rest. Extremely inefficient.
As we have control over the sensors, we could ask them to load balance themselves. One approach would be to give each device an ID, and then use the modulus operator to only send data at the right time per minute:
import time
def main(device_id):
data = None
second_to_send = device_id % 60
while 1:
time_now = time.localtime().tm_sec
if time_now == 0:
data = read_sensors()
if time_now == second_to_send and data:
send(data)
time.sleep(1)
One consequence of this type of load balancing is that we no longer need such a high powered server. The memory and CPU we thought we needed to maintain connections with everyone is not required.
What I'm trying to say here is that you should make sure that your particular solution focuses on the whole problem. With the brief description you have provided, it doesn't seem like we need to maintain huge numbers of connections the whole time. However, let's say we do need to have 100% connectivity. What options do we have?
Non-blocking networking
The effect of non-blocking I/O means that functions that are asking a file descriptor for data when there is none return immediately. For networking, this could potentially be bad as a function attempting to read from a socket will return no data to the caller. Therefore, it can be a lot simpler sometimes to spawn a thread and then call read. That way blocking inside the thread will not affect the rest of the program.
The problems with threads include memory inefficiency, latency involved with thread creation and computational complexity associated with context switching.
To take advantage of non-blocking I/O, you could protentially poll every relevant file descriptor in a while 1: loop. That would be great, except for the fact that the CPU would run at 100%.
To avoid this, event-based libraries have been created. They will run the CPU at 0% when there is no work to be done, activating only when data is to be read to send. Within the Python world, Twisted, Tornado or gevent are big players. However, there are many options. In particular, diesel looks attractive.
Here's the relevant extract from the Tornado web page:
Because it is non-blocking and uses epoll or kqueue, it can handle thousands of simultaneous standing connections, which means it is ideal for real-time web services.
Each of those options takes a slightly different approach. Twisted and Tornado are fairly similar in their approach, relying on non-blocking operations. Tornado is focused on web applications, whereas the Twisted community is interested in networking more broadly. There is subsequently more tooling for non-HTTP communications.
gevent is different. The library modifies the socket calls, so that each connection runs in an extremely lightweight thread-like context, although in effect this is hidden from you as a programmer. Whenever there is a blocking call, such as a database query or other I/O, gevent will switch contexts very quickly.
The upshot of each of these options is that you are able to serve many clients within a single OS thread.
Tweaking the server
Your operating system imposes limits on the number of connections that it will allow. You may hit these limits if you reach the numbers you're talking about. In particular, Linux maintains limits for each user in /etc/security/limits.conf. You can access your user's limits by calling ulimit in the shell:
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63357
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 63357
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I have emboldened the most relevant line here, that of open files. Open external connections are considered to be open files. Once that 1024 limit is hit, no application will be able to open another file, nor will any more clients be able to connect to your server. Let's say you have a user, httpd as your web server. This should provide you with an idea of the modifications you could make to raise that limit:
httpd soft nofile 20480
httpd hard nofile 20480
For extremely high volumes, you may hit system-wide limits. You can view them through cat /proc/sys/fs/file-max:
$ cat /proc/sys/fs/file-max
801108
To modify this limit, use sudo sysctl -w fs.file-max=n, where n is the number of open files you wish to permit. Modify /etc/sysctl.conf to have this survive reboots.

There is generally speaking no problem with having even tens of thousands of sockets at once on even a very modest home server.
Just make sure you do not create a new thread or process for each connection.

Related

Rabbitmq memory control, queue is full and is not paging. Connection hangs

I'm testing out a rabbitMQ, celery setup.
In the current setup there is a jobqueue (2GB RAM, 65GB HD) and only one worker which pushes a lot of messages to the queue (later, we'll add a bunch of workers). When the jobqueue reaches about ~11 million messages the connection hangs (pretty sure this is a case of blocking due to Memory Based Flow Control as in http://www.rabbitmq.com/memory.html). But the connection hangs forever, never closing the connection, nor paging to disk. This is undesirable behavior -- causing the celery workers to become zombie processes.
In thinking about the total size that system might actually require -- we would like the queue to be able to take something like 10,000 times this load -- a total max of around ~30billion messages in the queue at a time.
Here are some relevant settings:
{vm_memory_high_watermark,0.8},
{vm_memory_high_watermark_paging_ratio,0.5}]
We initially changed the vm_high_watermark from .4 to .8, which allowed more messages in the queue but still not enough.
We're thinking of course the system will need more RAM at some point, although before that happens we want to understand the current problem and how to deal with it.
Right now, there are only 11m tasks in the queue and it is using 80% of 2GB RAM, and the entire system is only using 8GB of disk. The memory usage makes sense given that we set the vm_memory_high_watermark to .8. The disk usage does not make sense at all to me, though -- and suggests that the paginating is not happening. Why isn't RabbitMQ paginating to disk in order to allow the queue to grow more? While obviously slowing down the queue machine, this would allow it to not die -- and seems like desirable fallback behavior. AFAIK this is indeed the whole point of pagination.
Other notes:
We confirmed that the connections are hanging and have in fact been blocked for 41 hours since then (by examining the connections section of rabbitmqctl report). According to http://www.rabbitmq.com/memory.html this means that "flow control is taking place". The question is -- why isn't it paging messages to disk?
Other details:
Ubuntu 12.04.3 LTS
RabbitMQ 3.2.2, Erlang R14B04
Celery 3.0.24
Python 2.7.3
If your queue is not durable, no messages will be paged to disk. The system will be limited by available memory. IF you need messages to be flushed to disk, use a durable=true queue.
And this design, having a lot of load and not consuming the messages, is not ideal. RabbitMQ is not a database, the messages are meant to be transient. IF you need a datastore, use Redis, a RDBMS, etc.

configuring connection-pool size with Andy McCurdy's python-for-redis library

I have a python loader using Andy McCurdy's python library that opens multiple Redis DB connections and sets millions of keys looping through files of lines each containing an integer that is the redis-db number for that record. Alltogether, only 20 databases are open at the present time, but eventually there may be as many as 100 or more.
I notice that the redis log (set to verbose) always tells me there are "4 clients connected (0 slaves),... though I know that my 20 are open and are being used.
So I'm guessing this is about the connection pooling support built into the python library. Am I correct in that guess? If so the real question is is there a way to increase the pool size -- I have plenty of machine resources, a lot dedicated to Redis? Would increasing the pool size help performance as the number of virtual connections I'm making goes up?
At this point, I am actually hitting only ONE connection at a time though I have many open as I shuffle input records among them. But eventually there will be many scripts (2 dozen?) hitting Redis in parallel, mostly reading and I am wondering what effect increasing the pool size would have.
Thanks
matthew
So I'm guessing this is about the connection pooling support built into the python library. Am I correct in that guess?
Yes.
If so the real question is is there a way to increase the pool size
Not needed, it will increase connections up to 2**31 per default (andys lib). So your connections are idle anyways.
If you want to increase performance, you will need to change the application using redis.
and I am wondering what effect increasing the pool size would have.
None, at least not in this case.
IF redis becomes the bottleneck at some point, and you have a multi-core server. You must run multiple redis instances to increase performance, as it only runs on a single core. When you run multiple instances, and doing mostly reads, the slave feature can increase performance as the slaves can be used for all the reads.

Python/Redis Multiprocessing

I'm using Pool.map from the multiprocessing library to iterate through a large XML file and save word and ngram counts into a set of three redis servers. (which sit completely in memory) But for some reason all 4 cpu cores sit around 60% idle the whole time. The server has plenty of RAM and iotop shows that there is no disk IO happening.
I have 4 python threads and 3 redis servers running as daemons on three different ports. Each Python thread connects to all three servers.
The number of redis operations on each server is well below what it's benchmarked as capable of.
I can't find the bottleneck in this program? What would be likely candidates?
Network latency may be contributing to your idle CPU time in your python client application. If the network latency between client to server is even as little as 2 milliseconds, and you perform 10,000 redis commands, your application must sit idle for at least 20 seconds, regardless of the speed of any other component.
Using multiple python threads can help, but each thread will still go idle when a blocking command is sent to the server. Unless you have very many threads, they will often synchronize and all block waiting for a response. Because each thread is connecting to all three servers, the chances of this happening are reduced, except when all are blocked waiting for the same server.
Assuming you have uniform random distributed access across the servers to service your requests (by hashing on key names to implement sharding or partitioning), then the odds that three random requests will hash to the same redis server is inversely proportional to the number of servers. For 1 server, 100% of the time you will hash to the same server, for 2 it's 50% of the time, for 3 it's 33% of the time. What may be happening is that 1/3 of the time, all of your threads are blocked waiting for the same server. Redis is a single-threaded at handling data operations, so it must process each request one after another. Your observation that your CPU only reaches 60% utilization agrees with the probability that your requests are all blocked on network latency to the same server.
Continuing the assumption that you are implementing client-side sharding by hashing on key names, you can eliminate the contention between threads by assigning each thread a single server connection, and evaluate the partitioning hash before passing a request to a worker thread. This will ensure all threads are waiting on different network latency. But there may be an even better improvement by using pipelining.
You can reduce the impact of network latency by using the pipeline feature of the redis-py module, if you don't need an immediate result from the server. This may be viable for you, since you are storing the results of data processing into redis, it seems. To implent this using redis-py, periodically obtain a pipeline handle to an existing redis connection object using the .pipeline() method and invoke multiple store commands against that new handle the same as you would for the primary redis.Redis connection object. Then invoke .execute() to block on the replies. You can get orders of magnitude improvement by using pipelining to batch tens or hundreds of commands together. Your client thread won't block until you issue the final .execute() method on the pipeline handle.
If you apply both changes, and each worker thread communicates to just one server, pipelining multiple commands together (at least 5-10 to see a significant result), you may see greater CPU usage in the client (nearer to 100%). The cpython GIL will still limit the client to one core, but it sounds like you are already using other cores for the XML parsing by using the multiprocessing module.
There is a good writeup about pipelining on the redis.io site.

Multiple simultaneous tcp client connections for performace test

i need to create multiple TCP connections simultaneously to some custom TCP-server application for its performance testing. I know a lot of such for Web (i.e. curl-loader based on libcurl), but I didn't found some general one.
Scenario for client is the simplest: create connection, send special data, read the answer and close connection. In every step there's timestamp. All timestamps should be written to file for the further calculations. I need about 10 000 such connections in parallel.
I'd prefer some ready solution but there's nothing found in Google, so I'm ready to write this one with Python. If so, can you recommend me a suitable python modules that could produce this amount of connections? (multiprocessing, twisted..?)
My two cents:
Go for Twisted, or any other asynchronous networking library.
Make sure you can open enough file descriptors on the client and on the
server. On my Linux box, for instance, I can have no more than 1024 file
file descriptors by default:
carlos#marcelino:~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
[..]
open files (-n) 1024
[..]
It may pay to run the client and the server on different machines.
Handling 10k connections is a tough problem (known as C10K problem). If you need real numbers, stick with C++(Boost/POCO libraries or OS native API), or distribute clients across 10 load-generating clients.
No way should you try this with Python (handle 10'000 connection on 1 CPU core - not realistic).

Evaluate my Python server structure

I'm building a game server in Python and I just wanted to get some input on the architecture of the server that I was thinking up.
So, as we all know, Python cannot scale across cores with a single process. Therefore, on a server with 4 cores, I would need to spawn 4 processes.
Here is the steps taken when a client wishes to connect to the server cluster:
The IP the client initially communicates with is the Gateway node. The gateway keeps track of how many clients are on each machine, and forwards the connection request to the machine with the lowest client count.
On each machine, there is one Manager process and X Server processes, where X is the number of cores on the processor (since Python cannot scale across cores, we need to spawn 4 cores to use 100% of a quad core processor)
The manager's job is to keep track of how many clients are on each process, as well as to restart the processes if any of them crash. When a connection request is sent from the gateway to a manager, the manager looks at its server processes on that machine (3 in the diagram) and forwards the request to whatever process has the least amount of clients.
The Server process is what actually does the communicating with the client.
Here is what a 3 machine cluster would look like. For the sake of the diagram, assume each node has 3 cores.
alt text http://img152.imageshack.us/img152/5412/serverlx2.jpg
This also got me thinking - could I implement hot swapping this way? Since each process is controlled by the manager, when I want to swap in a new version of the server process I just let the manager know that it should not send any more connections to it, and then I will register the new version process with the old one. The old version is kept alive as long as clients are connected to it, then terminates when there are no more.
Phew. Let me know what you guys think.
Sounds like you'll want to look at PyProcessing, now included in Python 2.6 and beyond as multiprocessing. It takes care of a lot of the machinery of dealing with multiple processes.
An alternative architectural model is to setup a work queue using something like beanstalkd and have each of the "servers" pull jobs from the queue. That way you can add servers as you wish, swap them out, etc, without having to worry about registering them with the manager (this is assuming the work you're spreading over the servers can be quantified as "jobs").
Finally, it may be worthwhile to build the whole thing on HTTP and take advantage of existing well known and highly scalable load distribution mechanisms, such as nginx. If you can make the communication HTTP based then you'll be able to use lots of off-the-shelf tools to handle most of what you describe.

Categories