Reasonable settings for ZODB pool_size

Reasonable settings for ZODB pool_size - python

What's a reasonable default for pool_size in a ZODB.DB call in a multi-threaded web application?
Leaving the actual default value 7 gives me some connection WARNINGs even when I'm the only one navigating through db-interacting handlers. Is it possible to set a number that's too high? What factors play into deciding what exactly to set it to?

The pool size is only a 'guideline'; the warning is logged when you exceed that size; if you were to use double the number of connections an CRITICAL log message would be registed instead. These are there to indicate you may be using too many connections in your application.
The pool will try to reduce the number of retained connections to the pool size as you close connections.
You need to set it to the maximum number of threads in your application. For Tornado, which I believe uses asynchronous events instead of threading almost exclusively, that might be harder to determine; if there is a maximum number of concurrent connections configurable in Tornado, then the pool size needs to be set to that number.
I am not sure how the ZODB will perform when your application scales to hundreds or thousands of concurrent connections, though. I've so far only used it with at most 100 or so concurrent connections spread across several processes and even machines (using ZEO or RelStorage to serve the ZODB across those processes).
I'd say that if most of these connections only read, you should be fine; it's writing on the same object concurrently that is ZODB's weak point as far as scalability is concerned.

Related

tornado websocket server - connections queue

I have a tornado.websocket.WebSocketHandler which processes data. The idea is to instantiate a limited number of handlers (e.g. so they are bounded by the the number of CPU cores). I would like to put the rest of the connections in a queue (as soon as they are opened) so one of them is activated when another finishes.
I was trying to do that via threading.Semaphore, but it seems that tornado socket handlers run in a single thread, so everything hangs out. How can I achieve that ?

Tornado has its own asynchronous semaphore class in tornado.locks.Semaphore.
Tornado is designed to make connections very cheap - one connection per core would be an extremely low limit. I suggest not limiting the number of connections per se, but limiting what you do with these connections. (and remember the GIL - unless you're calling out to C extensions for your cpu-intensive work, you can't make use of multiple CPU cores from python anyway). Doing your CPU-intensive work on a bounded ThreadPoolExecutor may be the best way to do what it sounds like you're trying to do.

In Celery are there significant performance implications of using many queues

Are there substantial performance implications that I should keep in mind when Celery workers are pulling from multiple (or perhaps many) queues? For example, would there be a significant performance penalty if my system were designed so that workers pulled from 10 to 15 queues rather than just 1 or 2? As a follow-up, what if some of those queues are sometimes empty?

The short answer to your question on queue limits is:
Don't worry having multiple queues will not be worse or better, broker are designed to handle huge numbers of them. Off course in a lot of use cases you don't need so many, except really advanced one. Empty queues don't create any problem, they just take a tiny amount of memory on the broker.
Don't forget also that you have other things like exchanges and bindings, also there you don't have real limits but is better you understand the performance implication of each of them before using it (a TOPIC exchange will use more CPU than a direct one for example)
To give you a more complete answer let's look at the performance topic from a more generic point of view.
When looking at a distributed system based on message passing like Celery there are 2 main topics to analyze from the point of view of performance:
The workers number and concurrency factor.
As you probably already know each celery worker has a concurrency parameter that sets how many tasks can be executed at the same time, this should be set in relation with the server capacity(CPU, RAM, I/O) and off course also based on the type of tasks that the specific consumer will execute (depends on the queue that it will consume).
Off course depending on the total number of tasks you need to execute in a certain time window you will need to decide how many workers/servers you will need to have up and running.
The broker, the Single point of Failure in this architecture style.
The broker, especially RabbitMQ, is designed to manage millions of messages without any problem, however more messages it will need to store more memory will use and more are the messages to route more CPU it will use.
This machine should be well tuned too and if possible be in high availability.
Off course the main thing to avoid is the messages are consumed at a lower rate than they are produced otherwise your queue will keep growing and your RabbitMQ will explode. Here you can find some hints.
There are cases where you may also need to increase the number of tasks executed in a certain time frame but on only in response to peaks of requests. The nice thing about this architecture is that you can monitor the size of the queues and when you understand is growing to fast you could create new machines on the fly with a celery worker already configured and than turn it off when they are not needed. This is a quite cost saving and efficient approach.
One hint, remember to don't store celery tasks results in RabbitMQ.

How does a python web server overcomes GIL

Any web server might have to handle a lot of requests at the same time. As python interpreter actually has GIL constraint, how concurrency is implemented?
Do they use multiple processes and use IPC for state sharing?

You usually have many workers(i.e. gunicorn), each being dispatched with independent requests. Everything else(concurrency related) is handled by the database so it is abstracted from you.
You don't need IPC, you just need a "single source of truth", which will be the RDBMS, a cache server(redis, memcached), etc.

First of all, requests can be handled independently. However, servers want to simultaneously handle them in order to keep the number of requests that can be handled per time at a maximum.
The implementation of this concept of concurrency depends on the webserver.
Some implementations may have a fixed number of threads or processes for handling requests. If all are in use, additional requests have to wait until being handled.
Another possibility is that a process or thread is spawned for each request. Spawning a process for each request leads to an absurd memory and cpu overhead. Spawning lightweight threads is better. Doing so, you can serve hundreds of clients per second. However, also threads bring their management overhead, manifesting itself in high memory and CPU consumption.
For serving thousands of clients per second, an event-driven architecture based on asynchronous coroutines is a state-of-the-art solution. It enables the server to serve clients at a high rate without spawning zillions of threads. On the Wikipedia page of the so-called C10k problem you find a list of web servers. Among those, many make use of this architecture.
Coroutines are available for Python, too. Have look at http://www.gevent.org/. That's why a Python WSGI app based on e.g uWSGI + gevent is an extremely performant solution.

As normal. Web serving is mostly I/O-bound, and the GIL is released during I/O operations. So either threading is used without any special accommodations, or an event loop (such as Twisted) is used.

configuring connection-pool size with Andy McCurdy's python-for-redis library

I have a python loader using Andy McCurdy's python library that opens multiple Redis DB connections and sets millions of keys looping through files of lines each containing an integer that is the redis-db number for that record. Alltogether, only 20 databases are open at the present time, but eventually there may be as many as 100 or more.
I notice that the redis log (set to verbose) always tells me there are "4 clients connected (0 slaves),... though I know that my 20 are open and are being used.
So I'm guessing this is about the connection pooling support built into the python library. Am I correct in that guess? If so the real question is is there a way to increase the pool size -- I have plenty of machine resources, a lot dedicated to Redis? Would increasing the pool size help performance as the number of virtual connections I'm making goes up?
At this point, I am actually hitting only ONE connection at a time though I have many open as I shuffle input records among them. But eventually there will be many scripts (2 dozen?) hitting Redis in parallel, mostly reading and I am wondering what effect increasing the pool size would have.
Thanks
matthew

So I'm guessing this is about the connection pooling support built into the python library. Am I correct in that guess?
Yes.
If so the real question is is there a way to increase the pool size
Not needed, it will increase connections up to 2**31 per default (andys lib). So your connections are idle anyways.
If you want to increase performance, you will need to change the application using redis.
and I am wondering what effect increasing the pool size would have.
None, at least not in this case.
IF redis becomes the bottleneck at some point, and you have a multi-core server. You must run multiple redis instances to increase performance, as it only runs on a single core. When you run multiple instances, and doing mostly reads, the slave feature can increase performance as the slaves can be used for all the reads.

Python/Redis Multiprocessing

I'm using Pool.map from the multiprocessing library to iterate through a large XML file and save word and ngram counts into a set of three redis servers. (which sit completely in memory) But for some reason all 4 cpu cores sit around 60% idle the whole time. The server has plenty of RAM and iotop shows that there is no disk IO happening.
I have 4 python threads and 3 redis servers running as daemons on three different ports. Each Python thread connects to all three servers.
The number of redis operations on each server is well below what it's benchmarked as capable of.
I can't find the bottleneck in this program? What would be likely candidates?

Network latency may be contributing to your idle CPU time in your python client application. If the network latency between client to server is even as little as 2 milliseconds, and you perform 10,000 redis commands, your application must sit idle for at least 20 seconds, regardless of the speed of any other component.
Using multiple python threads can help, but each thread will still go idle when a blocking command is sent to the server. Unless you have very many threads, they will often synchronize and all block waiting for a response. Because each thread is connecting to all three servers, the chances of this happening are reduced, except when all are blocked waiting for the same server.
Assuming you have uniform random distributed access across the servers to service your requests (by hashing on key names to implement sharding or partitioning), then the odds that three random requests will hash to the same redis server is inversely proportional to the number of servers. For 1 server, 100% of the time you will hash to the same server, for 2 it's 50% of the time, for 3 it's 33% of the time. What may be happening is that 1/3 of the time, all of your threads are blocked waiting for the same server. Redis is a single-threaded at handling data operations, so it must process each request one after another. Your observation that your CPU only reaches 60% utilization agrees with the probability that your requests are all blocked on network latency to the same server.
Continuing the assumption that you are implementing client-side sharding by hashing on key names, you can eliminate the contention between threads by assigning each thread a single server connection, and evaluate the partitioning hash before passing a request to a worker thread. This will ensure all threads are waiting on different network latency. But there may be an even better improvement by using pipelining.
You can reduce the impact of network latency by using the pipeline feature of the redis-py module, if you don't need an immediate result from the server. This may be viable for you, since you are storing the results of data processing into redis, it seems. To implent this using redis-py, periodically obtain a pipeline handle to an existing redis connection object using the .pipeline() method and invoke multiple store commands against that new handle the same as you would for the primary redis.Redis connection object. Then invoke .execute() to block on the replies. You can get orders of magnitude improvement by using pipelining to batch tens or hundreds of commands together. Your client thread won't block until you issue the final .execute() method on the pipeline handle.
If you apply both changes, and each worker thread communicates to just one server, pipelining multiple commands together (at least 5-10 to see a significant result), you may see greater CPU usage in the client (nearer to 100%). The cpython GIL will still limit the client to one core, but it sounds like you are already using other cores for the XML parsing by using the multiprocessing module.
There is a good writeup about pipelining on the redis.io site.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.