I'm using Pool.map from the multiprocessing library to iterate through a large XML file and save word and ngram counts into a set of three redis servers. (which sit completely in memory) But for some reason all 4 cpu cores sit around 60% idle the whole time. The server has plenty of RAM and iotop shows that there is no disk IO happening.
I have 4 python threads and 3 redis servers running as daemons on three different ports. Each Python thread connects to all three servers.
The number of redis operations on each server is well below what it's benchmarked as capable of.
I can't find the bottleneck in this program? What would be likely candidates?
Network latency may be contributing to your idle CPU time in your python client application. If the network latency between client to server is even as little as 2 milliseconds, and you perform 10,000 redis commands, your application must sit idle for at least 20 seconds, regardless of the speed of any other component.
Using multiple python threads can help, but each thread will still go idle when a blocking command is sent to the server. Unless you have very many threads, they will often synchronize and all block waiting for a response. Because each thread is connecting to all three servers, the chances of this happening are reduced, except when all are blocked waiting for the same server.
Assuming you have uniform random distributed access across the servers to service your requests (by hashing on key names to implement sharding or partitioning), then the odds that three random requests will hash to the same redis server is inversely proportional to the number of servers. For 1 server, 100% of the time you will hash to the same server, for 2 it's 50% of the time, for 3 it's 33% of the time. What may be happening is that 1/3 of the time, all of your threads are blocked waiting for the same server. Redis is a single-threaded at handling data operations, so it must process each request one after another. Your observation that your CPU only reaches 60% utilization agrees with the probability that your requests are all blocked on network latency to the same server.
Continuing the assumption that you are implementing client-side sharding by hashing on key names, you can eliminate the contention between threads by assigning each thread a single server connection, and evaluate the partitioning hash before passing a request to a worker thread. This will ensure all threads are waiting on different network latency. But there may be an even better improvement by using pipelining.
You can reduce the impact of network latency by using the pipeline feature of the redis-py module, if you don't need an immediate result from the server. This may be viable for you, since you are storing the results of data processing into redis, it seems. To implent this using redis-py, periodically obtain a pipeline handle to an existing redis connection object using the .pipeline() method and invoke multiple store commands against that new handle the same as you would for the primary redis.Redis connection object. Then invoke .execute() to block on the replies. You can get orders of magnitude improvement by using pipelining to batch tens or hundreds of commands together. Your client thread won't block until you issue the final .execute() method on the pipeline handle.
If you apply both changes, and each worker thread communicates to just one server, pipelining multiple commands together (at least 5-10 to see a significant result), you may see greater CPU usage in the client (nearer to 100%). The cpython GIL will still limit the client to one core, but it sounds like you are already using other cores for the XML parsing by using the multiprocessing module.
There is a good writeup about pipelining on the redis.io site.
Related
Hey I am pretty new to this community and I was wondering if this is possible?
For example I have a ThreadPoolExecutor with:
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Semaphore
lock = Semaphore(1)
# Just a pseudocode example
profileTasks = ["TEST1", "TEST2", "TEST3", "TEST4", ....]
def runTask(index, profile):
lock.acquire()
print(f"{index} with {profile}")
lock.release()
runningLoop = True
while runningLoop:
"""
Launch anyways with PoolExecutor
"""
tasks = []
with ThreadPoolExecutor(max_workers=threads) as executor:
for index, profile in enumerate(profileTasks):
tasks.append(
executor.submit(
runTask, index, profile
)
)
runningLoop = False
When I launch for instance more than 100 tasks the threads take so long to start from the executor, I want to split the workload so if I run 1000 tasks and I have a CPU with example 8 cores I want to split the 1000 tasks by 8 multiprocesses in which I run the thread pool executors.
I hope you understand what I mean, Threading in python is in general not petty smart cause it only uses 1 CPU core.
I tried counting the CPU cores and to execute it in a MultiProcessExecutor but it was a complete failure and froze my CPU.
The mostly used Python implementation (from python.org, commonly referred to as cpython because it is written in C) enforces that only one thread at a time can be executing python bytecodes.
So using threads to speed up computationally intensive applications does not work with this implementation.
If you want to use multiple cores for running the same job, you have to use e.g. a multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. The latter is built on top of multiprocessing.
Based on your comments, if your task is to sent HTTP or other network traffic then a ThreadPoolExecutor might be more appropriate.
Because I/O (be it disk or network) in cpython does not suffer from the aforementioned restriction. In technical terms, the Global Interpreter Lock in cpython is released during I/O, giving other threads time to run.
However, network I/O has its own problems.
If you look at performance;
the CPU running instructions and data from its cache is the fastest. (In order to keep this simple, I will not distinguish between bandwidth and latency here.)
If the CPU has to get data or instructions from memory, that is much slower than from the cache.
Disk I/O (especially HDD) is much slower than memory.
Network I/O is generally much slower than disk.
For example, when writing data (from /dev/zero to a file on disk) I've observed speeds of ≈200 MB/s on a SATA 3 harddisk using ZFS.
When using netcat to blast files from one computer to the other over a gigabit point-to-point ethernet link with no other traffic (probably the best possible case for consumer equipment at this time), I get maximum ≈120 MB/s. When e.g. downloading a video from the internet, I might get in the order of ≈12 MB/s max.
If you want to use a 1000 simultaneous network queries, a couple of things can happen:
You could saturate your internet connection. Instead of the tasks competing for CPU time, they are now competing for network bandwidth. This does not improve throughput nor latency.
Your ISP might restrict throughput.
If all the queries go to the same domain, you might trigger a denial of service attack warning and that domain's firewall will block or restrict your connections.
In short: running a 1000 queries at the same time from a single IP address is probably not a good idea.
I have a built a way to distribute a verification-based application over multiple machines using the following excellent blog post as a starting point:
https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing
…which itself is based on the Python multiprocessing docs (see the section on managers, remote managers and proxy objects):
https://docs.python.org/3/library/multiprocessing.html
However, for my application at least, I have noticed that the networking costs are not constant, but rather grow (at first glance linearly) with the overall verification time, to the point where the cost is no longer acceptable (from a handful of seconds to hundreds). Is there a way to minimise these costs, and what is most likely to be driving them?
In my setup, the server machine opens a server at a given port. The clients then connect and read jobs from a single jobs queue and put finished/solved jobs to a single reporting queue - these queues are registered multiprocessing.Queue() queues. When a result to the verification problem is returned, or the queue is empty, the server sends a multiprocessing.Event() signal to the clients, allowing them to terminate. With clients terminated, the server/controller machine then shuts down the server.
In terms of costs, I think can of the following:
the cost of opening the server (payed once)
the cost of keeping the server ‘open’ (is this a real cost?)
the cost of accepting connections from clients (payed once per client)
the cost of reading/writing from/to the serialized queues (variable)
In terms of the last cost, writing to the jobs queue by the controller machine is performed only once (though the number of items can vary). The amount a client reads from the jobs queue and writes to the reporting queue is variable, but tends to occur only a handful of times before the overall verification problem is complete, as opposed to hundred of times.
Any web server might have to handle a lot of requests at the same time. As python interpreter actually has GIL constraint, how concurrency is implemented?
Do they use multiple processes and use IPC for state sharing?
You usually have many workers(i.e. gunicorn), each being dispatched with independent requests. Everything else(concurrency related) is handled by the database so it is abstracted from you.
You don't need IPC, you just need a "single source of truth", which will be the RDBMS, a cache server(redis, memcached), etc.
First of all, requests can be handled independently. However, servers want to simultaneously handle them in order to keep the number of requests that can be handled per time at a maximum.
The implementation of this concept of concurrency depends on the webserver.
Some implementations may have a fixed number of threads or processes for handling requests. If all are in use, additional requests have to wait until being handled.
Another possibility is that a process or thread is spawned for each request. Spawning a process for each request leads to an absurd memory and cpu overhead. Spawning lightweight threads is better. Doing so, you can serve hundreds of clients per second. However, also threads bring their management overhead, manifesting itself in high memory and CPU consumption.
For serving thousands of clients per second, an event-driven architecture based on asynchronous coroutines is a state-of-the-art solution. It enables the server to serve clients at a high rate without spawning zillions of threads. On the Wikipedia page of the so-called C10k problem you find a list of web servers. Among those, many make use of this architecture.
Coroutines are available for Python, too. Have look at http://www.gevent.org/. That's why a Python WSGI app based on e.g uWSGI + gevent is an extremely performant solution.
As normal. Web serving is mostly I/O-bound, and the GIL is released during I/O operations. So either threading is used without any special accommodations, or an event loop (such as Twisted) is used.
I am getting at extremely fast rate, tweets from a long-lived connection to the Twitter API Streaming Server. I proceed by doing some heavy text processing and save the tweets in my database.
I am using PyCurl for the connection and callback function that care of text processing and saving in the db. See below my approach who is not working properly.
I am not familiar with network programming, so would like to know:
How can use Threads, Queue or Twisted frameworks to solve this problem ?
def process_tweet():
# do some heaving text processing
def open_stream_connection():
connect = pycurl.Curl()
connect.setopt(pycurl.URL, STREAMURL)
connect.setopt(pycurl.WRITEFUNCTION, process_tweet)
connect.setopt(pycurl.USERPWD, "%s:%s" % (TWITTER_USER, TWITTER_PASS))
connect.perform()
You should have a number of threads receiving the messages as they come in. That number should probably be 1 if you are using pycurl, but should be higher if you are using httplib - the idea being you want to be able to have more than one query on the Twitter API at a time, so there is a steady amount of work to process.
When each Tweet arrives, it is pushed onto a Queue.Queue. The Queue ensures that there is thread-safety in the communications - each tweet will only be handled by one worker thread.
A pool of worker threads is responsible for reading from the Queue and dealing with the Tweet. Only the interesting tweets should be added to the database.
As the database is probably the bottleneck, there is a limit to the number of threads in the pool that are worth adding - more threads won't make it process faster, it'll just mean more threads are waiting in the queue to access the database.
This is a fairly common Python idiom. This architecture will scale only to a certain degree - i.e. what one machine can process.
Here's simple setup if you are OK with using a single machine.
1 thread accepts connections. After a connection is accepted, it passes the accepted connection to another thread for processing.
You can, of course, use processes (e.g, using multiprocessing) instead of threads, but I'm not familiar with multiprocessing to give advice. The setup would be the same: 1 process accepts connections, then passes them to subprocesses.
If you need to shard the processing across multiple machines, then the simple thing to do would be to stuff the message into the database, then notify the workers about the new record (this will require some sort of coordination/locking between the workers). If you want to avoid hitting the database, then you'll have to pipe messages from your network process to the workers (and I'm not well versed enough in low level networking to tell you how to do that :))
I suggest this organization:
one process reads Twitter, stuffs tweets into database
one or more processes reads database, processes each, inserts into new database. Original tweets either deleted or marked processed.
That is, you have two more more processes/threads. The tweet database could be seen as a queue of work. Multiple worker processes take jobs (tweets) off the queue, and create data in the second database.
I'm building a game server in Python and I just wanted to get some input on the architecture of the server that I was thinking up.
So, as we all know, Python cannot scale across cores with a single process. Therefore, on a server with 4 cores, I would need to spawn 4 processes.
Here is the steps taken when a client wishes to connect to the server cluster:
The IP the client initially communicates with is the Gateway node. The gateway keeps track of how many clients are on each machine, and forwards the connection request to the machine with the lowest client count.
On each machine, there is one Manager process and X Server processes, where X is the number of cores on the processor (since Python cannot scale across cores, we need to spawn 4 cores to use 100% of a quad core processor)
The manager's job is to keep track of how many clients are on each process, as well as to restart the processes if any of them crash. When a connection request is sent from the gateway to a manager, the manager looks at its server processes on that machine (3 in the diagram) and forwards the request to whatever process has the least amount of clients.
The Server process is what actually does the communicating with the client.
Here is what a 3 machine cluster would look like. For the sake of the diagram, assume each node has 3 cores.
alt text http://img152.imageshack.us/img152/5412/serverlx2.jpg
This also got me thinking - could I implement hot swapping this way? Since each process is controlled by the manager, when I want to swap in a new version of the server process I just let the manager know that it should not send any more connections to it, and then I will register the new version process with the old one. The old version is kept alive as long as clients are connected to it, then terminates when there are no more.
Phew. Let me know what you guys think.
Sounds like you'll want to look at PyProcessing, now included in Python 2.6 and beyond as multiprocessing. It takes care of a lot of the machinery of dealing with multiple processes.
An alternative architectural model is to setup a work queue using something like beanstalkd and have each of the "servers" pull jobs from the queue. That way you can add servers as you wish, swap them out, etc, without having to worry about registering them with the manager (this is assuming the work you're spreading over the servers can be quantified as "jobs").
Finally, it may be worthwhile to build the whole thing on HTTP and take advantage of existing well known and highly scalable load distribution mechanisms, such as nginx. If you can make the communication HTTP based then you'll be able to use lots of off-the-shelf tools to handle most of what you describe.