Python threads seem to freeze the machine

Python threads seem to freeze the machine - python

I'm using python threads to resolve website IP Addresses. This is my worker process for the resolving. This is a daemon thread.
def get_ip_worker():
"""This is the worker (thread) process for parsing ips, this process takes domain from the q processes it
and then saves it to another q"""
socket.setdefaulttimeout(3)
while True:
domain = domains_q.get()
try:
addr_info = socket.getaddrinfo(domain, 80, 0, 0, socket.SOL_TCP)
for family, socktype, proto, name, ip in addr_info:
if family == 2: #okay it's ipv4
ip, port = ip
processed_q.put((ip, domain))
elif family == 10: #okay it's ipv6
ip, port, no_1, no_2 = ip
processed_q.put((ip, domain))
except:
pass
#print 'Socket Error'
domains_q.task_done()
EDIT: domain = domains_q.get() this line blocks until an item is available in the Queue.
The problem comes when I run this on 300 threads, Load Average seems okay, but simple ls -la takes 5 secs and everything is slow. Where did I go wrong? Should I use async or multiprocessing?

Do you really need to process 300 connections in parallel by 300 threads? I have never tried creating that many threads, but it may be a problem. And that is definitely not a good way of solving the problem. Usually there are other options. First, you do not need 300 threads to listen 300 connections. Create a number of threads that seem to work on your HW and OS. Use a single thread to retrieve requests from main queue, then pass them to a thread from a thread pool.
BTW, check if your "retrieve from a queue" operation really blocks and waits if the queue is empty. If not, the loops may be executed all the time, not depending on whether there are incoming requests or not.
What you may really need is a non-blocking mode for sockets and something like select.select() to wait until one of your sockets is ready for reading or writing. You may write that code on your own. If you are not eager to do that, probably a good asynchronous networking library like gevent (or twisted) can help to improve the architecture of your program. Utilizing the full power of multicore CPUs is a separate question, but I've heard there are solutions, at least for gevent (they are based on gunicorn that runs several processes; have never tried it). But I think you are experiencing problems not with execution speed, but with a need to effectively wait for I/O on many objects at a time. If so, avoid massive use of threads for that purpose, it is usually ineffective not only in Python, but even in languages without GIL that are better suited for multithreded programming. multiprocessing avoids GIL but adds its own execution costs, so I would suggest not to use it here.

Related

Python - Querying/Controlling multiple hosts across WAN link?

We have several data-centres located in several countries (Japan, Hong Kong, Singapore etc.).
We run applications on multiple hosts at each of these locations - probably around 50-100 hosts in total.
I'm working on a Python script that queries the status of each application, sends various triggers to them, and retrieves other things from them during runtime. This script could conceivably query a central server, which would then send the request to an agent running on each host.
One of the requirements is that the script is as responsive as possible - e.g. if I query the status of applications on all hosts in all locations, I would like the result within 1-3 seconds, as opposed to 20-30 seconds.
Hence, querying each hosts sequentially would be too slow, particularly considering the WAN hops we'd need to make.
We can assume that the query on each host itself is fairly trivial (e.g. is process running or not).
I'm fairly new to concurrent programming or asynchronous programming, so would value any input at all here. What is the "best" approach to tackling this problem?
Use a multi-threaded or multi-process approach - e.g. spawn a new thread for each host, send them all out, then wait for replies?
Use asyncore, twisted, tornado - any comments on which if any are suitable here? (I get the impression that asyncore isn't that popular. Tornado might be fun to try, but not sure how it could be used here?)
Use some kind of message queue (e.g. Kombu/RabbitMQ)?
Use celery, somehow? Would it be responsive enough for the responsive times we want? (e.g. under 3 seconds for the above).
Cheers,
Victor

Use gevent.
How?
from gevent import monkey; monkey.patch_socket() # So anything socket-based now works asynchronously.
#This should be the first line of you code!
import gevent
def query_server(server_ip):
# do_something with server_ip and sockets
server_ips = [....]
jobs = [gevent.spawn(query_server, server_ip) for server_ip in server_ips]
gevent.joinall(jobs)
print [job.result for job in jobs]
Why bother?
All your code will run in a single process and a single thread. This means you won't have to bother with locks, semaphores and message passing.
Your task seems to be mostly network-bound. Gevent will let you do network-bound work asynchronously, which means your code won't busy-wait on network connections, and instead will let OS notify it when the data is received.
It's a personal preference, but I think that gevent is the easiest asynchronous library to use when you want to do one-off work. (Like, you don't have to start a reactor a-la twisted).
Will it work?
The response-time will be the response time of your slowest server.
If using gevent doesn't do it, then you'll have to fix your network.

Use multiprocessing.Pool, especially the map() or map_async() members.
Write a function that takes a single argument (e.g. the hostname, or a list/tuple of hostname and other data. Let that function query a host and return relevant data.
Now compule a list of input variables (hostnames), and use multiprocessing.Pool.map() or multiprocessing.Pool.map_async() to execute the functions in parallel. The async variant will start returning data sooner, but there is a limit to the amount of work you can do in a callback.
This will automatically use as many cores as your machine has to process the functions in parallel.
If there are network delays however, there is not much the python program can do about that.

Non-blocking, non-concurrent tasks in Python

I am working on an implementation of a very small library in Python that has to be non-blocking.
On some production code, at some point, a call to this library will be done and it needs to do its own work, in its most simple form it would be a callable that needs to pass some information to a service.
This "passing information to a service" is a non-intensive task, probably sending some data to an HTTP service or something similar. It also doesn't need to be concurrent or to share information, however it does need to terminate at some point, possibly with a timeout.
I have used the threading module before and it seems the most appropriate thing to use, but the application where this library will be used is so big that I am worried of hitting the threading limit.
On local testing I was able to hit that limit at around ~2500 threads spawned.
There is a good possibility (given the size of the application) that I can hit that limit easily. It also makes me weary of using a Queue given the memory implications of placing tasks at a high rate in it.
I have also looked at gevent but I couldn't see an example of being able to spawn something that would do some work and terminate without joining. The examples I went through where calling .join() on a spawned Greenlet or on an array of greenlets.
I don't need to know the result of the work being done! It just needs to fire off and try to talk to the HTTP service and die with a sensible timeout if it didn't.
Have I misinterpreted the guides/tutorials for gevent ? Is there any other possibility to spawn a callable in fully non-blocking fashion that can't hit a ~2500 limit?
This is a simple example in Threading that does work as I would expect:
from threading import Thread
class Synchronizer(Thread):
def __init__(self, number):
self.number = number
Thread.__init__(self)
def run(self):
# Simulating some work
import time
time.sleep(5)
print self.number
for i in range(4000): # totally doesn't get past 2,500
sync = Synchronizer(i)
sync.setDaemon(True)
sync.start()
print "spawned a thread, number %s" % i
And this is what I've tried with gevent, where it obviously blocks at the end to
see what the workers did:
def task(pid):
"""
Some non-deterministic task
"""
gevent.sleep(1)
print('Task', pid, 'done')
for i in range(100):
gevent.spawn(task, i)
EDIT:
My problem stemmed out from my lack of familiarity with gevent. While the Thread code was indeed spawning threads, it also prevented the script from terminating while it did some work.
gevent doesn't really do that in the code above, unless you add a .join(). All I had to do to see the gevent code do some work with the spawned greenlets was to make it a long running process. This definitely fixes my problem as the code that needs to spawn the greenlets is done within a framework that is a long running process in itself.

Nothing requires you to call join in gevent, if you're expecting your main thread to last longer than any of your workers.
The only reason for the join call is to make sure the main thread lasts at least as long as all of the workers (so that the program doesn't terminate early).

Why not spawn a subprocess with a connected pipe or similar and then, instead of a callable, just drop your data on the pipe and let the subprocess handle it completely out of band.

As explained in Understanding Asynchronous/Multiprocessing in Python, asyncoro framework supports asynchronous, concurrent processes. You can run tens or hundreds of thousands of concurrent processes; for reference, running 100,000 simple processes takes about 200MB. If you want to, you can mix threads in rest of the system and coroutines with asyncoro (provided threads and coroutines don't share variables, but use coroutine interface functions to send messages etc.).

Should I use epoll or just blocking recv in threads?

I'm trying to write a scalable custom web server.
Here's what I have so far:
The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).
Each process has a thread pool. The process assigns the socket to a thread.
The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.
Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?
PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.

If you're already using multiple threads, epoll doesn't offer you much additional benefit.
The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).
If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).
But...
What you're describing sounds suspiciously like you're basically reinventing the wheel - your:
main loop and request interpreter
pool of processes
sounds almost exactly like:
nginx (or any other load balancer/reverse proxy)
A pre-forking tornado app
Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.
If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.
(P.S. I approve of your avatar. <3)

The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).
On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.
Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).

How to implement threaded socket.recv() in python?

I have a number of devices from which i need to get status updates. A socket object is all I have, and socket.recv() is all I need to get the status. Put into a single threaded application, no problems occur:
class Device:
def receive(self):
log.debug("receive waiting: %r", self.device_id)
try:
packet = self.socket.recv(255)
except Exception as e:
self.report_socket_error(e)
self.reconnect()
log.debug("received response: %r", self.device_id)
d = Device()
d.connect()
while True:
d.receive()
However, the same code wrapped in a threading.Thread class causes deadlocks and funny behaviour. Wrapping it with locks didn't change anything. I traced the problem down to be the socket.recv() call...So, how to implement multiple threads where each thread owns one socket (1 thread owns exclusively 1 socket), which are able to wait for data simultaneously?
Thanks in advance

I know this does not answer your question on how to fix your deadlock problem, however it appears as your use of threads is overhead in your case:
You can just use one thread in which you use select() to find out which socket has available data and then handle the reported data. Unless the handling takes long or your protocol is more complicated select should be just fine and avoid all threading issues.
Have a look at http://docs.python.org/howto/sockets.html#non-blocking-sockets for more details.

How many different sockets do you have to read from?
If the answer is "just one", then use just one thread. Adding another helps you in no way and only complicates your life, as you found out.
If the answer is "several", than one way to organize this is indeed to have a thread per socket. recv is a blocking operation, which makes a thread an attractive option to organize code. Each thread owns a separate socket and reads from it at its leisure. You should have no problems and deadlocks with this.
Locks are unnecessary as long as no resources are shared. Even if you do share resources (logging, some data store, etc.) don't just use simple locks - Python has higher-level utilities for that like the Queue module.

A programming strategy to bypass the os thread limit?

The scenario: We have a python script that checks thousands of proxys simultaneously.
The program uses threads, 1 per proxy, to speed the process. When it reaches the 1007 thread, the script crashes because of the thread limit.
My solution is: A global variable that gets incremented when a thread spawns and decrements when a thread finishes. The function which spawns the threads monitors the variable so that the limit is not reached.
What will your solution be, friends?
Thanks for the answers.

You want to do non-blocking I/O with the select module.
There are a couple of different specific techniques. select.select should work for every major platform. There are other variations that are more efficient (and could matter if you are checking tens of thousands of connections simultaneously) but you will then need to write the code for you specific platform.

I've run into this situation before. Just make a pool of Tasks, and spawn a fixed number of threads that run an endless loop which grabs a Task from the pool, run it, and repeat. Essentially you're implementing your own thread abstraction and using the OS threads to implement it.
This does have drawbacks, the major one being that if your Tasks block for long periods of time they can prevent the execution of other Tasks. But it does let you create an unbounded number of Tasks, limited only by memory.

Does Python have any sort of asynchronous IO functionality? That would be the preferred answer IMO - spawning an extra thread for each outbound connection isn't as neat as having a single thread which is effectively event-driven.

Using different processes, and pipes to transfer data. Using threads in python is pretty lame. From what I heard, they don't actually run in parallel, even if you have a multi-core processor... But maybe it was fixed in python3.

My solution is: A global variable that gets incremented when a thread spawns and decrements when a thread finishes. The function which spawns the threads monitors the variable so that the limit is not reached.
The standard way is to have each thread get next tasks in a loop instead of dying after processing just one. This way you don't have to keep track of the number of threads, since you just fire a fixed number of them. As a bonus, you save on thread creation/destruction.

A counting semaphore should do the trick.
from socket import *
from threading import *
maxthreads = 1000
threads_sem = Semaphore(maxthreads)
class MyThread(Thread):
def __init__(self, conn, addr):
Thread.__init__(self)
self.conn = conn
self.addr = addr
def run(self):
try:
read = conn.recv(4096)
if read == 'go away\n':
global running
running = False
conn.close()
finally:
threads_sem.release()
sock = socket()
sock.bind(('0.0.0.0', 2323))
sock.listen(1)
running = True
while running:
conn, addr = sock.accept()
threads_sem.acquire()
MyThread(conn, addr).start()

Make sure your threads get destroyed properly after they've been used or use a threadpool, although per what I see they're not that effective in Python
see here:
http://code.activestate.com/recipes/203871/

Using the select module or a similar library would most probably be a more efficient solution, but that would require bigger architectural changes.
If you just want to limit the number of threads, a global counter should be fine, as long as you access it in a thread safe way.

Be careful to minimize the default thread stack size. At least on Linux, the default limit puts severe restrictions on the number of created threads. Linux allocates a chunk of the process virtual address space to the stack (usually 10MB). 300 threads x 10MB stack allocation = 3GB of virtual address space dedicated to stack, and on a 32 bit system you have a 3GB limit. You can probably get away with much less.

Twisted is a perfect fit for this problem. See http://twistedmatrix.com/documents/current/core/howto/clients.html for a tutorial on writing a client.
If you don't mind using alternate Python implmentations, Stackless has light-weight (non-native) threads. The only company I know doing much with it though is CCP--they use it for tasklets in their game on both the client and server. You still need to do async I/O with Stackless because if a thread blocks, the process blocks.

As mentioned in another thread, why do you spawn off a new thread for each single operation? This is a classical producer - consumer problem, isn't it? Depending a bit on how you look at it, the proxy checkers might be comsumers or producers.
Anyway, the solution is to make a "queue" of "tasks" to process, and make the threads in a loop check if there are any more tasks to perform in the queue, and if there isn't, wait a predefined interval, and check again.
You should protect your queue with some locking mechanisms, i.e. semaphores, to prevent race conditions.
It's really not that difficult. But it requires a bit of thinking getting it right. Good luck!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.