Using celery to process huge text files

Using celery to process huge text files - python

Background
I'm looking into using celery (3.1.8) to process huge text files (~30GB) each. These files are in fastq format and contain about 118M sequencing "reads", which are essentially each a combination of header, DNA sequence, and quality string). Also, these sequences are from a paired-end sequencing run, so I'm iterating two files simultaneously (via itertools.izip). What I'd like to be able to do is take each pair of reads, send them to a queue, and have them be processed on one of the machines in our cluster (don't care which) to return a cleaned-up version of the read, if cleaning needs to happen (e.g., based on quality).
I've set up celery and rabbitmq, and my workers are launched as follows:
celery worker -A tasks --autoreload -Q transient
and configured like:
from kombu import Queue
BROKER_URL = 'amqp://guest#godel97'
CELERY_RESULT_BACKEND = 'rpc'
CELERY_TASK_SERIALIZER = 'pickle'
CELERY_RESULT_SERIALIZER = 'pickle'
CELERY_ACCEPT_CONTENT=['pickle', 'json']
CELERY_TIMEZONE = 'America/New York'
CELERY_ENABLE_UTC = True
CELERYD_PREFETCH_MULTIPLIER = 500
CELERY_QUEUES = (
Queue('celery', routing_key='celery'),
Queue('transient', routing_key='transient',delivery_mode=1),
)
I've chosen to use an rpc backend and pickle serialization for performance, as well as not
writing anything to disk in the 'transient' queue (via delivery_mode).
Celery startup
To set up the celery framework, I first launch the rabbitmq server (3.2.3, Erlang R16B03-1) on a 64-way box, writing log files to a fast /tmp disk. Worker processes (as above) are launched on each node on the cluster (about 34 of them) ranging anywhere from 8-way to 64-way SMP for a total of 688 cores. So, I have a ton of available CPUs for the workers to use to process of the queue.
Job submission/performance
Once celery is up and running, I submit the jobs via an ipython notebook as below:
files = [foo, bar]
f1 = open(files[0])
f2 = open(files[1])
res = []
count = 0
for r1, r2 in izip(FastqGeneralIterator(f1), FastqGeneralIterator(f2)):
count += 1
res.append(tasks.process_read_pair.s(r1, r2))
if count == 10000:
break
t.stop()
g = group(res)
for task in g.tasks:
task.set(queue="transient")
This takes about a 1.5s for 10000 pairs of reads. Then, I call delay on the group to submit to the workers, which takes about 20s, as below:
result = g.delay()
Monitoring with rabbitmq console, I see that I'm doing OK, but not nearly fast enough.
Question
So, is there any way to speed this up? I mean, I'd like to see at least 50,000 read pairs processed every second rather than 500. Is there anything obvious that I'm missing in my celery configuration? My worker and rabbit logs are essentially empty. Would love some advice on how to get my performance up. Each individual read pair processes pretty quickly, too:
[2014-01-29 13:13:06,352: INFO/Worker-1] tasks.process_read_pair[95ec7f2f-0143-455a-a23b-c032998951b8]: HWI-ST425:143:C04A5ACXX:3:1101:13938:2894 1:N:0:ACAGTG HWI-ST425:143:C04A5ACXX:3:1101:13938:2894 2:N:0:ACAGTG 0.00840497016907 sec
Up to this point
So up to this point, I've googled all I can think of with celery, performance, routing, rabbitmq, etc. I've been through the celery website and docs. If I can't get the performance higher, I'll have to abandon this method in favor of another solution (basically dividing up the work into many smaller physical files and processing them directly on each compute node with multiprocessing or something). It would be a shame to not be able to spread this load out over the cluster, though. Plus, this seems like an exquisitely elegant solution.
Thanks in advance for any help!

Not an answer but too long for a comment.
Let's narrow the problem down a little...
Firstly, try skipping all your normal logic/message preparation and just do the tightest possible publishing loop with your current library. See what rate you get. This will identify if it's a problem with your non-queue-related code.
If it's still slow, set up a new python script but use amqplib instead of celery. I've managed to get it publishing at over 6000/s while doing useful work (and json encoding) on a mid-range desktop, so I know that it's performant. This will identify if the problem is with the celery library. (To save you time, I've snipped the following from a project of mine and hopefully not broken it when simplifying...)
from amqplib import client_0_8 as amqp
try:
lConnection = amqp.Connection(
host=###,
userid=###,
password=###,
virtual_host=###,
insist=False)
lChannel = lConnection.channel()
Exchange = ###
for i in range(100000):
lMessage = amqp.Message("~130 bytes of test data..........................................................................................................")
lMessage.properties["delivery_mode"] = 2
lChannel.basic_publish(lMessage, exchange=Exchange)
lChannel.close()
lConnection.close()
except Exception as e:
#Fail
Between the two approaches above you should be able to track down the problem to one of the Queue, the Library or your code.

Reusing the producer instance should give you some performance improvement:
with app.producer_or_acquire() as producer:
task.apply_async(producer=producer)
Also the task may be a proxy object and if so must be evaluated for every invocation:
task = task._get_current_object()
Using group will automatically reuse the producer and is usually what you would
do in a loop like this:
process_read_pair = tasks.process_read_pair.s
g = group(
process_read_pair(r1, r2)
for r1, r2 in islice(
izip(FastGeneralIterator(f1), FastGeneralIterator(f2)), 0, 1000)
)
result = g.delay()
You can also consider installing the librabbitmq module which is written in C.
The amqp:// transport will automatically use it if available (or can be specified manually using librabbitmq://:
pip install librabbitmq
Publishing messages directly using the underlying library may be faster
since it will bypass the celery routing helpers and so on, but I would not
think it was that much slower. If so there is definitely room for optimization in Celery,
as I have mostly focused on optimizing the consumer side so far.
Note also that you may want to process multiple DNA pairs in the same task,
as using coarser task granularity may be beneficial for CPU/memory caches and so on,
and it will often saturate parallelization anyway since that is a finite resource.
NOTE: The transient queue should be durable=False

One solution you have is that the reads are highly compressible so replacing the following
res.append(tasks.process_read_pair.s(r1, r2))
by
res.append(tasks.process_bytes(zlib.compress(pickle.dumps((r1, r2))),
protocol = pickle.HIGHEST_PROTOCOL),
level=1))
and calling a pickle.loads(zlib.decompress(obj)) on the other side.
It should win a factor around big factor for long enough DNA sequence if they are not long enough you can grouping them by chunk in an array which you dumps and compress.
another win can be to use zeroMQ for transport if you don't do yet.
I'm not sure what process_byte should be

Again, not an answer, but too long for comments. Per Basic's comments/answer below, I set up the following test using the same exchange and routing as my application:
from amqplib import client_0_8 as amqp
try:
lConnection = amqp.Connection()
lChannel = lConnection.channel()
Exchange = 'celery'
for i in xrange(1000000):
lMessage = amqp.Message("~130 bytes of test data..........................................................................................................")
lMessage.properties["delivery_mode"] = 1
lChannel.basic_publish(lMessage, exchange=Exchange, routing_key='transient')
lChannel.close()
lConnection.close()
except Exception as e:
print e
You can see that it's rocking right along.
I guess now it's up to finding out the difference between this and what's going on inside of celery

I added amqp into my logic, and it's fast. FML.
from amqplib import client_0_8 as amqp
try:
import stopwatch
lConnection = amqp.Connection()
lChannel = lConnection.channel()
Exchange = 'celery'
t = stopwatch.Timer()
files = [foo, bar]
f1 = open(files[0])
f2 = open(files[1])
res = []
count = 0
for r1, r2 in izip(FastqGeneralIterator(f1), FastqGeneralIterator(f2)):
count += 1
#res.append(tasks.process_read_pair.s(args=(r1, r2)))
#lMessage = amqp.Message("~130 bytes of test data..........................................................................................................")
lMessage = amqp.Message(" ".join(r1) + " ".join(r2))
res.append(lMessage)
lMessage.properties["delivery_mode"] = 1
lChannel.basic_publish(lMessage, exchange=Exchange, routing_key='transient')
if count == 1000000:
break
t.stop()
print "added %d tasks in %s" % (count, t)
lChannel.close()
lConnection.close()
except Exception as e:
print e
So, I made a change to submit an async task to celery in the loop, as below:
res.append(tasks.speed.apply_async(args=("FML",), queue="transient"))
The speed method is just this:
#app.task()
def speed(s):
return s
Submitting the tasks I'm slow again!
So, it doesn't appear to have anything to do with:
How I'm iterating to submit to the queue
The message that I'm submitting
but rather, it has to do with the queueing of the function?!?! I'm confused.

Again, not an answer, but more of an observation. By simply changing my backend from rpc to redis, I more than triple my throughput:

Related

Python 3.7 MultiThreading strategy

I have a work load that consist of a very slow query that returns a HUGE amount of data that has to be parsed and calculated, all that on a loop. Basically, it looks like this:
for x in lastTenYears
myData = DownloadData(x) # takes about ~40-50 [sec]
parsedData.append(ParseData(myData)) # takes another +30-60 [sec]
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
How can I achieve this parallelism of operations?
Ideally speaking, I would like to have 1 thread always downloading, and N threads doing the parsing. The download part is actually a query against a database, so it's not good to have a bunch o parallel of them...
Details:
The parsing of the data is a heavily CPU bound, and consists of raw math calculations and nothing else.
Using Python 3.7.4

1) Use a threadsafe queue. Queue.FIFOQueue. At the top level define
my_queue = Queue.FIFOQueue()
parsedData = []
2) On the first thread, kick off the data loading
my_queue.put(DownloadData(x))
On the second thread
if not (my_queue.empty()):
myData = my_queue.get()
parsedData.append(ParseData(myData))

If your program is CPU bound you will have hard times to do anything else in other threads due to the GIL (global interpreter lock).
Here is a link to an article which might help you to understand the topic: https://opensource.com/article/17/4/grok-gil
Downloading the data in a sub-process is most likely the best approach.

It's hard to say if and how much this will actually help (as I have nothing to test...), but you might try a multiprocessing.Pool. It handles all the dirty work for you and you can customize number of processes, chunk size etc.
from multiprocessing import Pool
def worker(x):
myData = DownloadData(x)
return ParseData(myData)
if __name__ == "__main__":
processes = None # defaults to os.cpu_count()
chunksize = 1
with Pool(processes) as pool:
parsedData = pool.map(worker, lastTenYears, chunksize)
Here for the example I use the map method, but according to your needs you might want to use imap or map_async.

Q : How can I achieve this parallelism of operations?
The step number one is to realise, the above requested use-case is not a [PARALLEL] code-execution, but an un-ordered batch of resources-use policy limited execution of a strict sequence of pairs of :
First-a-remote-[DB-Query](returning (cit.) HUGE amount of data)
Next-a-local-[CPU-process]( of (cit.) HUGE amount of data just returned here)
The latency of the first could be masked( if it were permitted, but it is not permitted - due to a will not to overload the DB-host ),the latency for the second not( can start but a next I/O-bound DB-Query, yet only if not violating the rule of keeping the DB-machine but under a mild workload ).
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
It is high time to make thing clear and sound :
Facts :
A )
The CPU-bound tasks will never run faster in whatever number N of threads in python-GIL-lock controlled ecosystem( since ever and forever, as Guido ROSSUM has expressed ),as the GIL-lock enforces a re-[SERIAL]-isation, so the more threads "work", the more threads actually wait for acquiring the GIL-lock, before they "get" it but for a 1 / ( N + 1 )-th fraction of time of the resulting, thanks to the GIL-lock policing again pure-[SERIAL], duration of N * ( 30 - 60 ) [sec]
B )
The I/O-bound task makes no sense to off-load into a full process-based, concurrent execution, as the full-copy of the python process ( in Windows also with duplicating the whole python interpreter state with all data, during the sub-process instantiation ) makes no sense, as there are smarter techniques for I/O-bound processing ( where GIL-lock does not hurt so much.
C )
The whole concept of N-parsing : 1-querying is principally wrong - the maximum achievable goal is to mask the latency of the I/O-process ( where making sense ), yet here each one and every query takes those said ~ 40-50 [sec] so no second pack-of-data to parse will ever be present here before running those said ~ 40-50 [sec] next time, sono second worker will ever get anything to parse anytime before T0 + ~ 80~100 [sec] - so one could dream a wish to have N-(unbound)-workers working ( yet have 'em but actually waiting for data ) is possible, but awfully anti-productive ( the worse for N-(GIL-MUTEX-ed)-"waiting"-agents ).

How to apply parallel or asynchronous I/O file writing on a python piece of code

To begin with, we're given the following piece of code:
from validate_email import validate_email
import time
import os
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
emails = set()
with open(email_path) as f:
for email in f:
email = email.strip()
if email in emails:
continue
emails.add(email)
if validate_email(email, verify=True):
good_emails.write(email + '\n')
else:
bad_emails.write(email + '\n')
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
I expect contacting SMTP servers to be the most expensive part by far from my program when emails.txt contains large amount of lines (>1k). Using some form of parallel or asynchronous I/O should speed this up a lot, since I can wait for multiple servers to respond instead of waiting sequentially.
As far as I have read:
Asynchronous I/O operates by queuing a request for I/O to the file
descriptor, tracked independently of the calling process. For a file
descriptor that supports asynchronous I/O (raw disk devcies
typically), a process can call aio_read() (for instance) to request a
number of bytes be read from the file descriptor. The system call
returns immediately, whether or not the I/O has completed. Some time
later, the process then polls the operating system for the completion
of the I/O (that is, buffer is filled with data).
To be sincere, I didn't quite understand how to implement async I/O on my program. Can anybody take a little time and explain me the whole process ?
EDIT as per PArakleta suggested:
from validate_email import validate_email
import time
import os
from multiprocessing import Pool
import itertools
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
with open(email_path, "r") as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
good_emails.close()
bad_emails.close()
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")

You're asking the wrong question
Having looked at the validate_email package your real problem is that you're not efficiently batching your results. You should be only doing the MX lookup once per domain and then only connect to each MX server once, go through the handshake, and then check all of the addresses for that server in a single batch. Thankfully the validate_email package does the MX result caching for you, but you still need to be group the email addresses by server to batch the query to the server itself.
You need to edit the validate_email package to implement batching, and then probably give a thread to each domain using the actual threading library rather than multiprocessing.
It's always important to profile your program if it's slow and figure out where it is actually spending the time rather than trying to apply optimisation tricks blindly.
The requested solution
IO is already asynchronous if you are using buffered IO and your use case fits with the OS buffering. The only place you could potentially get some advantage is in read-ahead but Python already does this if you use the iterator access to a file (which you are doing). AsyncIO is an advantage to programs that are moving large amounts of data and have disabled the OS buffers to prevent copying the data twice.
You need to actually profile/benchmark your program to see if it has any room for improvement. If your disks aren't already throughput bound then there is a chance to improve the performance by parallel execution of the processing of each email (address?). The easiest way to check this is probably to check to see if the core running your program is maxed out (i.e. you are CPU bound and not IO bound).
If you are CPU bound then you need to look at threading. Unfortunately Python threading doesn't work in parallel unless you have non-Python work to be done so instead you'll have to use multiprocessing (I'm assuming validate_email is a Python function).
How exactly you proceed depends on where the bottleneck's in your program are and how much of a speed up you need to get to the point where you are IO bound (since you cannot actually go any faster than that you can stop optimising when you hit that point).
The emails set object is hard to share because you'll need to lock around it so it's probably best that you keep that in one thread. Looking at the multiprocessing library the easiest mechanism to use is probably Process Pools.
Using this you would need to wrap your file iterable in an itertools.ifilter which discards duplicates, and then feed this into a Pool.imap_unordered and then iterate that result and write into your two output files.
Something like:
with open(email_path) as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
The validate_map function should be something simple like:
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
The unique function should be something like:
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
ETA: I just realised that validate_email is a library which actually contacts SMTP servers. Given that it's not busy in Python code you can use threading. The threading API though is not as convenient as the multiprocessing library but you can use multiprocessing.dummy to have a thread based Pool.
If you are CPU bound then it's not really worth having more threads/processes than cores but since your bottleneck is network IO you can benefit from many more threads/processes. Since processes are expensive you want to swap to threads and then crank up the number running in parallel (although you should be polite not to DOS-attack the servers you are connecting to).
Consider from multiprocessing.dummy import Pool as ThreadPool and then call ThreadPool(processes=32).imap_unordered().

Troubleshooting data inconsistencies with Python multiprocessing/threading

TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?

I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.

I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.

For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.

In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!

I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

error: can't start new thread

I have a site that runs with follow configuration:
Django + mod-wsgi + apache
In one of user's request, I send another HTTP request to another service, and solve this by httplib library of python.
But sometimes this service don't get answer too long, and timeout for httplib doesn't work. So I creating thread, in this thread I send request to service, and join it after 20 sec (20 sec - is a timeout of request). This is how it works:
class HttpGetTimeOut(threading.Thread):
def __init__(self,**kwargs):
self.config = kwargs
self.resp_data = None
self.exception = None
super(HttpGetTimeOut,self).__init__()
def run(self):
h = httplib.HTTPSConnection(self.config['server'])
h.connect()
sended_data = self.config['sended_data']
h.putrequest("POST", self.config['path'])
h.putheader("Content-Length", str(len(sended_data)))
h.putheader("Content-Type", 'text/xml; charset="utf-8"')
if 'base_auth' in self.config:
base64string = base64.encodestring('%s:%s' % self.config['base_auth'])[:-1]
h.putheader("Authorization", "Basic %s" % base64string)
h.endheaders()
try:
h.send(sended_data)
self.resp_data = h.getresponse()
except httplib.HTTPException,e:
self.exception = e
except Exception,e:
self.exception = e
something like this...
And use it by this function:
getting = HttpGetTimeOut(**req_config)
getting.start()
getting.join(COOPERATION_TIMEOUT)
if getting.isAlive(): #maybe need some block
getting._Thread__stop()
raise ValueError('Timeout')
else:
if getting.resp_data:
r = getting.resp_data
else:
if getting.exception:
raise ValueError('REquest Exception')
else:
raise ValueError('Undefined exception')
And all works fine, but sometime I start catching this exception:
error: can't start new thread
at the line of starting new thread:
getting.start()
and the next and the final line of traceback is
File "/usr/lib/python2.5/threading.py", line 440, in start
_start_new_thread(self.__bootstrap, ())
And the answer is: What's happen?
Thank's for all, and sorry for my pure English. :)

The "can't start new thread" error almost certainly due to the fact that you have already have too many threads running within your python process, and due to a resource limit of some kind the request to create a new thread is refused.
You should probably look at the number of threads you're creating; the maximum number you will be able to create will be determined by your environment, but it should be in the order of hundreds at least.
It would probably be a good idea to re-think your architecture here; seeing as this is running asynchronously anyhow, perhaps you could use a pool of threads to fetch resources from another site instead of always starting up a thread for every request.
Another improvement to consider is your use of Thread.join and Thread.stop; this would probably be better accomplished by providing a timeout value to the constructor of HTTPSConnection.

You are starting more threads than can be handled by your system. There is a limit to the number of threads that can be active for one process.
Your application is starting threads faster than the threads are running to completion. If you need to start many threads you need to do it in a more controlled manner I would suggest using a thread pool.

I was running on a similar situation, but my process needed a lot of threads running to take care of a lot of connections.
I counted the number of threads with the command:
ps -fLu user | wc -l
It displayed 4098.
I switched to the user and looked to system limits:
sudo -u myuser -s /bin/bash
ulimit -u
Got 4096 as response.
So, I edited /etc/security/limits.d/30-myuser.conf and added the lines:
myuser hard nproc 16384
myuser soft nproc 16384
Restarted the service and now it's running with 7017 threads.
Ps. I have a 32 cores server and I'm handling 18k simultaneous connections with this configuration.

I think the best way in your case is to set socket timeout instead of spawning thread:
h = httplib.HTTPSConnection(self.config['server'],
timeout=self.config['timeout'])
Also you can set global default timeout with socket.setdefaulttimeout() function.
Update: See answers to Is there any way to kill a Thread in Python? question (there are several quite informative) to understand why. Thread.__stop() doesn't terminate thread, but rather set internal flag so that it's considered already stopped.

I completely rewrite code from httplib to pycurl.
c = pycurl.Curl()
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, CONNECTION_TIMEOUT)
c.setopt(pycurl.TIMEOUT, COOPERATION_TIMEOUT)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.POST, 1)
c.setopt(pycurl.SSL_VERIFYHOST, 0)
c.setopt(pycurl.SSL_VERIFYPEER, 0)
c.setopt(pycurl.URL, "https://"+server+path)
c.setopt(pycurl.POSTFIELDS,sended_data)
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.perform()
something like that.
And I testing it now. Thanks all of you for help.

If you are tying to set timeout why don't you use urllib2.

I'm running a python script on my machine only to copy and convert some files from one format to another, I want to maximize the number of running threads to finish as quickly as possible.
Note: It is not a good workaround from an architecture perspective If you aren't using it for a quick script on a specific machine.
In my case, I checked the max number of running threads that my machine can run before I got the error, It was 150
I added this code before starting a new thread. which checks if the max limit of running threads is reached then the app will wait until some of the running threads finish, then it will start new threads
while threading.active_count()>150 :
time.sleep(5)
mythread.start()

If you are using a ThreadPoolExecutor, the problem may be that your max_workers is higher than the threads allowed by your OS.
It seems that the executor keeps the information of the last executed threads in the process table, even if the threads are already done. This means that when your application has been running for a long time, eventually it will register in the process table as many threads as ThreadPoolExecutor.max_workers

As far as I can tell it's not a python problem. Your system somehow cannot create another thread (I had the same problem and couldn't start htop on another cli via ssh).
The answer of Fernando Ulisses dos Santos is really good. I just want to add, that there are other tools limiting the number of processes and memory usage "from the outside". It's pretty common for virtual servers. Starting point is the interface of your vendor or you might have luck finding some information in files like
/proc/user_beancounters

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.