Python: Is there a way to join threads while using semaphores

Python: Is there a way to join threads while using semaphores - python

Background:
I have an inventory application that scrapes data from our various IT resources (VMware, storage, backups, etc...) We have a vCenter that has over 2000 VMs registered to it. I have code to go in and pull details for each VM in its own thread to parallelize the collections.
I have them joined to a parent thread so that the different sections will complete before it moves onto the next area. I also have it set to timeout after 10 minutes so that the collection isn't held up by a single object thread that just gets stuck. What I've found though is that when I try to pull data for more than about 1000 objects at once, it overloads the vCenter and it kills my connection, and almost all of the child threads die.
I'm quite sure that it's partially related to vCenter versions that are below 7.0 (we're using 6.7 in a lot of places). But we're stuck using the current versions due to older hardware.
What I would like to do is limit the number of threads spun up using semaphores, but also have them joined to the parent thread when they are spun up. All of the ways I've thought of to do this either end up serializing the collection, or end up having the join timeout after 10 minutes.
Is there a way to pull this off? The part that gets me stuck is joining the thread because it blocks the rest of the operations. Once I stop joining the threads, I can't join any others.
Code sample:
try:
objects = vsphere_client.vcenter.VM.list() # try newer REST API operation
old_objects = container_view.view # old pyvmomi objects
rest_api = True
except UnableToAllocateResource: # if there's too many objects for the REST API to return happens at 1000 on vCenter 6.7 and 4000 on 7.0
objects = container_view.view
old_objects = None
except OperationNotFound: # if different error happens
objects = container_view.view
old_objects = None
threads = []
for obj in objects:
thread = RESTVMDetail(vsphere_client, db_vcenter, obj, old_objects, rest_api, db_vms, db_hosts,
db_datastores, db_networks, db_vm_disks, db_vm_os_disks, db_vm_nics, db_vm_cdroms,
db_vm_floppies, db_vm_scsis, db_regions, db_sites, db_environments, db_platforms,
db_applications, db_functions, db_costs, db_vm_snapshots, api_limiter)
threads.append(thread)
for thread in threads:
thread.start()
for thread in threads:
thread.join(600)

I had to switch this to a consumer/producer implementation utilizing a queue. That allowed me to limit the number of collections that would be kicked off simultaneously.

Related

How to get rid of zombie processes using torch.multiprocessing.Pool (Python)

I am using torch.multiprocessing.Pool to speed up my NN in inference, like this:
import torch.multiprocessing as mp
mp = torch.multiprocessing.get_context('forkserver')
def parallel_predict(predict_func, sequences, args):
predicted_cluster_ids = []
pool = mp.Pool(args.num_workers, maxtasksperchild=1)
out = pool.imap(
func=functools.partial(predict_func, args=args),
iterable=sequences,
chunksize=1)
for item in tqdm(out, total=len(sequences), ncols=85):
predicted_cluster_ids.append(item)
pool.close()
pool.terminate()
pool.join()
return predicted_cluster_ids
Note 1) I am using imap because I want to be able to show a progress bar with tqdm.
Note 2) I tried with both forkserver and spawn but no luck. I cannot use other methods because of how they interact (poorly) with CUDA.
Note 3) I am using maxtasksperchild=1 and chunksize=1 so for each sequence in sequences it spawns a new process.
Note 4) Adding or removing pool.terminate() and pool.join() makes no difference.
Note 5) predict_func is a method of a class I created. I could also pass the whole model to parallel_predict but it does not change anything.
Everything works fine except the fact that after a while I run out of memory on the CPU (while on the GPU everything works as expected). Using htop to monitor memory usage I notice that, for every process I spawn with pool I get a zombie that uses 0.4% of the memory. They don't get cleared, so they keep using space. Still, parallel_predict does return the correct result and the computation goes on. My script is structured in a way that id does validation multiple times so next time parallel_predict is called the zombies add up.
This is what I get in htop:
Usually, these zombies get cleared after ctrl-c but in some rare cases I need to killall.
Is there some way I can force the Pool to close them?
UPDATE:
I tried to kill the zombie processes using this:
def kill(pool):
import multiprocessing
import signal
# stop repopulating new child
pool._state = multiprocessing.pool.TERMINATE
pool._worker_handler._state = multiprocessing.pool.TERMINATE
for p in pool._pool:
os.kill(p.pid, signal.SIGKILL)
# .is_alive() will reap dead process
while any(p.is_alive() for p in pool._pool):
pass
pool.terminate()
But it does not work. It gets stuck at pool.terminate()
UPDATE2:
I tried to use the initializer arg in imap to catch signals like this:
def process_initializer():
def handler(_signal, frame):
print('exiting')
exit(0)
signal.signal(signal.SIGTERM, handler)
def parallel_predict(predict_func, sequences, args):
predicted_cluster_ids = []
with mp.Pool(args.num_workers, initializer=process_initializer, maxtasksperchild=1) as pool:
out = pool.imap(
func=functools.partial(predict_func, args=args),
iterable=sequences,
chunksize=1)
for item in tqdm(out, total=len(sequences), ncols=85):
predicted_cluster_ids.append(item)
for p in pool._pool:
os.kill(p.pid, signal.SIGTERM)
pool.close()
pool.terminate()
pool.join()
return predicted_cluster_ids
but again it does not free memory.

Ok, I have more insights to share with you. Indeed this is not a bug, it is actually the "supposed" behavior for the multiprocessing module in Python (torch.multiprocessing wraps it). What happens is that, although the Pool terminates all the processes, the memory is not released (given back to the OS). This is also stated in the documentation, though in a very confusing way.
In the documentation it says that
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue
but also:
A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one. The maxtasksperchild argument to the Pool exposes this ability to the end user
but the "clean up" does NOT happen.
To make things worse I found this post in which they recommend to use maxtasksperchild=1. This increases the memory leak, because this way the number of zombies goes with the number of data points to be predicted, and since pool.close() does not free memory they add up.
This is very bad if you are using multiprocessing for example in validation. For every validation step I was reinitializing the pool but the memory didn't get freed from the previous iteration.
The SOLUTION here is to move pool = mp.Pool(args.num_workers) outside the training loop, so the pool does not get closed and reopened, and therefore it always reuses the same processes. NOTE: again remember to remove maxtasksperchild=1 and chunksize=1.
I think this should be included in the best practices page.
BTW in my opinion this behavior of the multiprocessing library should be considered as a bug and should be fixed Python side (not Pytorch side)

Passing updated args to multiple threads periodically in python

I have three base stations, they have to work in parallel, and they will receive a list every 10 seconds which contain information about their cluster, and I want to run this code for about 10 minutes. So, every 10 seconds my three threads have to call the target method with new arguments, and this process should last long for 10 minutes. I don't know how to do this, but I came up with the below idea which seems to be not quite a good one! Thus I appreciate any help.
I have a list named base_centroid_assign that I want to pass each item of it to a distinct thread. The list content will be updated frequently (supposed for instance 10 seconds), I so wish to recall my previous threads and give the update items to them.
In the below code, the list contains three items which have multiple items in them (it's nested). I want to have three threads stop after executing the quite simple target function, and then recall the threads with update item; however, when I run the below code, I ended up with 30 threads! (the run_time variable is 10 and list's length is 3).
How can I implement idea as mentioned above?
run_time = 10
def cluster_status_broadcasting(info_base_cent_avr):
print(threading.current_thread().name)
info_base_cent_avr.sort(key=lambda item: item[2], reverse=True)
start = time.time()
while(run_time > 0):
for item in base_centroid_assign:
t = threading.Thread(target=cluster_status_broadcasting, args=(item,))
t.daemon = True
t.start()
print('Entire job took:', time.time() - start)
run_time -= 1

Welcome to Stackoverflow.
Problems with thread synchronisation can be so tricky to handle that Python already has some very useful libraries specifically to handle such tasks. The primary such library is queue.Queue in Python 3. The idea is to have a queue for each "worker" thread. The main thread collect and put new data onto a queue, and have the subsidiary threads get the data from that queue.
When you call a Queue's get method its normal action is to block the thread until something is available, but presumably you want the threads to continue working on the current inputs until new ones are available, in which case it would make more sense to poll the queue and continue with the current data if there is nothing from the main thread.
I outline such an approach in my answer to this question, though in that case the worker threads are actually sending return values back on another queue.
The structure of your worker threads' run method would then need to be something like the following pseudo-code:
def run(self):
request_data = self.inq.get() # Wait for first item
while True:
process_with(request_data)
try:
request_data = self.inq.get(block=False)
except queue.Empty:
continue
You might like to add logic to terminate the thread cleanly when a sentinel value such as None is received.

Troubleshooting data inconsistencies with Python multiprocessing/threading

TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?

I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.

I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks

Closing SSE connection in browser causes segfault in uWSGI when multiple sets of child processes are running concurrently

I'm building a web application for processing ~60,000 (and growing) large files, perform some analysis and return a "best guess" that needs to be verified by a user. The files will be refined by category to avoid loading every file, but I'm still left with a scenario where I might have to process 1000+ files at a time.
These are large files that can take up to 8-9 seconds each to process, and in a 1000+ file situation it is impractical to have a user wait 8 seconds between reviews or 2 hours+ while the files are processed before hand.
To overcome this, I've decided to use multiprocessing to spawn several workers, each of which will pick from a queue of files, process them and insert into an output queue. I have another method that basically polls the output queue for items and then streams them to the client when one becomes available.
We're using gevent with uWSGI and Django in our environment and I'm aware that child process creation via multiprocessing in the context of gevent yields an undesired event loop state in the child. Greenlets spawned before forking are duplicated in the child. Therefore, I've decided to use lets to assist in the handling of the child processes.
This all works beautifully while uninterrupted. However, if a user were to switch categories anytime while files are still being processed, I close the SSE connection in the browser and open another causing a new set of child processes to spawn and killing the existing processes (or attempting to). This causes one of two things to happen.
When I yield out the results to the client, I get an IOError from uWSGI because the connection has closed. I wrapped the entire function in a try...finally and kill all the workers before the function exits.
I can either block while the processes are killed or do it in the background. Each method has different consequences. When trying to kill without blocking, the original processes are never killed, the new processes stop yielding, and any request (from any page) to the server hangs until I manually kill all uWSGI processes.
When blocking, uWSGI reports a segmentation fault, the main worker is killed and restarted, killing all child processes - new and old.
An example of the JavaScript used to open/close the connection:
var state = {};
function analyze(){
// If a connection exists, close it.
if (state.evtSource) {
state.evtSource.close();
}
// Create a new connection to the server.
evtSource = state.evtSource = new EventSource('?myarg=myval');
evtSource.onmessage = function(message){
//do stuff
}
}
Server-side example code:
from item import Item
import lets
import multiprocessing
import time
MAX_WORKERS = 10
# Worker is outside of ProcessFiles because ``lets``
# pickles the target.
def worker(item):
return item.process()
class ProcessFiles(object):
def __init__(self):
self.input_queue = multiprocessing.Queue()
self.output_queue = multiprocessing.Queue()
self.file_count = 0
self.pool = lets.ProcessPool(MAX_WORKERS)
def query_for_results(self):
# Query db for records of files to process.
# Return results and set self.file_count equal to
# the number of records returned.
pass
def start(self):
# Queue up files to process.
for result in self.query_for_results():
item = Item(result)
self.input_queue.append(item)
# Spawn up to MAX_WORKERS child processes to analyze
# all of the items in the input queue. Append processed file
# to output queue.
for item in self.input_queue:
self.pool.apply_async(worker, args=(item,), callback=self.callback)
# Poll for items to send to client.
return self.get_processed_items()
def callback(self, processed):
self.output_queue.put(processed)
def get_processed_items(self):
# Wait for the output queue to hold at least 1 item.
# When an item becomes available, yield it to client.
try:
count = 0
while count != self.file_count:
try:
item = self.output_queue.get(timeout=1)
except:
# Queue is empty. Wait and retry.
time.sleep(1)
continue
count += 1
yield item
yield 'end'
finally:
# Kill all child processes.
self.pool.kill(block=True) # <- Causes segfault.
#self.pool.kill(block=False) # <- Silently fails.
This only happens when a user makes a selection, and while processing those files, makes another selection, effectively closing the current connection and creating a new one, creating two different sets of child processes.
Why does blocking cause a segmentation fault? Why is the behavior different when blocking vs not blocking? What can I do to kill all the original processes?

Using celery to process huge text files

Background
I'm looking into using celery (3.1.8) to process huge text files (~30GB) each. These files are in fastq format and contain about 118M sequencing "reads", which are essentially each a combination of header, DNA sequence, and quality string). Also, these sequences are from a paired-end sequencing run, so I'm iterating two files simultaneously (via itertools.izip). What I'd like to be able to do is take each pair of reads, send them to a queue, and have them be processed on one of the machines in our cluster (don't care which) to return a cleaned-up version of the read, if cleaning needs to happen (e.g., based on quality).
I've set up celery and rabbitmq, and my workers are launched as follows:
celery worker -A tasks --autoreload -Q transient
and configured like:
from kombu import Queue
BROKER_URL = 'amqp://guest#godel97'
CELERY_RESULT_BACKEND = 'rpc'
CELERY_TASK_SERIALIZER = 'pickle'
CELERY_RESULT_SERIALIZER = 'pickle'
CELERY_ACCEPT_CONTENT=['pickle', 'json']
CELERY_TIMEZONE = 'America/New York'
CELERY_ENABLE_UTC = True
CELERYD_PREFETCH_MULTIPLIER = 500
CELERY_QUEUES = (
Queue('celery', routing_key='celery'),
Queue('transient', routing_key='transient',delivery_mode=1),
)
I've chosen to use an rpc backend and pickle serialization for performance, as well as not
writing anything to disk in the 'transient' queue (via delivery_mode).
Celery startup
To set up the celery framework, I first launch the rabbitmq server (3.2.3, Erlang R16B03-1) on a 64-way box, writing log files to a fast /tmp disk. Worker processes (as above) are launched on each node on the cluster (about 34 of them) ranging anywhere from 8-way to 64-way SMP for a total of 688 cores. So, I have a ton of available CPUs for the workers to use to process of the queue.
Job submission/performance
Once celery is up and running, I submit the jobs via an ipython notebook as below:
files = [foo, bar]
f1 = open(files[0])
f2 = open(files[1])
res = []
count = 0
for r1, r2 in izip(FastqGeneralIterator(f1), FastqGeneralIterator(f2)):
count += 1
res.append(tasks.process_read_pair.s(r1, r2))
if count == 10000:
break
t.stop()
g = group(res)
for task in g.tasks:
task.set(queue="transient")
This takes about a 1.5s for 10000 pairs of reads. Then, I call delay on the group to submit to the workers, which takes about 20s, as below:
result = g.delay()
Monitoring with rabbitmq console, I see that I'm doing OK, but not nearly fast enough.
Question
So, is there any way to speed this up? I mean, I'd like to see at least 50,000 read pairs processed every second rather than 500. Is there anything obvious that I'm missing in my celery configuration? My worker and rabbit logs are essentially empty. Would love some advice on how to get my performance up. Each individual read pair processes pretty quickly, too:
[2014-01-29 13:13:06,352: INFO/Worker-1] tasks.process_read_pair[95ec7f2f-0143-455a-a23b-c032998951b8]: HWI-ST425:143:C04A5ACXX:3:1101:13938:2894 1:N:0:ACAGTG HWI-ST425:143:C04A5ACXX:3:1101:13938:2894 2:N:0:ACAGTG 0.00840497016907 sec
Up to this point
So up to this point, I've googled all I can think of with celery, performance, routing, rabbitmq, etc. I've been through the celery website and docs. If I can't get the performance higher, I'll have to abandon this method in favor of another solution (basically dividing up the work into many smaller physical files and processing them directly on each compute node with multiprocessing or something). It would be a shame to not be able to spread this load out over the cluster, though. Plus, this seems like an exquisitely elegant solution.
Thanks in advance for any help!

Not an answer but too long for a comment.
Let's narrow the problem down a little...
Firstly, try skipping all your normal logic/message preparation and just do the tightest possible publishing loop with your current library. See what rate you get. This will identify if it's a problem with your non-queue-related code.
If it's still slow, set up a new python script but use amqplib instead of celery. I've managed to get it publishing at over 6000/s while doing useful work (and json encoding) on a mid-range desktop, so I know that it's performant. This will identify if the problem is with the celery library. (To save you time, I've snipped the following from a project of mine and hopefully not broken it when simplifying...)
from amqplib import client_0_8 as amqp
try:
lConnection = amqp.Connection(
host=###,
userid=###,
password=###,
virtual_host=###,
insist=False)
lChannel = lConnection.channel()
Exchange = ###
for i in range(100000):
lMessage = amqp.Message("~130 bytes of test data..........................................................................................................")
lMessage.properties["delivery_mode"] = 2
lChannel.basic_publish(lMessage, exchange=Exchange)
lChannel.close()
lConnection.close()
except Exception as e:
#Fail
Between the two approaches above you should be able to track down the problem to one of the Queue, the Library or your code.

Reusing the producer instance should give you some performance improvement:
with app.producer_or_acquire() as producer:
task.apply_async(producer=producer)
Also the task may be a proxy object and if so must be evaluated for every invocation:
task = task._get_current_object()
Using group will automatically reuse the producer and is usually what you would
do in a loop like this:
process_read_pair = tasks.process_read_pair.s
g = group(
process_read_pair(r1, r2)
for r1, r2 in islice(
izip(FastGeneralIterator(f1), FastGeneralIterator(f2)), 0, 1000)
)
result = g.delay()
You can also consider installing the librabbitmq module which is written in C.
The amqp:// transport will automatically use it if available (or can be specified manually using librabbitmq://:
pip install librabbitmq
Publishing messages directly using the underlying library may be faster
since it will bypass the celery routing helpers and so on, but I would not
think it was that much slower. If so there is definitely room for optimization in Celery,
as I have mostly focused on optimizing the consumer side so far.
Note also that you may want to process multiple DNA pairs in the same task,
as using coarser task granularity may be beneficial for CPU/memory caches and so on,
and it will often saturate parallelization anyway since that is a finite resource.
NOTE: The transient queue should be durable=False

One solution you have is that the reads are highly compressible so replacing the following
res.append(tasks.process_read_pair.s(r1, r2))
by
res.append(tasks.process_bytes(zlib.compress(pickle.dumps((r1, r2))),
protocol = pickle.HIGHEST_PROTOCOL),
level=1))
and calling a pickle.loads(zlib.decompress(obj)) on the other side.
It should win a factor around big factor for long enough DNA sequence if they are not long enough you can grouping them by chunk in an array which you dumps and compress.
another win can be to use zeroMQ for transport if you don't do yet.
I'm not sure what process_byte should be

Again, not an answer, but too long for comments. Per Basic's comments/answer below, I set up the following test using the same exchange and routing as my application:
from amqplib import client_0_8 as amqp
try:
lConnection = amqp.Connection()
lChannel = lConnection.channel()
Exchange = 'celery'
for i in xrange(1000000):
lMessage = amqp.Message("~130 bytes of test data..........................................................................................................")
lMessage.properties["delivery_mode"] = 1
lChannel.basic_publish(lMessage, exchange=Exchange, routing_key='transient')
lChannel.close()
lConnection.close()
except Exception as e:
print e
You can see that it's rocking right along.
I guess now it's up to finding out the difference between this and what's going on inside of celery

I added amqp into my logic, and it's fast. FML.
from amqplib import client_0_8 as amqp
try:
import stopwatch
lConnection = amqp.Connection()
lChannel = lConnection.channel()
Exchange = 'celery'
t = stopwatch.Timer()
files = [foo, bar]
f1 = open(files[0])
f2 = open(files[1])
res = []
count = 0
for r1, r2 in izip(FastqGeneralIterator(f1), FastqGeneralIterator(f2)):
count += 1
#res.append(tasks.process_read_pair.s(args=(r1, r2)))
#lMessage = amqp.Message("~130 bytes of test data..........................................................................................................")
lMessage = amqp.Message(" ".join(r1) + " ".join(r2))
res.append(lMessage)
lMessage.properties["delivery_mode"] = 1
lChannel.basic_publish(lMessage, exchange=Exchange, routing_key='transient')
if count == 1000000:
break
t.stop()
print "added %d tasks in %s" % (count, t)
lChannel.close()
lConnection.close()
except Exception as e:
print e
So, I made a change to submit an async task to celery in the loop, as below:
res.append(tasks.speed.apply_async(args=("FML",), queue="transient"))
The speed method is just this:
#app.task()
def speed(s):
return s
Submitting the tasks I'm slow again!
So, it doesn't appear to have anything to do with:
How I'm iterating to submit to the queue
The message that I'm submitting
but rather, it has to do with the queueing of the function?!?! I'm confused.

Again, not an answer, but more of an observation. By simply changing my backend from rpc to redis, I more than triple my throughput:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.