Threading in python - processing multiple large files concurrently

Threading in python - processing multiple large files concurrently - python

I'm new to python and I'm having trouble understanding how threading works. By skimming through the documentation, my understanding is that calling join() on a thread is the recommended way of blocking until it completes.
To give a bit of background, I have 48 large csv files (multiple GB) which I am trying to parse in order to find inconsistencies. The threads share no state. This can be done single threadedly in a reasonable ammount of time for a one-off, but I am trying to do it concurrently as an exercise.
Here's a skeleton of the file processing:
def process_file(data_file):
with open(data_file) as f:
print "Start processing {0}".format(data_file)
line = f.readline()
while line:
# logic omitted for brevity; can post if required
# pretty certain it works as expected, single 'thread' works fine
line = f.readline()
print "Finished processing file {0} with {1} errors".format(data_file, error_count)
def process_file_callable(data_file):
try:
process_file(data_file)
except:
print >> sys.stderr, "Error processing file {0}".format(data_file)
And the concurrent bit:
def partition_list(l, n):
""" Yield successive n-sized partitions from a list.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
partitions = list(partition_list(data_files, 4))
for partition in partitions:
threads = []
for data_file in partition:
print "Processing file {0}".format(data_file)
t = Thread(name=data_file, target=process_file_callable, args = (data_file,))
threads.append(t)
t.start()
for t in threads:
print "Joining {0}".format(t.getName())
t.join(5)
print "Joined the first chunk of {0}".format(map(lambda t: t.getName(), threads))
I run this as:
python -u datautils/cleaner.py > cleaner.out 2> cleaner.err
My understanding is that join() should block the calling thread waiting for the thread it's called on to finish, however the behaviour I'm observing is inconsistent with my expectation.
I never see errors in the error file, but I also never see the expected log messages on stdout.
The parent process does not terminate unless I explicitly kill it from the shell. If I check how many prints I have for Finished ... it's never the expected 48, but somewhere between 12 and 15. However, having run this single-threadedly, I can confirm that the multithreaded run is actually processing everything and doing all the expected validation, only it does not seem to terminate cleanly.
I know I must be doing something wrong, but I would really appreciate if you can point me in the right direction.

I can't understand where mistake in your code. But I can recommend you to refactor it a little bit.
First at all, threading in python is not concurrent at all. It's just illusion, because there is a Global Interpreter Lock, so only one thread can be executed in same time. That's why I recommend you to use multiprocessing module:
from multiprocessing import Pool, cpu_count
pool = Pool(cpu_count)
for partition in partition_list(data_files, 4):
res = pool.map(process_file_callable, partition)
print res
At second, you are using not pythonic way to read file:
with open(...) as f:
line = f.readline()
while line:
... # do(line)
line = f.readline()
Here is pythonic way:
with open(...) as f:
for line in f:
... # do(line)
This is memory efficient, fast, and leads to simple code. (c) PyDoc
By the way, I have only one hypothesis what can happen with your program in multithreading way - app became more slower, because unordered access to hard disk drive is significantly slower than ordered. You can try to check this hypothesis using iostat or htop, if you are using Linux.
If your app does not finish work, and it doesn't do anything in process monitor (cpu or disk is not active), it means you have some kind of deadlock or blocked access to same resource.

Thanks everybody for your input and sorry for not replying sooner - I'm working on this on and off as a hobby project.
I've managed to write a simple example that proves it was my bad:
from itertools import groupby
from threading import Thread
from random import randint
from time import sleep
for key, partition in groupby(range(1, 50), lambda k: k//10):
threads = []
for idx in list(partition):
thread_name = 'thread-%d' % idx
t = Thread(name=thread_name, target=sleep, args=(randint(1, 5),))
threads.append(t)
print 'Starting %s' % t.getName()
t.start()
for t in threads:
print 'Joining %s' % t.getName()
t.join()
print 'Joined the first group of %s' % map(lambda t: t.getName(), threads)
The reason it was failing initially was the while loop the 'logic omitted for brevity' was working fine, however some of the input files that were being fed in were corrupted (had jumbled lines) and the logic went into an infinite loop on them. This is the reason some threads were never joined. The timeout for the join made sure that they were all started, but some never finished hence the inconsistency between 'starting' and 'joining'. The other fun fact was that the corruption was on the last line, so all the expected data was being processed.
Thanks again for your advice - the comment about processing files in a while instead of the pythonic way pointed me in the right direction, and yes, threading behaves as expected.

Related

Why does not multithreading speed up my program?

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)

The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

How to put() and get() from a multiprocessing.Queue() at the same time?

I'm working on a python 2.7 program that performs these actions in parallel using multiprocessing:
reads a line from file 1 and file 2 at the same time
applies function(line_1, line_2)
writes the function output to a file
I am new to multiprocessing and I'm not extremely expert with python in general. Therefore, I read a lot of already asked questions and tutorials: I feel close to the point but I am now probably missing something that I can't really spot.
The code is structured like this:
from itertools import izip
from multiprocessing import Queue, Process, Lock
nthreads = int(mp.cpu_count())
outq = Queue(nthreads)
l = Lock()
def func(record_1, record_2):
result = # do stuff
outq.put(result)
OUT = open("outputfile.txt", "w")
IN1 = open("infile_1.txt", "r")
IN2 = open("infile_2.txt", "r")
processes = []
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
while (not outq.empty()):
l.acquire()
item = outq.get()
OUT.write(item)
l.release()
OUT.close()
IN1.close()
IN2.close()
To my understanding (so far) of multiprocessing as package, what I'm doing is:
creating a queue for the results of the function that has a size limit compatible with the number of cores of the machine.
filling this queue with the results of func().
reading the queue items until the queue is empty, writing them to the output file.
Now, my problem is that when I run this script it immediately becomes a zombie process. I know that the function works because without the multiprocessing implementation I had the results I wanted.
I'd like to read from the two files and write to output at the same time, to avoid generating a huge list from my input files and then reading it (input files are huge). Do you see anything gross, completely wrong or improvable?

The biggest issue I see is that you should pass the queue object through the process instead of trying to use it as a global in your function.
def func(record_1, record_2, queue):
result = # do stuff
queue.put(result)
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2, outq))
Also, as currently written, you would still be pulling all that information into memory (aka the queue) and waiting for the read to finish before writing to the output file. You need to move the p.join loop until after reading through the queue, and instead of putting all the information in the queue at the end of the func it should be filling the queue with chucks in a loop over time, or else it's the same as just reading it all into memory.
You also don't need a lock unless you are using it in the worker function func, and if you do, you will again want to pass it through.
If you want to not to read / store a lot in memory, I would write out the same time I am iterating through the input files. Here is a basic example of combining each line of the files together.
with open("infile_1.txt") as infile1, open("infile_2.txt") as infile2, open("out", "w") as outfile:
for line1, line2 in zip(infile1, infile2):
outfile.write(line1 + line2)
I don't want to write to much about all of these, just trying to give you ideas. Let me know if you want more detail about something. Hope it helps!

Troubleshooting data inconsistencies with Python multiprocessing/threading

TL;DR: Getting different results after running code with threading and multiprocessing and single threaded. Need guidance on troubleshooting.
Hello, I apologize in advance if this may be a bit too generic, but I need a bit of help troubleshooting an issue and I am not sure how best to proceed.
Here is the story; I have a bunch of data indexed into a Solr Collection (~250m items), all items in that collection have a sessionid. Some items can share the same session id. I am combing through the collection to extract all items that have the same session, massage the data a bit and spit out another JSON file for indexing later.
The code has two main functions:
proc_day - accepts a day and processes all the sessions for that day
and
proc_session - does everything that needs to happen for a single session.
Multiprocessing is implemented on proc_day, so each day would be processed by a separate process, the proc_session function can be ran with threads. Below is the code I am using for threading/multiprocessing below. It accepts a function, a list of arguments and number of threads / multiprocesses. It will then create a queue based on input args, then create processes/threads and let them go through it. I am not posting the actual code, since it generally runs fine single threaded without any issues, but can post it if needed.
autoprocs.py
import sys
import logging
from multiprocessing import Process, Queue,JoinableQueue
import time
import multiprocessing
import os
def proc_proc(func,data,threads,delay=10):
if threads < 0:
return
q = JoinableQueue()
procs = []
for i in range(threads):
thread = Process(target=proc_exec,args=(func,q))
thread.daemon = True;
thread.start()
procs.append(thread)
for item in data:
q.put(item)
logging.debug(str(os.getpid()) + ' *** Processes started and data loaded into queue waiting')
s = q.qsize()
while s > 0:
logging.info(str(os.getpid()) + " - Proc Queue Size is:" + str(s))
s = q.qsize()
time.sleep(delay)
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
logging.debug(str(os.getpid()) + ' - *** Main Proc waiting')
q.join()
logging.debug(str(os.getpid()) + ' - *** Done')
def proc_exec(func,q):
p = multiprocessing.current_process()
logging.debug(str(os.getpid()) + ' - Starting:{},{}'.format(p.name, p.pid))
while True:
d = q.get()
try:
logging.debug(str(os.getpid()) + " - Starting to Process {}".format(d))
func(d)
sys.stdout.flush()
logging.debug(str(os.getpid()) + " - Marking Task as Done")
q.task_done()
except:
logging.error(str(os.getpid()) + " - Exception in subprocess execution")
logging.error(sys.exc_info()[0])
logging.debug(str(os.getpid()) + 'Ending:{},{}'.format(p.name, p.pid))
autothreads.py:
import threading
import logging
import time
from queue import Queue
def thread_proc(func,data,threads):
if threads < 0:
return "Thead Count not specified"
q = Queue()
for i in range(threads):
thread = threading.Thread(target=thread_exec,args=(func,q))
thread.daemon = True
thread.start()
for item in data:
q.put(item)
logging.debug('*** Main thread waiting')
s = q.qsize()
while s > 0:
logging.debug("Queue Size is:" + str(s))
s = q.qsize()
time.sleep(1)
logging.debug('*** Main thread waiting')
q.join()
logging.debug('*** Done')
def thread_exec(func,q):
while True:
d = q.get()
#logging.debug("Working...")
try:
func(d)
except:
pass
q.task_done()
I am running into problems with validating data after python runs under different multiprocessing/threading configs. There is a lot of data, so I really need to get multiprocessing working. Here are the results of my test yesterday.
Only with multiprocessing - 10 procs:
Days Processed 30
Sessions Found 3,507,475
Sessions Processed 3,514,496
Files 162,140
Data Output: 1.9G
multiprocessing and multithreading - 10 procs 10 threads
Days Processed 30
Sessions Found 3,356,362
Sessions Processed 3,272,402
Files 424,005
Data Output: 2.2GB
just threading - 10 threads
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 733,664
Data Output: 3.3GB
Single process/ no threading
Days Processed 31
Sessions Found 3,595,263
Sessions Processed 3,595,263
Files 162,190
Data Output: 1.9GB
These counts were gathered by grepping and counties entries in the log files (1 per main process). The first thing that jumps out is that days processed doesn't match. However, I manually checked the log files and it looks like a log entry was missing, there are follow on log entries to indicate that the day was actually processed. I have no idea why it was omitted.
I really don't want to write more code to validate this code, just seems like a terrible waste of time, is there any alternative?

I gave some general hints in the comments above. I think there are multiple problems with your approach, at very different levels of abstraction. You are also not showing all code of relevance.
The issue might very well be
in the method you are using to read from solr or in preparing read data before feeding it to your workers.
in the architecture you have come up with for distributing the work among multiple processes.
in your logging infrastructure (as you have pointed out yourself).
in your analysis approach.
You have to go through all of these points, and as of the complexity of the issue surely nobody here will be able to identify the exact issues for you.
Regarding points (3) and (4):
If you are not sure about the completeness of your log files, you should perform the analysis based on the payload output of your processing engine. What I am trying to say: the log files probably are just a side product of your data processing. The primary product is the thing you should analyze. Of course it is also important to get your logs right. But these two problems should be treated independently.
My contribution regarding point (2) in the list above:
What is especially suspicious about your multiprocessing-based solution is your way to wait for the workers to finish. You seem not to be sure by which method you should wait for your workers, so you apply three different methods:
First, you are monitoring the size of the queue in a while loop and wait for it to become 0. This is a non-canonical approach, which might actually work.
Secondly, you join() your processes in a weird way:
for p in procs:
logging.debug(str(os.getpid()) + " - Joining Process {}".format(p))
p.join(1)
Why are you defining a timeout of one second here and do not respond to whether the process actually terminated within that time frame? You should either really join a process, i.e. wait until it has terminated or you specify a timeout and, if that timeout expires before the process finishes, treat that situation specially. Your code does not distinguish these situations, so p.join(1) is like writing time.sleep(1) instead.
Thirdly, you join the queue.
So, after making sure that q.qsize() returns 0 and after waiting for another second, do you really think that joining the queue is important? Does it make any difference? One of these approaches should be enough, and you need to think about which of these criteria is most important to your problem. That is, one of these conditions should deterministically implicate the other two.
All this looks like a quick & dirty hack of a multiprocessing solution, whereas you yourself are not really sure how that solution should behave. One of the most important insights I have obtained while working on concurrency architectures: You, the architect, must be 100 % aware of how the communication and control flow works in your system. Not properly monitoring and controlling the state of your worker processes may very well be the source of the issues you are observing.

I figured it out, I followed Jan-Philip's advice and started examining the output data of the multiprocess/multithreaded process. Turned out that an object that does all these things with the data from Solr was shared among threads. I did not have any locking mechanisms, so in a case it had mixed data from multiple sessions which caused inconsistent output. I validated this by instantiating a new object for every thread and the counts matched up. It is a bit slower, but still workable.
Thanks

Strange python queue behavior. Crashes if queue isn't named "queue"

The name kind of says it all. I'm writing this program in python 2.7, and I'm trying to take advantage of threaded queues to make a whole bunch of web requests. Here's the problem: I would like to have two different queues, one to handle the threaded requests, and a separate one to handle the responses. If I have a queue in my program that isn't named "queue", for example if I want the initial queue to be named "input_q", then the program crashes and just refuses to work. This makes absolutely no sense to me. In the code below, all of the imported custom modules work just fine (at least, they did independently, passed all unit tests, and don't see any reason they could be the source of the problem).
Also, via diagnostic statements, I have determined that it crashes just before it spawns the thread pool.
Thanks in advance.
EDIT: Crash may be the wrong term here. It actually just stops. Even after waiting half an hour to complete, when the original program ran in under thirty seconds, the program wouldn't run. When I told it to print out toCheck, it would only make it part way through the list, stop in the middle of an entry, and do nothing.
EDIT2: Sorry for wasting everyones time, I forgot about this post. Someone had changed one of my custom modules (threadcheck). It looks like it was initializing the module, then running along its merry way with the rest of the program. Threadcheck was crashing after initialization, when the program was in the middle of computations, and that crash was taking the whole thing down with it.
code:
from binMod import binExtract
from grabZip import grabZip
import random
import Queue
import time
import threading
import urllib2
from threadCheck import threadUrl
import datetime
queue = Queue.Queue()
#output_q = Queue.Queue()
#input_q = Queue.Queue()
#output = queue
p=90
qb = 22130167533
url = grabZip(qb)
logFile = "log.txt"
metaC = url.grabMetacell()
toCheck = []
print metaC[0]['images']
print "beginning random selection"
for i in range(4):
if (len(metaC[i]['images'])>0):
print metaC[i]['images'][0]
for j in range(len(metaC[i]['images'])):
chance = random.randint(0, 100)
if chance <= p:
toCheck.append(metaC[i]['images'][j]['resolution 7 url'])
print "Spawning threads..."
for i in range(20):
t = threadUrl(queue)
t.setDaemon(True)
t.start()
print "initializing queue..."
for i in range(len(toCheck)):
queue.put(toCheck[i])
queue.join()
#input_q.join()
output = open(logFile, 'a')
done = datetime.datetime.now()
results = "\n %s \t %s \t %s \t %s"%(done, qb, good, bad)
output.write(results)

What the names are is irrelevant to Python -- Python doesn't care, and the objects themselves (for the most part) don't even know the names they have been assigned to. So the problem has to be somewhere else.
As has been suggested in the comments, carefully check your renames of queue.
Also, try it without daemon mode.

Using the Queue class in Python 2.6

Let's assume I'm stuck using Python 2.6, and can't upgrade (even if that would help). I've written a program that uses the Queue class. My producer is a simple directory listing. My consumer threads pull a file from the queue, and do stuff with it. If the file has already been processed, I skip it. The processed list is generated before all of the threads are started, so it isn't empty.
Here's some pseudo-code.
import Queue, sys, threading
processed = []
def consumer():
while True:
file = dirlist.get(block=True)
if file in processed:
print "Ignoring %s" % file
else:
# do stuff here
dirlist.task_done()
dirlist = Queue.Queue()
for f in os.listdir("/some/dir"):
dirlist.put(f)
max_threads = 8
for i in range(max_threads):
thr = Thread(target=consumer)
thr.start()
dirlist.join()
The strange behavior I'm getting is that if a thread encounters a file that's already been processed, the thread stalls out and waits until the entire program ends. I've done a little bit of testing, and the first 7 threads (assuming 8 is the max) stop, while the 8th thread keeps processing, one file at a time. But, by doing that, I'm losing the entire reason for threading the application.
Am I doing something wrong, or is this the expected behavior of the Queue/threading classes in Python 2.6?

I tried running your code, and did not see the behavior you describe. However, the program never exits. I recommend changing the .get() call as follows:
try:
file = dirlist.get(True, 1)
except Queue.Empty:
return
If you want to know which thread is currently executing, you can import the thread module and print thread.get_ident().
I added the following line after the .get():
print file, thread.get_ident()
and got the following output:
bin 7116328
cygdrive 7116328
cygwin.bat 7149424
cygwin.ico 7116328
dev etc7598568
7149424
fix 7331000
home 7116328lib
7598568sbin
7149424Thumbs.db
7331000
tmp 7107008
usr 7116328
var 7598568proc
7441800
The output is messy because the threads are writing to stdout at the same time. The variety of thread identifiers further confirms that all of the threads are running.
Perhaps something is wrong in the real code or your test methodology, but not in the code you posted?

Since this problem only manifests itself when finding a file that's already been processed, it seems like this is something to do with the processed list itself. Have you tried implementing a simple lock? For example:
processed = []
processed_lock = threading.Lock()
def consumer():
while True:
with processed_lock.acquire():
fileInList = file in processed
if fileInList:
# ... et cetera
Threading tends to cause the strangest bugs, even if they seem like they "shouldn't" happen. Using locks on shared variables is the first step to make sure you don't end up with some kind of race condition that could cause threads to deadlock.
Of course, if what you're doing under # do stuff here is CPU-intensive, then Python will only run code from one thread at a time anyway, due to the Global Interpreter Lock. In that case, you may want to switch to the multiprocessing module - it's very similar to threading, though you will need to replace shared variables with another solution (see here for details).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.