Methods for passing large objects in python multiprocessing - python

I'm doing something like this:
from multiprocessing import Process, Queue
def func(queue):
# do stuff to build up sub_dict
queue.put(sub_dict)
main_dict = {}
num_processes = 16
processes = []
queue = Queue()
for i in range(num_processes):
proc = Process(target=func)
processes.append(proc)
proc.start()
for proc in processes:
main_dict.update(queue.get())
for proc in processes:
proc.join()
The sub_dicts are something like 62,500 keys long, and each value is a several page document of words split into a numpy array.
What I've found is that the whole script tends to get stuck a lot towards the end of the executions of func. func takes about 25 minutes to run in each process (and I have 16 cores), but then I need to wait another hour before everything is done.
On another post commenters suggested that it's probably because of the overhead of the multiprocessing. That is, those huge sub_dicts need to be pickled and unpickled to rejoin the main process.
Apart from me coming up with my own data compression scheme, are there any handy ways to get around this problem?
More context
What I'm doing here is chunking a really large array of file names into 16 pieces and sending them to func. Then func opens those files, extracts the content, preprocesses it, and puts it in a sub_dict with {filename: content}. Then that sub_dict comes back to the main process to be added into main_dict. It's not the pickling of the original array chunks that's expensive. It's the pickling of the incoming sub_dicts
EDIT
Doesn't solve the actual question here, but I found out what my real issue was. I was running into swap memory because I underestimated the usage as compared to the relatively smaller disk space of the dataset I was processing. Doubling the memory on my VM sorted the main issue.

Related

Why does not multithreading speed up my program?

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

How to put() and get() from a multiprocessing.Queue() at the same time?

I'm working on a python 2.7 program that performs these actions in parallel using multiprocessing:
reads a line from file 1 and file 2 at the same time
applies function(line_1, line_2)
writes the function output to a file
I am new to multiprocessing and I'm not extremely expert with python in general. Therefore, I read a lot of already asked questions and tutorials: I feel close to the point but I am now probably missing something that I can't really spot.
The code is structured like this:
from itertools import izip
from multiprocessing import Queue, Process, Lock
nthreads = int(mp.cpu_count())
outq = Queue(nthreads)
l = Lock()
def func(record_1, record_2):
result = # do stuff
outq.put(result)
OUT = open("outputfile.txt", "w")
IN1 = open("infile_1.txt", "r")
IN2 = open("infile_2.txt", "r")
processes = []
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
while (not outq.empty()):
l.acquire()
item = outq.get()
OUT.write(item)
l.release()
OUT.close()
IN1.close()
IN2.close()
To my understanding (so far) of multiprocessing as package, what I'm doing is:
creating a queue for the results of the function that has a size limit compatible with the number of cores of the machine.
filling this queue with the results of func().
reading the queue items until the queue is empty, writing them to the output file.
Now, my problem is that when I run this script it immediately becomes a zombie process. I know that the function works because without the multiprocessing implementation I had the results I wanted.
I'd like to read from the two files and write to output at the same time, to avoid generating a huge list from my input files and then reading it (input files are huge). Do you see anything gross, completely wrong or improvable?
The biggest issue I see is that you should pass the queue object through the process instead of trying to use it as a global in your function.
def func(record_1, record_2, queue):
result = # do stuff
queue.put(result)
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2, outq))
Also, as currently written, you would still be pulling all that information into memory (aka the queue) and waiting for the read to finish before writing to the output file. You need to move the p.join loop until after reading through the queue, and instead of putting all the information in the queue at the end of the func it should be filling the queue with chucks in a loop over time, or else it's the same as just reading it all into memory.
You also don't need a lock unless you are using it in the worker function func, and if you do, you will again want to pass it through.
If you want to not to read / store a lot in memory, I would write out the same time I am iterating through the input files. Here is a basic example of combining each line of the files together.
with open("infile_1.txt") as infile1, open("infile_2.txt") as infile2, open("out", "w") as outfile:
for line1, line2 in zip(infile1, infile2):
outfile.write(line1 + line2)
I don't want to write to much about all of these, just trying to give you ideas. Let me know if you want more detail about something. Hope it helps!

Multiprocessing queue to get the data to handle for process

I have a list with filenames of files need to extract and I have a function which extracts these files. And since it is mostly CPU using task, it would be nice to spawn it between multiple processes to utilize multiple CPU-s.
Right now my code looks like this:
import multiprocessing
def unpack(files):
for f in files:
Archive(f).extractall('\\path\\to\\destination\\')
n_cpu = multiprocessing.cpu_count()
chunks = split(cabs_to_unpack, n_cpu) # just splits array into n equal chunks
for i in range(n_cpu):
p = Process(target=unpack, args=(chunks[i],))
p.start()
p.join()
But files to handle are very different by size. Some files are 1 kb, most are something about 300 kb and a few files are about 1.5Gb.
So my approach works not perfect: 5 processes handle their portion files very fast and exiting, and other three processes are working hard to handle some large file and a bunch of small files. So it wold be nice to make fast processes not to exit, but handle these small files too.
And it looks like it would be nice to use here some Queue with list of files, which can work correct with multiple processes. And my unpack function would looks like this:
def unpack(queue):
while queue.not_empty():
f = queue.get()
Archive(f).extractall('\\path\\to\\destination\\')
But I can't find this Queue in multiprocessing module. The only multiprocessing .Queue doesn't take a list of objects to initialize and looks like it should be used as a container where processes push the data and not as a container to get data from.
So my question is simple and maybe stupid (I'm new to multiprocessing), but which object/class should I use as a container with data to handle?
I'd recommend a multiprocessing.Pool.
from multiprocessing import Pool
def unpack(file_path):
Archive(file_path).extractall('\\path\\to\\destination\\')
pool = Pool()
pool.map(unpack, list_of_files)
It already deals with chunk size, re-use of the worker processes and process handling logic.

Python - weird behavior with multiprocessing - join does not execute

I am using the multiprocessing python module. I have about 20-25 tasks to run simultaneously. Each task will create a pandas.DataFrame object of ~20k rows. Problem is, all tasks execute well, but when it comes to "joining" the processes, it just stops. I've tried with "small" DataFrames and it works very well. To illustrate my point, I created the code below.
import pandas
import multiprocessing as mp
def task(arg, queue):
DF = pandas.DataFrame({"hello":range(10)}) # try range(1000) or range(10000)
queue.put(DF)
print("DF %d stored" %arg)
listArgs = range(20)
queue = mp.Queue()
processes = [mp.Process(target=task,args=(arg,queue)) for arg in listArgs]
for p in processes:
p.start()
for i,p in enumerate(processes):
print("joining %d" %i)
p.join()
results = [queue.get() for p in processes]
EDIT:
With DF = pandas.DataFrame({"hello":range(10)}) I have everything correct: "DF 0 stored" up to "DF 19 stored", same with "joining 0" to "joining 19".
However with DF = pandas.DataFrame({"hello":range(1000)}) the issue arises: while it is storing the DF, the joining step stops after "joining 3".
Thanks for the useful tips :)
This problem is explained in the docs, under Pipes and Queues:
Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
Using a manager would work, but there are a lot of easier ways to solve this:
Read the data off the queue first, then join the processes, instead of the other way around.
Manage the Queue manually (e.g., using a JoinableQueue and task_done).
Just use Pool.map instead of reinventing the wheel. (Yes, much of what Pool does isn't necessary for your use case—but it also isn't going to get in the way, and the nice thing is, you already know it works.)
I won't show the implementation for #1 because it's so trivial, or for #2 because it's such a pain, but for #3:
def task(arg):
DF = pandas.DataFrame({"hello":range(1000)}) # try range(1000) or range(10000)
return DF
with mp.Pool(processes=20) as p:
results = p.map(task, range(20), chunksize=1)
(In 2.7, Pool may not work in a with statement; you can install the port of the later version of multiprocessing back to 2.7 off PyPI, or you can just manually create the pool, then close it in a try/finally, just you would handle a file if it didn't work in a with statement...)
You may ask yourself, why exactly does it fail at this point, but work with smaller numbers—even just a little bit smaller?
A pickle of that DataFrame is just over 16K. (The list by itself is a little smaller, but if you try it with 10000 instead of 1000 you should see the same thing without Pandas.)
So, the first child writes 16K, then blocks until there's room to write the last few hundred bytes. But you're not pulling anything off the pipe (by calling queue.get) until after the join, and you can't join until they exit, which they can't do until you unblock the pipe, so it's a classic deadlock. There's enough room for the first 4 to get through, but no room for 5. Because you have 4 cores, most of the time, the first 4 that get through will be the first 4. But occasionally #4 will beat #3 or something, and then you'll fail to join #3. That would happen more often with an 8-core machine.

Multiprocessing queue - Why does the memory consumption increase?

The following script generates 100 random dictionaries of size 100000, feeds each (key, value) tuple into a queue, while one separate process reads from the queue:
import multiprocessing as mp
import numpy.random as nr
def get_random_dict(_dummy):
return dict((k, v) for k, v in enumerate(nr.randint(pow(10, 9), pow(10, 10), pow(10, 5))))
def consumer(q):
for (k, v) in iter(q.get, 'STOP'):
pass
q = mp.Queue()
p = mp.Process(target=consumer, args=(q,))
p.start()
for d in mp.Pool(1).imap_unordered(get_random_dict, xrange(100)):
for k, v in d.iteritems():
q.put((k, v))
q.put('STOP')
p.join()
I was expecting the memory usage to be constant because the consumer process pulls data from the queue as the main process feeds it. I verified that data doesn't accumulate in the queue.
However, I monitored the memory consumption and it keeps increasing as the script runs. If I replace imap_unordered by for _ in xrange(100): d = get_random_dict(), then the memory consumption is constant. What is the explanation?
Pool.imap is not literally identical to imap. It is the same in that it can be used like imap and that it returns an iterator. However, the implementation is entirely different. The backing pool will be working as hard as it can to complete all the jobs given to it as quickly as possible, regardless how how quickly the iterator is being consumed. If you only wanted a job to be processed when requested then there would be no point in using multiprocessing. Might as well just use itertools.imap and be done with it.
The reason that your memory consumption is increasing is therefore because the pool is creating dictionaries faster than your consumer process is consuming them. This will be because the way a pool retrieves results from a worker process is uni-directional (one process writes and process reads), and so no explicit synchronisation mechanism is needed. Whereas, a Queue is bidirectional -- both processes can read and write to the queue. This means there needs to be explicit synchronisation between processes using a queue to make sure they aren't competing to add the next item to a queue or remove an item from the queue (thus leaving the queue in an inconsistent state).
I think that the main problem is using multiprocessing.Pool to collect the dictionaries created in one process (Pool process), and then put them in the queue in main process. I think (I may be wrong) that Pool creates some queues of its own, and those are probably the ones in which the data accumulates.
You can see that clearly if you put some debugging prints like this:
...
def get_random_dict(_dummy):
print 'generating dict'
...
...
for d in mp.Pool(1).imap_unordered(get_random_dict, xrange(100)):
print 'next d'
...
You'll then see something like this:
generating dict
generating dict
next d
generating dict
generating dict
generating dict
generating dict
generating dict
next d
...
Which clearly shows you have those generated dicts accumulated somewhere
(probably in the inner tubing of Pool).
I think that much better solution would be to put the data from the
get_random_dict directly to the queue and abandon using *map functions
from Pool.

Categories