Create a set with multiprocessing - python

I have a big list of items, and some auxiliary data. For each item in the list and element in data, I compute some thing, and add all the things into an output set (there may be many duplicates). In code:
def process_list(myList, data):
ret = set()
for item in myList:
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
what_I_Want = process_list(myList, data)
Because myList is big and compute(item, foo) is costly, I need to use multiprocessing. For now this is what I have:
from multiprocessing import Pool
initialize_worker(bar):
global data
data = bar
def process_item(item):
ret = set()
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
p = Pool(nb_proc, initializer = initialize_worker, initiargs = (data))
ret = p.map(process_item, myList)
what_I_Want = set().union(*ret)
What I do not like about that is that ret can be big. I am thinking about 3 options:
1) Chop myList into chunks a pass them to the workers, who will use process_list on each chunk (hence some duplicates will be removed at that step), and then union all the sets obtained to remove the last duplicates.
question: Is there an elegant way of doing that? Can we specify to Pool.map that it should pass the chunks to the workers instead of each item in the chunks? I know I could chop the list by myself, but this is damn ugly.
2) Have a shared set between all processes.
question: Why multiprocessing.manager does not feature set()? (I know it has dict(), but still..) If I use a manager.dict(), won't the communications between the processes and the manager slow down considerably the thing?
3) Have a shared multiprocessing.Queue(). Each worker puts the things it computes into the queue. Another worker does the unioning until some stopItem is found (which we put in the queue after the p.map)
question: Is this a stupid idea? Are communications between processes and a multiprocessing.Queue faster than those with a, say, manager.dict()? Also, how could I get back the set computed by the worker doing the unioning?

A minor thing: initiargs takes a tuple.
If you want to avoid creating all the results before reducing them into a set, you can use Pool.imap_unordered() with some chunk size. That will produce chunk size results from each workers as they become available.
If you want to change process_item to accept chunks directly, you have to do it manually. toolz.partition_all could be used to partition the initial dataset.
Finally, the managed data structures are bound to have much higher synchronization overhead. I'd avoid them as much as possible.
Go with imap_unordered and see if that's good enough; if not, then partition; if you cannot help having more than a couple duplicates total, use a managed dict.

Related

Unable To Display Result Array In Python Multiprocessing

Result Array is displayed as empty after trying to append values into it.
I have even declared result as global inside function.
Any suggestions?
Error Image
try this
res= []
inputData = [a,b,c,d]
def function(data):
values = [some_Number_1, some_Number_2]
return values
def parallel_run(function, inputData):
cpu_no = 4
if len(inputData) < cpu_no:
cpu_no = len(inputData)
p = multiprocessing.Pool(cpu_no)
global resultsAr
resultsAr = p.map(function, inputData, chunksize=1)
p.close()
p.join()
print ('res = ', res)
This happens since you're misunderstanding the basic point of multiprocessing: the child process spawned by multiprocessing.Process is separate from the parent process, and thus any modifications to data (including global variables) in the child process(es) are not propagated into the parent.
You will need to use multiprocessing-specific data types (queues and pipes), or the higher-level APIs provided by e.g. multiprocessing.Pool, to get data out of the child process(es).
For your application, the high-level recipe would be
def square(v):
return v * v
def main():
arr = [1, 2, 3, 4, 5]
with multiprocessing.Pool() as p:
squared = p.map(square, arr)
print(squared)
– however you'll likely find that this is massively slower than not using multiprocessing due to the overheads involved in such a small task.
Welcome to StackOverflow, Suyash !
The problem is that multiprocessing.Process is, as its name says, a separate process. You can imagine it almost as if you're running your script again from the terminal, with very little connection to the mother script.
Therefore, it has its own copy of the result array, which it modifies and prints.
The result in the "main" process is unmodified.
To convince yourself of this, try to print id(res) in both __main__ and in square(). You'll see they are different.

Pattern for serial-to-parallel-to-serial data processing

I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.

Optimize parsing of GB sized files in parallel

I have several compressed files with sizes on the order of 2GB compressed. The beginning of each file has a set of headers which I parse and extract a list of ~4,000,000 pointers (pointers).
For each pair of pointers (pointers[i], pointers[i+1]) for 0 <= i < len(pointers), I
seek to pointers[i]
read pointers[i+1]-pointer[i]
decompress it
do a single pass operation on that data and update a dictionary with what I find.
The issue is, I can only process roughly 30 of pointer pairs a second using a single Python process, which means each file takes more than a day to get through.
Assuming splitting up the pointers list among multiple processes doesn't hurt performance (due to each process looking at the same file, though different non-overlapping parts), how can I use multiprocessing to speed this up?
My single threaded operation looks like this:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
I tried changing this to use multiprocessing.Pool:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
but after a little while of processing, I get the following message and the script just hangs there with no further output:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
However, if I run with a serialized version of map without multiprocessing, things run just fine:
map(search_wrapper, map_args)
Any advice on how to change my multiprocessing code so it doesn't hang? Is it even a good idea to attempt to use multiple processes to read the same file?

Fill up a dictionary in parallel with multiprocessing

Yesterday i asked a question: Reading data in parallel with multiprocess
I got very good answers, and i implemented the solution mentioned in the answer i marked as correct.
def read_energies(motif):
os.chdir("blabla/working_directory")
complx_ener = pd.DataFrame()
# complex function to fill that dataframe
lig_ener = pd.DataFrame()
# complex function to fill that dataframe
return motif, complx_ener, lig_ener
COMPLEX_ENERGIS = {}
LIGAND_ENERGIES = {}
p = multiprocessing.Pool(processes=CPU)
for x in p.imap_unordered(read_energies, peptide_kd.keys()):
COMPLEX_ENERGIS[x[0]] = x[1]
LIGAND_ENERGIES[x[0]] = x[2]
However, this solution takes the same amount of time as if i would just iterate over peptide_kd.keys() and fill up the DataFrames one by one. Why is that so? Is there a way to fill up the desired dicts in parallel and actually get a speed increase? i am running it on a 48 core HPC.
You are incurring a good amount of overhead in (1) starting up each process, and (2) having to copy the pandas.DataFrame (and etc) across several processes. If you just need to have a dict filled in parallel, I'd suggest using a shared memory dict. If no key will be overwritten, then it's easy and you don't have to worry about locks.
(Note I'm using multiprocess below, which is a fork of multiprocessing -- but only so I can demonstrate from the interpreter, otherwise, you'd have to do the below from __main__).
>>> from multiprocess import Process, Manager
>>>
>>> def f(d, x):
... d[x] = x**2
...
>>> manager = Manager()
>>> d = manager.dict()
>>> job = [Process(target=f, args=(d, i)) for i in range(5)]
>>> _ = [p.start() for p in job]
>>> _ = [p.join() for p in job]
>>> print d
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
This solution doesn't make copies of the dict to share across processes, so that part of the overhead is reduced. For large objects like a pandas.DataFrame, it can be significant compared to the cost of a simple operation like x**2. Similarly, spawning a Process can take time, and you maybe be able to do the above even faster (for lightweight objects) by using threads (i.e. from multiprocess.dummy instead of multiprocess for either your originally posted solution or mine above).
If you do need to share DataFrames (as your code suggests instead of as the question asks), you might be able to do it by creating a shared memory numpy.ndarray.

How to stream results from Multiprocessing.Pool to csv?

I have a python process (2.7) that takes a key, does a bunch of calculations and returns a list of results. Here is a very simplified version.
I am using multiprocessing to create threads so this can be processed faster. However, my production data has several million rows and each loop takes progressively longer to complete. The last time I ran this each loop took over 6 minutes to complete while at the start it takes a second or less. I think this is because all the threads are adding results into resultset and that continues to grow until it contains all the records.
Is it possible to use multiprocessing to stream the results of each thread (a list) into a csv or batch resultset so it writes to the csv after a set number of rows?
Any other suggestions for speeding up or optimizing the approach would be appreciated.
import numpy as np
import pandas as pd
import csv
import os
import multiprocessing
from multiprocessing import Pool
global keys
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop,(key for key in keys) )
loaddata = []
for sublist in resultset:
loaddata.append(sublist)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in loaddata:
writer.writerow(listitem)
file.close
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Here is an answer consolidating the suggestions Eevee and I made
import numpy as np
import pandas as pd
import csv
from multiprocessing import Pool
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop, keys, chunksize=200)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in resultset:
writer.writerow(listitem)
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Again, the changes here are
Iterate over resultset directly, rather than needlessly copying it to a list first.
Feed the keys list directly to pool.imap instead of creating a generator comprehension out of it.
Providing a larger chunksize to imap than the default of 1. The larger chunksize reduces the cost of the inter-process communication required to pass the values inside keys to the sub-processes in your pool, which can give big performance boosts when keys is very large (as it is in your case). You should experiment with different values for chunksize (try something considerably larger than 200, like 5000, etc.) and see how it affects performance. I'm making a wild guess with 200, though it should definitely do better than 1.
The following very simple code collects many worker's data into a single CSV file. A worker takes a key and returns a list of rows. The parent processes several keys at a time, using several workers. When each key is done, the parent writes output rows, in order, to a CSV file.
Be careful about order. If each worker writes to the CSV file directly, they'll be out of order or will stomp on each others. Having each worker write to its own CSV file will be fast, but will require merging all the data files together afterward.
source
import csv, multiprocessing, sys
def worker(key):
return [ [key, 0], [key+1, 1] ]
pool = multiprocessing.Pool() # default 1 proc per CPU
writer = csv.writer(sys.stdout)
for resultset in pool.imap(worker, [1,2,3,4]):
for row in resultset:
writer.writerow(row)
output
1,0
2,1
2,0
3,1
3,0
4,1
4,0
5,1
My bet would be that dealing with the large structure at once using appending is what makes it slow. What I usually do is that I open up as many files as cores and use modulo to write to each file immediately such that the streams don't cause trouble compared to if you'd direct them all into the same file (write errors), and also not trying to store huge data. Probably not the best solution, but really quite easy. In the end you just merge back the results.
Define at start of the run:
num_cores = 8
file_sep = ","
outFiles = [open('out' + str(x) + ".csv", "a") for x in range(num_cores)]
Then in the key_loop function:
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
outFiles[key % num_cores].write(file_sep.join([str(x) for x in test_list])
+ "\n")
Afterwards, don't forget to close: [x.close() for x in outFiles]
Improvements:
Iterate over blocks like mentioned in the comments. Writing/processing 1 line at a time is going to be much slower than writing blocks.
Handling errors (closing of files)
IMPORTANT: I'm not sure of the meaning of the "keys" variable, but the numbers there will not allow modulo to ensure you have each process write to each individual stream (12 keys, modulo 8 will make 2 processes write to the same file)

Categories