Multiprocessing on a list being passed to a function - python

I have a function that processes one url at a time:
def sanity(url):
try:
if 'media' in url[:10]:
url = "http://dummy.s3.amazonaws.com" + url
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
ret = urllib.request.urlopen(req)
allurls.append(url)
return 1
except (urllib.request.HTTPError,urllib.request.URLError,http.client.HTTPException, ValueError) as e:
print(e, url)
allurls.append(url)
errors.append(url)
return 0
In the main function, I have a list of URLs that need to be processed by the above function. I have tried but doesn't work.
start=0
allurls=[]
errors=[]
#arr=[0,100,200...]
for i in arr:
p=Process(target=sanity,args=(urls[start:i],))
p.start()
p.join()
The above code is supposed to process the URLs in a batch of 100. But it doesn't work. I know it's not working because I am writing the lists allurls and errors to two different files and they are empty when they should not be. I have found that the lists are empty. I don't understand this behavior.

If I understand you correctly, you want to process chunks of a list at a time, but process those chunks in parallel? Secondly you want to store the answers in a global variable. Problem is processes are not threads, so you much more involved to share memory between them.
So the alternative is to return the answer, the below code helps you do just that. First you need to convert your list to a list of lists, each list containing the data you would want to process in that chunk. You can then pass that list of lists to a function that processes each of those. The output of each chunk is a list of answers, and a list of errors (I'd recommend to convert this to a dict to keep track of which one threw an error). Then after the processes returns you can untangle the list of lists the create your list of answers and list of errors.
Here is the code that would achieve the above:
from multiprocessing import Pool
def f(x):
try:
return [x*x, None] # 0 for sucess
except Exception as e:
return [None, e] # 1 for failure
def chunk_f(x):
output = []
errors = []
for xi in x:
ans, err = f(xi)
if ans:
output.append(ans)
if err:
errors.append(err)
return [output, errors]
n = 10 # chunk size
data = list(range(95)) # test data
data.extend(['a', 'b'])
l = [data[k*n:(k+1)*n] for k in range(int(len(data)/n+1))]
p = Pool(8)
d = p.map(chunk_f, l)
new_data = []
all_errors = []
for da, de in d:
new_data.extend(da)
all_errors.extend(de)
print(new_data)
print(all_errors)
You can also look at this stack overflow answer on different methods of chunking your data.

Related

Pattern for serial-to-parallel-to-serial data processing

I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.

Python multiprocessing - Return a dict

I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:
def toParallel(ht, token):
keys = []
words = token[token['hashtag'] == ht]['word']
for w in words:
keys.append(checkString(w))
y = {ht:keys}
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])
result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)
final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only
bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950
How can I do it?
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get(). Pool.apply_async() is asynchronous.
Something like this:
workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]
Alternatively, you can also use Pool.map() which will do all this for you.
Either way, I recommend you read the documentation carefully.
Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__'. This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here.
ps: this is unnecessary:
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
If you call multiprocessing.Pool() without arguments (or None), it already creates a pool of workers with the size of your cpu count.

Ordering data from returned pool.apply_async

I am currently writing a steganography program. I currently have the majority of the things I want working. However I want to rebuild my message using multiple processes, this obviously means the bits returned from the processes need to be ordered. So currently I have:
Ok im home now I will put some actual code up.
def message_unhide(data):
inp = cv.LoadImage(data[0]) #data[0] path to image
steg = LSBSteg(inp)
bin = steg.unhideBin()
return bin
#code in main program underneath
count = 0
f = open(files[2], "wb") #files[2] = name of file to rebuild
fat = open("fat.txt", 'w+')
inp = cv.LoadImage(files[0][count]) # files[0] directory path of images
steg = LSBSteg(inp)
bin = steg.unhideBin()
fat.write(bin)
fat.close()
fat = open("fat.txt", 'rb')
num_files = fat.read() #amount of images message hidden across
fat.close()
count += 1
pool = Pool(5)
binary = []
''' Just something I was testing
for x in range(int(num_files)):
binary.append(0)
print (binary)
'''
while count <= int(num_files):
data = [files[0][count], count]
#f.write(pool.apply(message_unhide, args=(data, ))) #
#binary[count - 1] = [pool.apply_async(message_unhide, (data, ))] #
#again just another few ways i was trying to overcome
binary = [pool.apply_async(message_unhide, (data, ))]
count += 1
pool.close()
pool.join()
bits = [b.get() for b in binary]
print(binary)
#for b in bits:
# f.write(b)
f.close()
This method just overwrites binary
binary = [pool.apply_async(message_unhide, (data, ))]
This method fills the entire binary, however I loose the .get()
binary[count - 1] = [pool.apply_async(message_unhide, (data, ))]
Sorry for sloppy coding I am certainly no expert.
Your main issue has to do with overwriting binary in the loop. You only have one item in the list because you're throwing away the previous list and recreating it each time. Instead, you should use append to modify the existing list:
binary.append(pool.apply_async(message_unhide, (data, )))
But you might have a much nicer time if you use pool.map instead of rolling your own version. It expects an iterable yielding a single argument to pass to the function on each iteration, and it returns a list of the return values. The map call blocks until all the values are ready, so you don't need any other synchronization logic.
Here's an implementation using a generator expression to build the data argument items on the fly. You could simplify things and just pass files[0] to map if you rewrote message_unhide to accept the filename as its argument directly, without indexing a list (you never use the index, it seems):
# no loop this time
binary = pool.map(message_unhide, ([file, i] for i, file in enumerate(files[0])))

Merging lists obtained by a loop

I've only started python recently but am stuck on a problem.
# function that tells how to read the urls and how to process the data the
# way I need it.
def htmlreader(i):
# makes variable websites because it is used in a loop.
pricedata = urllib2.urlopen(
"http://website.com/" + (",".join(priceids.split(",")[i:i + 200]))).read()
# here my information processing begins but that is fine.
pricewebstring = pricedata.split("},{")
# results in [[1234,2345,3456],[3456,4567,5678]] for example.
array1 = [re.findall(r"\d+", a) for a in pricewebstring]
# writes obtained array to my text file
itemtxt2.write(str(array1) + '\n')
i = 0
while i <= totalitemnumber:
htmlreader(i)
i = i + 200
See the comments in the script as well.
This is in a loop and will each time give me an array (defined by array1).
Because I print this to a txt file it results in a txt file with separate arrays.
I need one big array so it needs to merge the results of htmlreader(i).
So my output is something like:
[[1234,2345,3456],[3456,4567,5678]]
[[6789,4567,2345],[3565,1234,2345]]
But I want:
[[1234,2345,3456],[3456,4567,5678],[6789,4567,2345],[3565,1234,2345]]
Any ideas how I can approach this?
Since you want to gather all the elements in a single list, you can simply gather them in another list, by flattening it like this
def htmlreader(i, result):
...
result.extend([re.findall(r"\d+", a) for a in pricewebstring])
i, result = 0, []
while i <= totalitemnumber:
htmlreader(i, result)
i = i + 200
itemtxt2.write(str(result) + '\n')
In this case, the result created by re.findall (a list) is added to the result list. Finally, you are writing the entire list as a whole to the file.
If the above shown method is confusing, then change it like this
def htmlreader(i):
...
return [re.findall(r"\d+", a) for a in pricewebstring]
i, result = 0, []
while i <= totalitemnumber:
result.extend(htmlreader(i))
i = i + 200

Python 3 multiprocessing: internal and timeout error handling and callbacks

I am trying to use Pool.starmap_async to run some code that takes multiple parameters as inputs, in order to quickly sweep through a parameter space. The code runs a linalg function that sometimes does not converge, and instead throws a np.linalg.LinAlgError. In this case I'd like my code to return np.nan, and carry on its merry way. I would also, ideally, like to specify a timeout so that the code gives up after a set number of seconds and continues on to a different parameter combination.
# This is actually some long function that sometimes returns a linalg error
def run_solver(A, B):
return A+B
if __name__ == '__main__':
# Parameters
Asearch = np.arange(4, 8, 1)
Bsearch = np.arange(0.2, 2, 0.2)
# Search all combinations of Qsearch and Rmsearch
AB = np.array(list(itertools.product(Qsearch, Rmsearch)))
A = AB[:, 0]
B = AB[:, 1]
result = {}
with Pool(processes=15) as pool:
def cb(r):
print("callback")
result[params] = r
def ec(r):
result[params] = np.nan
print("error callback")
raise np.linalg.LinAlgError
try:
params = (zip(A, B))
r = pool.starmap_async(run_solver, params, callback=cb, error_callback=ec)
print(r.get(timeout=10))
except np.linalg.LinAlgError:
print("parameters did not converge")
except mp.context.TimeoutError:
print("Timeout error. Continuing...")
pickle.dump(result, open("result.p", "wb"))
print("pickling output:", result)`
I have tried to catch the TimeoutError as an exception so that the code will continue, and I'm purposefully raising the LinAlgError because I'm trying to pick apart when the code runs out of time vs fails to converge in time -- I realize that's redundant. For one thing, the result dictionary does not end up being how I intended: is there a way to query the current process's parameters and use those as the dictionary keys? Also, if a Timeout error occurs I would ideally flag those parameters in some way -- what's the best way to do this?
Finally, why in this code is callback only called once? Shouldn't it be called as each process successfully completes? The code returns a dictionary where all of the parameters are crammed into a single key (as a .zip file) and all of the answers are a list in the key value.
I don't think I'm fully understanding the problem here, but what if you simplified it down to something like this, where you catch the LinAlgError in the calculation function.
Here apply_async is used to get a result object for each task sent to the pool. This allows you to easily apply a timeout to the result objects.
def run_solver(A, B):
try:
result = A + B
except np.linalg.LinAlgError:
result = np.nan
return result
results = []
with Pool(processes=15) as pool:
params = (zip(A, B))
result_pool = [pool.apply_async(run_solver, args) for args in params]
for result in result_pool:
try:
results.append(result.get(15))
except context.TimeoutError:
# do desired action on timeout
results.append(None)

Categories