I have the following function that I would like to run using multiprocessing:
def bruteForcePaths3(paths, availableNodes):
results = []
#start by taking each combination 2 at a time, then 3, etc
for i in range(1,len(availableNodes)+1):
print "combo number: %d" % i
currentCombos = combinations(availableNodes, i)
for combo in currentCombos:
#get a fresh copy of paths for this combiniation
currentPaths = list(paths)
currentRemainingPaths = []
# print combo
for node in combo:
#determine better way to remove nodes, for now- if it's in, we remove
currentRemainingPaths = [path for path in currentPaths if not (node in path)]
currentPaths = currentRemainingPaths
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print combo
results.append(frozenset(combo))
return results
Bases on a few other post (Combining itertools and multiprocessing?), I tried to multiprocess this by the following:
def grouper_nofill(n, iterable):
it=iter(iterable)
def take():
while 1: yield list(islice(it,n))
return iter(take().next,[])
def mp_bruteForcePaths(paths, availableNodes):
pool = multiprocessing.Pool(4)
chunksize=256
async_results=[]
def worker(paths,combos, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
results = bruteForcePaths2(paths, combos)
print results
out_q.put(results)
for i in range(1,len(availableNodes)+1):
currentCombos = combinations(availableNodes, i)
for finput in grouper_nofill(chunksize,currentCombos):
args = (paths, finput)
async_results.extend(pool.map_async(bruteForcePaths2, args).get())
print async_results
def bruteForcePaths2(args):
paths, combos = args
results = []
for combo in combos:
#get a fresh copy of paths for this combiniation
currentPaths = list(paths)
currentRemainingPaths = []
# print combo
for node in combo:
#determine better way to remove nodes, for now- if it's in, we remove
currentRemainingPaths = [path for path in currentPaths if not (combo in path)]
currentPaths = currentRemainingPaths
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print combo
results.append(frozenset(combo))
return results
I need to be able to pass in 2 arguments to the bruteforce function. I'm getting the error:
"too many values to unpack"
So 3 part question:
How can I multiprocess the bruteforce function over nproc cpu's splitting the combinations iterator?
How can I pass in the two arguments- path and combinations?
How do I get the result (think the mpa_async should do that for me)?
Thanks.
This
args = (paths, finput)
pool.map_async(bruteForcePaths2, args)
makes these two calls, which is not your intent:
bruteForcePaths2(paths)
bruteForcePaths2(finput)
You can use apply_async instead to submit single function calls to the pool. Note also that if you call get immediately it will wait for the result, and you don't get any advantage from multiprocessing.
You could do it like this:
for i in range(1,len(availableNodes)+1):
currentCombos = combinations(availableNodes, i)
for finput in grouper_nofill(chunksize,currentCombos):
args = (paths, finput)
async_results.append(pool.apply_async(bruteForcePaths2, args))
results = [x.get() for x in async_results]
Related
How can I implement yield from in my recursion? I am trying to understand how to implement it but failing:
# some data
init_parent = [1020253]
df = pd.DataFrame({'parent': [1020253, 1020253],
'id': [1101941, 1101945]})
# look for parent child
def recur1(df, parents, parentChild=None, step=0):
if len(parents) != 0:
yield parents, parentChild
else:
parents = df.loc[df['parent'].isin(parents)][['id', 'parent']]
parentChild = parents['parent'].to_numpy()
parents = parents['id'].to_numpy()
yield from recur1(df=df, parents=parents, parentChild=parentChild, step=step+1)
# exec / only printing results atm
out = recur1(df, init_parent, step=0)
[x for x in out]
I'd say your biggest issue here is that recur1 isn't always guaranteed to return a generator. For example, suppose your stack calls into the else branch three times before calling into the if branch. In this case, the top three frames would be returning a generator received from the lower frame, but the lowest from would be returned from this:
yield parents, parentChild
So, then, there is a really simple way you can fix this code to ensure that yield from works. Simply transform your return from a tuple to a generator-compatible type by enclosing it in a list:
yield [(parents, parentChild)]
Then, when you call yield from recur1(df=df, parents=parents, parentChild=parentChild, step=step+1) you'll always be working with something for which yeild from makes sense.
I am currently taking 6.00.2x from MITx, and there is a line from a search tree algorithm that confuses me, could anyone help please?
val, taken = maxVal(foods, maxUnits)
This syntax does not make sense to me. maxVal is a function, so presumably foods and maxUnits are inputs. But what are val and taken, what does this line do? Nowhere in the code are there variables instantiated with those names, so I am just not sure what they are (and this line of syntax means).
PS: The complete code is as follows. The aforementioned syntax occurs on 3rd line of the function testMaxVal. foods is a list of 1) food, 2) values, and 3) calories.
def maxVal(toConsider, avail):
"""Assumes toConsider a list of items, avail a weight
Returns a tuple of the total value of a solution to the
0/1 knapsack problem and the items of that solution"""
if toConsider == [] or avail == 0:
result = (0, ())
elif toConsider[0].getCost() > avail:
#Explore right branch only
result = maxVal(toConsider[1:], avail)
else:
nextItem = toConsider[0]
#Explore left branch
withVal, withToTake = maxVal(toConsider[1:],
avail - nextItem.getCost())
withVal += nextItem.getValue()
#Explore right branch
withoutVal, withoutToTake = maxVal(toConsider[1:], avail)
#Choose better branch
if withVal > withoutVal:
result = (withVal, withToTake + (nextItem,))
else:
result = (withoutVal, withoutToTake)
return result
def testMaxVal(foods, maxUnits, printItems = True):
print('Use search tree to allocate', maxUnits,
'calories')
val, taken = maxVal(foods, maxUnits)
print('Total value of items taken =', val)
if printItems:
for item in taken:
print(' ', item)
testMaxVal(foods, 750)
As you can see, maxVal can return two outputs at the same time like at the line :
result = (withoutVal, withoutToTake)
Recover these two outputs in two variable val and taken is done by the line :
val, taken = maxVal(foods, maxUnits)
The function maxVal returns a tuple. You can return multiple values from a function in python in the form of tuple.
Example:
def connect():
connection = _connect()
message = "Connected"
if not connection:
message = "Not connected"
return connection, message
connection, message = connect()
maxVal returns a pair.
You can "deconstruct" any tuple by assigning its elements to the appropriate number of variables simultaneously.
Example:
>>> a,b,c = (1,2, "hello")
>>> a
1
>>> b
2
>>> c
'hello'
I want to use multi process to stack many images. Each stack consists of 5 images, which means I have a list of images with a sublist of the images which should be combined:
img_lst = [[01_A, 01_B, 01_C, 01_D, 01_E], [02_A, 02_B, 02_C, 02_D, 02_E], [03_A, 03_B, 03_C, 03_D, 03_E]]
At them moment I call my function do_stacking(sub_lst) with a loop:
for sub_lst in img_lst:
# example: do_stacking([01_A, 01_B, 01_C, 01_D, 01_E])
do_stacking(sub_lst)
I want to speed up with multiprocessing but I am not sure how to call pool.map function:
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, ???)
pool.close()
pool.join()
Edit:
I have a list of list:
[
[01_A, 01_B, 01_C, 01_D, 01_E],
[02_A, 02_B, 02_C, 02_D, 02_E],
[03_A, 03_B, 03_C, 03_D, 03_E]
]
Each sublist should be passed to a function called do_stacking(sublist). I only want to proceed with the sublist and not with the entire list.
My question is how to handle the loop of the list (for x in img_lst)? Should I create a loop for each Pool?
Pool.map works like the builtin map function.It fetch one element from the second argument each time and pass it to the function that represent by the first argument.
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
img_list = []
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
img_list.append(data)
print img_list # check if the img_list is right?
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, img_list)
pool.close()
pool.join()
I have the following code:
import sys
from pyspark import SparkContext
def mapper(array):
aux = []
array = str(array)
aux = array.split(' | ')
return {(aux[0][:-1],aux[1][:-1]): [(aux[0][1:],aux[1][1:])]}
def reducer(d1, d2):
for k in d1.keys():
if d2.has_key(k):
d1[k] = d1[k] + d2[k]
d2.pop(k)
d1.update(d2)
return d1
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: bruijn <file>")
exit(-1)
sc = SparkContext(appName="Assembler")
kd = sys.argv[1].lstrip('k').rstrip('mer.txt').split('d')
k, d = int(kd[0]), int(kd[1])
dic = sc.textFile(sys.argv[1],False).map(mapper).reduce(reducer)
filepath = open('DeBruijn.txt', 'w')
for key in sorted(dic):
filepath.write(str(key) + ' -> ' + str(dic[key]) + '\n')
filepath.close()
print('De Bruijn graph successfully generated!')
sc.stop()
I would like to create an empty list called vertexes inside the main and make the mapper append elements inside it. However using the keyword global does not work. I have tried using an accumulator, but accumulators' values cannot be acessed inside tasks.
I figured it out how to do it by creating a custom type of Accumulatior that works with lists. In my code all I had to do was to insert the following import and implement the following class:
from pyspark.accumulators import AccumulatorParam
class VectorAccumulatorParam(AccumulatorParam):
def zero(self, value):
return []
def addInPlace(self, val1, val2):
return val1 + [val2] if type(val2) != list else val2 #Had to do this check because without it the result would be a list with all the tuples inside of another list.
My mapper function would be like this:
def mapper(array):
global vertexes
aux = []
array = str(array)
aux = array.split(' | ')
vertexes += (aux[0][:-1], aux[1][:-1]) #Adding a tuple into accumulator
vertexes += (aux[0][1:], aux[1][1:]) #Adding a tuple into accumulator
return {(aux[0][:-1],aux[1][:-1]): [(aux[0][1:],aux[1][1:])]
And inside the main function before calling the mapper function I created the accumulator:
vertexes = sc.accumulator([],VectorAccumulatorParam())
After the mapper/reducer function calls, I could get the result:
vertexes = list(set(vertexes.value))
Herio Sousa's VectorAccumulatorParam is a good idea. However, you can actually use built-in class AddingAccumulatorParam, which is basically the same to VectorAccumulatorParam.
Check out the original code here https://github.com/apache/spark/blob/41afa16500e682475eaa80e31c0434b7ab66abcb/python/pyspark/accumulators.py#L197-L213
As you've noticed, you can't append elements inside of the mapper (or well you can append the elements inside of the mapper, but the change is not propegated to any of the other mappers or your main function). As you've noticed accumulators do allow you to append elements, however they can only be read in the driver program and written to in the executors. You could have another mapper output the keys and call distinct on it if you want the distinct keys. You might also want to look at reduceByKey instead of the reduce you are using.
Here's the simplest multi threading example I found so far:
import multiprocessing
import subprocess
def calculate(value):
return value * 10
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10000)
results = []
r = pool.map_async(calculate, tasks, callback=results.append)
r.wait() # Wait on the results
print results
I have two lists and one index to access the elements in each list. The ith position on the first list is related to the ith position on the second. I didn't use a dict because the lists are ordered.
What I was doing was something like:
for i in xrange(len(first_list)):
# do something with first_list[i] and second_list[i]
So, using that example, I think can make a function sort of like this:
#global variables first_list, second_list, i
first_list, second_list, i = None, None, 0
#initialize the lists
...
#have a function to do what the loop did and inside it increment i
def function:
#do stuff
i += 1
But, that makes i a shared resource and I'm not sure if that'd be safe. It also seems to me my design is not lending itself well to this multithreaded approach, but I'm not sure how to fix it.
Here's a working example of what I wanted (Edit an image you want to use):
import multiprocessing
import subprocess, shlex
links = ['http://www.example.com/image.jpg']*10 # don't use this URL
names = [str(i) + '.jpg' for i in range(10)]
def download(i):
command = 'wget -O ' + names[i] + ' ' + links[i]
print command
args = shlex.split(command)
return subprocess.call(args, shell=False)
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10)
r = pool.map_async(download, tasks)
r.wait() # Wait on the results
First off, it might be beneficial to make one list of tuples, for example
new_list[i] = (first_list[i], second_list[i])
That way, as you change i, you ensure that you are always operating on the same items from first_list and second_list.
Secondly, assuming there are no relations between the i and i-1 entries in your lists, you can use your function to operate on one given i value, and spawn a thread to handle each i value. Consider
indices = range(len(new_list))
results = []
r = pool.map_async(your_function, indices, callback=results.append)
r.wait() # Wait on the results
This should give you what you want.