my Python multiprocesses are apparently not independent - python

I have a very specific problem with python parallelisation let's see if I can explain it,
I want to execute a function foo() using the multiprocessing library for parallelisation.
# Creation of the n processes, in this case 4, and start it
threads = [multiprocessing.Process(target=foo, args=(i)) for i in range(n)]
for th in threads:
th.start()
The foo() function is a recursive function who explores a tree in depth until one specific event happens. Depending on how it expands through the tree, this event can occur in a few steps, for example 5 or even in millions. The tree nodes are a set of elements and in each step I select a random element from this set with rand_element = random.sample(node.set_of_elements,1)[0] and make a recursive call accordingly to them, i.e., two different random elements have different tree paths.
The problem is that for some unknown reason, the processes apparently does not behave independently. For example, if I run 4 processes in parallel, sometimes they return this result.
1, Number of steps: 5
2, Number of steps: 5
3, Number of steps: 5
4, Number of steps: 5
that is to say, all the processes take the "good path" and ends in a very few steps. On the other hand, other times it returns this.
1, Number of steps: 6516
2, Number of steps: 8463
3, Number of steps: 46114
4, Number of steps: 56312
that is to say, all the processes takes "bad paths". I haven't had a single execution in which at least one takes the "good path" and the rest the "bad path".
If I run foo() multiple times sequentially, more than a half of execution ends with less than 5000 steps, but in concurrency I don't see this proportion, all the processes ends either fast or slow.
How is it possible?
Sorry if I can't give you more precise details about the program and execution, but it is too big and complex to explain here.

I have found the solution, I post it in case someone finds it helpful
The problem was that at some point inside foo(), I have used the my_set.pop() method instead of set.remove(random.sample (my_set, 1) [0]). The first one, my_set.pop() doesn't actually return a random element. In Python 3.6 sets have a concrete order like lists, the key is that the established order is generated randomly, so, to return a (pseudo)random element, the my_set.pop() method, always returns the first element. The problem was that in my case, all processes share that order, so my_set.pop() returns the same first element in all of them.

You should use collections.OrderedDict (or another ordered data structure) rather than set if your program cares about item order (as random.sample() does, for example). Even in Python 3.7 and later, at the time of this writing, sets are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to your program.
With set, you should not expect items to be inserted or enumerated in any particular order, even in a pseudorandom order.
See also:
Does Python have an ordered set?
Are dictionaries ordered in Python 3.6+?
https://stackoverflow.com/a/64855489/815724

Related

Python: Better way to write nested for loops and if statements

I am trying to find a more Pythonic way of doing the below.
for employee in get_employees:
for jobs in employee['jobs']:
for nemployee in employee_comps:
if nemployee['employee_id'] == employee['id']:
for njob in nemployee['hourly_compensations']:
if njob['job_id'] == jobs['id']:
njob['rate'] = jobs['rate']
It works but seems clunky. I'm new to Python, if there is another thread that will help with this please direct me there!
The main comment I would make about the code is that you are free to change the order of the outer three for loops because the operation that you are performing does not depend on the order that you loop over these (as you are not breaking out of any loops when finding a match), and that given that this is the case, there is no point in doing the jobs loop only to reach an if statement inside it that is independent of the value of jobs. It would be more efficient to put the jobs loop inside the other two, so that it can also be inside the if, i.e. the loop is only performed for those combinations of values of employee and nemployee where the if condition evaluates True.
Beyond this but less importantly, where there are consecutive for statements (over independent iterables) after doing this rearrangement, you could replace them with a single loop over an itertools.product iterator to reduce the depth of nesting of for loops if you wish (reducing it from four to two explicit loops):
from itertools import product
for employee, nemployee in product(get_employees, employee_comps):
if nemployee['employee_id'] == employee['id']:
for jobs, njob in product(employee['jobs'],
nemployee['hourly_compensations']):
if njob['job_id'] == jobs['id']:
njob['rate'] = jobs['rate']
The code you have is very clean and pythonic, I would suggest staying with that.
If you want it in one line, this should work, but I don't have data to test it on, so I'm not sure.
[[njob.update({njob['rate']: jobs['rate']}) for njob in nemployee['hourly_compensations'] if njob['job_id'] == jobs['id']] for employee in get_employees for jobs in employee['jobs'] for nemployee in employee_comps if nemployee['employee_id'] == employee['id']]

Using heaps for schedulers

On the Python official docs here, the following is mentioned regarding heaps:
A nice feature of this sort is that you can efficiently insert new
items while the sort is going on, provided that the inserted items are
not “better” than the last 0’th element you extracted. This is
especially useful in simulation contexts, where the tree holds all
incoming events, and the “win” condition means the smallest scheduled
time. When an event schedules other events for execution, they are
scheduled into the future, so they can easily go into the heap
I can only think of the following simple algorithm to implement a scheduler using heap:
# Priority queue using heap
pq = []
# The first element in the tuple represents the time at which the task should run.
task1 = (1, Task(...))
task2 = (2, Task(...))
add_task(pq, task1)
add_task(pq, task2)
# Add a few more root-level tasks
while pq:
next_task = heapq.heappop()
next_task.perform()
for child_task in next_task.get_child_tasks():
# Add new child tasks if available
heapq.heappush(pq, child_task)
In this, where does sorting even come into the picture?
And even if the future child tasks have a time for the 'past', still this algorithm would work correctly.
So, why is the author warning about the child events only being scheduled for the future??
And what does this mean:
you can efficiently insert new items while the sort is going on,
provided that the inserted items are not “better” than the last 0’th
element you extracted.
Heap are used as data structure for priority queue, in fact the fundamental in a min heap is that you have the lowest priority on top (or in max heap the higher priority on top). Therefore you can always extract lowest or highest element without search it.
You can always insert new element during the sorting, try to look how the heapSort works. Every time you need to build your heap and then extract the maximum value and put it on the end of the array, after you decrement the heap.length of 1.
If you already sorted some numbers: [..., 13, 15, 16] and you insert a new number that is higher of the last element that is extracted (13 = 0’th element) you will get a wrong solution, because you will extract the new number but you won't put it in the right place: [1, 2, 5, 7, 14, 13, 15, 16]. It will be placed before 13 because it swap the element on heap.length position.
This is obviously wrong so you can only insert element that are less of the 0’th element.

Timeout issues with Python itertools.combinations()

We've got a script that uses itertools.combinations() and it seems to hang with a large input size.
I'm a relatively inexperienced Python programmer so I'm not sure how to fix this problem. Is there a more suitable library? Or is there a way to enable verbose logging to that I can debug why the method call is hanging?
Any help is much appreciated.
[Edit]
def findsubsets(S,m):
return set( itertools.combinations(S, m) )
for s in AllSearchTerms:
S.append(itemsize)
itemsize = itemsize + 1
for i in range (1,6):
Subset = findsubsets(S,i)
for sub in Subset:
for s in sub:
sublist.append(AllSearchTerms[s])
PComb.append(sublist)
sublist = []
You have two things in your code that will hang for large input sizes.
First, your function findsubsets calls itertools.combinations then converts the result to a set. The result of itertools.combinations is a generator, yielding each combination one at a time without storing them or all calculating them all at once. When you convert that to a set, you force Python to calculate and store them all at once. Therefore the line return set( itertools.combinations(S, m) ) is almost certainly where your program hangs. You can check that by placing print statements (or some other kind of logging statements) immediately before and after that line, and if you see the preceding print and the program hangs before you see the succeeding one, you have found the problem. The solution is not to convert the combinations to a set. Leave it as a generator, and your program can grab one combination at a time, as needed.
Second, even if you do what I just suggested, your loop for sub in Subset: is a fairly tight loop that uses every combination. If the input size is large, that loop will take a very long time and implementing my previous paragraph will not help. You probably should reorganize your program to avoid the large input sizes, or at least show some kind of progress during that loop. The combinations function has a predictable output size so you can even show the percent done in a progress bar.
There is no logging inside itertools.combinations since it is not needed when used properly, and there is no logging in the conversion of the generator to a set. You can implement logging in your own tight loop, if needed.

How to try all possible paths?

I need to try all possible paths, branching every time I hit a certain point. There are <128 possible paths for this problem, so no need to worry about exponential scaling.
I have a player that can take steps through a field. The player
takes a step, and on a step there could be an encounter.
There are two options when an encounter is found: i) Input 'B' or ii) Input 'G'.
I would like to try both and continue repeating this until the end of the field is reached. The end goal is to have tried all possibilities.
Here is the template, in Python, for what I am talking about (Step object returns the next step using next()):
from row_maker_inlined import Step
def main():
initial_stats = {'n':1,'step':250,'o':13,'i':113,'dng':0,'inp':'Empty'}
player = Step(initial_stats)
end_of_field = 128
# Walk until reaching an encounter:
while player.step['n'] < end_of_field:
player.next()
if player.step['enc']:
print 'An encounter has been reached.'
# Perform an input on an encounter step:
player.input = 'B'
# Make a branch of player?
# perform this on the branch:
# player.input = 'G'
# Keep doing this, and branching on each encounter, until the end is reached.
As you can see, the problem is rather simple. Just I have no idea, as a beginner programmer, how to solve such a problem.
I believe I may need to use recursion in order to keep branching. But I really just do not understand how one 'makes a branch' using recursion, or anything else.
What kind of solution should I be looking at?
You should be looking at search algorithms like breath first search (BFS) and depth first search (DFS).
Wikipedia has this as the pseudo-code implementation of BFS:
procedure BFS(G, v) is
let Q be a queue
Q.enqueue(v)
label v as discovered
while Q is not empty
v← Q.dequeue()
for all edges from v to w in G.adjacentEdges(v) do
if w is not labeled as discovered
Q.enqueue(w)
label w as discovered
Essentially, when you reach an "encounter" you want to add this point to your queue at the end. Then you pick your FIRST element off of the queue and explore it, putting all its children into the queue, and so on. It's a non-recursive solution that is simple enough to do what you want.
DFS is similar but instead of picking the FIRST element form the queue, you pick the last. This makes it so that you explore a single path all the way to a dead end before coming back to explore another.
Good luck!

Plotting time against size of input for Longest Common Subsequence Problem

I wish to plot the time against size of input, for Longest common subsequence problem in recursive as well as dynamic programming approaches. Until now I've developed programs for evaluating lcs functions in both ways, a simple random string generator(with help from here) and a program to plot the graph. Now I need to connect all these in the following way.
Now I have to connect all these. That is, the two programs for calculating lcs should run about 10 times with output from simple random string generator given as command line arguments to these programs.
The time taken for execution of these programs are calculated and this along with the length of strings used is stored in a file like
l=15, r=0.003, c=0.001
This is parsed by the python program to populate the following lists
sequence_lengths = []
recursive_times = []
dynamic_times = []
and then the graph is plotted. I've the following questions regarding above.
1) How do I pass the output of one C program to another C program as command line arguments?
2) Is there any function to evaluate the time taken to execute the function in microseconds? Presently the only option I have is time function in unix. Being a command-line utility makes it tougher to handle.
Any help would be much appreciated.
If the data being passed from program to program is small and can be converted to character format, you can pass it as one or more command-line arguments. If not you can write it to a file and pass its name as a argument.
For Python programs many people use the timeit module's Timer class to measure code execution speed. You can also roll-you-own using the clock() or time() functions in time module. The resolution depends on what platform you're running on.
1) There are many ways, the simplest is to use system with a string constructed from the output (or popen to open it as a pipe if you need to read back its output), or if you wish to leave the current program then you can use the various exec (placing the output in the arguments).
In an sh shell you can also do this with command2 $(command1 args_to_command_1)
2) For timing in C, see clock and getrusage.

Categories