Timeout issues with Python itertools.combinations() - python

We've got a script that uses itertools.combinations() and it seems to hang with a large input size.
I'm a relatively inexperienced Python programmer so I'm not sure how to fix this problem. Is there a more suitable library? Or is there a way to enable verbose logging to that I can debug why the method call is hanging?
Any help is much appreciated.
[Edit]
def findsubsets(S,m):
return set( itertools.combinations(S, m) )
for s in AllSearchTerms:
S.append(itemsize)
itemsize = itemsize + 1
for i in range (1,6):
Subset = findsubsets(S,i)
for sub in Subset:
for s in sub:
sublist.append(AllSearchTerms[s])
PComb.append(sublist)
sublist = []

You have two things in your code that will hang for large input sizes.
First, your function findsubsets calls itertools.combinations then converts the result to a set. The result of itertools.combinations is a generator, yielding each combination one at a time without storing them or all calculating them all at once. When you convert that to a set, you force Python to calculate and store them all at once. Therefore the line return set( itertools.combinations(S, m) ) is almost certainly where your program hangs. You can check that by placing print statements (or some other kind of logging statements) immediately before and after that line, and if you see the preceding print and the program hangs before you see the succeeding one, you have found the problem. The solution is not to convert the combinations to a set. Leave it as a generator, and your program can grab one combination at a time, as needed.
Second, even if you do what I just suggested, your loop for sub in Subset: is a fairly tight loop that uses every combination. If the input size is large, that loop will take a very long time and implementing my previous paragraph will not help. You probably should reorganize your program to avoid the large input sizes, or at least show some kind of progress during that loop. The combinations function has a predictable output size so you can even show the percent done in a progress bar.
There is no logging inside itertools.combinations since it is not needed when used properly, and there is no logging in the conversion of the generator to a set. You can implement logging in your own tight loop, if needed.

Related

Time to excecute function goes up instead of down instead of up as x increases - Range()

My bad if this is a bit nooby or I don't understand how this works. I'm trying to get the time of range(1,x) using the code below.
Code
import timeit
def main(x):
return range(1,x)
def timeThem(x):
start = timeit.default_timer()
main(x)
stop = timeit.default_timer()
return stop - start
for i in range(5):
print(timeThem(i))
Now I would expect since x is getting larger in range(1,x) the time it would take to execute this would be longer. What i'd guess it would look something like this.
Expected Output
.01 .02 .03 .04 .05
But no, my time output gets shorter for some reason. As shown below I get something totally different than what I had imagined.
Received Output
8.219999999975469e-07
6.740000000060586e-07
1.0670000000004287e-06
4.939999999967193e-07
4.420000000032731e-07
What am I doing wrong here? Or do I just not understand how this really works?
From the range documentation:
Return an object that produces a sequence of integers from start (inclusive) to stop (exclusive) by step.
range do not produce the actual sequence so it can run in constant time. Note that iterating over the results is done in linear time.
Furthermore, your values are too small to see any significant difference in the timing even if range would run in linear time. Consequently, you are measuring noise.
your main function only returns a generator
def main(x):
return range(1,x)
Basically, a generator is not executed right away but an iterator with two values and no evaluation of it yet. So it does not matter whether you give x=1, x=100 or x=1000000. From a performance view Its basically returns a tuple like
def main(x):
return (1,x)
This is due to the nature of a generator that it's just get evaluated if you iterate over it. E.g. list(range(0, <infity>) ) would brake your memory but for i in range(0,<infity>): print(i) would just take forever to compute.
So range(x, 1000 ) did just create one object - it did not evaluate it
please be aware that python has some other coding standards than other languages like java or javascript where timeThem is a proper name, but in python, we follow pep8 that says one should use snake-case like time_them.
Personally, I would recommend you to use something like time_function to be even more explicit about what your function is supposed to do.

Python multiprocessing pool.map within an apply loop causing resets and strange behavior

I am experiencing some really weird behavior with pool.starmap in the context of a groupby apply function. Without getting the specifics, my code is something like this:
def groupby_apply_func(input_df):
print('Processing group # x (I determine x from the df)')
output_df = pool.starmap(another_func, zip(some fields in input_df))
return output_df
result = some_df.groupby(groupby_fields).apply(groupby_apply_func)
In words, this takes a dataframe, forms a groupby on it, sends these groups to groupby_apply_func, which does some processing asynchronously using starmap and returns the results, which are concatenated into a final df. pool is a worker pool made from the multiprocessing library.
This code works for smaller datasets without problem. So there are no syntax errors or anything. The computer will loop through all of the groups formed by groupby, send them to groupby_apply_func (I can see the progress from the print statement), and come back fine.
The weird behavior is: on large datasets, it starts looping through the groups. Then, halfway through, or 3/4 way through (which in real time might be 12 hours), it starts completely over at the beginning of the groupbys! It resets the loop and begins again. Then, sometimes, the second loop resets also and so on... and it gets stuck in an infinite loop. Again, this is only with large datasets, it works as intended on small ones.
Could there be something in the apply functionality that, upon running out of memory, for example, decides to start re-processing all the groups? Seems unlikely to me, but I did read that the apply function will actually process the first group multiple times in order to optimize code paths, so I know that there is "meta" functionality in there - and some logic to handle the processing - and it's not just a straight loop.
Hope all that made sense. Does anyone know the inner workings of groupby.apply and if so if anything in there could possibly be causing this?
thx
EDIT: IT APPEARS TO RESET THE LOOP at this point in ops.py ... it gets to this except clause and then proceeds to line 195 which is for key, (i, group) in zip(group_keys, splitter): which starts the entire loop over again. Does this mean anything to anybody?
except libreduction.InvalidApply as err:
# This Exception is raised if `f` triggers an exception
# but it is preferable to raise the exception in Python.
if "Let this error raise above us" not in str(err):
# TODO: can we infer anything about whether this is
# worth-retrying in pure-python?
raise
I would use a list of the group dataframes as the argument to map (I don't think you need starmap here), rather than hiding the multiprocessing in the function to be applied.
def func(df):
# do something
return df.apply(func2)
with mp.Pool(mp.cpu_count()) as p:
groupby = some_df.groupby(groupby_fields)
groups = [groupby.get_group(group) for group in groupby.groups]
result = p.map(func, groups)
OK so I figured it out. Doesn't have anything to do with starmap. It is due to the groupby apply function. This function tries to call fast_apply over the groupby prior to running "normal" apply. If anything causes an error in that fast_apply loop (in my case it was an out of memory error) it then tries to re-run using "normal" apply. However, it does not print the exception / error and just catches all errors.
Not sure if any Python people will read this but I'd humbly suggest that:
if an error really occurs in the fast_apply loop, maybe print it out, rather than catch everything, this could make debugging this like this much easier
the logic to re-run the entire loop if fast_apply fails... seems a little weird to me. Probably not a big deal for small apply operations. In my case I had a huge one and I really don't want it re-running the entire thing again. How about: Perhaps give the user an option to NOT use fast_apply - to avoid the whole fast_apply optimization? I don't know the inner workings of it and I'm sure it's in there for a good reason, but it does add complexity and in my case created very confusing situation which took hours to figure out.

sys.std.readline() Vs. input() [duplicate]

I'm trying to decide which one to use when I need to acquire lines of input from STDIN, so I wonder how I need to choose them in different situations.
I found a previous post (https://codereview.stackexchange.com/questions/23981/how-to-optimize-this-simple-python-program) saying that:
How can I optimize this code in terms of time and memory used? Note that I'm using different function to read the input, as sys.stdin.readline() is the fastest one when reading strings and input() when reading integers.
Is that statement true ?
The builtin input and sys.stdin.readline functions don't do exactly the same thing, and which one is faster may depend on the details of exactly what you're doing. As aruisdante commented, the difference is less in Python 3 than it was in Python 2, when the quote you provide was from, but there are still some differences.
The first difference is that input has an optional prompt parameter that will be displayed if the interpreter is running interactively. This leads to some overhead, even if the prompt is empty (the default). On the other hand, it may be faster than doing a print before each readline call, if you do want a prompt.
The next difference is that input strips off any newline from the end of the input. If you're going to strip that anyway, it may be faster to let input do it for you, rather than doing sys.stdin.readline().strip().
A final difference is how the end of the input is indicated. input will raise an EOFError when you call it if there is no more input (stdin has been closed on the other end). sys.stdin.readline on the other hand will return an empty string at EOF, which you need to know to check for.
There's also a third option, using the file iteration protocol on sys.stdin. This is likely to be much like calling readline, but perhaps nicer logic to it.
I suspect that while differences in performance between your various options may exist, they're liky to be smaller than the time cost of simply reading the file from the disk (if it is large) and doing whatever you are doing with it. I suggest that you avoid the trap of premature optimization and just do what is most natural for your problem, and if the program is too slow (where "too slow" is very subjective), you do some profiling to see what is taking the most time. Don't put a whole lot of effort into deciding between the different ways of taking input unless it actually matters.
As Linn1024 says, for reading large amounts of data input() is much slower.
A simple example is this:
import sys
for i in range(int(sys.argv[1])):
sys.stdin.readline()
This takes about 0.25μs per iteration:
$ time yes | py readline.py 1000000
yes 0.05s user 0.00s system 22% cpu 0.252 total
Changing that to sys.stdin.readline().strip() takes that to about 0.31μs.
Changing readline() to input() is about 10 times slower:
$ time yes | py input.py 1000000
yes 0.05s user 0.00s system 1% cpu 2.855 total
Notice that it's still pretty fast though, so you only really need to worry when you are reading thousands of entries like above.
It checks if it is TTY every time as input() runs by syscall and it works much more slow than sys.stdin.readline()
https://github.com/python/cpython/blob/af2f5b1723b95e45e1f15b5bd52102b7de560f7c/Python/bltinmodule.c#L1981
import sys
def solve(N, A):
for in range (N):
A.append(int(sys.stdin.readline().strip()))
return A
def main():
N = int(sys.stdin.readline().strip())
A = []
result = solve(N, A):
print(result)
main()

my Python multiprocesses are apparently not independent

I have a very specific problem with python parallelisation let's see if I can explain it,
I want to execute a function foo() using the multiprocessing library for parallelisation.
# Creation of the n processes, in this case 4, and start it
threads = [multiprocessing.Process(target=foo, args=(i)) for i in range(n)]
for th in threads:
th.start()
The foo() function is a recursive function who explores a tree in depth until one specific event happens. Depending on how it expands through the tree, this event can occur in a few steps, for example 5 or even in millions. The tree nodes are a set of elements and in each step I select a random element from this set with rand_element = random.sample(node.set_of_elements,1)[0] and make a recursive call accordingly to them, i.e., two different random elements have different tree paths.
The problem is that for some unknown reason, the processes apparently does not behave independently. For example, if I run 4 processes in parallel, sometimes they return this result.
1, Number of steps: 5
2, Number of steps: 5
3, Number of steps: 5
4, Number of steps: 5
that is to say, all the processes take the "good path" and ends in a very few steps. On the other hand, other times it returns this.
1, Number of steps: 6516
2, Number of steps: 8463
3, Number of steps: 46114
4, Number of steps: 56312
that is to say, all the processes takes "bad paths". I haven't had a single execution in which at least one takes the "good path" and the rest the "bad path".
If I run foo() multiple times sequentially, more than a half of execution ends with less than 5000 steps, but in concurrency I don't see this proportion, all the processes ends either fast or slow.
How is it possible?
Sorry if I can't give you more precise details about the program and execution, but it is too big and complex to explain here.
I have found the solution, I post it in case someone finds it helpful
The problem was that at some point inside foo(), I have used the my_set.pop() method instead of set.remove(random.sample (my_set, 1) [0]). The first one, my_set.pop() doesn't actually return a random element. In Python 3.6 sets have a concrete order like lists, the key is that the established order is generated randomly, so, to return a (pseudo)random element, the my_set.pop() method, always returns the first element. The problem was that in my case, all processes share that order, so my_set.pop() returns the same first element in all of them.
You should use collections.OrderedDict (or another ordered data structure) rather than set if your program cares about item order (as random.sample() does, for example). Even in Python 3.7 and later, at the time of this writing, sets are documented as unordered collections, so they should not be used if the order in which items are inserted or enumerated is important to your program.
With set, you should not expect items to be inserted or enumerated in any particular order, even in a pseudorandom order.
See also:
Does Python have an ordered set?
Are dictionaries ordered in Python 3.6+?
https://stackoverflow.com/a/64855489/815724

How to make a recursive program run for a long time without getting RunTimeError in Python

This code is the recursive factorial function.
The problem is that if I want to calculate a very large number, it generates this error:
RuntimeError : maximum recursion depth exceeded
import time
def factorial (n) :
if n == 0:
return 1
else:
return n * (factorial (n -1 ) )
print " The factorial of the number is: " , factorial (1500)
time.sleep (3600)
The goal is to do with the recursive function is a factor which can calculate maximum one hour.
This is a really bad idea. Python is not at all well-suited for recursing that many times. I'd strongly recommend you switch this to a loop which checks a timer and stops when it reaches the limit.
But, if you're seriously interested in increasing the recursion limit in cython (the default depth is 1000), there's a sys setting for that, sys.setrecursionlimit. Note as it says in the documentation that "the highest possible limit is platform-dependent" - meaning there's no way to know when your program will fail. Nor is there any way you, I or cython could ever tell whether your program will recurse for something as irrelevant to the actual execution of your code as "an hour." (Just for fun, I tried this with a method that passes an int counting how many times its already recursed, and I got to 9755 before IDLE totally restarted itself.)
Here's an example of a way I think you should do this:
# be sure to import time
start_time = time.time()
counter = 1
# will execute for an hour
while time.time() < start_time + 3600:
factorial(counter) # presumably you'd want to do something with the return value here
counter += 1
You should also keep in mind that regardless of whether you use iteration or recursion, (unless you're using a separate thread) you're still going to be blocking the entire program for the entirety of the hour.
Don't do that. There is an upper limit on how deep your recursion can get. Instead, do something like this:
def factorial(n):
result = 1
for i in range(1, n+1):
result *= i
return result
Any recursive function can be rewritten to an iterative function. If your code is fancier than this, show us the actual code and we'll help you rewrite it.
Few things to note here:
You can increase recursion stack with:
import sys
sys.setrecursionlimit(someNumber) # may be 20000 or bigger.
Which will basically just increase your limit for recursion. Note that in order for it to run one hour, this number should be so unreasonably big, that it is mostly impossible. This is one of the problems with recursion and this is why people think about iterative programs.
So basically what you want is practically impossible and you would rather make it with a loop/while approach.
Moreover your sleep function does not do what you want. Sleep just forces you to wait additional time (frozing your program)
It is a guard against a stack overflow. You can change the recursion limit with sys.setrecursionlimit(newLimit)where newLimit is an integer.
Python isn't a functional language. Rewriting the algorithm iteratively, if possible, is generally a better idea.

Categories