Python: Better way to write nested for loops and if statements - python

I am trying to find a more Pythonic way of doing the below.
for employee in get_employees:
for jobs in employee['jobs']:
for nemployee in employee_comps:
if nemployee['employee_id'] == employee['id']:
for njob in nemployee['hourly_compensations']:
if njob['job_id'] == jobs['id']:
njob['rate'] = jobs['rate']
It works but seems clunky. I'm new to Python, if there is another thread that will help with this please direct me there!

The main comment I would make about the code is that you are free to change the order of the outer three for loops because the operation that you are performing does not depend on the order that you loop over these (as you are not breaking out of any loops when finding a match), and that given that this is the case, there is no point in doing the jobs loop only to reach an if statement inside it that is independent of the value of jobs. It would be more efficient to put the jobs loop inside the other two, so that it can also be inside the if, i.e. the loop is only performed for those combinations of values of employee and nemployee where the if condition evaluates True.
Beyond this but less importantly, where there are consecutive for statements (over independent iterables) after doing this rearrangement, you could replace them with a single loop over an itertools.product iterator to reduce the depth of nesting of for loops if you wish (reducing it from four to two explicit loops):
from itertools import product
for employee, nemployee in product(get_employees, employee_comps):
if nemployee['employee_id'] == employee['id']:
for jobs, njob in product(employee['jobs'],
nemployee['hourly_compensations']):
if njob['job_id'] == jobs['id']:
njob['rate'] = jobs['rate']

The code you have is very clean and pythonic, I would suggest staying with that.
If you want it in one line, this should work, but I don't have data to test it on, so I'm not sure.
[[njob.update({njob['rate']: jobs['rate']}) for njob in nemployee['hourly_compensations'] if njob['job_id'] == jobs['id']] for employee in get_employees for jobs in employee['jobs'] for nemployee in employee_comps if nemployee['employee_id'] == employee['id']]

Related

Best way in python to check if a loop is not executed

The title might be misleading, so here is a better explanation.
Consider the following code:
def minimum_working_environment(r):
trial=np.arange(0,6)[::-1]
for i in range(len(trial)):
if r>trial[i]:
return i
return len(trial)
We see that if r is smaller than the smallest element of trial, the if clause inside the loop is never executed. Therefore, the function never returns anything in the loop and returns something in the last line. If the if clause inside the loop is executed, return terminates the code, so the last line is never executed.
I want to implement something similar, but without return, i.e.,
def minimum_working_environment(self,r):
self.trial=np.arange(0,6)[::-1]
for i in range(len(self.trial)):
if r>trial[i]:
self.some_aspect=i
break
self.some_aspect=len(self.trial)
Here, break disrupts the loop but the function is not terminated.
The solutions I can think of are:
Replace break with return 0 and not check the return value of the function.
Use a flag variable.
Expand the self.trial array with a very small negative number, like -1e99.
First method looks good, I will probably implement it if I don't get any answer. The second one is very boring. The third one is not just boring but also might cause performance problems.
My questions are:
Is there a reserved word like return that would work in the way that I want, i.e., terminate the function?
If not, what is the best solution to this?
Thanks!
You can check that a for loop did not run into a break with else, which seems to be what you're after.
import numpy as np
def minimum_working_environment(r):
trial = np.arange(0, 6)[::-1]
for i in range(len(trial)):
if r > trial[i]:
return i
return len(trial)
def alternative(r):
trial = np.arange(0, 6)[::-1]
for i in range(len(trial)):
if r > trial[i]:
break
else:
i = len(trial)
return i
print(minimum_working_environment(3))
print(minimum_working_environment(-3))
print(alternative(3))
print(alternative(-3))
Result:
3
6
3
6
This works because the loop controlling variable i will still have the last value it had in the loop after the break and the else will only be executed if the break never executes.
However, if you just want to terminate a function, you should use return. The example I provided is mainly useful if you do indeed need to know if a loop completed fully (i.e. without breaking) or if it terminated early. It works for your example, which I assume was exactly that, just an example.

Python multiprocessing pool.map within an apply loop causing resets and strange behavior

I am experiencing some really weird behavior with pool.starmap in the context of a groupby apply function. Without getting the specifics, my code is something like this:
def groupby_apply_func(input_df):
print('Processing group # x (I determine x from the df)')
output_df = pool.starmap(another_func, zip(some fields in input_df))
return output_df
result = some_df.groupby(groupby_fields).apply(groupby_apply_func)
In words, this takes a dataframe, forms a groupby on it, sends these groups to groupby_apply_func, which does some processing asynchronously using starmap and returns the results, which are concatenated into a final df. pool is a worker pool made from the multiprocessing library.
This code works for smaller datasets without problem. So there are no syntax errors or anything. The computer will loop through all of the groups formed by groupby, send them to groupby_apply_func (I can see the progress from the print statement), and come back fine.
The weird behavior is: on large datasets, it starts looping through the groups. Then, halfway through, or 3/4 way through (which in real time might be 12 hours), it starts completely over at the beginning of the groupbys! It resets the loop and begins again. Then, sometimes, the second loop resets also and so on... and it gets stuck in an infinite loop. Again, this is only with large datasets, it works as intended on small ones.
Could there be something in the apply functionality that, upon running out of memory, for example, decides to start re-processing all the groups? Seems unlikely to me, but I did read that the apply function will actually process the first group multiple times in order to optimize code paths, so I know that there is "meta" functionality in there - and some logic to handle the processing - and it's not just a straight loop.
Hope all that made sense. Does anyone know the inner workings of groupby.apply and if so if anything in there could possibly be causing this?
thx
EDIT: IT APPEARS TO RESET THE LOOP at this point in ops.py ... it gets to this except clause and then proceeds to line 195 which is for key, (i, group) in zip(group_keys, splitter): which starts the entire loop over again. Does this mean anything to anybody?
except libreduction.InvalidApply as err:
# This Exception is raised if `f` triggers an exception
# but it is preferable to raise the exception in Python.
if "Let this error raise above us" not in str(err):
# TODO: can we infer anything about whether this is
# worth-retrying in pure-python?
raise
I would use a list of the group dataframes as the argument to map (I don't think you need starmap here), rather than hiding the multiprocessing in the function to be applied.
def func(df):
# do something
return df.apply(func2)
with mp.Pool(mp.cpu_count()) as p:
groupby = some_df.groupby(groupby_fields)
groups = [groupby.get_group(group) for group in groupby.groups]
result = p.map(func, groups)
OK so I figured it out. Doesn't have anything to do with starmap. It is due to the groupby apply function. This function tries to call fast_apply over the groupby prior to running "normal" apply. If anything causes an error in that fast_apply loop (in my case it was an out of memory error) it then tries to re-run using "normal" apply. However, it does not print the exception / error and just catches all errors.
Not sure if any Python people will read this but I'd humbly suggest that:
if an error really occurs in the fast_apply loop, maybe print it out, rather than catch everything, this could make debugging this like this much easier
the logic to re-run the entire loop if fast_apply fails... seems a little weird to me. Probably not a big deal for small apply operations. In my case I had a huge one and I really don't want it re-running the entire thing again. How about: Perhaps give the user an option to NOT use fast_apply - to avoid the whole fast_apply optimization? I don't know the inner workings of it and I'm sure it's in there for a good reason, but it does add complexity and in my case created very confusing situation which took hours to figure out.

How to run a for loop with variable range?

I want to have a Python program like--
for i in range(r):
if (i==2):
#change r in some way
which will run the loop for the new range r, after it gets modified in the if statement.
Even if I change r after the if statement,the for loop runs for the initial r I gave.This must be happening because range(r) gets fixed in the for statement in the first line itself,and is not affected by change in r later on.
Is there a "simple way" to bypass this?
By "simple" I mean that I don't want to add a counter which counts how many times the loop already ran and how many times it need to run again after changing(specifically increasing) r,or by replacing for loop with a while loop.
When you say:
for i in range(r):
range(r) creates a range object. It does this only once when the loop is set up initially. Therefore, any changes you make to r inside the loop have no effect on the performance of the loop, since it's the range object that dictates the number of iterations (it just happens to be initialized with r).
Rule of thumb: If you know how many iterations you need in advance, use a for loop. If you don't know how many iterations you need, use a while loop.
I don't believe this is possible. However, using a while loop and a manual counter is itself an extremely simple way to do this.
The code will look something like this:
i = 0
while i < r:
if i == 2:
# Change r in some way

Timeout issues with Python itertools.combinations()

We've got a script that uses itertools.combinations() and it seems to hang with a large input size.
I'm a relatively inexperienced Python programmer so I'm not sure how to fix this problem. Is there a more suitable library? Or is there a way to enable verbose logging to that I can debug why the method call is hanging?
Any help is much appreciated.
[Edit]
def findsubsets(S,m):
return set( itertools.combinations(S, m) )
for s in AllSearchTerms:
S.append(itemsize)
itemsize = itemsize + 1
for i in range (1,6):
Subset = findsubsets(S,i)
for sub in Subset:
for s in sub:
sublist.append(AllSearchTerms[s])
PComb.append(sublist)
sublist = []
You have two things in your code that will hang for large input sizes.
First, your function findsubsets calls itertools.combinations then converts the result to a set. The result of itertools.combinations is a generator, yielding each combination one at a time without storing them or all calculating them all at once. When you convert that to a set, you force Python to calculate and store them all at once. Therefore the line return set( itertools.combinations(S, m) ) is almost certainly where your program hangs. You can check that by placing print statements (or some other kind of logging statements) immediately before and after that line, and if you see the preceding print and the program hangs before you see the succeeding one, you have found the problem. The solution is not to convert the combinations to a set. Leave it as a generator, and your program can grab one combination at a time, as needed.
Second, even if you do what I just suggested, your loop for sub in Subset: is a fairly tight loop that uses every combination. If the input size is large, that loop will take a very long time and implementing my previous paragraph will not help. You probably should reorganize your program to avoid the large input sizes, or at least show some kind of progress during that loop. The combinations function has a predictable output size so you can even show the percent done in a progress bar.
There is no logging inside itertools.combinations since it is not needed when used properly, and there is no logging in the conversion of the generator to a set. You can implement logging in your own tight loop, if needed.

Check statement for a loop only once

Let’s say I have following simple code:
useText = True
for i in range(20):
if useText:
print("The square is "+ str(i**2))
else:
print(i**2)
I use the variable useText to control which way to print the squares. It doesn’t change while running the loop, so it seems inefficient to me to check it every time the loop runs. Is there any way to check useText only once, before the loop, and then always print out according to that result?
This question occurs to me quite often. In this simple case of course it doesn’t matter but I could imagine this leading to slower performance in more complex cases.
The only difference that useText accomplishes here is the formatting string. So move that out of the loop.
fs = '{}'
if useText:
fs = "The square is {}"
for i in range(20):
print(fs.format(i**2))
(This assumes that useText doesn't change during the loop! In a multithreaded program that might not be true.)
The general structure of your program is to loop through a sequence and print the result in some manner.
In code, this becomes
for i in range(20):
print_square(i)
Before the loop runs, set print_square appropriately depending on the useText variable.
if useText:
print_square = lambda x: print("The square is" + str(x**2))
else:
print_square = lambda x: print(x**2)
for i in range(20):
print_square(i)
This has the advantage of not repeating the loop structure or the check for useText and could easily be extended to support other methods of printing the results inside the loop.
If you are not going to change the value of useText inside the loop, you can move it outside of for:
if useText:
for i in range(20):
print("The square is "+ str(i**2))
else:
for i in range(20):
print(i**2)
We can move if outside of for since you mentioned useText is not changing.
If you write something like this, you're checking the condition, running code, moving to the next iteration, and repeating, checking the condition each time, because you're running the entire body of the for loop, including the if statement, on each iteration:
for i in a_list:
if condition:
code()
If you write something like this, with the if statement inside the for loop, you're checking the condition and running the entire for loop only if the condition is true:
if condition:
for i in a_list:
code()
I think you want the second one, because that one only checks the condition once, at the start. It does that because the if statement isn't inside the loop. Remember that everything inside the loop is run on each iteration.

Categories