Surprising challenge generating comprehensive list - python

I am facing a surprising challenge with Python.
I am a Physicist generating a series of simulations of layers at an optical interface. The details of the simulations are not specifically important but what is crucial is that I generate all possible cases are generated - different materials within a range of thicknesses and layer orders.
I have been writing code to generate a comprehensive and unique list but I am staggered at how long it takes to compute even relatively simple systems! Surely Python and a reasonable computer should handle this without excessive stress. Suggestions would be greatly appreciated.
Thank you
from itertools import permutations, combinations_with_replacement
def join_adjacent_repeated_materials(potential_structure):
"""
Self-explanitory...
"""
#print potential_structure
new_layers = [] # List to hold re-cast structure
for layer in potential_structure:
if len(new_layers) > 0: # if not the first item in the list of layers
last_layer=new_layers[-1] # last element of existing layer list
if layer[0] == last_layer[0]: # true is the two layers are the same material
combined_layer = (layer[0], layer[1] + last_layer[1])
new_layers[len(new_layers)-1] = combined_layer
else: # adjcent layers are different material so no comibantion is possible
new_layers.append(layer)
else: # for the first layer
new_layers.append(layer)
return tuple(new_layers)
def calculate_unique_structure_lengths(thicknesses, materials, maximum_number_of_layers,\
maximum_individual_layer_thicknesses, \
maximum_total_material_thicknesses):
"""
Create a set on all possible multilayer combinations.
thicknesses : if this contains '0' the total number of layers will vary
from 0 to maximum_number_of_layers, otherwise, the
number total number layers will always be maximum_number_of_layers
e.g. arange(0 , 100, 5)
materials : list of materials used
e.g. ['Metal', 'Dielectric']
maximum_number_of_layers : pretty self-explanitory...
e.g. 5
maximum_individual_layer_thicknesses : filters the created the multilayer structures
preventing the inclusion layers that are too thick
- this is important after the joining of
adjacent materials
e.g. (('Metal',30),('Dielectric',20))
maximum_total_material_thicknesses : similar to the above but filters structures where the total
amount of a particular material is exceeded
e.g. (('Metal',50),('Dielectric',100))
"""
# generate all possible thickness combinations and material combinations
all_possible_thickness_sets = set(permutations(combinations_with_replacement(thicknesses, maximum_number_of_layers)))
all_possible_layer_material_orders = set(permutations(combinations_with_replacement(materials, maximum_number_of_layers)))
first_set = set() # Create set object (list of unique elements, no repeats)
for layer_material_order in all_possible_layer_material_orders:
for layer_thickness_set in all_possible_thickness_sets:
potential_structure = [] # list to hold this structure
for layer, thickness in zip(layer_material_order[0], layer_thickness_set[0]): # combine the layer thickness with its material
if thickness != 0: # layers of zero thickness are not added to potential_structure
potential_structure.append((layer, thickness))
first_set.add(tuple(potential_structure)) # add this potential_structure to the first_set set
#print('first_set')
#for struct in first_set:
# print struct
## join adjacent repeated materials
second_set = set() # create new set
for potential_structure in first_set:
second_set.add(join_adjacent_repeated_materials(potential_structure))
## remove structures where a layer is too thick
third_set = set()
for potential_structure in second_set: # check all the structures in the set
conditions_satisfied=True # default
for max_condition in maximum_individual_layer_thicknesses: # check this structure using each condition
for layer in potential_structure: # examine each layer
if layer[0] == max_condition[0]: # match condition with material
if layer[1] > max_condition[1]: # test thickness condition
conditions_satisfied=False
if conditions_satisfied:
third_set.add(potential_structure)
##remove structures that contain too much of a certain material
fourth_set = set()
for potential_structure in second_set: # check all the structures in the set
conditions_satisfied=True # default
for max_condition in maximum_total_material_thicknesses: # check this structure using each condition
amount_of_material_in_this_structure = 0 # initialise a counter
for layer in potential_structure: # examine each layer
if layer[0] == max_condition[0]: # match condition with material
amount_of_material_in_this_structure += layer[1]
if amount_of_material_in_this_structure > max_condition[1]: # test thickness condition
conditions_satisfied=False
if conditions_satisfied:
fourth_set.add(potential_structure)
return fourth_set
thicknesses = [0,1,2]
materials = ('A', 'B') # Tuple cannot be accidentally appended to later
maximum_number_of_layers = 3
maximum_individual_layer_thicknesses=(('A',30),('B',20))
maximum_total_material_thicknesses=(('A',20),('B',15))
calculate_unique_structure_lengths(thicknesses, materials, maximum_number_of_layers,\
maximum_individual_layer_thicknesses = maximum_individual_layer_thicknesses, \
maximum_total_material_thicknesses = maximum_total_material_thicknesses)

all_possible_thickness_sets = set(permutations(combinations_with_replacement(thicknesses, maximum_number_of_layers)))
all_possible_layer_material_orders = set(permutations(combinations_with_replacement(materials, maximum_number_of_layers)))
Holy crap! These sets are going to be huge! Let's give an example. If thicknesses has 6 things in it and maximum_number_of_layers is 3, then the first set is going to have about 2 quintillion things in it. Why are you doing this? If these are really the sets you want to use, you're going to need to find an algorithm that doesn't need to build these sets, because it's never going to happen. I suspect these aren't the sets you want; perhaps you wanted itertools.product?
all_possible_thickness_sets = set(product(thicknesses, repeat=maximum_number_of_layers))
Here's an example of what itertools.product does:
>>> for x in product([1, 2, 3], repeat=2):
... print x
...
(1, 1)
(1, 2)
(1, 3)
(2, 1)
(2, 2)
(2, 3)
(3, 1)
(3, 2)
(3, 3)
Does that look like what you need?

So one thing you do really often is to add something to a set. If you look at the runtime behaviour of sets in the Python documentation, it says, quite on the bottom, about worst cases "Individual actions may take surprisingly long, depending on the history of the container". I think memory reallocation may bite you if you add a lot of elements, because Python has no way of knowing how much memory to reserve when you start.
The more I look at it, the more I think you're reserving more memory than you have to. third_set for example doesn't even get used. second_set could be replaced by first_set if you'd just call join_adjacent_materials directly. And if I read it correctly, even first_set could go away and you could just create the candidates as you construct fourth_set.
Of course the code may become less readable if you put everything into a single bunch of nested loops. However, there are ways to structure your code that don't create unnecessary objects just for readability - you could for example create a function which generates candidates and return each result via yield.

FWIW I instrumented your code to enable profiling. Here are the results:
Output:
Sun May 25 16:06:31 2014 surprising-challenge-generating-comprehensive-python-list.stats
348,365,046 function calls in 1,538.413 seconds
Ordered by: cumulative time, internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1052.933 1052.933 1538.413 1538.413 surprising-challenge-generating-comprehensive-python-list.py:34(calculate_unique_structure_lengths)
87091200 261.764 0.000 261.764 0.000 {zip}
87091274 130.492 0.000 130.492 0.000 {method 'add' of 'set' objects}
174182440 93.223 0.000 93.223 0.000 {method 'append' of 'list' objects}
30 0.000 0.000 0.000 0.000 surprising-challenge-generating-comprehensive-python-list.py:14(join_adjacent_repeated_materials)
100 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
To get an even finer grained picture of where the code was spending it's time, I used the line_profiler module on an almost verbatim copy of your code and got the following results for each of your functions:
> python "C:\Python27\Scripts\kernprof.py" -l -v surprising-challenge-generating-comprehensive-python-list.py
Wrote profile results to example.py.lprof
Timer unit: 3.2079e-07 s
File: surprising-challenge-generating-comprehensive-python-list.py
Function: join_adjacent_repeated_materials at line 3
Total time: 0.000805183 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 #profile
4 def join_adjacent_repeated_materials(potential_structure):
5 """
6 Self-explanitory...
7 """
8 #print potential_structure
9
10 30 175 5.8 7.0 new_layers = [] # List to hold re-cast structure
11 100 544 5.4 21.7 for layer in potential_structure:
12 70 416 5.9 16.6 if len(new_layers) > 0: # if not the first item in the list of layers
13 41 221 5.4 8.8 last_layer=new_layers[-1] # last element of existing layer list
14 41 248 6.0 9.9 if layer[0] == last_layer[0]: # true is the two layers are the same material
15 30 195 6.5 7.8 combined_layer = (layer[0], layer[1] + last_layer[1])
16 30 203 6.8 8.1 new_layers[len(new_layers)-1] = combined_layer
17 else: # adjcent layers are different material so no comibantion is possible
18 11 68 6.2 2.7 new_layers.append(layer)
19 else: # for the first layer
20 29 219 7.6 8.7 new_layers.append(layer)
21
22 30 221 7.4 8.8 return tuple(new_layers)
File: surprising-challenge-generating-comprehensive-python-list.py
Function: calculate_unique_structure_lengths at line 24
Total time: 3767.94 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
24 #profile
25 def calculate_unique_structure_lengths(thicknesses, materials, maximum_number_of_layers,\
26 maximum_individual_layer_thicknesses, \
27 maximum_total_material_thicknesses):
28 """
29 Create a set on all possible multilayer combinations.
30
31 thicknesses : if this contains '0' the total number of layers will vary
32 from 0 to maximum_number_of_layers, otherwise, the
33 number total number layers will always be maximum_number_of_layers
34 e.g. arange(0 , 100, 5)
35
36 materials : list of materials used
37 e.g. ['Metal', 'Dielectric']
38
39 maximum_number_of_layers : pretty self-explanitory...
40 e.g. 5
41
42 maximum_individual_layer_thicknesses : filters the created the multilayer structures
43 preventing the inclusion layers that are too thick
44 - this is important after the joining of
45 adjacent materials
46 e.g. (('Metal',30),('Dielectric',20))
47
48 maximum_total_material_thicknesses : similar to the above but filters structures where the total
49 amount of a particular material is exceeded
50 e.g. (('Metal',50),('Dielectric',100))
51
52
53 """
54 # generate all possible thickness combinations and material combinations
55 1 20305240 20305240.0 0.2 all_possible_thickness_sets = set(permutations(combinations_with_replacement(thicknesses, maximum_number_of_layers)))
56 1 245 245.0 0.0 all_possible_layer_material_orders = set(permutations(combinations_with_replacement(materials, maximum_number_of_layers)))
57
58
59 1 13 13.0 0.0 first_set = set() # Create set object (list of unique elements, no repeats)
60 25 235 9.4 0.0 for layer_material_order in all_possible_layer_material_orders:
61 87091224 896927052 10.3 7.6 for layer_thickness_set in all_possible_thickness_sets:
62 87091200 920048586 10.6 7.8 potential_structure = [] # list to hold this structure
63 348364800 4160332176 11.9 35.4 for layer, thickness in zip(layer_material_order[0], layer_thickness_set[0]): # combine the layer thickness with its material
64 261273600 2334038439 8.9 19.9 if thickness != 0: # layers of zero thickness are not added to potential_structure
65 174182400 2003639625 11.5 17.1 potential_structure.append((layer, thickness))
66 87091200 1410517427 16.2 12.0 first_set.add(tuple(potential_structure)) # add this potential_structure to the first_set set
67
68 #print('first_set')
69 #for struct in first_set:
70 # print struct
71
72 ## join adjacent repeated materials
73 1 14 14.0 0.0 second_set = set() # create new set
74 31 274 8.8 0.0 for potential_structure in first_set:
75 30 5737 191.2 0.0 second_set.add(join_adjacent_repeated_materials(potential_structure))
76
77 ## remove structures where a layer is too thick
78 1 10 10.0 0.0 third_set = set()
79 23 171 7.4 0.0 for potential_structure in second_set: # check all the structures in the set
80 22 164 7.5 0.0 conditions_satisfied=True # default
81 66 472 7.2 0.0 for max_condition in maximum_individual_layer_thicknesses: # check this structure using each condition
82 104 743 7.1 0.0 for layer in potential_structure: # examine each layer
83 60 472 7.9 0.0 if layer[0] == max_condition[0]: # match condition with material
84 30 239 8.0 0.0 if layer[1] > max_condition[1]: # test thickness condition
85 conditions_satisfied=False
86 22 149 6.8 0.0 if conditions_satisfied:
87 22 203 9.2 0.0 third_set.add(potential_structure)
88
89 ##remove structures that contain too much of a certain material
90 1 10 10.0 0.0 fourth_set = set()
91 23 178 7.7 0.0 for potential_structure in second_set: # check all the structures in the set
92 22 158 7.2 0.0 conditions_satisfied=True # default
93 66 489 7.4 0.0 for max_condition in maximum_total_material_thicknesses: # check this structure using each condition
94 44 300 6.8 0.0 amount_of_material_in_this_structure = 0 # initialise a counter
95 104 850 8.2 0.0 for layer in potential_structure: # examine each layer
96 60 2961 49.4 0.0 if layer[0] == max_condition[0]: # match condition with material
97 30 271 9.0 0.0 amount_of_material_in_this_structure += layer[1]
98 30 255 8.5 0.0 if amount_of_material_in_this_structure > max_condition[1]: # test thickness condition
99 conditions_satisfied=False
100 22 160 7.3 0.0 if conditions_satisfied:
101 22 259 11.8 0.0 fourth_set.add(potential_structure)
102
103 1 16 16.0 0.0 return fourth_set
As you can see constructing thefirst_setincalculate_unique_structure_lengths()is by far the most time-consuming step.

Related

Why is checking for a variables existence taking more time than copying an array which should be a O(1) vs O(n) operation?

These numbers don't make sense to me.
Why does checking for list existence or checking len() of an list take longer than an copy()?
It's O(1) vs O(n) operations.
Total time: 3.01392 s
File: all_combinations.py
Function: recurse at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 #profile
16 def recurse(in_arr, result=[]):
17 nonlocal count
18
19 1048576 311204.0 0.3 10.3 if not in_arr:
20 524288 141102.0 0.3 4.7 return
21
22 524288 193554.0 0.4 6.4 in_arr = in_arr.copy() # Note: this adds a O(n) operation
23
24 1572863 619102.0 0.4 20.5 for i in range(len(in_arr)):
25 1048575 541166.0 0.5 18.0 next = result + [in_arr.pop(0)]
26 1048575 854453.0 0.8 28.4 recurse(in_arr, next)
27 1048575 353342.0 0.3 11.7 count += 1
Total time: 2.84882 s
File: all_combinations.py
Function: recurse at line 38
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 #profile
39 def recurse(result=[], index=0):
40 nonlocal count
41 nonlocal in_arr
42
43 # base
44 1048576 374126.0 0.4 13.1 if index > len(in_arr):
45 return
46
47 # recur
48 2097151 846711.0 0.4 29.7 for i in range(index, len(in_arr)):
49 1048575 454619.0 0.4 16.0 next_result = result + [in_arr[i]]
50 1048575 838434.0 0.8 29.4 recurse(next_result, i + 1)
51 1048575 334930.0 0.3 11.8 count = count + 1
It's not that making the copy takes longer by itself than the O(1) operations you mentioned.
But remember that your base case is running far more often than the recursive case.
I'm not sure what you reference with "it" in your question. The generic ("royal") "it" does not take longer; your implementation is what takes longer. In most language implementations, len is O(1), because it's an instance attribute maintained along with any changes in the object. This existence implementation is slower because it recurs instead of simply iterating through the list, although it's still O(N).

DataFrame max() not return max

Real beginner question here, but it is so simple, I'm genuinely stumped. Python/DataFrame newbie.
I've loaded a DataFrame from a Google Sheet, however any graphing or attempts at calculations are generating bogus results. Loading code:
# Setup
!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Linear Regression - Brain vs. Body Predictor').worksheet("Raw Data")
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
This seems to work fine and the data looks to be correctly loaded when I print out the DataFrame but running max() returns obviously false results. For example:
print(df[0])
print(df[0].max())
Will output:
0 3.385
1 0.48
2 1.35
3 465
4 36.33
5 27.66
6 14.83
7 1.04
8 4.19
9 0.425
10 0.101
11 0.92
12 1
13 0.005
14 0.06
15 3.5
16 2
17 1.7
18 2547
19 0.023
20 187.1
21 521
22 0.785
23 10
24 3.3
25 0.2
26 1.41
27 529
28 207
29 85
...
32 6654
33 3.5
34 6.8
35 35
36 4.05
37 0.12
38 0.023
39 0.01
40 1.4
41 250
42 2.5
43 55.5
44 100
45 52.16
46 10.55
47 0.55
48 60
49 3.6
50 4.288
51 0.28
52 0.075
53 0.122
54 0.048
55 192
56 3
57 160
58 0.9
59 1.62
60 0.104
61 4.235
Name: 0, Length: 62, dtype: object
Max: 85
Obviously, the maximum value is way out -- it should be 6654, not 85.
What on earth am I doing wrong?
First StackOverflow post, so thanks in advance.
If you check it, you'll see at the end of your print() that dtype=object. Also, you'll notice your pandas Series have "int" values along with "float" values (e.g. you have 6654 and 3.5 in the same Series).
These are good hints you have a series of strings, and the max operator here is comparing based on string comparing. You want, however, to have a series of numbers (specifically floats) and to compare based on number comparing.
Check the following reproducible example:
>>> df = pd.DataFrame({'col': ['0.02', '9', '85']}, dtype=object)
>>> df.col.max()
'9'
You can check that because
>>> '9' > '85'
True
You want these values to be considered floats instead. Use pd.to_numeric
>>> df['col'] = pd.to_numeric(df.col)
>>> df.col.max()
85
For more on str and int comparison, check this question

Calculate mean values from pandas dataframe

I am trying to find a good way to calculate mean values from values in a dataframe. It contains measured data from an experiment and is imported from an excel sheet. The columns contain the time passed by, electric current and the corresponding voltage.
The current is changed in steps and then held for some time (the current values vary a little bit, so they are not exactly the same for each step). Now I want to calculate the mean voltage for each current step. Since it takes some time after the voltage gets stable after a step, I also want to leave out the first few voltage values after a step.
Currently I am doing this with loops, but I was wondering wether there is a nicer way with the usage of the groupby function (or others maybe).
Just say if you need more details or clarification.
Example of data:
s [A] [V]
0 6.0 -0.001420 0.780122
1 12.0 -0.002484 0.783297
2 18.0 -0.001478 0.785870
3 24.0 -0.001256 0.793559
4 30.0 -0.001167 0.806086
5 36.0 -0.000982 0.815364
6 42.0 -0.003038 0.825018
7 48.0 -0.001174 0.831739
8 54.0 0.000478 0.838861
9 60.0 -0.001330 0.846086
10 66.0 -0.001456 0.851556
11 72.0 0.000764 0.855950
12 78.0 -0.000916 0.859778
13 84.0 -0.000916 0.859778
14 90.0 -0.001445 0.863569
15 96.0 -0.000287 0.864303
16 102.0 0.000056 0.865080
17 108.0 -0.001119 0.865642
18 114.0 -0.000843 0.866434
19 120.0 -0.000997 0.866809
20 126.0 -0.001243 0.866964
21 132.0 -0.002238 0.867180
22 138.0 -0.001015 0.867177
23 144.0 -0.000604 0.867505
24 150.0 0.000507 0.867571
25 156.0 -0.001569 0.867525
26 162.0 -0.001569 0.867525
27 168.0 -0.001131 0.866756
28 174.0 -0.001567 0.866884
29 180.0 -0.002645 0.867240
.. ... ... ...
242 1708.0 24.703866 0.288902
243 1714.0 26.469208 0.219226
244 1720.0 26.468838 0.250437
245 1726.0 26.468681 0.254972
246 1732.0 26.468173 0.271525
247 1738.0 26.468260 0.247282
248 1744.0 26.467666 0.296894
249 1750.0 26.468085 0.247300
250 1756.0 26.468085 0.247300
251 1762.0 26.467808 0.261096
252 1768.0 26.467958 0.259615
253 1774.0 26.467828 0.260871
254 1780.0 28.232325 0.185291
255 1786.0 28.231697 0.197642
256 1792.0 28.231170 0.172802
257 1798.0 28.231103 0.170685
258 1804.0 28.229453 0.184009
259 1810.0 28.230816 0.181833
260 1816.0 28.230913 0.188348
261 1822.0 28.230609 0.178440
262 1828.0 28.231144 0.168507
263 1834.0 28.231144 0.168507
264 1840.0 8.813723 0.641954
265 1846.0 8.814301 0.652373
266 1852.0 8.818517 0.651234
267 1858.0 8.820255 0.637536
268 1864.0 8.821443 0.628136
269 1870.0 8.823643 0.636616
270 1876.0 8.823297 0.635422
271 1882.0 8.823575 0.622253
Output:
s [A] [V]
0 303.000000 -0.000982 0.857416
1 636.000000 0.879220 0.792504
2 699.000000 1.759356 0.752446
3 759.000000 3.519479 0.707161
4 816.000000 5.278372 0.669020
5 876.000000 7.064800 0.637848
6 939.000000 8.828799 0.611196
7 999.000000 10.593054 0.584402
8 1115.333333 12.357359 0.556127
9 1352.000000 14.117167 0.528826
10 1382.000000 15.882287 0.498577
11 1439.000000 17.646748 0.468379
12 1502.000000 19.410817 0.437342
13 1562.666667 21.175572 0.402381
14 1621.000000 22.939826 0.365724
15 1681.000000 24.704600 0.317134
16 1744.000000 26.468235 0.256047
17 1807.000000 28.231037 0.179606
18 1861.000000 8.819844 0.638190
The current approach:
df = df[['s','[A]','[V]']]
#Looping over the rows to separate current points
b=df['[A]'].iloc[0]
start=0
list = []
for index, row in df.iterrows():
if not math.isclose(row['[A]'], b, abs_tol=1e-02):
b=row['[A]']
list.append(df.iloc[start:index])
start=index
list.append(df.iloc[start:])
#Deleting first few points after each current change
list_b = []
for l in list:
list_b.append(l.iloc[3:])
#Calculating mean values for each current point
list_c = []
for l in list_b:
list_c.append(l.mean())
result=pd.DataFrame(list_c)
Does this help?
df.groupby(['Columnname', 'Columnname2']).mean()
You may need to create intermediate dataframes for each step. Can you provide an example of the output you want?

How to write code in a vectorized way instead of using loops?

I would like to write the following code in a vectorized way as the current code is pretty slow (and would like to learn Python best practices). Basically, the code is saying that if today's value is within 10% of yesterday's value, then today's value (in a new column) is the same as yesterday's value. Otherwise, today's value is unchanged:
def test(df):
df['OldCol']=(100,115,101,100,99,70,72,75,78,80,110)
df['NewCol']=df['OldCol']
for i in range(1,len(df)-1):
if df['OldCol'][i]/df['OldCol'][i-1]>0.9 and df['OldCol'][i]/df['OldCol'][i-1]<1.1:
df['NewCol'][i]=df['NewCol'][i-1]
else:
df['NewCol'][i]=df['OldCol'][i]
return df['NewCol']
The output should be the following:
OldCol NewCol
0 100 100
1 115 115
2 101 101
3 100 101
4 99 101
5 70 70
6 72 70
7 75 70
8 78 70
9 80 70
10 110 110
Can you please help?
I would like to use something like this but I did not manage to solve my issue:
def test(df):
df['NewCol']=df['OldCol']
cond=np.where((df['OldCol'].shift(1)/df['OldCol']>0.9) & (df['OldCol'].shift(1)/df['OldCol']<1.1))
df['NewCol'][cond[0]]=df['NewCol'][cond[0]-1]
return df
A solution in three steps :
df['variation']=(df.OldCol/df.OldCol.shift())
df['gap']=~df.variation.between(0.9,1.1)
df['NewCol']=df.OldCol.where(df.gap).fillna(method='ffill')
For :
OldCol variation gap NewCol
0 100 nan True 100
1 115 1.15 True 115
2 101 0.88 True 101
3 100 0.99 False 101
4 99 0.99 False 101
5 70 0.71 True 70
6 72 1.03 False 70
7 75 1.04 False 70
8 78 1.04 False 70
9 80 1.03 False 70
10 110 1.38 True 110
It seems to be 30x faster than loops on this exemple.
In one line :
x=df.OldCol;df['NewCol']=x.where(~(x/x.shift()).between(0.9,1.1)).fillna(method='ffill')
You should boolean mask your original dataframe:
df[(0.9 <= df['NewCol']/df['OldCol']) & (df['NewCol']/df['OldCol'] <= 1.1)] Will give you all rows where NewCol is within 10% of OldCol
So to set the NewCol field in these rows:
within_10 = df[(0.9 <= df['NewCol']/df['OldCol']) & (df['NewCol']/df['OldCol'] <= 1.1)]
within_10['NewCol'] = within_10['OldCol']
Since you seem to be on a good way of finding the "jump" days yourself I'll only show the trickier bit. So let's assume you have a numpy array with old of length N and a boolean numpy array jump of the same size. As a matter of convention the zeroth element of jump is set at True. Then you can first calculate the numbers of repeats between jumps:
jump_indices = np.where(jumps)[0]
repeats = np.diff(np.r_[jump_indices, [N]])
once you have these you can use np.repeat:
new = np.repeat(old[jump_indices], repeats)

Speed up numpy.where for extracting integer segments?

I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.

Categories