Pyomo: Abstract model creation about power distribution - python

I am a new Pyomo/Python user. In the process of learning pyomo, I need to solve some optimization problems.
I need to make a balance between the total generation and demand (800 MW).
If I want to find the minimum value of 'Sum(ap^2+bp+c))'.And this is the mathematic model:mathematics model of this problem
How can I construct an abstract model that I can choose the value of a, b, c from the same row in the table below? If I set the 'sets' individually, then the value of abc will come from the different row which cannot satisfy the formula. And how do I set a random value between the Pmin and Pmax? Use two constraints to limit the value?
It really makes me confused.
from pyomo.environ import *
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import random
model = AbstractModel()
model.J = Set()
model.A = Param(model.J)
model.B = Param(model.J)
model.C = Param(model.J)
model.P_min = Param(model.J)
model.P_max= Param(model.J)
model.P = Var(model.J, domain=NonNegativeReals)
def obj_expression(model):
return sum(model.A* model.P**2+model.B*model.P+model.C for j in model.J)
model.OBJ = Objective(rule=obj_expression, sense=minimize)
# Upper bounds rule
def upper_bounds_rule(model,j):
return model.P[j] <= model.P_max[j]
model.upper = Constraint(model.J,rule=upper_bounds_rule)
# Lower bounds rule
def lower_bounds_rule(model, j):
return model.P[j] >= model.P_min[j]
model.lower = Constraint(model.J, rule=lower_bounds_rule)
def rule_eq1(model,j):
return sum(model.P[j] >= 800;
model.eq1 = Constraint(model.J, rule=rule_eq1)
opt = SolverFactory('Ipopt')
instance = model.create_instance("E:/pycharm_project/PD/pd.dat")
results = opt.solve(instance) # solves and updates instance
#data file
# Creating the set J:
set J := g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 ;
# Creating Parameters A, B, C, P_min, P_max:
param : A B C P_min P_max:=
g1 0.0148 12.1 82 80 200
g2 0.0289 12.6 49 120 320
g3 0.0135 13.2 100 50 150
g4 0.0127 13.9 105 250 520
g5 0.0261 13.5 72 80 280
g6 0.0212 15.4 29 50 150
g7 0.0382 14.0 32 30 120
g8 0.0393 13.5 40 30 110
g9 0.0396 15.0 25 20 80
g10 0.0510 14.3 15 20 60
;
Can you help me with this? Thanks!
Vivi

I think you should structure your .dat files in accordance with what is described in the Pyomo documentation here.
I believe something like this would work for you for parameters A, B, C:
# Creating the set J:
set J := g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 ;
# Creating Parameters A, B, C:
param : A B C :=
g1 0.0148 12.1 82
g2 0.0289 12.6 49
g3 0.0135 13.2 100
g4 0.0127 13.9 105
g5 0.0261 13.5 72
g6 0.0212 15.4 29
g7 0.0382 14.0 32
g8 0.0393 13.5 40
g9 0.0396 15.0 25
g10 0.0510 14.3 15
;
Now I am not sure about Pmin and Pmax, since in your model they seem to be 2-dimensional, while in your data, they seem to only be 1-dimensional. But generally, you can follow the instructions in the link above to create your .dat files.
As for your second question, I am not sure I understand you correctly, but you refer to the P_min <= P <= P_max constraint in the mathematical model description, correct?
Then, for the P_min <= P part you need to slightly change your current constraint:
def lower_bounds_rule(model, i, j):
return model.P[i,j] >= model.P_min[i,j]
model.lower = Constraint(model.i,model.j,rule=lower_bounds_rule)

Related

How to solve multiple multivariate equation systems with constraints

I am trying to solve a blending problem with a system of 3 equations and I have 3 objectives to reach for, or try to get the values as close as posible for the three of them:
The equations are:
def sat (c,s,a,f):
return (100*c)/(2.8*s+1.18*a+0.65*f) #For this I need sat = 98.5
def ms (s,a,f):
return s/(a+f) #For this I need ms = 2.5
def ma (a,f):
return (a/f) #For this I need ms = 1.3
#The total mix ratio:
r1+r2+r3+r4+r5+r6 = 1
material_1:
c = 51.29
s = 4.16
a = 0.97
f = 0.38
material_2:
c = 51.42
s = 4.16
a = 0.95
f = 0.37
material_3:
c = 6.88
s = 63.36
a = 13.58
f = 3.06
material_4:
c = 32.05
s = 1.94
a = 0.0
f = 0.0
material_5:
c = 4.56
s = 21.43
a = 3.82
f = 52.28
material_6:
c = 0.19
s = 7.45
a = 4.58
f = 0.42
#The aproximate values I am trying to find are around:
0.300 <= r1 <= 0.370
0.300 <= r2 <= 0.370
0.070 <= r3 <= 0.130
0.005 <= r4 <= 0.015
0.010 <= r5 <= 0.030
0.110 <= r6 <= 0.130
So how can I calculate the value for every ratio "r" in order to get the closets values to the objectives for the 3 equations?
I looked on some optimizers but as I am new with them I still can not understand how to set up the problem, the equations and constraints into them.
I guess I made it, of course the code is awful but I will try to make it look better later.
I added the cost of the components so I can give a function to "minimize", of course this is becouse I know the aproximated material ratio so it guide the solver to it.
I will post the code for it:
c1 = 51.42
c2 = 51.42
c3 = 6.88
c5 = 32.05
c6 = 4.56
c7 = 0.19
s1 = 4.16
s2 = 4.16
s3 = 63.36
s5 = 1.94
s6 = 21.43
s7 = 7.45
a1 = 0.97
a2 = 0.95
a3 = 13.58
a5 = 0.0
a6 = 3.82
a7 = 4.58
f1 = 0.38
f2 = 0.37
f3 = 3.06
f5 = 0.0
f6 = 52.28
f7 = 0.42
r7 = 0.125
r1 = cp.Variable()
r2 = cp.Variable()
r3 = cp.Variable()
r5 = cp.Variable()
r6 = cp.Variable()
#Costos
caliza = 10
arcilla = 20
hierro = 170
yeso = 80
objective = cp.Minimize(r1*caliza+r2*caliza+r3*arcilla+r5*yeso+r6*hierro)
constraints = [
r1-r2 == 0,
r1>= 0.20,
r1<= 0.40,
r3<=0.14,
r3>=0.06,
r5>=0.001,
r5<=0.008,
r6>=0.01,
r6<=0.03,
2.5*((r1*a1+r2*a2+r3*a3+r5*a5+r6*a6+r7*a7)+(f1*r1+f2*r2+f3*r3+f5*r5+f6*r6+f7*r7))-(r1*s1+r2*s2+r3*s3+r5*s5+r6*s6+r7*s7)==0,
(98.5*(2.8*(r1*s1+r2*s2+r3*s3+r5*s5+r6*s6+r7*s7)+1.18*(r1*a1+r2*a2+r3*a3+r5*a5+r6*a6+r7*a7)+0.65*(f1*r1+f2*r2+f3*r3+f5*r5+f6*r6+f7*r7))-100*(r1*c1+r2*c2+r3*c3+r5*c5+r6*c6+r7*c7)) == 0,
#1.3*(f1*r1+f2*r2+f3*r3+f5*r5+f6*r6+f7*r7)-(r1*a1+r2*a2+r3*a3+r5*a5+r6*a6+r7*a7) == 0,
r1+r2+r3+r5+r6+r7 == 1]
problem = cp.Problem(objective,constraints)
problem.solve()
print(r1.value,r2.value,r3.value,r5.value,r6.value)
print(problem.status)
This gives me the result:
0.3644382497863931 0.3644382497863931 0.12287226775076901 0.0009999999955268117 0.022251232680917873
optimal
Anyways, the only way to make a feasible result is to only consider 2 of the three constraints functions, becouse the components cant reach the 3 of them and this indicates that I need to check de material components before I try to reach the 3 constraints (wich were the sat, ma and ms).
Now I will try to make the code better using pandas so i can get the material components with somekind of for loop and laso use it for the ratios.
Thank you so much for your help👍.
So this is a simple/trivial example to show the intent that was mentioned in comment to minimize the square of errors... Instead of using a constraint to pin a value to an exact outcome, we let the solver find the best outcome that minimizes the square of the error where error = value - target. I think what I've written below is fairly clear. CVXPY likes to work in the linear algebra realm, and I'm sure this could be converted into vector / matrix format, but the concept is to remove constraints and let the solver figure out the best combo. Obviously, if there are hard constraints, those need to be added, but note that I've just made an example with 2 of your 3 targets (with some trivial math) and moved it into the objective.
Your problem with 3 constraints that aren't simultaneously satisfiable is probably a candidate for a conversion like this...
import cvxpy as cp
r1 = cp.Variable()
r2 = cp.Variable()
ma = 2.5
ms = 3.4
delta_1 = (r1 + r2 - ma)**2 # diff from r1 + r2 and ma
delta_2 = (3*r1 + 2*r2 - ms)**2 # diff from 3r1 + 2r2 and ms
prob = cp.Problem(cp.Minimize(delta_1 + delta_2))
prob.solve()
print(prob.value)
print(r1.value, r2.value)
Output
9.860761315262648e-31
-1.6000000000000014 4.100000000000002
Ok this is what i have done and works fine:
#I call the values from a pandas DF:
c1 = df.at[0, 'MAX']
c2 = df.at[4, 'MAX']
c3 = df.at[8, 'MAX']
c5 = df.at[12, 'MAX']
c6 = df.at[16, 'MAX']
c7 = df.at[20, 'MAX']
s1 = df.at[1, 'MAX']
s2 = df.at[5, 'MAX']
s3 = df.at[9, 'MAX']
s5 = df.at[13, 'MAX']
s6 = df.at[17, 'MAX']
s7 = df.at[21, 'MAX']
a1 = df.at[2, 'MAX']
a2 = df.at[6, 'MAX']
a3 = df.at[10, 'MAX']
a5 = df.at[14, 'MAX']
a6 = df.at[18, 'MAX']
a7 = df.at[22, 'MAX']
f1 = df.at[3, 'MAX']
f2 = df.at[7, 'MAX']
f3 = df.at[11, 'MAX']
f5 = df.at[15, 'MAX']
f6 = df.at[19, 'MAX']
f7 = df.at[23, 'MAX']
r1 = cp.Variable()
r2 = cp.Variable()
r3 = cp.Variable()
r5 = cp.Variable()
r6 = cp.Variable()
r7 = 12.5
#Objectives
ma = 1.3
ms = 2.50
lsf = 98.5
delta1 =(ms*((r1*a1+r2*a2+r3*a3+r5*a5+r6*a6+r7*a7)+(f1*r1+f2*r2+f3*r3+f5*r5+f6*r6+f7*r7))-(r1*s1+r2*s2+r3*s3+r5*s5+r6*s6+r7*s7))**2
delta2 =(ma*(f1*r1+f2*r2+f3*r3+f5*r5+f6*r6+f7*r7)-(r1*a1+r2*a2+r3*a3+r5*a5+r6*a6+r7*a7))**2
delta3 =((lsf*(2.8*(r1*s1+r2*s2+r3*s3+r5*s5+r6*s6+r7*s7)+1.18*(r1*a1+r2*a2+r3*a3+r5*a5+r6*a6+r7*a7)+0.65*(f1*r1+f2*r2+f3*r3+f5*r5+f6*r6+f7*r7))-100*(r1*c1+r2*c2+r3*c3+r5*c5+r6*c6+r7*c7)))**2
objective = cp.Minimize(delta1+delta2+delta3)
constraints = [r1-r2 == 0, #I added this to make r1=r2.
r1>= 0.20,
r3>=0, #I added these to make it non negative.
r5>=0,
r5<=0.008,
r6>=0,
r1+r2+r3+r5+r6+r7 == 1]
problem = cp.Problem(objective,constraints)
problem.solve()
print(r1.value,r2.value,r3.value,r5.value,r6.value)
print(problem.status)
Once again i want to thank you for your help guys.
Maybe you know how I can improve the code for get the variable values, maybe there is and example of using a for loop to get the values instead of put it directly from the DF for each one, the DF looks like this:
DATO MAX
0 c1 51.95000
1 s1 3.07000
2 a1 0.83000
3 f1 0.31000
4 c2 52.26000
5 s2 2.82000
6 a2 0.75000
...

Group or bin rows by bp position values in defined intervals and sum their counts (in R or python)

I have a file of SNPs for two mouse strains (B6 and CAST) in this format:
Chr Position B6.SNP CAST.SNP B6.Count CAST.Count
chr17 1000 A G 102 82
chr17 2000 A G 91 76
chr17 5000 C T 55 87
chr17 7500 G A 70 36
chr17 10200 A G 83 45
chr17 17000 C T 34 91
chr17 20000 G A 95 46
I would like to group the SNPs (i.e rows) by 'Chr' and 'Position' in 10,000 bp bins along the chromosome; in other words, group all SNPs that fall within 0-10,000bp, then 10,0001-20,000bp, and so forth). In addition, for all the SNPs that fall within each bin, I want to sum their B6 and CAST counts, as well as create a new column that includes the number of SNPs that fell within each bin.
So perhaps an output file (using the example above):
Chr Start End SNP.Count B6.Count.Sum CAST.Count.Sum
chr17 0 10000 4 318 281
chr17 10001 20000 3 212 182
Thanks in advance.
Here is a start using GenomicRanges, TxDb, and plyranges in R:
## Read snp data
snps <- read.table(text = "
Chr Position B6.SNP CAST.SNP B6.Count CAST.Count
chr17 1000 A G 102 82
chr17 2000 A G 91 76
chr17 5000 C T 55 87
chr17 7500 G A 70 36
chr17 10200 A G 83 45
chr17 17000 C T 34 91
chr17 20000 G A 95 46
", header = T)
## Convert to GRanges object
library(GenomicRanges)
gr <- GRanges(seqnames = snps$Chr, ranges = IRanges(start = snps$Position,
end = snps$Position))
mcols(gr) <- snps[,3:6]
## Use txdb to assign seqinfo
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
names_txdb <- seqnames(seqinfo(txdb))
names_gr <- seqnames(seqinfo(gr))
seqinfo(gr) <- seqinfo(txdb)[names_txdb[names_txdb %in% names_gr]]
## Make windows
windows <- tileGenome(seqinfo(gr), tilewidth = 10000, cut.last.tile.in.chrom = TRUE)
## Use plyranges to summarize by group overlap
library(plyranges)
df <-
windows %>%
group_by_overlaps(gr) %>%
summarise(B6.Count.Sum = sum(B6.Count),
CAST.Count.Sum = sum(CAST.Count),
SNP.Count = n())
binnedGR <- windows[df$query] %>% `mcols<-`(value = df[-1])
## Result as GRanges or as data.frame
binnedGR
as.data.frame(binnedGR)

Multivariate 'quadratic' regression in python (like fitlm function in matlab)

I wanted to ask if anyone can help me out.
I want to create a 'quadratic' regression of 5 input variables in python and obtain a regression quadratic equation.
In matlab I can use the function
fitlm(ds,'quadratic')
ds is a nx5 array.
The output is (example):
Linear regression model:
x6 ~ [Linear formula with 21 terms in 5 predictors]
Estimated Coefficients:
Estimate SE tStat pValue
___________ __________ __________ __________
(Intercept) 3.8574 0.60766 6.348 2.296e-08
x1 0.2847 0.26311 1.0821 0.28316
x2 0.0022534 0.0046868 0.48079 0.63226
x3 -0.0092632 0.010228 -0.9057 0.36839
x4 -0.0039061 0.00043497 -8.9802 4.7159e-13
x5 0.0014984 0.00061604 2.4323 0.017722
x1:x2 -0.004019 0.0014052 -2.8602 0.0056639
x1:x3 -1.1981e-05 0.0021956 -0.0054568 0.99566
x1:x4 0.00011539 0.00011732 0.98356 0.32893
x1:x5 0.00011744 0.00017357 0.67661 0.50102
x2:x3 1.6354e-06 4.3911e-05 0.037243 0.9704
x2:x4 2.9589e-06 2.3464e-06 1.2611 0.21173
x2:x5 3.0621e-06 3.4713e-06 0.8821 0.38092
x3:x4 2.2725e-06 3.6662e-06 0.61986 0.53749
x3:x5 -1.4034e-05 5.7374e-06 -2.446 0.017117
x4:x5 2.5923e-06 2.8928e-07 8.9614 5.0922e-13
x1^2 -0.14307 0.052186 -2.7415 0.0078616
x2^2 -4.5755e-05 2.2194e-05 -2.0616 0.043186
x3^2 2.5903e-05 5.4432e-05 0.47587 0.63574
x4^2 1.1868e-06 1.4496e-07 8.1874 1.2233e-11
x5^2 -2.1103e-05 6.8098e-07 -30.989 4.7528e-41
How can I do the same thing in python?
I tried to use the linear_model.LinearRegression() and PolynomialFeatures() from sklearn but it returned me only 5 terms (linear ones) by now.
I attach some example values.
Row 1-5 contains the parameters, Row6 contains the targets.
x1 x2 x3 x4 x5 x6
1.75 -2.5 76 1050 0 0.99
1 10 84 900 0 1.1598
1.5 10 84 900 100 1.2034
1.5 10 68 900 100 1.3544
1.5 10 84 900 200 0.8591
1.5 10 84 900 200 0.8595
1.25 -2.5 76 1050 100 1.072
1.25 22.5 76 750 200 1.0426
1 10 84 900 200 0.8588
1.25 -2.5 92 750 100 1.3811
1.25 22.5 92 1050 100 1.0213
2 10 84 900 0 1.0336
Thank you very much in advance!
Regards!
AF

Surprising challenge generating comprehensive list

I am facing a surprising challenge with Python.
I am a Physicist generating a series of simulations of layers at an optical interface. The details of the simulations are not specifically important but what is crucial is that I generate all possible cases are generated - different materials within a range of thicknesses and layer orders.
I have been writing code to generate a comprehensive and unique list but I am staggered at how long it takes to compute even relatively simple systems! Surely Python and a reasonable computer should handle this without excessive stress. Suggestions would be greatly appreciated.
Thank you
from itertools import permutations, combinations_with_replacement
def join_adjacent_repeated_materials(potential_structure):
"""
Self-explanitory...
"""
#print potential_structure
new_layers = [] # List to hold re-cast structure
for layer in potential_structure:
if len(new_layers) > 0: # if not the first item in the list of layers
last_layer=new_layers[-1] # last element of existing layer list
if layer[0] == last_layer[0]: # true is the two layers are the same material
combined_layer = (layer[0], layer[1] + last_layer[1])
new_layers[len(new_layers)-1] = combined_layer
else: # adjcent layers are different material so no comibantion is possible
new_layers.append(layer)
else: # for the first layer
new_layers.append(layer)
return tuple(new_layers)
def calculate_unique_structure_lengths(thicknesses, materials, maximum_number_of_layers,\
maximum_individual_layer_thicknesses, \
maximum_total_material_thicknesses):
"""
Create a set on all possible multilayer combinations.
thicknesses : if this contains '0' the total number of layers will vary
from 0 to maximum_number_of_layers, otherwise, the
number total number layers will always be maximum_number_of_layers
e.g. arange(0 , 100, 5)
materials : list of materials used
e.g. ['Metal', 'Dielectric']
maximum_number_of_layers : pretty self-explanitory...
e.g. 5
maximum_individual_layer_thicknesses : filters the created the multilayer structures
preventing the inclusion layers that are too thick
- this is important after the joining of
adjacent materials
e.g. (('Metal',30),('Dielectric',20))
maximum_total_material_thicknesses : similar to the above but filters structures where the total
amount of a particular material is exceeded
e.g. (('Metal',50),('Dielectric',100))
"""
# generate all possible thickness combinations and material combinations
all_possible_thickness_sets = set(permutations(combinations_with_replacement(thicknesses, maximum_number_of_layers)))
all_possible_layer_material_orders = set(permutations(combinations_with_replacement(materials, maximum_number_of_layers)))
first_set = set() # Create set object (list of unique elements, no repeats)
for layer_material_order in all_possible_layer_material_orders:
for layer_thickness_set in all_possible_thickness_sets:
potential_structure = [] # list to hold this structure
for layer, thickness in zip(layer_material_order[0], layer_thickness_set[0]): # combine the layer thickness with its material
if thickness != 0: # layers of zero thickness are not added to potential_structure
potential_structure.append((layer, thickness))
first_set.add(tuple(potential_structure)) # add this potential_structure to the first_set set
#print('first_set')
#for struct in first_set:
# print struct
## join adjacent repeated materials
second_set = set() # create new set
for potential_structure in first_set:
second_set.add(join_adjacent_repeated_materials(potential_structure))
## remove structures where a layer is too thick
third_set = set()
for potential_structure in second_set: # check all the structures in the set
conditions_satisfied=True # default
for max_condition in maximum_individual_layer_thicknesses: # check this structure using each condition
for layer in potential_structure: # examine each layer
if layer[0] == max_condition[0]: # match condition with material
if layer[1] > max_condition[1]: # test thickness condition
conditions_satisfied=False
if conditions_satisfied:
third_set.add(potential_structure)
##remove structures that contain too much of a certain material
fourth_set = set()
for potential_structure in second_set: # check all the structures in the set
conditions_satisfied=True # default
for max_condition in maximum_total_material_thicknesses: # check this structure using each condition
amount_of_material_in_this_structure = 0 # initialise a counter
for layer in potential_structure: # examine each layer
if layer[0] == max_condition[0]: # match condition with material
amount_of_material_in_this_structure += layer[1]
if amount_of_material_in_this_structure > max_condition[1]: # test thickness condition
conditions_satisfied=False
if conditions_satisfied:
fourth_set.add(potential_structure)
return fourth_set
thicknesses = [0,1,2]
materials = ('A', 'B') # Tuple cannot be accidentally appended to later
maximum_number_of_layers = 3
maximum_individual_layer_thicknesses=(('A',30),('B',20))
maximum_total_material_thicknesses=(('A',20),('B',15))
calculate_unique_structure_lengths(thicknesses, materials, maximum_number_of_layers,\
maximum_individual_layer_thicknesses = maximum_individual_layer_thicknesses, \
maximum_total_material_thicknesses = maximum_total_material_thicknesses)
all_possible_thickness_sets = set(permutations(combinations_with_replacement(thicknesses, maximum_number_of_layers)))
all_possible_layer_material_orders = set(permutations(combinations_with_replacement(materials, maximum_number_of_layers)))
Holy crap! These sets are going to be huge! Let's give an example. If thicknesses has 6 things in it and maximum_number_of_layers is 3, then the first set is going to have about 2 quintillion things in it. Why are you doing this? If these are really the sets you want to use, you're going to need to find an algorithm that doesn't need to build these sets, because it's never going to happen. I suspect these aren't the sets you want; perhaps you wanted itertools.product?
all_possible_thickness_sets = set(product(thicknesses, repeat=maximum_number_of_layers))
Here's an example of what itertools.product does:
>>> for x in product([1, 2, 3], repeat=2):
... print x
...
(1, 1)
(1, 2)
(1, 3)
(2, 1)
(2, 2)
(2, 3)
(3, 1)
(3, 2)
(3, 3)
Does that look like what you need?
So one thing you do really often is to add something to a set. If you look at the runtime behaviour of sets in the Python documentation, it says, quite on the bottom, about worst cases "Individual actions may take surprisingly long, depending on the history of the container". I think memory reallocation may bite you if you add a lot of elements, because Python has no way of knowing how much memory to reserve when you start.
The more I look at it, the more I think you're reserving more memory than you have to. third_set for example doesn't even get used. second_set could be replaced by first_set if you'd just call join_adjacent_materials directly. And if I read it correctly, even first_set could go away and you could just create the candidates as you construct fourth_set.
Of course the code may become less readable if you put everything into a single bunch of nested loops. However, there are ways to structure your code that don't create unnecessary objects just for readability - you could for example create a function which generates candidates and return each result via yield.
FWIW I instrumented your code to enable profiling. Here are the results:
Output:
Sun May 25 16:06:31 2014 surprising-challenge-generating-comprehensive-python-list.stats
348,365,046 function calls in 1,538.413 seconds
Ordered by: cumulative time, internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1052.933 1052.933 1538.413 1538.413 surprising-challenge-generating-comprehensive-python-list.py:34(calculate_unique_structure_lengths)
87091200 261.764 0.000 261.764 0.000 {zip}
87091274 130.492 0.000 130.492 0.000 {method 'add' of 'set' objects}
174182440 93.223 0.000 93.223 0.000 {method 'append' of 'list' objects}
30 0.000 0.000 0.000 0.000 surprising-challenge-generating-comprehensive-python-list.py:14(join_adjacent_repeated_materials)
100 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
To get an even finer grained picture of where the code was spending it's time, I used the line_profiler module on an almost verbatim copy of your code and got the following results for each of your functions:
> python "C:\Python27\Scripts\kernprof.py" -l -v surprising-challenge-generating-comprehensive-python-list.py
Wrote profile results to example.py.lprof
Timer unit: 3.2079e-07 s
File: surprising-challenge-generating-comprehensive-python-list.py
Function: join_adjacent_repeated_materials at line 3
Total time: 0.000805183 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 #profile
4 def join_adjacent_repeated_materials(potential_structure):
5 """
6 Self-explanitory...
7 """
8 #print potential_structure
9
10 30 175 5.8 7.0 new_layers = [] # List to hold re-cast structure
11 100 544 5.4 21.7 for layer in potential_structure:
12 70 416 5.9 16.6 if len(new_layers) > 0: # if not the first item in the list of layers
13 41 221 5.4 8.8 last_layer=new_layers[-1] # last element of existing layer list
14 41 248 6.0 9.9 if layer[0] == last_layer[0]: # true is the two layers are the same material
15 30 195 6.5 7.8 combined_layer = (layer[0], layer[1] + last_layer[1])
16 30 203 6.8 8.1 new_layers[len(new_layers)-1] = combined_layer
17 else: # adjcent layers are different material so no comibantion is possible
18 11 68 6.2 2.7 new_layers.append(layer)
19 else: # for the first layer
20 29 219 7.6 8.7 new_layers.append(layer)
21
22 30 221 7.4 8.8 return tuple(new_layers)
File: surprising-challenge-generating-comprehensive-python-list.py
Function: calculate_unique_structure_lengths at line 24
Total time: 3767.94 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
24 #profile
25 def calculate_unique_structure_lengths(thicknesses, materials, maximum_number_of_layers,\
26 maximum_individual_layer_thicknesses, \
27 maximum_total_material_thicknesses):
28 """
29 Create a set on all possible multilayer combinations.
30
31 thicknesses : if this contains '0' the total number of layers will vary
32 from 0 to maximum_number_of_layers, otherwise, the
33 number total number layers will always be maximum_number_of_layers
34 e.g. arange(0 , 100, 5)
35
36 materials : list of materials used
37 e.g. ['Metal', 'Dielectric']
38
39 maximum_number_of_layers : pretty self-explanitory...
40 e.g. 5
41
42 maximum_individual_layer_thicknesses : filters the created the multilayer structures
43 preventing the inclusion layers that are too thick
44 - this is important after the joining of
45 adjacent materials
46 e.g. (('Metal',30),('Dielectric',20))
47
48 maximum_total_material_thicknesses : similar to the above but filters structures where the total
49 amount of a particular material is exceeded
50 e.g. (('Metal',50),('Dielectric',100))
51
52
53 """
54 # generate all possible thickness combinations and material combinations
55 1 20305240 20305240.0 0.2 all_possible_thickness_sets = set(permutations(combinations_with_replacement(thicknesses, maximum_number_of_layers)))
56 1 245 245.0 0.0 all_possible_layer_material_orders = set(permutations(combinations_with_replacement(materials, maximum_number_of_layers)))
57
58
59 1 13 13.0 0.0 first_set = set() # Create set object (list of unique elements, no repeats)
60 25 235 9.4 0.0 for layer_material_order in all_possible_layer_material_orders:
61 87091224 896927052 10.3 7.6 for layer_thickness_set in all_possible_thickness_sets:
62 87091200 920048586 10.6 7.8 potential_structure = [] # list to hold this structure
63 348364800 4160332176 11.9 35.4 for layer, thickness in zip(layer_material_order[0], layer_thickness_set[0]): # combine the layer thickness with its material
64 261273600 2334038439 8.9 19.9 if thickness != 0: # layers of zero thickness are not added to potential_structure
65 174182400 2003639625 11.5 17.1 potential_structure.append((layer, thickness))
66 87091200 1410517427 16.2 12.0 first_set.add(tuple(potential_structure)) # add this potential_structure to the first_set set
67
68 #print('first_set')
69 #for struct in first_set:
70 # print struct
71
72 ## join adjacent repeated materials
73 1 14 14.0 0.0 second_set = set() # create new set
74 31 274 8.8 0.0 for potential_structure in first_set:
75 30 5737 191.2 0.0 second_set.add(join_adjacent_repeated_materials(potential_structure))
76
77 ## remove structures where a layer is too thick
78 1 10 10.0 0.0 third_set = set()
79 23 171 7.4 0.0 for potential_structure in second_set: # check all the structures in the set
80 22 164 7.5 0.0 conditions_satisfied=True # default
81 66 472 7.2 0.0 for max_condition in maximum_individual_layer_thicknesses: # check this structure using each condition
82 104 743 7.1 0.0 for layer in potential_structure: # examine each layer
83 60 472 7.9 0.0 if layer[0] == max_condition[0]: # match condition with material
84 30 239 8.0 0.0 if layer[1] > max_condition[1]: # test thickness condition
85 conditions_satisfied=False
86 22 149 6.8 0.0 if conditions_satisfied:
87 22 203 9.2 0.0 third_set.add(potential_structure)
88
89 ##remove structures that contain too much of a certain material
90 1 10 10.0 0.0 fourth_set = set()
91 23 178 7.7 0.0 for potential_structure in second_set: # check all the structures in the set
92 22 158 7.2 0.0 conditions_satisfied=True # default
93 66 489 7.4 0.0 for max_condition in maximum_total_material_thicknesses: # check this structure using each condition
94 44 300 6.8 0.0 amount_of_material_in_this_structure = 0 # initialise a counter
95 104 850 8.2 0.0 for layer in potential_structure: # examine each layer
96 60 2961 49.4 0.0 if layer[0] == max_condition[0]: # match condition with material
97 30 271 9.0 0.0 amount_of_material_in_this_structure += layer[1]
98 30 255 8.5 0.0 if amount_of_material_in_this_structure > max_condition[1]: # test thickness condition
99 conditions_satisfied=False
100 22 160 7.3 0.0 if conditions_satisfied:
101 22 259 11.8 0.0 fourth_set.add(potential_structure)
102
103 1 16 16.0 0.0 return fourth_set
As you can see constructing thefirst_setincalculate_unique_structure_lengths()is by far the most time-consuming step.

Speed up numpy.where for extracting integer segments?

I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.

Categories