I am trying to implement a simple quadratic program using CPLEX's Python API. The sample file qpex1 provided with CPLEX discusses this. The problem, as mentioned in qpex.lp is
Maximize
obj: x1 + 2 x2 + 3 x3 + [ - 33 x1 ^2 + 12 x1 * x2 - 22 x2 ^2 + 23 x2 * x3
- 11 x3 ^2 ] / 2
Subject To
c1: - x1 + x2 + x3 <= 20
c2: x1 - 3 x2 + x3 <= 30
Bounds
0 <= x1 <= 40
End
The problem, while being implemented in python, receives a matrix qmat which implements the quadratic portion of the objective function. The matrix is :
qmat = [[[0, 1], [-33.0, 6.0]],
[[0, 1, 2], [6.0, -22.0, 11.5]],
[[1, 2], [11.5, -11.0]]]
p.objective.set_quadratic(qmat)
Can someone explain the structure of this matrix? What are the parts in the data structure that is being used? What are the components and so on.
First list is the set of indeces, the second list the set of the corresponding values, so the qmat matrix is:
-33 6 0
6 -22 11.5
0 11.5 -11
that results in:
| -33 6 0 | x1
x1 x2 x3 | 6 -22 11.5 | x2 = - 33 x1 ^2 + 12 x1 * x2 - 22 x2 ^2 + 23 x2 * x3 - 11 x3 ^2
| 0 11.5 -11 | x3
Related
I have a large matrix with 12 columns and approximately 1.000.000 rows. Each column represents the money spent by a client in a given month, so with the 12 columns I have information for 1 full year. Each row represents one client.
I need to divide the people into groups based on how much money they spent each month, and I consider the following intervals:
money=0
0<money<=25
25<money<=50
50<money<=75
So for example group1 would be formed by clients that spent 0$ each month for the whole year, group2 would be clients who spent between 0 and 25$ the first month, and 0$ the rest of the months, and so on. In the end I have 12 months, and 4 intervals, so I need to divide data into 4^12=16.777.216 groups (I know this yields to more groups than observations, and that many of the groups will be empty or with very few clients, but that is another problem, so far I am interested in doing this division into groups)
I am currently working in R although I could also switch to Python if required (those are the programming languages I control best), and so far my only idea has been to use nested for loops, one for loop for each month. But this is very, very slow.
So my question is: is there a faster way to do this?
Here I provide a small example with fake data, 10 observations (instead of the 1.000.000), 5 columns (instead of 12) and a simplified version of my current code for doing the grouping.
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50), nrow=10, ncol=5))
intervals = c(-4, -1, 0, 1, 4)
id_list = c()
group_list = c()
group_idx = 0
for(idx1 in 1:(length(intervals)-1))
{
data1 = data[(data[, 2] >= intervals[idx1]) & (data[, 2] < intervals[idx1+1]),]
for(idx2 in 1:(length(intervals))-1)
{
data2 = data1[(data1[, 3] >= intervals[idx2]) & (data1[, 3] < intervals[idx2+1]),]
for(idx3 in 1:(length(intervals)-1))
{
data3 = data2[(data2[, 4] >= intervals[idx3]) & (data2[, 4] < intervals[idx3+1]),]
for(idx4 in 1:(length(intervals)-1))
{
data4 = data3[(data3[, 5] >= intervals[idx4]) & (data3[, 5] < intervals[idx4+1]),]
for(idx5 in 1:(length(intervals)-1))
{
data5 = data4[(data4[, 6] >= intervals[idx5]) & (data4[, 6] < intervals[idx5+1]),]
group_idx = group_idx + 1
id_list = c(id_list, data5$id)
group_list = c(group_list, rep(group_idx, nrow(data5)))
}
}
}
}
}
If you do need to do this--which I certainly have my doubts about--I would suggest creating a matrix with the classification for each cell of the original data, and then pasting them together to make a group label.
Doing this we can set the group labels to be human readable, which might be nice.
I would recommend simply adding this grouping column to the original data and then using dplyr or data.table to do grouped operations for your next steps, but if you really want separate data frames for each you can then split the original data based on these group labels.
## I redid your sample data to put it on the same general scale as
## your actual data
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50, mean = 50, sd = 20), nrow=10, ncol=5))
my_breaks = c(0, 25 * 1:3, Inf)
## you could use default labels, but this seems nicer
my_labs = c("Low", "Med", "High", "Extreme")
## classify each value from the data
grouping = vapply(
data[-1], \(x) as.character(cut(x, breaks = my_breaks)),
FUN.VALUE = character(nrow(data))
)
## create labels for the groups
group_labels = apply(grouping, 2, \(x) paste(1:(ncol(data) - 1), x, sep = ":", collapse = " | "))
## either add the grouping value to the original data or split the data based on groups
data$group = group_labels
result = split(data, group_labels)
result
# $`1:(25,50] | 2:(75,Inf] | 3:(0,25] | 4:(50,75] | 5:(75,Inf] | 1:(25,50] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(50,75]`
# id X1 X2 X3 X4 X5
# 1 1 33.18289 74.55261 68.01024 56.31830 81.00121
# 6 6 37.94184 47.22028 44.13036 69.03148 61.24447
#
# $`1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50] | 1:(25,50] | 2:(25,50] | 3:(0,25] | 4:(50,75] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 2 2 77.68719 33.96441 68.83739 72.19388 33.95154
# 7 7 40.55667 38.05374 78.37178 29.80935 32.25983
#
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(50,75] | 1:(25,50] | 2:(75,Inf] | 3:(75,Inf] | 4:(25,50] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 3 3 24.89016 28.392148 79.35924 94.309211 48.50842
# 8 8 37.29257 6.320665 79.97548 9.990545 40.79511
#
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(75,Inf] | 1:(50,75] | 2:(25,50] | 3:(0,25] | 4:(0,25] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 4 4 51.40286 46.84931 64.13522 74.34207 87.91336
# 9 9 44.28453 54.81635 36.85836 14.75628 35.51343
#
# $`1:(75,Inf] | 2:(25,50] | 3:(25,50] | 4:(75,Inf] | 5:(25,50] | 1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50]`
# id X1 X2 X3 X4 X5
# 5 5 84.22882 28.56480 66.38018 79.58444 40.86862
# 10 10 52.76216 44.81289 32.94409 47.14784 48.61578
Using findInterval, a group ID can be added in a fraction of a second on a 1M row table:
library(data.table)
set.seed(538924142)
data <- data.frame(id = 1:1e6, matrix(runif(12e6, 0, 75)*sample(0:1, 12e6, TRUE, c(0.25, 0.75)), 1e6, 12))
system.time({
setDT(data)[
, grp := colSums(
matrix(
findInterval(
t(as.matrix(.SD)),
c(0, 25, 50, 75),
left.open = TRUE
),
12, 1e6
)*4^(0:11)
),
.SDcols = 2:13
]
})
#> user system elapsed
#> 0.26 0.05 0.31
head(data)
#> id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 1 0.00000 67.680617 26.178075 65.66532 0.00000 55.2356394 5.438976 72.20526839 70.47368 0.000000 0.00000 29.17494 8641772
#> 2: 2 0.00000 8.193552 10.482581 19.15885 30.28639 44.3917749 1.876230 11.19145219 55.22776 48.725632 17.18597 74.58265 14375508
#> 3: 3 0.00000 63.301921 0.000000 61.50508 0.00000 0.5755531 52.139676 51.46551228 58.90514 60.098006 12.90056 0.00000 2094284
#> 4: 4 18.06334 34.970526 9.599701 38.64339 57.00753 62.3455201 30.377876 73.73237960 0.00000 18.706219 0.00000 25.57064 8712089
#> 5: 5 27.49489 8.770596 0.000000 67.30562 58.43427 26.2856874 65.784429 36.96939287 54.65132 3.676736 29.51849 25.35926 10992582
#> 6: 6 59.27949 14.830172 2.233060 13.27291 16.63301 2.5727847 0.000000 0.05254523 23.44611 29.529823 0.00000 63.00820 13190487
data[which.min(data$grp)]
#> id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 189293 4.801804 0 26.7038 0 0 0 0 0 0 0 0 0 33
data[which.max(data$grp)]
#> id X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 637400 0 0 69.10316 56.61781 52.88433 62.50076 72.81748 57.27957 70.34022 72.01065 53.4228 56.72517 16777200
Then proceed with data.table subsetting and grouping operations. If you really want it split:
group_list <- split(data, by = "grp")
But,
Be aware that processing list of data.tables will be generally much
slower than manipulation in single data.table by group using by
argument
I have n rows and n+1 columns matrix and need to construct such system
For example matrix is
x4 x3 x2 x1 result
1 1 0 1 0
1 0 1 0 1
0 1 0 1 1
1 0 1 1 0
Then equation will be (+ is XOR)
x4+x3+x1=0
x4+x2=1
x3+x1=1
x4+x2+x1=0
I need to return answer as list of x1,.....
How can we do it in python?
You could make use of the Python interface pyz3 of Microsoft's Z3 solver:
from z3 import *
def xor2(a, b):
return Xor(a, b)
def xor3(a, b, c):
return Xor(a, Xor(b, c))
# define Boolean variables
x1 = Bool('x1')
x2 = Bool('x2')
x3 = Bool('x3')
x4 = Bool('x4')
s = Solver()
# every equation is expressed as one constraint
s.add(Not(xor3(x4, x3, x1)))
s.add(xor2(x4, x2))
s.add(xor2(x3, x1))
s.add(Not(xor3(x4, x2, x1)))
# solve and output results
print(s.check())
print(s.model())
Result:
sat
[x3 = False, x2 = False, x1 = True, x4 = True]
Learn Gauss, can be also used for XOR. Then write a gauss python program
Can anyone suggest an efficient way of reshaping a column (in a python pandas dataframe) into multiple columns, with alternating column assignment. I could do this with a loop but wondering if there is a more elegant way. For an example, consider the following example:
Added: does anyone have a solution that will reshape every n values in a single column into n separate columns e.g. reshaping from a single column with n variables to n columns?
Col
1 x1
2 y1
3 z1
4 x2
5 y2
6 z2
7 x3
8 y3
9 z3
..
to
x y z
1 x1 y1 z1
2 x2 y2 z2
3 x3 y3 z3
...
You can just reshape the underlying values, assuming that you have the correct number of values for the given shape and that you only care about ordering the values by shape without respect to the values themselves
s
Col
1 x1
2 y1
3 z1
4 x2
5 y2
6 z2
7 x3
8 y3
9 z3
pd.DataFrame(s.to_numpy().reshape(3, 3))
0 1 2
0 x1 y1 z1
1 x2 y2 z2
2 x3 y3 z3
You can use:
df_final=(pd.DataFrame(df.groupby(df.Col.str[-1])['Col'].apply(list)
.values.tolist(),columns=['x','y','z']))
x y z
0 x1 y1 z1
1 x2 y2 z2
2 x3 y3 z3
You can use auxiliary variables to work as the row and column index, then apply df.pivot
df1['aux'] = df1.Col.str[:-1]
df1['aux_idx'] = df1.Col.str[-1:]
print(df1.pivot(index= 'aux_idx', columns='aux', values='Col'))
Output:
aux x y z
aux_idx
1 x1 y1 z1
2 x2 y2 z2
3 x3 y3 z3
For the same result by just counting the number of elements, use df.index module n as the key
df1['aux_idx'] = (df1.index-1)// 3
df1['aux'] = df1.Col.str[:-1]
print(df1.pivot(index= 'aux_idx', columns='aux', values='Col'))
Output:
aux x y z
aux_idx
0 x1 y1 z1
1 x2 y2 z2
2 x3 y3 z3
I'm trying to solve a linear systems of inequations in Python.
My linear system look something like this :
3 * x1 + 2 * x2 + 4 * x3 > 0
x1 - 4 * x2 - 7 * x3 > 0
I've tried to use NumPy but linalg.solve(a, b) is designed for equations (=) and I have inequations (>).
I've thought about adding variables to my problem to trasnform inequations into equation like :
3 * x1 + 2 * x2 + 4 * x3 - x4 + 0 * x5 = 0
x1 + 4 * x2 + 7 * x3 + 0 * x4 - x5 = 0
x4 and x5 being > 0.
But I don't know how many constraints I'm going to have and I don't know if linalg.solve gives only strictly positive value to variables.
I've also look into SciPy linprog.
I could add an objective function like x1 + x2 + x3, that would'nt be a problem.
But with linprog I have only inequations as <= 0, but I want to exclude the value 0. It would be okay if I could have < 0.
I hope my problem is clear.
I've asked Google for some help, but find nothing. I guess I'm missing something since I can't be the only one with this problem.
Thank you for your help.
I would suggest to introduce a tolerance that defines how close you can come to zero and perhaps iterate on the tolerance value
I.e. Rewrite
3 * x1 + 2 * x2 + 4 * x3 > 0
x1 + 4 * x2 + 7 * x3 > 0
as
3 * x1 + 2 * x2 + 4 * x3 >= t
x1 + 4 * x2 + 7 * x3 >= t
Now, you can use spicy.linprog to solve this.
where t > 0.
Perhaps t >= 0.01 is acceptable as a starting point.
Then iterate on t in [0.01, 0.001, 0.0001, ... ]
At some point, your solution might start changing by less than your precision requirement.
Hope this helps.
I'd like to minimize a set of equations where the variables are known with their uncertainties. In essence I'd like to test the hypothesis that the given measured variables conform to the formula constraints given by the equations. This seems like something I should be able to do with scipy-optimize. For example I have three equations:
8 = 0.5 * x1 + 1.0 * x2 + 1.5 * x3 + 2.0 * x4
4 = 0.0 * x1 + 0.0 * x2 + 1.0 * x3 + 1.0 * x4
1 = 1.0 * x1 + 1.0 * x2 + 0.0 * x3 + 0.0 * x4
And four measured unknowns with their 1-sigma uncertainty:
x1 = 0.246 ± 0.007
x2 = 0.749 ± 0.010
x3 = 1.738 ± 0.009
x4 = 2.248 ± 0.007
Looking for any pointers in the right direction.
This is my approach. Assuming x1-x4 are approximately normally distributed around each mean (1-sigma uncertainty), the problem is turning into one of minimizing the sum of square of errors, with 3 linear constrain functions. Therefore, we can attack it using scipy.optimize.fmin_slsqp()
In [19]:
def eq_f1(x):
return (x*np.array([0.5, 1.0, 1.5, 2.0])).sum()-8
def eq_f2(x):
return (x*np.array([0.0, 0.0, 1.0, 1.0])).sum()-4
def eq_f3(x):
return (x*np.array([1.0, 1.0, 0.0, 0.0])).sum()-1
def error_f(x):
error=(x-np.array([0.246, 0.749, 1.738, 2.248]))/np.array([0.007, 0.010, 0.009, 0.007])
return (error*error).sum()
In [20]:
so.fmin_slsqp(error_f, np.array([0.246, 0.749, 1.738, 2.248]), eqcons=[eq_f1, eq_f2, eq_f3])
Optimization terminated successfully. (Exit mode 0)
Current function value: 2.17576389592
Iterations: 4
Function evaluations: 32
Gradient evaluations: 4
Out[20]:
array([ 0.25056582, 0.74943418, 1.74943418, 2.25056582])
I appear to me that I have a very similar problem. I am relatively new to py and used it mostly to sort and reduce data with pandas.
I have a set of linear equations, where I want to find the best fit parameters. However, the dataset has known uncertainties that need to be considered given in parentheses).
x1*99(1)+x2*45(1)=52(0.2)
x1*1(0.5)+x2*16(1)=15(0.1)
Moreover there are constraints:
x1>=0
x2>=0
x1+x2=1
My approach would be to treat the equations as constraints and solve the sum of the residues as it has been shown in the example above.
Solving this without uncertainties is not the issue. I ask to get a hint on how to account for the uncertainties while finding the best fit parameters.
As given, the problem has no solution. This is because if the inputs x1, x2, x3 and x4 are gaussian, then the outputs:
y1 = 0.5 * x1 + 1.0 * x2 + 1.5 * x3 + 2.0 * x4 - 8.0
y2 = 0.0 * x1 + 0.0 * x2 + 1.0 * x3 + 1.0 * x4 - 4.0
y3 = 1.0 * x1 + 1.0 * x2 + 0.0 * x3 + 0.0 * x4 - 1.0
are also gaussian.
Assuming that x1, x2, x3 and x4 are independent random variables, this is easy to see with OpenTURNS:
import openturns as ot
x1 = ot.Normal(0.246, 0.007)
x2 = ot.Normal(0.749, 0.010)
x3 = ot.Normal(1.738, 0.009)
x4 = ot.Normal(2.248, 0.007)
y1 = 0.5 * x1 + 1.0 * x2 + 1.5 * x3 + 2.0 * x4 - 8.0
y2 = 0.0 * x1 + 0.0 * x2 + 1.0 * x3 + 1.0 * x4 - 4.0
y3 = 1.0 * x1 + 1.0 * x2 + 0.0 * x3 + 0.0 * x4 - 1.0
The following script produces the graph:
graph1 = y1.drawPDF()
graph1.setLegends(["y1"])
graph2 = y2.drawPDF()
graph2.setLegends(["y2"])
graph3 = y3.drawPDF()
graph3.setLegends(["y3"])
graph1.add(graph2)
graph1.add(graph3)
graph1.setColors(["dodgerblue3",
"darkorange1",
"forestgreen"])
graph1.setXTitle("Y")
The previous script produces the following output.
Given the location of the 0.0 in this distribution, I would say that solving the equations is mathematically impossible, but physically consistent with the data.
Actually, I guess that the gaussian distributions you gave for x1, ..., x4 are estimated from data. So I would rather reformulate the problem as follows:
Given a sample of observed values of x1, x2, x3, x4, what is the value of e1, e2, e3 which is so that :
y1 = 0.5 * x1 + 1.0 * x2 + 1.5 * x3 + 2.0 * x4 - 8 + e1 = 0
y2 = 0.0 * x1 + 0.0 * x2 + 1.0 * x3 + 1.0 * x4 - 4 + e2 = 0
y3 = 1.0 * x1 + 1.0 * x2 + 0.0 * x3 + 0.0 * x4 - 1 + e3 = 0
This turns the problem into an inversion problem, which can be solved by calibrating e1, e2, e3. Furthermore, given the finite sample size of x1, ..., x4, we might want to produce the distribution of e1, e2, e3. This can be done by bootstraping the input / output pairs (x, y): the distribution of e1, e2, e3 then reflects the variability of these parameters depending on the sample at hand.
First, we have to generate a sample from the distribution (I suppose that you have this sample, but did not publish it so far):
distribution = ot.ComposedDistribution([x1, x2, x3, x4])
sampleSize = 10
xobs = distribution.getSample(sampleSize)
Then we define the model:
formulas = [
"y1 := 0.5 * x1 + 1.0 * x2 + 1.5 * x3 + 2.0 * x4 + e1 - 8.0",
"y2 := 0.0 * x1 + 0.0 * x2 + 1.0 * x3 + 1.0 * x4 + e2 - 4.0",
"y3 := 1.0 * x1 + 1.0 * x2 + 0.0 * x3 + 0.0 * x4 + e3 - 1.0"
]
program = ";".join(formulas)
g = ot.SymbolicFunction(["x1", "x2", "x3", "x4", "e1", "e2", "e3"],
["y1", "y2", "y3"],
program)
And set the observed outputs, which is a sample of zeros:
yobs = ot.Sample(sampleSize, 3)
We start with initial values equal to zero, and define the function to calibrate:
e1Initial = 0.0
e2Initial = 0.0
e3Initial = 0.0
thetaPrior = ot.Point([e1Initial,e2Initial,e3Initial])
calibratedIndices = [4, 5, 6]
mycf = ot.ParametricFunction(g, calibratedIndices, thetaPrior)
Then we can calibrate the model:
algo = ot.NonLinearLeastSquaresCalibration(mycf, xobs, yobs, thetaPrior)
algo.run()
calibrationResult = algo.getResult()
print(calibrationResult.getParameterMAP())
This prints:
[0.0265988,0.0153057,0.00495758]
This means that the errors e1, e2, e3 are rather small.
We can compute a confidence interval:
thetaPosterior = calibrationResult.getParameterPosterior()
print(thetaPosterior.computeBilateralConfidenceIntervalWithMarginalProbability(0.95)[0])
This prints:
[0.0110046, 0.0404756]
[0.00921992, 0.0210059]
[-0.00601084, 0.0156665]
The third parameter e3 might be zero, but neither e1 nor e2.
Finally, we can get the distribution of the errors:
thetaPosterior = calibrationResult.getParameterPosterior()
and draw it:
graph1 = thetaPosterior.getMarginal(0).drawPDF()
graph2 = thetaPosterior.getMarginal(1).drawPDF()
graph3 = thetaPosterior.getMarginal(2).drawPDF()
graph1.add(graph2)
graph1.add(graph3)
graph1.setColors(["dodgerblue3",
"darkorange1",
"forestgreen"])
graph1
This produces:
This shows that e3 might be zero given the variability in the observed inputs x1, ..., x4. But e1 and e2 cannot be zero. The conclusion for this sample is that the third equation is approximately solved by the observed values of x1, ..., x4, but not the two first equations.