How to access solution for dual simplex solver? - python

I have an objective function with several hundreds of quadratic terms which I would like to minimize; in this case I try to minimize the absolute distance between several variables. So the structure of my problem looks like this (highly simplified):
Minimize
obj: [ a^2 - 2 a * b + b^2 ] / 2
Subject To
c1: a + b >= 10
c2: a <= 100
End
I use the Python API to solve the problem in the following way:
import cplex
cpx = cplex.Cplex()
cpx.read('quadratic_obj_so.lp')
# use the dual simplex
cpx.parameters.lpmethod.set(cpx.parameters.lpmethod.values.dual)
cpx.solve()
print cpx.solution.get_values()[0:15]
print cpx.solution.status[cpx.solution.get_status()]
print cpx.solution.get_objective_value()
And for the above example I then receive (showing only iterations 16-18):
Itn Primal Obj Dual Obj Prim Inf Upper Inf Dual Inf
16 1.4492800e-19 -1.0579911e-07 3.81e-14 7.11e-15 5.17e-25
17 9.0580247e-21 -2.6449779e-08 1.91e-14 3.55e-15 2.33e-25
18 5.6612645e-22 -6.6124446e-09 5.45e-14 7.11e-15 6.46e-27
[73.11695794600045, 73.11695794603409]
optimal
0.0
so a and b are equal which makes sense since I try to minimize their distance and the constrains are clearly fulfilled.
However, my actual problem is far more complex and I receive:
Itn Primal Obj Dual Obj Prim Inf Upper Inf Dual Inf
92 1.4468496e+06 1.2138985e+06 1.80e+02 2.64e-12 5.17e-02
93 1.4468523e+06 1.2138969e+06 2.23e+02 2.17e-12 1.08e-02
94 1.4468541e+06 1.2138945e+06 2.93e+02 2.31e-12 5.62e-02
* 1.4457132e+06 1.2138598e+06 7.75e+00 7.61e-09 2.76e-02
num_best
1445714.46525
I have now several questions regarding the output which are closely connected:
1) Clearly, it is not the objective value for the dual simplex printed. Why is that, since I set the solver to be the dual simplex?!
2) How do I now access the results for the dual simplex? As the objective value is smaller I would be more interested in these results.
3) Does the num_best status guarantee that all the constrains are met i.e. is the solution valid but just not guaranteed to be optimal?
4) Primal Obj and Dual Obj differ quite a lot. Is there any strategy to minimize their difference?

To the best of my knowledge, get_objective_value always returns the best primal bound (regardless of the lpmethod).
Information about the dual solution can be retrieved with get_dual_values.
The num_best solution status means that a solution is available, but there is no proof of optimality (see here). This is probably the most important point with regards to the rest of the questions here.
You could try turning on the numerical emphasis parameter to see if that helps. There are also various tolerances you can adjust (e.g., optimality tolerance).
Note that all of the links I've used above are for the C Callable Library (which the Python API calls internally) for CPLEX 12.6.3.

Related

Variation on linear programming?

I'm trying to find an existing algorithm for the following problem:
i.e, let's say we have 3 variables, x, y, z (all must be integers).
I want to find values for all variables that MUST match some constraints, such as x+y<4, x<=50, z>x, etc.
In addition, there are extra POSSIBLE constraints, like y>=20, etc. (same as before).
The objective function (which i'm intrested in maximizing its value) is the number of EXTRA constraints that are met in the optimal solution (the "must" constraints + the fact that all values are integers, is a demand. without it, there's no valid solution).
If using OR-Tools, as the model is integral, I would recommend using CP-SAT, as it offers indicator constraints with a nice API.
The API would be:
b = model.NewBoolVar('indicator variable')
model.Add(x + 2 * y >= 5).OnlyEnforceIf(b)
...
model.Maximize(sum(indicator_variables))
To get maximal performance, I would recommend using parallelism.
solver = cp_model.CpSolver()
solver.parameters.log_search_progress = True
solver.parameters.num_search_workers = 8 # or more on a bigger computer
status = solver.Solve(model)

Compare decision variables to floating values

I'm currently working on a trajectory optimization problem that involves binary actuators. In order to avoid solving an MINLP I do not simply optimize over the states and control inputs but instead, assume that each of the binary actuators alternates between the states "on" and "off" and optimize over the duration of those intervals. I will call the array containing the decision variable h (an N by 2 matrix in the particular case below).
A minimal example using a double integrator that has two control inputs that enact a positive or negative force on the system respectively would be:
Here I model the state trajectory as some train of 3rd order polynomial.
I particularly do not want to merge these actuators into one with the states -1,0,1 since the more general system I'd like to apply this to also has more binary actuators. I add some additional constraints such as connecting the polynomials continuously and differentiably; enforce that the sum of all intervals is equal to the desired final time; enforce initial and final state constraints and finally enforce the dynamics of the system.
My initial idea was to then enforce the dynamics at constant intervals, i.e.:
However, the issue here is that any of the actuators could be in any arbitrary interval for some time t. Since the intervals can shrink to duration zero one actuator might be in the last interval while the other one remains in the first. I.e. the value of a decision variable (time duration) changes which decision variables are dependent on each other. This is more or less manifested in drake by the fact that I cannot do a comparison such as Tau < t if Tau is a drake expression and t is some number. The code snippet is:
# Enforce dynamics at the end of each control interval
for t in np.arange(0, Tf, dt_dyn):
# Find the index of the interval that is active for each actuator
t_ctrl = np.cumsum(h, axis=0)
intervals = (t_ctrl < t)
idxs = np.sum(intervals, axis=0)
# If the idx is even the actuator is off, otherwise its on
prog.AddConstraint(eq(qdd(q_a, t, dt_state),
continuous_dynamics(q(q_a, t, dt_state),
qd(q_a, t, dt_state),
[idxs[0] % 2, idxs[1] % 2])))
and the resulting error message:
Traceback (most recent call last):
File "test.py", line 92, in <module>
intervals = (t_ctrl < t)
RuntimeError: You should not call `__bool__` / `__nonzero__` on `Formula`. If you are trying to make a map with `Variable`, `Expression`, or `Polynomial` as keys (and then access the map in Python), please use pydrake.common.containers.EqualToDict`.
In the end, my question is more of a conceptual than technical nature: Does drake support this dependence of the "dependence of decision variables on the values" in some other way? Or is there a different way I can transcribe my problem to avoid the ambiguity in which interval my actuators are for some given time?
I've also linked to the overall script that implements the minimal example here
The immediate problem is that intervals = (t_ctrl < t) is a vector of dtype Formula, not a vector of dtype Variable(type=BINARY), so you can't actually sum it up. To do an arithmetic sum, you'd need to change that line to an np.vectorize-wrapped function that calls something like if_then_else(t_argument < t_constant, 0.0, 1.0) in order to have it be an Expression-valued vector, at which which would be sum-able.
That won't actually help you, though, since you cannot do % (modular) arithmetic on symbolic expressions anyway, so the % 2.0 == 0.0 stuff will raise an exception once you make it that far.
I suspect that you'll need a different approach to encoding the problem into variables, but unfortunately an answer there is beyond my skill level at the moment.

Problem with roots of a non-linear equation

I have a hyperbolic function and i need to find the 0 of it. I have tried various classical methods (bisection, newton and so on).
Second derivatives are continuous but not accessible analytically, so i have to exclude methods using them.
For the purpose of my application Newton method is the only one providing sufficient speed but it's relatively unstable if I'm not close enough to the actual zero. Here is a simple screenshot:
The zero is somewhere around 0.05. and since the function diverges at 0, if i take a initial guess value greater then the minimum location of a certain extent, then i obviously have problems with the asymptote.
Is there a more stable method in this case that would eventually offer speeds comparable to Newton?
I also thought of transforming the function in an equivalent better function with the same zero and only then applying Newton but I don't really know which transformations I can do.
Any help would be appreciated.
Dekker's or Brent's method should be almost as fast as Newton. If you want something simple to implement yourself, the Illinois variant of the regula-falsi method is also reasonably fast. These are all bracketing methods, so should not leave the domain if the initial interval is inside the domain.
def illinois(f,a,b,tol=1e-8):
'''regula falsi resp. false postion method with
the Illinois anti-stalling variation'''
fa = f(a)
fb = f(b)
if abs(fa)<abs(fb): a,fa,b,fb = b,fb,a,fa
while(np.abs(b-a)>tol):
c = (a*fb-b*fa)/(fb-fa)
fc = f(c)
if fa*fc < 0:
fa *= 0.5
else:
a, fa = b, fb
b, fb = c, fc
return b, fb
How about using log(x) instead of x?
For your case, #sams-studio's answer might work, and I would try that first. In a similar situation - also in multi-variate context - I used Newton-homotopy methods.
Basically, you limit the Newton step until the absolute value of y is descending.
The cheapest way to implement is that you half the Newton step if y increases from the last step. After a few steps, you're back at Newton with full second order convergence.
Disclamer: If you can bound your solution (you know a maximal x), the answer from #Lutz Lehmann would also be my first choice.

Likeliness of "A" being better than "B" using Poisson distribution

Background
I'm running an A-B test for two campaigns.
I got three step funnels set up for both campaigns.
So far B seems to be better than A, but how do I know when I have gathered enough measure points?
Funnel steps
In the data below, there are three steps. Step_1 is the number of users that reached our sign up page.
Step_2 is the number of users that filled in our sign up form
Step_3 is the number of users that confirmed their email address.
Question
How can I calculate the likelihood that A is better than B, or vice versa?
Or more eloquently:
Given an "infinate amount of cases" where we have A:6 and B:8 observations in Step_3 and a conversion rate from Step_1 of A:12.5% and B:13.333...%. In how many of these cases does A end up with a higher conversion rate than B and vice versa?
Step_1 Step_2 Step_3
A 144.0 18 6
B 135.0 18 8
Rationale
Each user going through the funnel is unaffected by other users.
Each user cannot reach the next step without going through the earlier.
Each user will either stop at a step, or continue to the next. Giving only two options for each independent observation
This means a binomial distribution can be used to predict the likeliness of a user converting to the next step.
What I tried so far
So far I've tried using a poisson distribution
from scipy.stats.distributions import poisson
And using the poisson.ppf somehow I should be able to say "The likeliness of A being better than B is 5%, the likeliness of B being better than A is 25%."
Of course I can just plug in some values to the function and go "Hey, this looks great" but I feel like I need to call upon the vast knowledge of the Stacked Oracles of Stack Overflow to make sure I'm doing something statistically sound.
Why Poisson
In my humble understanding of distributions:
The poisson distribution is a lot like the binomial distribution (scipy.stats.binom), but better suited for predictions involving few observations than it's binom big brother.
The poisson distribution is a binomial distribution, because it asserts two possible outcomes
The reason binomial distributions are what I want to use is because there are two outcomes in my simulated scenario, either the user proceeds down the funnel, or the user exits. This is the bi in binomial.
The poisson distribution is based on the assumption that two observations cannot affect each other. So wether or not user_1 makes it to step_3, step_2 or just to step_1, it does not matter for user_2. This is very much the case, they do not know of each others existence.
Mathematically speaking Binomial is more precise in this case than Poisson. For example, using Poisson you'll get a positive probability of more than 18 of your 18 candidates making the conversion. Poisson owes its popularity to being easier to compute.
The result also depends on your prior knowledge. For example if both your outcomes look very high compared to typical conversion rates then all else being equal the difference you see is more significant.
Assuming no prior knowledge, i.e. assuming that every conversion rate between 0 and 1 is equally likely if you know nothing else, the probability of a given conversion rate r once you take into account your observation of 6 out of 18 possible conversions is given by the Beta distribution, in this case Beta(r; 6+1, 18-6+1)
Technically speaking this is not a probability but a likelihood. The difference is the following: a probablity describes how often you will observe different outcomes if you compare "parallel universes" that are identical, even though reputable statisticians probably wouldn't use that terminology. A likelihood is the other way round: given a fixed outcome comparing different universes how often will you observe a particular kind of universe. (To be even more technical, this description is only fully correct if as we did a "flat prior" is assumed.) In your case there are two kinds of universe, one where A is better than B and one where B is better than A.
The probability of B being better than A is then
integral_0^1 dr Beta_cdf(r; 6+1, 18-6+1) x Beta_pdf(r; 8+1, 18-8+1)
You can use scipy.stats.beta and scipy.integrate.quad to calculate that and you'll get a 0.746 probability of B being better than A:
quad(lambda r: beta(7, 13).cdf(r) * beta(9,11).pdf(r), 0, 1)
# (0.7461608994979401, 1.3388378385104094e-08)
To conclude, by this measure the evidence for B being better than A is not very strong.
UPDATE:
The two step case can be solved conceptually similarly, but is a bit more challenging to compute.
We have two steps 135 / 144 -> 18 -> 8 / 6. Given these numbers how are the conversion rates for A and B and step 1 and step 2 distributed? Ultimately we are interested in the product of step 1 and step 2 for A and for B. Since I couldn't get scipy to solve the integrals in reasonable time I fell back to a Monte Carlo scheme. Just draw the conversion rates with appropriate probabilites N=10^7 times and count how often B is better than A:
(beta(9,11).rvs(N)*beta(19,118).rvs(N) > beta(7,13).rvs(N)*beta(19,127).rvs(N)).mean()
The result is very similar to the single step one: 0.742 in favour of B. Again, not very strong evidence.

How to find values below (or above) average

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.
from StringIO import StringIO
myst="""01/01/2016 8781262
01/02/2016 8958598
01/03/2016 8787628
01/04/2016 9770861
01/05/2016 8409410
01/06/2016 8924784
01/07/2016 8597500
01/08/2016 6436862
01/09/2016 1542677
"""
u_cols=['month', 'count']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)
Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?
This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.
A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.
Define a zscore function
def zscore(s):
return (s - np.mean(s)) / np.std(s)
Apply it to the count column
zscore(df['count'])
0 0.414005
1 0.488906
2 0.416694
3 0.831981
4 0.256946
5 0.474624
6 0.336390
7 -0.576197
8 -2.643349
Name: count, dtype: float64
Notice that the September observation is 2.6 standard deviations away.
Use abs and gt to identify outliers
zscore(df['count']).abs().gt(2)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
Name: count, dtype: bool
Again, September comes back true.
Tie it all together to filter your original dataframe
df[zscore(df['count']).abs().gt(2)]
filter the other way
df[zscore(df['count']).abs().le(2)]
first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.
But on the other side:
In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.
So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.
STRAIGHTFORWARD APPROACH
If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).
In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).
Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:
High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.
ALTERNATIVES
If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).
In this case I would regard two different approaches, depending again on how does your data look like:
Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.
Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.
I hope it helps!
Cheers
One way to filter outliers is the interquartile range (IQR, wikipedia), which is the difference between 75% (Q3) and 25% quartile (Q1).
The outliers are defined if the data falls below Q1 - k * IQR resp. above Q3 + k * IQR.
You can select the constant k based on your domain knowledge (a common choice is 1.5).
Given the data, a filter in pandas could look like this:
iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]

Categories