A problem I'm currently working on requires me to optimize some dimension parameters for a structure in order to prevent buckling while still not being over engineered. I've been able to solve it use iterative (semi-brute forced) methods, however, I wondering if there is a way to implement a gradient descent method to optimize the parameters. More background is given below:
Let's say we are trying to optimize three length/thickness parameters, (t1,t2,t3) .
We initialize these parameters with some random guess (t1,t2,t3)g. Through some transformation to each of these parameters (weights and biases), the aim is to obtain (t1,t2,t3)ideal such that three main criteria (R1,R2,R3)ideal are met. The criteria are calculated by using (t1,t2,t3)i as inputs to some structural equations, where i represents the inputs after the first iteration. Following this, some kind of loss function could be implemented to calculate the error, (R1,R2,R3)i - (R1,R2,R3)ideal
My confusion lies in the fact that traditionally, (t1,t2,t3)ideal would be known and the cost would be a function of the error between (t1,t2,t3)ideal and (t1,t2,t3)i, and subsequent iterations would follow. However, in a case where (t1,t2,t3)ideal are unknown and the targets (R1,R2,R3)ideal (known) are an indirect function of the inputs, how would gradient descent be implemented? How would minimizing the cost relate to the step change in (t1,t2,t3)i ?
P.S: Sorry about the formatting, I cannot embed latex images until my reputation is higher.
I'm having some difficulty understanding how the constraints you're describing are calculated. I'd imagine the quantity you're trying to minimize is the total material used or the cost of construction, not the "error" you describe?
I don't know the details of your specific problem, but it's probably a safe bet that the cost function isn't convex. Any gradient-based optimization algorithm carries the risk of getting stuck in a local minimum. If the cost function isn't computationally intensive to evaluate then I'd recommend you use an algorithm like differential evolution that starts with a population of initial guesses scattered throughout the parameter space. SciPy has a nice implementation of it that allows for constraints (and includes a final gradient-based "polishing" step).
Related
I implemented the Softmax function and later discovered that it has to be stabilized in order to be numerically stable (duh). And now, it is again not stable because even after deducting the max(x) from my vector, the given vector values are still too big to be able to be the powers of e. Here is the picture of the code I used to pinpoint the bug, vector here is sample output vector from forward propagating:
We can clearly see that the values are too big, and instead of probability, I get these really small numbers which leads to small error which leads to vanishing gradients and finally making the network unable to learn.
You are completely right, just translating the mathematical definition of softmax might make it unstable, which is why you have to substract the maximum of x before doing any compution.
Your implementation is correct, and vanishing/exploding gradient is an independant problem that you might encounter depending on what kind of neural network you intent to use.
I am using lifelines package to do Cox Regression. After trying to fit the model, I checked the CPH assumptions for any possible violations and it returned some problematic variables, along with the suggested solutions.
One of the solution that I would like to try is the one suggested here:
https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
However, the example written here is using CoxTimeVaryingFitter which, unlike CoxPHFitter, does not have concordance score, which will help me gauge the model performance. Additionally, CoxTimeVaryingFitter does not have check assumption feature. Does this mean that by putting it into episodic format, all the assumptions are automatically satisfied?
Alternatively, after reading a SAS textbook on survival analysis, it seemed like their solution is to create the interaction term directly (multiplying the problematic variable with the survival time) without changing the format to episodic format (as shown in the link). This way, I was hoping to just keep using CoxPHFitter due to its model scoring capability.
However, after doing this alternative, when I call check_assumptions again on the model with the time-interaction variable, the CPH assumption on the time-interaction variable is violated.
Now I am torn between:
Using CoxTimeVaryingFitter without knowing what the model performance is (seems like a bad idea)
Using CoxPHFitter, but the assumption is violated on the time-interaction variable (which inherently does not seem to fix the problem)
Any help regarding to solve this confusion is greatly appreciated
Here is one suggestion:
If you choose the CoxTimeVaryingFitter, then you need to somehow evaluate the quality of your model. Here is one way. Use the regression coefficients B and write down your model. I'll write it as S(t;x;B), where S is an estimator of the survival, t is the time, and x is a vector of covariates (age, wage, education, etc.). Now, for every individual i, you have a vector of covariates x_i. Thus, you have the survival function for each individual. Consequently, you can predict which individual will 'fail' first, which 'second', and so on. This produces a (predicted) ranking of survival. However, you know the real ranking of survival since you know the failure times or times-to-event. Now, quantify how many pairs (predicted survival, true survival) share the same ranking. In essence, you would be estimating the concordance.
If you opt to use CoxPHFitter, I don't think it was meant to be used with time-varying covariates. Instead, you could use two other approaches. One is to stratify your variable, i.e., cph.fit(dataframe, time_column, event_column, strata=['your variable to stratify']). The downside is that you no longer obtain a hazard ratio for that variable. The other approach is to use splines. Both of these methods are explained in here.
I come upon the following optimization problem:
The target function is a multivariate and non-differentiable function which takes as argument a list of scalars and return a scalar. It is non-differentiable in the sense that the computation within the function is based on pandas and a series of rolling, std, etc. actions.
The pseudo code is below:
def target_function(x: list) -> float:
# calculations
return output
Besides, each component of the x argument has its own bounds defined as a tuple (min, max). So how should I use the scipy.optimize library to find the global minimum of this function? Any other libraries could help?
I already tried scipy.optimize.brute, which took me like forever and scipy.optimize.minimize, which never produced a seemingly correct answer.
basinhopping, brute, and differential_evolution are the methods available for global optimization. As you've already discovered, brute-force global optimization is not going to be particularly efficient.
Differential evolution is a stochastic method that should do better than brute-force, but may still require a large number of objective function evaluations. If you want to use it, you should play with the parameters and see what will work best for your problem. This tends to work better than other methods if you know that your objective function is not "smooth": there could be discontinuities in the function or its derivatives.
Basin-hopping, on the other hand, makes stochastic jumps but also uses local relaxation after each jump. This is useful if your objective function has many local minima, but due to the local relaxation used, the function should be smooth. If you can't easily get at the gradient of your function, you could still try basin-hopping with one of the local minimizers which doesn't require this information.
The advantage of the scipy.optimize.basinhopping routine is that it is very customizable. You can use take_step to define a custom random jump, accept_test to override the test used for deciding whether to proceed with or discard the results of a random jump and relaxation, and minimizer_kwargs to adjust the local minimization behavior. For example, you might override take_step to stay within your bounds, and then select perhaps the L-BFGS-B minimizer, which can numerically estimate your function's gradient as well as take bounds on the parameters. L-BFGS-B does work better if you give it a gradient, but I've used it without one and it still is able to minimize well. Be sure to read about all of the parameters on the local and global optimization routines and adjust things like tolerances as acceptable to improve performance.
I've been running some linear/logistic regression models recently, and I wanted to know how you can output the cost function for each iteration. One of the parameters in sci-kit LinearRegression is 'maxiter', but in reality you need to see cost vs iteration to find out what this value really needs to be i.e. is the benefit worth the computational time to run more iterations etc
I'm sure I'm missing something but I would have thought there was a method that outputted this information?
Thanks in advance!
One has to understand if there is any iteration (implying computing a cost function) or an analytical exact solution, when fitting any estimator.
Linear Regression
In fact, Linear Regression - ie Minimization of the Ordinary Least Square - is not an algorithm but a minimization problem that can be solved using different techniques. And those techniques
Not getting into the details of the statistical part described here :
There are at least three methods used in practice for computing least-squares solutions: the normal equations, QR decomposition, and singular value decomposition.
As far as I got into the details of the codes, it seems that the computational time is involved by getting the analytical exact solution, not iterating over the cost function. But I bet they depend on your system being under-, well- or over-determined, as well as the language and library you are using.
Logistic Regression
As Linear Regression, Logistic Regression is a minimization problem that can be solved using different techniques that, for scikit-learn, are : newton-cg, lbfgs, liblinear and sag.
As you mentionned, sklearn.linear_model.LogisticRegression includes the max_iter argument, meaning it includes iterations*. Those are controled either because the updated argument doesn't change anymore - up to a certain epsilon value - or because it reached the maximum number of iterations.
*As mentionned in the doc, it includes iterations only for some of the solvers
Useful only for the newton-cg, sag and lbfgs solvers. Maximum number of iterations taken for the solvers to converge.
In fact, each solver involves its own implementation, such as here for the liblinear solver.
I would recommand to use the verbose argument, maybe equal to 2 or 3 to get the maximum value. Depending on the solver, it might print the cost function error. However, I don't understand how you are planning to use this information.
Another solution might be to code your own solver and print the cost function at each iteration.
Curiosity kills cat but I checked the source code of scikit which involves many more.
First, sklearn.linear_model.LinearRegression use a fit to train its parameters.
Then, in the source code of fit, they use the Ordinary Least Square of Numpy (source).
Finally, Numpy's Least Square function uses the functionscipy.linalg.lapack.dgelsd, a wrapper to the LAPACK (Linear Algebra PACKage) function DGELSD written in Fortran (source).
That is to say that getting into the error calculation, if any, is not easy for scikit-learn developers. However, for the various using of LinearRegression and many more I had, the trade-off between cost-function and iteration time is well-adressed.
I'm trying to model a monotonic function, that is bounded by y_min and y_max values, satisfies two value pairs (x0,y0), (x1,y1) within that range.
Is there some kind of package in python that might help in solving for the parameters for such a function?
Alternatively, if someone knows of a good source or paper on modeling such a function, I'd be much obliged...
(for anyone interested, i'm actually trying to model a market respons curve of a number of buyers vs. the price of a product, this is bounded by zero and the maximal demand)
well it's still not clear to me what you want, but the classic function of that form is the sigmoid (it's used in neural nets, for example). in a more flexible form that becomes the general logistic function which will fit your (x,y) constraints with suitable parameters. there's also the gompertz curve, which is similar.
however, those are all defined over an open domain and i doubt you have negative numbers of buyers. if that's an issue (you may not care as they get very close to zero) you could try transforming the number of buyers (taking the log kind-of works, but only if you can have a fraction of a buyer...).