I've been running some linear/logistic regression models recently, and I wanted to know how you can output the cost function for each iteration. One of the parameters in sci-kit LinearRegression is 'maxiter', but in reality you need to see cost vs iteration to find out what this value really needs to be i.e. is the benefit worth the computational time to run more iterations etc
I'm sure I'm missing something but I would have thought there was a method that outputted this information?
Thanks in advance!
One has to understand if there is any iteration (implying computing a cost function) or an analytical exact solution, when fitting any estimator.
Linear Regression
In fact, Linear Regression - ie Minimization of the Ordinary Least Square - is not an algorithm but a minimization problem that can be solved using different techniques. And those techniques
Not getting into the details of the statistical part described here :
There are at least three methods used in practice for computing least-squares solutions: the normal equations, QR decomposition, and singular value decomposition.
As far as I got into the details of the codes, it seems that the computational time is involved by getting the analytical exact solution, not iterating over the cost function. But I bet they depend on your system being under-, well- or over-determined, as well as the language and library you are using.
Logistic Regression
As Linear Regression, Logistic Regression is a minimization problem that can be solved using different techniques that, for scikit-learn, are : newton-cg, lbfgs, liblinear and sag.
As you mentionned, sklearn.linear_model.LogisticRegression includes the max_iter argument, meaning it includes iterations*. Those are controled either because the updated argument doesn't change anymore - up to a certain epsilon value - or because it reached the maximum number of iterations.
*As mentionned in the doc, it includes iterations only for some of the solvers
Useful only for the newton-cg, sag and lbfgs solvers. Maximum number of iterations taken for the solvers to converge.
In fact, each solver involves its own implementation, such as here for the liblinear solver.
I would recommand to use the verbose argument, maybe equal to 2 or 3 to get the maximum value. Depending on the solver, it might print the cost function error. However, I don't understand how you are planning to use this information.
Another solution might be to code your own solver and print the cost function at each iteration.
Curiosity kills cat but I checked the source code of scikit which involves many more.
First, sklearn.linear_model.LinearRegression use a fit to train its parameters.
Then, in the source code of fit, they use the Ordinary Least Square of Numpy (source).
Finally, Numpy's Least Square function uses the functionscipy.linalg.lapack.dgelsd, a wrapper to the LAPACK (Linear Algebra PACKage) function DGELSD written in Fortran (source).
That is to say that getting into the error calculation, if any, is not easy for scikit-learn developers. However, for the various using of LinearRegression and many more I had, the trade-off between cost-function and iteration time is well-adressed.
Related
A problem I'm currently working on requires me to optimize some dimension parameters for a structure in order to prevent buckling while still not being over engineered. I've been able to solve it use iterative (semi-brute forced) methods, however, I wondering if there is a way to implement a gradient descent method to optimize the parameters. More background is given below:
Let's say we are trying to optimize three length/thickness parameters, (t1,t2,t3) .
We initialize these parameters with some random guess (t1,t2,t3)g. Through some transformation to each of these parameters (weights and biases), the aim is to obtain (t1,t2,t3)ideal such that three main criteria (R1,R2,R3)ideal are met. The criteria are calculated by using (t1,t2,t3)i as inputs to some structural equations, where i represents the inputs after the first iteration. Following this, some kind of loss function could be implemented to calculate the error, (R1,R2,R3)i - (R1,R2,R3)ideal
My confusion lies in the fact that traditionally, (t1,t2,t3)ideal would be known and the cost would be a function of the error between (t1,t2,t3)ideal and (t1,t2,t3)i, and subsequent iterations would follow. However, in a case where (t1,t2,t3)ideal are unknown and the targets (R1,R2,R3)ideal (known) are an indirect function of the inputs, how would gradient descent be implemented? How would minimizing the cost relate to the step change in (t1,t2,t3)i ?
P.S: Sorry about the formatting, I cannot embed latex images until my reputation is higher.
I'm having some difficulty understanding how the constraints you're describing are calculated. I'd imagine the quantity you're trying to minimize is the total material used or the cost of construction, not the "error" you describe?
I don't know the details of your specific problem, but it's probably a safe bet that the cost function isn't convex. Any gradient-based optimization algorithm carries the risk of getting stuck in a local minimum. If the cost function isn't computationally intensive to evaluate then I'd recommend you use an algorithm like differential evolution that starts with a population of initial guesses scattered throughout the parameter space. SciPy has a nice implementation of it that allows for constraints (and includes a final gradient-based "polishing" step).
Is there any place with a brief description of each of the algorithms for the parameter method in the minimize function of the lmfit package? Both there and in the documentation of SciPy there is no explanation about the details of each algorithm. Right now I know I can choose between them but I don't know which one to choose...
My current problem
I am using lmfit in Python to minimize a function. I want to minimize the function within a finite and predefined range where the function has the following characteristics:
It is almost zero everywhere, which makes it to be numerically identical to zero almost everywhere.
It has a very, very sharp peak in some point.
The peak can be anywhere within the region.
This makes many minimization algorithms to not work. Right now I am using a combination of the brute force method (method="brute") to find a point close to the peak and then feed this value to the Nelder-Mead algorithm (method="nelder") to finally perform the minimization. It is working approximately 50 % of the times, and the other 50 % of the times it fails to find the minimum. I wonder if there are better algorithms for cases like this one...
I think it is a fair point that docs for lmfit (such as https://lmfit.github.io/lmfit-py/fitting.html#fit-methods-table) and scipy.optimize (such as https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#optimization-scipy-optimize) do not give detailed mathematical descriptions of the algorithms.
Then again, most of the docs for scipy, numpy, and related libraries describe how to use the methods, but do not describe in much mathematical detail how the algorithms work.
In fairness, the different optimization algorithms share many features and the differences between them can get pretty technical. All of these methods try to minimize some metric (often called "cost" or "residual") by changing the values of parameters for the supplied function.
It sort of takes a text book (or at least a Wikipedia page) to establish the concepts and mathematical terms used for these methods, and then a paper (or at least a Wikipedia page) to describe how each method differs from the others. So, I think the basic answer would be to look up the different methods.
I am comparing the numerical results of C++ and Python computations. In C++, I make use of LAPACK's sgels function to compute the coefficients of a linear regression problem. In Python, I use Numpy's linalg.lstsq function for a similar task.
What is the mathematical difference between the methods used by sgels and linalg.lstsq?
What is the expected error (e.g. 6 significant digits) when comparing the results (i.e. the regression coefficients) numerically?
FYI: I am by no means a C++ or Python expert, which makes it difficult to understand what is going on inside the functions.
Taking a look at the source of numpy, in the file linalg.py, lstsq relies on LAPACK's zgelsd() for complex and dgelsd() for real. Here are the differences to sgels():
dgelsd() is for double while sgels() is for float. There is a difference of precision...
dgels() makes use the QR factorization of the matrix A and assumes that A has full rank. The condition number of the matrix must be reasonable to get a significant result. See this course for getting the logic of the method. On the other hand, dgelsd() makes use of the Singular value decomposition of A. In particular, A can be rank defiencient and small singular values are discarted depending of the additional argument rcond or machine precision. Notice that numpy's default value for rcond is -1: negative values refers to machine precision. See this course for the logic.
According to the benchmark of LAPACK, on can expect dgels() to be about 5 time faster than dgelsd().
You may see significant differences between the result of sgels() and dgelsd() if the matrix is ill conditionned. Indeed, there is a bound on the error of the linear regression which depends on the algorithm and the value of rcond() that is used. See the user guide of LAPACK on, Error Bounds for Linear Least Squares Problems for estimates of the errors and Further Details: Error Bounds for Linear Least Squares Problems for technical details.
As a conclusion, sgels() and dgels() can be used if the measures in b are accurate and easily related to the explanatory variables. For instance, if sensors are placed at the exits of exhaust pipes, it's easy to guess which motors are running. But sometimes, the linear link between the source and the measures is not precisely known (uncertainty on the terms of A) or discriminating polluters on the base of measurements becomes more difficult (Some polluters are far from the set of sensors and A is ill-conditionned). In this kind of situation, dgelsd() and tunning the rcond argument can help. Whenever in doubt, use dgelsd() and estimate the error on the estimated x according to LAPACK's user guide.
I have a theoretical question. I am using as optimisation method the scipy.optimize.minimize. and especially the Nelder-Mead method. I have spotted that the results are changing. I am using it to minimise the sum of square differences(SSD) between two images. Running it at different times returns different results, sometimes the error is more than 50% e.g. SSD at first optimisation equals to 150 to the second optimisation 450.
Theoretically, it can be based to the heuristic search method of the algorithm and the non derivative nature of it.
Is it normal such a big difference?
I am sorry about the my theoretical question but I am very curious to know/discuss.
I have a follow up question to the post written a couple days ago, thank you for the previous feedback:
Finding complex roots from set of non-linear equations in python
I have gotten the set non-linear equations set up in python now so that fsolve will handle the real and imaginary parts independently. However, there are still problems with the python "fsolve" converging to the correct solution. I have exactly the same inputs that are used in Matlab, and after double checking, the set of equations are exactly the same as well. Matlab, no matter how I set the initial values, will always converge to the correct solution. With python however, every initial condition produces a different result, and never the correct one. After a fraction of a second, the following warning appears with python:
/opt/local/Library/Frameworks/Python.framework/Versions/Current/lib/python2.7/site-packages/scipy/optimize/minpack.py:227:
RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
I was wondering if there are some known differences between the fsolve in python and Matlab, and if there are some known methods to optimize the performance in python.
Thank you very much
I don't think that you should rely on the fact that the names are the same. I see from your other question that you are specifying that Matlab's fsolve use the 'levenberg-marquardt' algorithm rather than the default. Python's scipy.optimize.fsolve uses MINPACK's hybrd algorithms. Levenberg-Marquardt finds roots approximately by minimizing the sum of squares of the function and is quite robust. It is not a true root-finding method like the default 'trust-region-dogleg' algorithm. I don't know how the hybrd schemes work, but they claim to be a modification of Powell's method.
If you want something similar to what you're doing in Matlab, I'd look for an optimization scheme that implements Levenberg-Marquardt, such as scipy.optimize.root, which you were also using in your previous question. Is there a reason why you're not using that?