numpy: code to update least squares with more observations - python

I am looking for a numpy-based implementation of ordinary least squares that would allow the fit to be updated with more observations. Something along the lines of Applied Statistics algorithm AS 274 or R's biglm.
Failing that, a routine for updating a QR decomposition with new rows would also be of interest.
Any pointers?

scikits.statsmodels has an recursive OLS that updates the inverse X'X in the sandbox that could be used for this. (used only to calculate recursive OLS residuals.)
Nathaniel Smith posted his code for OLS when the data is too large to fit in memory to the scipy-user mailing list. The main code updates X'X.
I think econpy also has a function for this.
Pandas has an expanding OLS, but it may not be easy to use in an online fashion.
Nathaniels code might be the closest to biglm. I don't think there is anything for general linear model (error covariance different from identity).
All need some work before they can be used for this. I don't know of any python(-wrapped) code that would update QR.
update:
see http://mail.scipy.org/pipermail/scipy-dev/2010-February/013853.html
there is incremental qr and cholesky in cholmod available, but I didn't try it, either license or compilation on windows problems, and I don't think I tried to get incremental_qr to work
see attachements
http://mail.scipy.org/pipermail/scipy-dev/2010-February/013844.html

You might try the pythonequations project at http://code.google.com/p/pythonequations/downloads/list, though it may be more than you need it does use scipy and numpy. That code is the middleware for the http://zunzun.com online curve and surface fitting web site (I'm the author). The source code comes with many examples. Alternatively, the web site alone may be sufficient - please give it a try.
James Phillips
2548 Vera Cruz Drive
Birmingham, AL 35235 USA
zunzun#zunzun.com

This is not a detailed answer yet, but:
AFAIK, the QR update like this is not implemented in numpy, but anyway I'll like ask you to specify a more detailed manner what you are actually aiming for.
Especially, why it would not be acceptable to just calculate new estimate for x (of Ax= b) with k latest observations, when (bunch of) new observations arrives (and with modern hardware, k indeed can be quite large one)?

The LSQ.F90 part of the file compiles easily enough with,
gfortran-4.4 -shared -fPIC -g -o lsq.so LSQ.F90
and this works in Python,
from ctypes import cdll
lsq = cdll.LoadLibrary('./lsq.so')
As soon as I figure out the function call I'll include it in this answer.

Related

How to implement multiple testing for scipy.stats tests

I have a dataframe of values from various samples from two groups. I performed a scipy.stats.ttest on these, which works perfectly, but I am a bit concerned here with the fact that so much testing may yield multiple testing error.
And I wonder how to implement MTC (multiple testing correction) with this. I mean, is there some function in scipy or statsmodels which would perform directly the tests and apply MTC on the output series of p-value, or can I apply an MTC function on a list of p-value without problem?
I know that statsmodels may comprise such functions, but what it has in power, it lacks greatly in manageability and documentation, unhappily (indeed, that's not the fault of the developers, they are three for such huge project). Anyway, I am a little stuck here, so I'll gladly take any suggesting. I didn't ask this in CrossValidated, because it is more related to the implementation part than the statistical part.
Edit 9th Oct 2019:
this link works as of today
https://www.statsmodels.org/stable/generated/statsmodels.stats.multitest.multipletests.html
original answer (returns 404 now)
statsmodels.sandbox.stats.multicomp.multipletests takes an array of p-values and returns the adjusted p-values. The documentation is pretty clear.

PCA with missing values in Python

I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.
Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.
I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.

Does Matlab's fminimax apply Pareto optimality?

I am working on multi-objective optimization in Matlab, and am using the fiminimax in the Optimization toolbox. I want to know if fminimax applies Pareto optimization, and if not, why? Also, can you suggest a multi-objective optimization package in Matlab or Python that does use Pareto?
For python, DEAP may be the one you're looking for. Extensive documentation with a lot of real life examples, and a really helpful Google Groups forum. It implements two robust MO algorithms: NSGA-II and SPEA-II.
Edit (as requested)
I am using DEAP for my MSc thesis, so I will let you know how we are using Pareto optimality. Setting DEAP up is pretty straight-forward, as you will see in the examples. Use this one as a starting point. This is the short version, which uses the built-in algorithms and operators. Read both and then follow these guidelines.
As the OneMax example is single-objective, it doesn't use MO algorithms. However, it's easy to implement them:
Change your evaluation function so it returns a n-tuple with the desired scores. If you want to minimize standard deviation too, something like return sum(individual), numpy.std(individual) would work.
Also, modify the weights parameter of the base.Fitness object so it matches that returned n-tuple. A positive float means maximization, while a negative one means minimization. You can use any real number, but I would stick with 1.0 and -1.0 for the sake of simplicity.
Change your genetic operators to cxSimulatedBinaryBounded(), mutPolynomialBounded() and selNSGA2(), for crossover, mutation and selection operations, respectively. These are the suggested methods, as they were developed by the NSGA-II authors.
If you want to use one of the embedded ready-to-go algorithms in DEAP, choose MuPlusLambda().
When calling the algorithm, remember to change the halloffame parameter from HallOfFame() to ParetoFront(). This will return all non-dominated individuals, instead of the best lexicographically sorted "best individuals in all generations". Then you can resolve your Pareto Front as desired: weighted sum, custom lexicographic sorting, etc.
I hope that helps. Take into account that there's also a full, somehow more advanced, NSGA2 example available here.
For fminimax and fgoalattain it looks like the answer is no. However, the genetic algorithm solver, gamultiobj, is Pareto set-based, though I'm not sure if it's the kind of multi-objective optimization function you want to use. gamultiobj implements the NGSA-II evolutionary algorithm. There's also this package that implements the Strengthen Pareto Evolutionary Algorithm 2 (SPEA-II) in C with a Matlab mex interface. It's a bit old so you might want to recompile it (you'll need to anyways if you're not on Windows 32-bit).

Nash equilibrium in Python

Is there a Python library out there that solves for the Nash equilibrium of two-person zero-games? I know the solution can be written down in terms of linear constraints and, in theory, scipy should be able to optimize it. However, for two-person zero-games the solution is exact and unique, but some of the solvers fail to converge for certain problems.
Rather than listing any of the libraries on Linear programing on the Python website, I would like to know what library would be most effective in terms of ease of use and speed.
Raymond Hettinger wrote a recipe for solving zero-sum payoff matrices. It should serve your purposes alright.
As for a more general library for solving game theory, there's nothing specifically designed for that. But, like you said, scipy can tackle optimization problems like this. You might be able to do something with GarlicSim, which claims to be for "any kind of simulation: Physics, game theory..." but I've never used it before so I can't recommend it.
There is Gambit, which is a little difficult to set up, but has a python API.
I've just started putting together some game theory python code: http://drvinceknight.github.com/Gamepy/
There's code which:
solves matching games,
calculates shapley values in cooperative games,
runs agent based simulations to identify emergent behaviour in normal form games,
(clumsily - my python foo is still growing) uses the lrs library (written in C: http://cgm.cs.mcgill.ca/~avis/C/lrs.html) to calculate the solutions to normal form games (this is I believe what you want).
The code is all available on github and that site (the first link at the beginning of this answer) explains how the code works and gives user examples.
You might also want to check out 'Gambit' which I've never used.

Integrate stiff ODEs with Python

I'm looking for a good library that will integrate stiff ODEs in Python. The issue is, scipy's odeint gives me good solutions sometimes, but the slightest change in the initial conditions causes it to fall down and give up. The same problem is solved quite happily by MATLAB's stiff solvers (ode15s and ode23s), but I can't use it (even from Python, because none of the Python bindings for the MATLAB C API implement callbacks, and I need to pass a function to the ODE solver). I'm trying PyGSL, but it's horrendously complex. Any suggestions would be greatly appreciated.
EDIT: The specific problem I'm having with PyGSL is choosing the right step function. There are several of them, but no direct analogues to ode15s or ode23s (bdf formula and modified Rosenbrock if that makes sense). So what is a good step function to choose for a stiff system? I have to solve this system for a really long time to ensure that it reaches steady-state, and the GSL solvers either choose a miniscule time-step or one that's too large.
If you can solve your problem with Matlab's ode15s, you should be able to solve it with the vode solver of scipy. To simulate ode15s, I use the following settings:
ode15s = scipy.integrate.ode(f)
ode15s.set_integrator('vode', method='bdf', order=15, nsteps=3000)
ode15s.set_initial_value(u0, t0)
and then you can happily solve your problem with ode15s.integrate(t_final). It should work pretty well on a stiff problem.
(See also Link)
Python can call C. The industry standard is LSODE in ODEPACK. It is public-domain. You can download the C version. These solvers are extremely tricky, so it's best to use some well-tested code.
Added: Be sure you really have a stiff system, i.e. if the rates (eigenvalues) differ by more than 2 or 3 orders of magnitude. Also, if the system is stiff, but you are only looking for a steady-state solution, these solvers give you the option of solving some of the equations algebraically. Otherwise, a good Runge-Kutta solver like DVERK will be a good, and much simpler, solution.
Added here because it would not fit in a comment: This is from the DLSODE header doc:
C T :INOUT Value of the independent variable. On return it
C will be the current value of t (normally TOUT).
C
C TOUT :IN Next point where output is desired (.NE. T).
Also, yes Michaelis-Menten kinetics is nonlinear. The Aitken acceleration works with it, though. (If you want a short explanation, first consider the simple case of Y being a scalar. You run the system to get 3 Y(T) points. Fit an exponential curve through them (simple algebra). Then set Y to the asymptote and repeat. Now just generalize to Y being a vector. Assume the 3 points are in a plane - it's OK if they're not.) Besides, unless you have a forcing function (like a constant IV drip), the MM elimination will decay away and the system will approach linearity. Hope that helps.
PyDSTool wraps the Radau solver, which is an excellent implicit stiff integrator. This has more setup than odeint, but a lot less than PyGSL. The greatest benefit is that your RHS function is specified as a string (typically, although you can build a system using symbolic manipulations) and is converted into C, so there are no slow python callbacks and the whole thing will be very fast.
I am currently studying a bit of ODE and its solvers, so your question is very interesting to me...
From what I have heard and read, for stiff problems the right way to go is to choose an implicit method as a step function (correct me if I am wrong, I am still learning the misteries of ODE solvers). I cannot cite you where I read this, because I don't remember, but here is a thread from gsl-help where a similar question was asked.
So, in short, seems like the bsimp method is worth taking a shot, although it requires a jacobian function. If you cannot calculate the Jacobian, I will try with rk2imp, rk4imp, or any of the gear methods.

Categories