Migrating from Stata to Python - python

Some coworkers who have been struggling with Stata 11 are asking for my help to try to automate their laborious work. They mainly use 3 commands in Stata:
tsset (sets a time series analysis)
as in: tsset year_column, yearly
varsoc (Obtain lag-order selection statistics for VARs)
as in: varsoc column_a column_b
vec (vector error-correction model)
as in: vec column_a column_b, trend(con) lags(1) noetable
Does anyone know any scientific library that I can use through python for this same functionality?

I believe both scikits.timeseries and econpy / pytrix implement vector autoregression methods, but I haven't put either through their paces.

scikits.timeseries is mainly for data handling and has only some statistical, econometric analysis and no vectorautoregression. pytrix has some econometrics functions but also no VAR.
(At least last time I looked.)
scikits.statsmodels and pandas both have VAR, pandas also does the data handling for time series. I haven't seen any vector error correction models in python yet, but scikits.statsmodels is getting close.
http://groups.google.ca/group/pystatsmodels?hl=en&pli=1

Check out scikits.statsmodels.tsa.api.VAR (may need to get the latest development version--use Google) and, in check out the documentation for it:
http://statsmodels.sourceforge.net/devel/vector_ar.html#var
These models integrate with pandas also. I'll be working in the coming months to improve integration of pandas with the rest of statsmodels
Vector Error Correction Models have not been implemented yet but are on the TODO list!

Use Rpy2 and call the R var package.

I have absolutely no clue what any of those do, but NumPy and SciPy. Maybe Sage or SymPy.

Related

Do we have an implementation of Bayesian structural time series in Python?

We are looking for a close pythonian implementation of the r library bsts.
To be precise, I'm looking for something that allows me to emulate the functionality of 'add_regressor' from fbprophet.
Have already tried Pybsts (the kernel kept dying), and
According to a thread on tensorflow_probability Github account, it doesn't support multivariate mode yet.
Any help would be appreciated.
Thanks
This blog post from Tensorflow Probability shows how to add an exogenous regressor with the TFP structural time series tools. In particular, check out the usage of the temperature_effect variable in the Example: Forecasting Demand for Electricity section!
I recently wrote a version of R's bsts package in Python. It doesn't have all of bsts's features, but it does have options for level, trend, seasonality, and regression. The syntax closely follows statsmodels' UnobservedComponents module. You can find the code and description of the package here: https://github.com/devindg/pybuc.

Welch's ANOVA in Python

I'm not sure if stackoverflow is the best forum for this, but anyway...
Scipy implements ANOVA using stats.f_oneway, which assumes equal variances. It says in the docs that if the variances are unequal, one could consider the Kruskal-Wallis test instead.
However, what I want is Welch's ANOVA. Scipy has a Welch t-test, but of course this doesn't work if I have more than two groups.
What I find interesting is that scipy used to have stats.oneway which allowed for an equal variance setting. However, it has been deprecated.
Is there an easy way to implement Welch's ANOVA in Python?
Just required the same thing. I had to copy code from R package. Also requested scipy.stats to add this feature. Here is the ~10 lines of code for the implementation
https://github.com/scipy/scipy/issues/11122
The pingouin package has a Welch's ANOVA implemented. You can find the documentation for it at https://pingouin-stats.org/generated/pingouin.welch_anova.html.

PCA with missing values in Python

I'm trying to do a PCA analysis on a masked array. From what I can tell, matplotlib.mlab.PCA doesn't work if the original 2D matrix has missing values. Does anyone have recommendations for doing a PCA with missing values in Python?
Thanks.
Imputing data will skew the result in ways that might bias the PCA estimates. A better approach is to use a PPCA algorithm, which gives the same result as PCA, but in some implementations can deal with missing data more robustly.
I have found two libraries. You have
Package PPCA on PyPI, which is called PCA-magic on github
Package PyPPCA, having the same name on PyPI and github
Since the packages are in low maintenance, you might want to implement it yourself instead. The code above build on theory presented in the well quoted (and well written!) paper by Tipping and Bishop 1999. It is available on Tippings home page if you want guidance on how to implement PPCA properly.
As an aside, the sklearn implementation of PCA is actually a PPCA implementation based on TippingBishop1999, but they have not chosen to implement it in such a way that it handles missing values.
EDIT: both the libraries above had issues so I could not use them directly myself. I forked PyPPCA and bug fixed it. Available on github.
I think you will probably need to do some preprocessing of the data before doing PCA.
You can use:
sklearn.impute.SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
With this function you can automatically replace the missing values for the mean, median or most frequent value. Which of this options is the best is hard to tell, it depends on many factors such as how the data looks like.
By the way, you can also use PCA using the same library with:
sklearn.decomposition.PCA
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
And many others statistical functions and machine learning tecniques.

Calculating adjusted p-values in Python

So, I've been spending some time looking for a way to get adjusted p-values (aka corrected p-values, q-values, FDR) in Python, but I haven't really found anything. There's the R function p.adjust, but I would like to stick to Python coding, if possible. Is there anything similar for Python?
If this is somehow a bad question, sorry in advance! I did search for answers first, but found none (except a Matlab version)... Any help is appreciated!
It is available in statsmodels.
http://statsmodels.sourceforge.net/devel/stats.html#multiple-tests-and-multiple-comparison-procedures
http://statsmodels.sourceforge.net/devel/generated/statsmodels.sandbox.stats.multicomp.multipletests.html
and some explanations, examples and Monte Carlo
http://jpktd.blogspot.com/2013/04/multiple-testing-p-value-corrections-in.html
According to the biostathandbook, the BH is easy to compute.
def fdr(p_vals):
from scipy.stats import rankdata
ranked_p_values = rankdata(p_vals)
fdr = p_vals * len(p_vals) / ranked_p_values
fdr[fdr > 1] = 1
return fdr
You can try the module rpy2 that allows you to import R functions (b.t.w., a basic search returns How to implement R's p.adjust in Python).
Another possibility is to look at the maths an redo it yourself, because it is still relatively easy.
Apparently there is an ongoing implementation in scipy: http://statsmodels.sourceforge.net/ipdirective/_modules/scikits/statsmodels/sandbox/stats/multicomp.html . Maybe it is already usable.
You mentioned in your question q-values and no answer provided a link which addresses this. I believe this package (at least it seems so from the documentation) calculates q-values in python
https://puolival.github.io/multipy/
and also this one
https://github.com/nfusi/qvalue

numpy: code to update least squares with more observations

I am looking for a numpy-based implementation of ordinary least squares that would allow the fit to be updated with more observations. Something along the lines of Applied Statistics algorithm AS 274 or R's biglm.
Failing that, a routine for updating a QR decomposition with new rows would also be of interest.
Any pointers?
scikits.statsmodels has an recursive OLS that updates the inverse X'X in the sandbox that could be used for this. (used only to calculate recursive OLS residuals.)
Nathaniel Smith posted his code for OLS when the data is too large to fit in memory to the scipy-user mailing list. The main code updates X'X.
I think econpy also has a function for this.
Pandas has an expanding OLS, but it may not be easy to use in an online fashion.
Nathaniels code might be the closest to biglm. I don't think there is anything for general linear model (error covariance different from identity).
All need some work before they can be used for this. I don't know of any python(-wrapped) code that would update QR.
update:
see http://mail.scipy.org/pipermail/scipy-dev/2010-February/013853.html
there is incremental qr and cholesky in cholmod available, but I didn't try it, either license or compilation on windows problems, and I don't think I tried to get incremental_qr to work
see attachements
http://mail.scipy.org/pipermail/scipy-dev/2010-February/013844.html
You might try the pythonequations project at http://code.google.com/p/pythonequations/downloads/list, though it may be more than you need it does use scipy and numpy. That code is the middleware for the http://zunzun.com online curve and surface fitting web site (I'm the author). The source code comes with many examples. Alternatively, the web site alone may be sufficient - please give it a try.
James Phillips
2548 Vera Cruz Drive
Birmingham, AL 35235 USA
zunzun#zunzun.com
This is not a detailed answer yet, but:
AFAIK, the QR update like this is not implemented in numpy, but anyway I'll like ask you to specify a more detailed manner what you are actually aiming for.
Especially, why it would not be acceptable to just calculate new estimate for x (of Ax= b) with k latest observations, when (bunch of) new observations arrives (and with modern hardware, k indeed can be quite large one)?
The LSQ.F90 part of the file compiles easily enough with,
gfortran-4.4 -shared -fPIC -g -o lsq.so LSQ.F90
and this works in Python,
from ctypes import cdll
lsq = cdll.LoadLibrary('./lsq.so')
As soon as I figure out the function call I'll include it in this answer.

Categories