patsy formula - adding powers of a factor - python

I use patsy to build design matrix. I need to include powers of the original factors. For example, with the regression , I want to be able to write
patsy.dmatrix('y~x1 + x1**2 + x2 + x2**2 + x2**3', data)
where data is a dataframe that contains column y, x1, x2. But it does not seem to work at all. Any solutions?

Patsy has a special interpretation of ** that it inherited from R. I've considered making it automatically do the right thing when applied to numeric factors, but haven't actually implemented it... in the mean time, there's a general method for telling patsy to switch to using the Python interpretation of operators, instead of the Patsy interpretation: you wrap your expression in I(...). So:
patsy.dmatrix('y~x1 + I(x1**2) + x2 + I(x2**2) + I(x2**3)', data)
(More detailed explanation here)

Patsy does not seem to manage power representation (yet?). A way to get around can be found here: python stats models - quadratic term in regression

Related

Building a filter with Python & MATLAB, results are not the same

I want to translate this MATLAB code into Python, I guess I did everything right, even though I didn't get the same results.
MATLAB script:
n=2 %Filter_Order
Wn=[0.4 0.6] %# Normalized cutoff frequencies
[b,a] = butter(n,Wn,'bandpass') % Transfer function coefficients of the filter
Python script:
import numpy as np
from scipy import signal
n=2 #Filter_Order
Wn=np.array([0.4,0.6]) # Normalized cutoff frequencies
b, a = signal.butter(n, Wn, btype='band') #Transfer function coefficients of the filter
a coefficients in MATLAB: 1, -5.55e-16, 1.14, -1.66e-16, 0.41
a coefficients in Python: 1, -2.77e-16, 1.14, -1.94e-16, 0.41
Could it just be a question of precision, since the two different values (the 2nd and 4th) are both on the order of 10^(-16)?!
The b coefficients are the same on the other hand.
You machine precision is about 1e-16 (in MATLAB this can be checked easily with eps(), I presume about the same in Python). The 'error' you are dealing with is thus on the order of machine precision, i.e. not actually calculable within fitting precision.
Also of note is that MATLAB ~= Python (or != in Python), thus the implementations of butter() on one hand and signal.butter() on the other will be slightly different, even if you use the exact same numbers, due to the way both languages are translated to machine code.
It rarely matters to have coefficients differing 16 orders of magnitude; the smaller ones would be essentially neglected. In case you do need exact values, consider using either symbolic math, or some kind of Variable Precision Arithmetic (vpa() in MATLAB), but I guess that in your case the difference is irrelevant.

Converting from R to Python, trying to understand a line

I have a fairly simple question. I have been converting some statistical analysis code from R to Python. Up until now, I have been doing just fine, but I have gotten stuck on this particular line:
nlsfit <- nls(N~pnorm(m, mean=mean, sd=sd),data=data4fit,start=list(mean=mu, sd=sig), control=list(maxiter=100,warnOnly = TRUE))
Essentially, the program is calculating the non-linear least-squares fit for a set of data, the "nls" command. In the original text, the "tilde" looks like an "enye", I'm not sure if that is significant.
As I understand the equivalent of pnorm in Python is norm.cdf from from scipy.stats. What I want to know is, what does the "tilde/enye" do before the pnorm function is invoked. "m" is a predefined variable, while "mean" and "sd" are not.
I also found some code, essentially reproducing nls in Python: nls Python code, however, because of the date of the post (2013), I was wondering if there are any more recent equivalents, preferably written in Pyton 3.
Any advice is appreiated, thanks!
As you can see from ?nls: the first argument in nsl is formula:
formula: a nonlinear model formula including variables and parameters.
Will be coerced to a formula if necessary
Now, if you do ?formula, we can read this:
The models fit by, e.g., the lm and glm functions are specified in a
compact symbolic form. The ~ operator is basic in the formation of
such models. An expression of the form y ~ model is interpreted as a
specification that the response y is modelled by a linear predictor
specified symbolically by model
Therefore, the ~ in your case nls join the response/dependent/regressand variable in the left with the regressors/explanatory variables in the right part of your nonlinear least squares.
Best!
This minimizes
sum((N - pnorm(m, mean=mean, sd=sd))^2)
using starting values for mean and sd specified in start. It will perform a maximum of 100 iterations and it will return instead of signalling an error in the case of termination before convergence.
The first argument to nls is an R formula which specifies the regression where the left hand side of the tilde (N) is the dependent variable and the right side is the function of the parameters (mean, sd) and data (m) used to predict it.
Note that formula objects do not have a fixed meaning in R but rather each function can interpret them in any way it likes. For example, formula objects used by nls are interpreted differently than formula objects used by lm. In nls the formula y ~ a + b * x would be used to specify a linear regression but in lm the same regression would be expressed as y ~ x .
See ?pnorm, ?nls, ?nls.control and ?formula .

Patsy formula when variable has a hypthen

I am trying to use the statsmodel linear regression functions with formulas. My sample data is coming from a Pandas data frame. I am having a slight problem with column names within the formula. Due to the downstream processes, I have hyphens within my column names. For example:
+------+-------+-------+
+ VOLT + B-NN + B-IDW +
+------+-------+-------+
Now, one of the reasons for keeping the hyphen as it allows python to split the string for other analysis, so I have to keep it. As you can see, when I want to regress VOLT with B-NN using VOLT ~ B-NN, I encounter a problem as the patsy formula cannot find B.
Is there a way to tell Patsy that B-NN is a variable name and not B minus NN?
Thanks.
BJR
patsy uses Q for quoting names, e.g. Q('B-IDW')
http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q
my_fit_function("y ~ Q('weight.in.kg')", ...)

the solution of linear combination with integer variables = fixed value

I plan to use python for the solution of next task.
There is an equation:
E=(n[1])*W[1]+ (n[2])*W[2]+..+ (n[N])*W[N]. The W[i],
E are known and are fixed values,
n[i] are integer variables.
I need to find all combinations of n[i] and write them.
Howe can I do it using numpy python?
Looks like a Diophantine equation.
There is no support for this in numpy/scipy and the usual suspect Integer-programming (which can be used to solve this) is also not available within scipy!
The general case is NP-hard!

What algorithm does Pandas use for computing variance?

Which method does Pandas use for computing the variance of a Series?
For example, using Pandas (v0.14.1):
pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145
Obviously due to some numeric instability. However, in R we get:
var(rep(500111,2000000))
0
I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses.
This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Update: To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by #Jeff); whereas if it is not installed, the naive implementation indicated by #BrenBarn is used.
The algorithm can be seen in nanops.py, in the function nanvar, the last line of which is:
return np.fabs((XX - X ** 2 / count) / d)
This is the "naive" implementation at the beginning of the Wikipedia article you mention. (d will be set to N-1 in the default case.)
The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.
I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.
np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999
np.var(repeat(100000000,100000))
0.0
Using Pandas 0.11.0.

Categories