Patsy formula when variable has a hypthen - python

I am trying to use the statsmodel linear regression functions with formulas. My sample data is coming from a Pandas data frame. I am having a slight problem with column names within the formula. Due to the downstream processes, I have hyphens within my column names. For example:
+------+-------+-------+
+ VOLT + B-NN + B-IDW +
+------+-------+-------+
Now, one of the reasons for keeping the hyphen as it allows python to split the string for other analysis, so I have to keep it. As you can see, when I want to regress VOLT with B-NN using VOLT ~ B-NN, I encounter a problem as the patsy formula cannot find B.
Is there a way to tell Patsy that B-NN is a variable name and not B minus NN?
Thanks.
BJR

patsy uses Q for quoting names, e.g. Q('B-IDW')
http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q
my_fit_function("y ~ Q('weight.in.kg')", ...)

Related

How to make variables get declared automatically for an optimization model (scipy.optimize - Python)

I have a dataset like this, and I have to make an optimization model. (I've been trying to use scipy.optimize in python) the objective function goes like this:
(variable for row 1)(row1['KPI 1']+row1['KPI 2']+row1['KPI 3']) +
(variable for row 2)(row2['KPI 1']+row2['KPI 2']+row2['KPI 3']) +
(variable for row 3)*(row3['KPI 1']+row3['KPI 2']+row3['KPI 3']) + ... and so on until the last row.
but the problem is, the practical dataset that will be used for this, will have at least 200 rows and this has to be made ready for any dataset that might have a different number of rows. so I have to come up with a way to make the variables get created automatically to match the number of rows.
I already tried this way.
x = []
x = [2 for i in range(len(data1))]
And I think it can be used to create variables but in the end it doesn't work with the minimize function.
I cannot find a single source on the internet that shows how to make variables get created automatically with scipy.optimize. is it not possible with scipy? if not, suggest me another method where it's possible.

Converting from R to Python, trying to understand a line

I have a fairly simple question. I have been converting some statistical analysis code from R to Python. Up until now, I have been doing just fine, but I have gotten stuck on this particular line:
nlsfit <- nls(N~pnorm(m, mean=mean, sd=sd),data=data4fit,start=list(mean=mu, sd=sig), control=list(maxiter=100,warnOnly = TRUE))
Essentially, the program is calculating the non-linear least-squares fit for a set of data, the "nls" command. In the original text, the "tilde" looks like an "enye", I'm not sure if that is significant.
As I understand the equivalent of pnorm in Python is norm.cdf from from scipy.stats. What I want to know is, what does the "tilde/enye" do before the pnorm function is invoked. "m" is a predefined variable, while "mean" and "sd" are not.
I also found some code, essentially reproducing nls in Python: nls Python code, however, because of the date of the post (2013), I was wondering if there are any more recent equivalents, preferably written in Pyton 3.
Any advice is appreiated, thanks!
As you can see from ?nls: the first argument in nsl is formula:
formula: a nonlinear model formula including variables and parameters.
Will be coerced to a formula if necessary
Now, if you do ?formula, we can read this:
The models fit by, e.g., the lm and glm functions are specified in a
compact symbolic form. The ~ operator is basic in the formation of
such models. An expression of the form y ~ model is interpreted as a
specification that the response y is modelled by a linear predictor
specified symbolically by model
Therefore, the ~ in your case nls join the response/dependent/regressand variable in the left with the regressors/explanatory variables in the right part of your nonlinear least squares.
Best!
This minimizes
sum((N - pnorm(m, mean=mean, sd=sd))^2)
using starting values for mean and sd specified in start. It will perform a maximum of 100 iterations and it will return instead of signalling an error in the case of termination before convergence.
The first argument to nls is an R formula which specifies the regression where the left hand side of the tilde (N) is the dependent variable and the right side is the function of the parameters (mean, sd) and data (m) used to predict it.
Note that formula objects do not have a fixed meaning in R but rather each function can interpret them in any way it likes. For example, formula objects used by nls are interpreted differently than formula objects used by lm. In nls the formula y ~ a + b * x would be used to specify a linear regression but in lm the same regression would be expressed as y ~ x .
See ?pnorm, ?nls, ?nls.control and ?formula .

Why doesn't Statsmodels OLS support reading in columns with multiple words?

I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.
I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:
import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv
sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>
However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):
import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()
File "<unknown>", line 1
Count of Specific Strands
^
SyntaxError: invalid syntax
Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:
test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()
Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?
This is due to the way the formula parser patsy is written: see this link for more information
The authors of patsy have, however, thought of this problem: (quoted from here)
This flexibility does create problems in one case, though – because we
interpret whatever you write in-between the + signs as Python code,
you do in fact have to write valid Python code. And this can be tricky
if your variable names have funny characters in them, like whitespace
or punctuation. Fortunately, patsy has a builtin “transformation”
called Q() that lets you “quote” such variables
Therefore, in your case, you should be able to write:
smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()

Naming columns by mathematical symbols in pandas dataframe

I want to add the units of my parameters next to each parameter as the name of a column in my dataframe. I also need to use statistical symbols for some column names such as μ and σ2.
I tried following code according to mathematical symbols in python that is r"$...$ but it does not work for dataframe:
P[r"Infiltration rate ($1/\h^-1$)"]=r['ACH_Base']
in order to give (1/h^-1) unit to Infiltration rate parameter.
In my code I have already created a new dataframe "P" and I am adding the ACH_Base column in "r" dataframe to P.
How can I add mathematical symbols for naming the columns in dataframes?
Thanks!!
It should work, but it depends on the backend used to display the dataframe. For instance, matplotlib has support to render LaTeX in plots.
Here is an example:
https://matplotlib.org/users/usetex.html#text-rendering-with-latex
LaTeX can also be rendered in jupyter notebooks, but this does not apply to Python code, only for markdown cells:
http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html?highlight=latex#LaTeX-equations
"\h" is an unknown symbol.
Does P[r"Infiltration rate ($1/h^-1$)"]=r['ACH_Base'] work to display what you want?
What unit do you wish to display? You can refer to https://matplotlib.org/users/mathtext.html and https://matplotlib.org/users/usetex.html#usetex-tutorial for more information on how to render text with LaTex.

patsy formula - adding powers of a factor

I use patsy to build design matrix. I need to include powers of the original factors. For example, with the regression , I want to be able to write
patsy.dmatrix('y~x1 + x1**2 + x2 + x2**2 + x2**3', data)
where data is a dataframe that contains column y, x1, x2. But it does not seem to work at all. Any solutions?
Patsy has a special interpretation of ** that it inherited from R. I've considered making it automatically do the right thing when applied to numeric factors, but haven't actually implemented it... in the mean time, there's a general method for telling patsy to switch to using the Python interpretation of operators, instead of the Patsy interpretation: you wrap your expression in I(...). So:
patsy.dmatrix('y~x1 + I(x1**2) + x2 + I(x2**2) + I(x2**3)', data)
(More detailed explanation here)
Patsy does not seem to manage power representation (yet?). A way to get around can be found here: python stats models - quadratic term in regression

Categories