When I want to fit some model in python,
I often use fit() method in statsmodels.
And some cases I write a script for automating fitting:
import statsmodels.formula.api as smf
import pandas as pd
df = pd.read_csv('mydata.csv') # contains column x and y
fitted = smf.poisson('y ~ x', df).fit()
My question is how to silence the fit() method.
In my environment it outputs some information about fitting to standard output like:
Optimization terminated successfully.
Current function value: 2.397867
Iterations 11
but I don't need it.
I couldn't find the argument which controls standard output printing.
How can I silence fit() method?
Python 3.3.4, IPython 2.0.0, pandas 0.13.1, statsmodels 0.5.0.
Use the disp argument to fit. It controls the verbosity of the optimizers in scipy.
mod.fit(disp=0)
See the documentation for fit.
Related
I'm going through a tutorial on mixed-effects models in Python.
I'm building a model where litter is the random effect. In the tutorial, the output contains the variance across the litter intercepts. However, in Bayesian hierarchical modeling, I'm also able to see the intercepts for every level of the random effect variable.
How would I see that here?
import pandas as pd
import statsmodels.api as sm
import scipy.stats as stats
import statsmodels.formula.api as smf
df = pd.read_csv("http://www-personal.umich.edu/~bwest/rat_pup.dat", sep = "\t")
model = smf.mixedlm("weight ~ litsize + C(treatment) + C(sex, Treatment('Male')) + C(treatment):C(sex, Treatment('Male'))",
df,
groups= "litter").fit()
model.summary()
I would also ideally like to see the estimate of the intercept across all litters. Then, how would I interpret that overall intercept compared to the intercept for each single litter?
If there's a better Python package for what I'm striving for, please suggest.
I'm working on some forecasts using statsmodels' arima model.
This used to work well with
model_result = model.fit(disp = -1)
but it seems that disp no longer seems to be working -
https://github.com/biolab/orange3-timeseries/blob/a9fb2ab04dffdc8c17cb4020e94a93538099c285/orangecontrib/timeseries/models.py#L305-L306
Has anyone ran into the same problem and knows an alternative for disp?
It has not been possible for me to continue reasonably without this.
BR and thank you!
I also got the same problem.
Two solutions:
1)Use an older version of statsmodels, where disp is still supported, you can do so by installing 0.12.2 version of statsmodels.
$pip install statsmodels==0.12.2
disp is an optional argument. If disp = True, or disp >0 convergence information is printed.
If disp = False or disp < 0 means no output in this case.
You can get rid of the warnings by using this in your code:
import warnings
warnings.filterwarnings("ignore")
2)Use the newer version of statsmodels. disp is no longer supported. So, you can not set a value.
use the following code:
import statsmodels.api as smapi
model = smapi.tsa.arima.ARIMA(train_data, order=(1,1,2))
result = model.fit()
Personally speaking the updated version of statsmodels is better.
I'm running linear regressions with statsmodels and because I tend to distrust my results I also ran the same regression with scipy. The underlying dataset has about 80,000 observations. Unofrtunately, I cannot provide the data for you to reproduce the errors.
I run two rounds of regressions: first simple OLS, second simple OLS with standardized variables
Surprisingly, the results differ a lot. While R² and p-value seem to be the same, coefficients, intercept and standard error are all over the place. Interestingly, after standardizing the results align more. Now, there is only a slight difference in the constant, which I am happy to attribute to rounding issues.
The exact numbers can be found in the appended screenshots.
Any idea, where these differences come from and why they disappear after standardizing? What did I do wrong? Do I have to be extra worried, since I run most of my regressions with sklearn (only swapped to statsmodels since I needed some p-values) and even more differences may occur?
Thanks for your help! If you need any additional information, feel free to ask. Code and Screenshots are povided below.
My code in short looks like this:
# package import
import numpy as np
from scipy.stats import linregress
from scipy.stats.mstats import zscore
import statsmodels.api as sma
import statsmodels.formula.api as smf
# adding constant
train_IV_cons = sma.add_constant(train_IV)
# run regression
(coefficients, intercept, rvalue, pvalue, stderr) = linregress(train_IV[:,0], train_DV)
print(coefficients, intercept, rvalue, pvalue, stderr)
est = smf.OLS(train_DV, train_IV_cons[:,[0,1]])
model_results = est.fit()
print(model_results.summary())
# normalize variables
train_IV_norm = train_IV
train_IV_norm[:,0]=np.array(ss.zscore(train_IV_norm[:,0]))
train_IV_norm_cons = sma.add_constant(train_IV_norm)
# run regressions
(coefficients, intercept, rvalue, pvalue, stderr) = linregress(train_IV_norm[:,0], train_DV_norm)
print(coefficients, intercept, rvalue, pvalue, stderr)
est = smf.OLS(train_DV_norm, train_IV_norm_cons[:,[0,1]])
model_results = est.fit()
print(model_results.summary())
First regression (not standardized data):
Second regression (standardized data):
I am currently using from pandas.stats.plm import PanelOLS to run Panel regressions. I am needing to switch to statsmodel so that I can ouput heteroskedastic robust results. I have been unable to find notation on calling a panel regression for statsmodel. In general, I find the documentation for statsmodel not very user friendly. Is someone familiar with panel regression syntax in statsmodel?
The linearmodels package is created to extend the statsmodels package to panelOLS (see https://github.com/bashtage/linearmodels). Here is the example from the package doc:
import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)
# MultiIndex, entity - time
data = data.set_index(['firm','year'])
from linearmodels import PanelOLS
mod = PanelOLS(data.invest, data[['value','capital']], entity_effect=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
Best Daniel
I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.
How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?
I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?
I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.
Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:
scale_ : ndarray, shape (n_features,)
Per feature relative scaling of the data.
New in version 0.17: scale_ is recommended instead of deprecated std_.
mean_ : array of floats with shape [n_features]
The mean value for each feature in the training set.
The following short snippet illustrates this:
from sklearn import preprocessing
import numpy as np
s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))
Scale with standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
save mean_ and var_ for later use
means = scaler.mean_
vars = scaler.var_
(you can print and copy paste means and vars or save to disk with np.save....)
Later use of saved parameters
def scale_data(array,means=means,stds=vars **0.5):
return (array-means)/stds
scale_new_data = scale_data(new_data)
You can use the joblib module to store the parameters of your scaler.
from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')
Later you can load the scaler.
from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)
Pickle brings a security vulnerability and allows attackers to execute arbitrary code on the servers. The conditions are:
possibility to replace the pickle file with another pickle file on the server (if no auditing of the pickle performed, i.e. signature validation or hash comparison)
the same but on the developer PC (attacker compromised some dev PC
If your server-side applications are executed as root (or under root in docker containers), then this is definitely worth of your attention.
Possible solution:
Model training should be done in a secure environment
Trained models should be signed by the key from another secure environment, which is not loaded to the gpg-agent (otherwise the attacker can quite easily replace the signature)
CI should test the models in an isolated environment (quarantine)
Use python3.8 or later which added security hooks to prevent code injection techniques
or just avoid pickle:)
Some links:
https://docs.python.org/3/library/pickle.html
Python: can I safely unpickle untrusted data?
https://github.com/pytorch/pytorch/issues/52596
https://www.python.org/dev/peps/pep-0578/
Possible approach to avoid pickling:
# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)
#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])
scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])