QQ-Plot in Python using Plotnine - python

I want to plot an array of values against a theoretical distribution using a QQ-Plot in Python. Ideally, I want to create the plot using the library Plotnine.
But when I try to create the plot, I'm getting error messages... here's my code with example data:
from scipy.stats import beta
from plotnine import *
import statsmodels.api as sm
import numpy as np
n = 207
values = -1 + np.random.beta(n/2-1, n/2-1, 100) * 2 # my data
dist = beta(n/2-1, n/2-1, loc = -1, scale = 2) # theoretical distribution
# 1. try:
ggplot(aes(sample = values)) + stat_qq(distribution = dist)
# gives ValueError: Unknown continuous distribution '<scipy.stats._distn_infrastructure.rv_frozen object at 0x0000029755C5C070>'
# 2. try:
params = {'a':n/2-1, 'b':n/2-1, 'loc':-1, 'scale':2}
ggplot(aes(sample = values)) + stat_qq(distribution = 'beta', dparams = params)
# gives TypeError: '>' not supported between instances of 'numpy.ndarray' and 'int'
Does anyone know what I'm doing wrong?
When I try to plot using statsmodels, it seems to work fine:
sm.qqplot(values, dist, line = '45')
As always, any help is highly appreciated!

This is a bug in plotnine, until it is fixed you can try to pass the arguments as a tuple instead of a dict. However, be careful about the positional matching of the arguments (a, b, loc, scale).
Edit
The bug is fixed in the current development version of plotnine and you can use a dict to pass the arguments.

Related

statsmodels glm fit_constrained 'unrecognized token'

Trying to use statsmodels.glm with constraints and unable to get (what I think should be) a simple requirement to work...
import numpy as np
import pandas as pd
from statsmodels.tools import add_constant
from statsmodels.formula.api import glm
test = np.array([[10,29,30],[32,26,23],[34,39,46]])
exog = add_constant(test)
y = np.array([11,23,27])
formula = 'y ~ x + q'
formula_df = formula_df = pd.DataFrame({'y' : y,'x':test[:,0],'q':test[:,1],'r':test[:,2],'int':exog[:,3]})
glmm = glm(formula=formula,data=formula_df)
glmm_results = glmm.fit_constrained(constraints='q = 0')
print(glmm_results.summary())
from the docs for fit constraints:
The constraints are of the form R params = q where R is the constraint_matrix and q is the vector of constraint_values.
The estimation creates a new model with transformed design matrix, exog, and converts the results back to the original parameterization.
Parameters
constraintsformula expression or tuple
If it is a tuple, then the constraint needs to be given by two arrays (constraint_matrix, constraint_value), i.e. (R, q). Otherwise, the constraints can be given as strings or list of strings. see t_test for details
however, if I try to simply ensure that q is > 0 by changing the constraint to:
glm_results = glmm.fit_constrained(constraints='q > 0')
then i get an unrecognized token error
PatsyError: unrecognized token in constraint
q > 0
^
I've tried '>>' as well. There isn't much documentation I can find for statsmodels and writing constraints beyond what I copy/pasted above. How do I get this to work?
An alternate question would be how do I write the 4 simplest types of constraints on coefficients in the q,R format (x = 0, x<0, x>0, a<x<b)?

Statsmodels: vector_ar and IRAnalysis

I'm trying to estimate impulse response functions of a -1 standard-deviation shock to a 3-dimension VAR using statsmodels.tsa, however I'm currently having issues with setting the shock magnitude.
This gives me the IRFs for a 1 s.d. shock, the default:
import numpy as np
import statsmodels.tsa as sm
model = sm.vector_ar.var_model.VAR(endog = data)
fitted = model.fit()
shock= -1*fitted.sigma_u
irf = sm.vector_ar.irf.IRAnalysis(model = fitted)
The function IRAnalysis takes an argument P, an upper diagonal matrix that sets the shocks, I found this looking at the source code. However inputting P as shown below doesn't seem to be doing anything.
irf = statsmodels.tsa.vector_ar.irf.IRAnalysis(model = fitted, P = -np.linalg.cholesky(model.fitted_U))
I would really appreciate some help.
Thanks in advance.
I have had the same question and finally found something that works on my end.
instead of using the IRAnalysis explicitly, I found that transforming the VAR model into it's MA representation was the best way to adjust the size of the shock.
from statsmodels.tsa.vector_ar.irf import IRAnalysis
J = fitted.ma_rep(T)
J = shock*np.array(J)
This will give you the output of the irfs for T periods.
I also wanted the standard error bands on my plots, so I did something similar to that particular function as well.
G, H = fitted.irf_errband_mc(orth=False, repl=1000, steps=T, signif=0.05, seed=None, burn=100, cum=False)
Hope this helps

Issues with using parameters for a K-S test and understand the result

I'm trying to run a K-S test on some data. Now I have the code working, but I'm not sure I understaned whats going on, and I also get an error when trying to set the loc. Essentially I get both the KS and P-test value. But I'm not sure I fully grasp it, enough to use the result.
I'm using the scipy.stats.ks_2samp module found here.
This is the code I am running
from scipy import stats
np.random.seed(12345678) #fix random seed to get the same result
n1 = len(low_ni_sample) # size of first sample
n2 = len(high_ni_sample) # size of second sample
# Scale is standard deviation
scale = 3
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]
print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))
Which gives this:
K-S Statistics 0.04507948306145837
P-value 0.8362207851676332
Now for those examples I've seen, the loc is added in as this:
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, loc=0.5, scale=scale)
If I do that however, I get this error:
Traceback (most recent call last):
File "<ipython-input-342-aa890a947919>", line 13, in <module>
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
File "/home/kongstad/anaconda3/envs/tensorflow/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py", line 937, in rvs
args, loc, scale, size = self._parse_args_rvs(*args, **kwds)
TypeError: _parse_args_rvs() got multiple values for argument 'loc'
Here is a snapshot, showing the content of the two datasets being used.
low_ni_sample, high_ni_sample.
So my questions are:
Why cant I add a loc value and what does it represent?
Changing the scale changes the result significantly, why and what to go by?
How would I plot this out in such a way it makes sense?
After running Silma's suggestion I stumbled upon a new error.
from scipy import stats
np.random.seed(12345678) #fix random seed to get the same result
n1 = len(low_ni_sample) # size of first sample
n2 = len(high_ni_sample) # size of second sample
# Scale is standard deviation
scale = 3
ndist = stats.norm(loc=0., scale=scale)
rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
rvs2 = ndist.rvs(high_ni_sample[:,0],size=n2)
#rvs1 = stats.norm.rvs(low_ni_sample[:,2], size=n1, scale=scale)
#rvs2 = stats.norm.rvs(high_ni_sample[:,2], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]
print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))
With this error message
rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
TypeError: rvs() got multiple values for argument 'size'
The error comes from the fact that you should first create an instance of the normal distribution before using it:
ndist = stats.norm(loc=0., scale=scale)
then do
rvs1 = ndist.rvs(size=n1)
to generate n1 samples drawn from a normal distribution centered on 0 and with a standard deviation scale.
The location is therefore the mean of your distribution.
Changing the scale changes the variance of your distribution (you get more variability), so this obviously impacts the KS test...
As for the plot, I'm not sure I see what you mean... if you want to plot the histograms, then do
import matplotlib.pyplot as plt
plt.hist(rvs1)
plt.show()
Or even better, install seaborn and use their distplot methods, for instance the KDE.
Overall I would advise you to try to read a little more on distributions and KS tests before you go any further, see for instance the wikipedia page.
EDIT
the code shown above is used to generate random samples from a standard distribution (which I assumed was your goal, to compare with your samples).
If what you want to do is directly compare your two sample data, then all you need is
ksresult = stats.ks_2samp(low_ni_sample[:,0], high_ni_sample[:,0])
again, this is assuming that low_ni_sample[:,0]and high_ni_sample[:,0] are 1D-arrays containing many measurements of the quantity of interest, cf. ks_2samp documentation

rpy2 Dynamic Time Warping (dtw) in python - windowing does not work

A now closed discussion shows how to use the R dtw package in python. This is a little clumsy, but the R dtw package is great and better than currently available python dtw implementations. Unfortunately, the windowing functions like the Sakoe-Chiba band do not work when trying to specify a "window.size". There appears to be an issue with the mapping to the argument. Note that "." in arguments is supposed to be replaced with "_" when using rpy2. But following this convention, the argument is not being used for some reason.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=5)
>>> RRuntimeError: Error in window.function(row(wm), col(wm), query.size= n, reference.size = m, :
argument "window.size" is missing, with no default
You can see that the error states "window.size" is missing, despite "window_size" clearly being specified in the rpy2 fashion.
Just a note from the future: this question is now superseded by the feature-equivalent dtw-python package (also found on PyPI). The rpy2-R-dtw bridge should no longer be necessary.
Answering my own question in case anyone ever has the same issue. The problem is the argument mapping and the R three dots ellipsis ‘...’. This can be fixed by specifying the mapping manually.
from rpy2.robjects.functions import SignatureTranslatedFunction
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
So with this specification the window_size argument is used correctly.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.functions import SignatureTranslatedFunction
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=10)
dist = alignment.rx('distance')[0][0]
print(dist)
>>> 117.348292359

SciPy: generating custom random variable from PMF

I'm trying to generate random variables according to a certain ugly distribution, in Python. I have an explicit expression for the PMF, but it involves some products which makes it unpleasant to obtain and invert the CDF (see below code for explicit form of PMF).
In essence, I'm trying to define a random variable in Python by its PMF and then have built-in code do the hard work of sampling from the distribution. I know how to do this if the support of the RV is finite, but here the support is countably infinite.
The code I am currently trying to run as per #askewchan's advice below is:
import scipy as sp
import numpy as np
class x_gen(sp.stats.rv_discrete):
def _pmf(self,k,param):
num = np.arange(1+param, k+param, 1)
denom = np.arange(3+2*param, k+3+2*param, 1)
p = (2+param)*(np.prod(num)/np.prod(denom))
return p
pa_limit = limitrv_gen()
print pa_limit.rvs(alpha,n=1)
However, this returns the error while running:
File "limiting_sim.py", line 42, in _pmf
num = np.arange(1+param, k+param, 1)
TypeError: only length-1 arrays can be converted to Python scalars
Basically, it seems that the np.arange() list isn't working somehow inside the def _pmf() function. I'm at a loss to see why. Can anyone enlighten me here and/or point out a fix?
EDIT 1: cleared up some questions by askewchan, edits reflected above.
EDIT 2: askewchan suggested an interesting approximation using the factorial function, but I'm looking more for an exact solution such as the one that I'm trying to get work with np.arange.
You should be able to subclass rv_discrete like so:
class mydist_gen(rv_discrete):
def _pmf(self, n, param):
return yourpmf(n, param)
Then you can create a distribution instance with:
mydist = mydist_gen()
And generate samples with:
mydist.rvs(param, size=1000)
Or you can then create a frozen distribution object with:
mydistp = mydist(param)
And finally generate samples with:
mydistp.rvs(1000)
With your example, this should work, since factorial automatically broadcasts. But, it might fail for large enough alpha:
import scipy as sp
import numpy as np
from scipy.misc import factorial
class limitrv_gen(sp.stats.rv_discrete):
def _pmf(self, k, alpha):
#num = np.prod(np.arange(1+alpha, k+alpha))
num = factorial(k+alpha-1) / factorial(alpha)
#denom = np.prod(np.arange(3+2*alpha, k+3+2*alpha))
denom = factorial(k + 2 + 2*alpha) / factorial(2 + 2*alpha)
return (2+alpha) * num / denom
pa_limit = limitrv_gen()
alpha = 100
pa_limit.rvs(alpha, size=10)

Categories