Can scipy.stats to calculate pmf automatically? - python

I have two questions.
1) I have an array like [1,2,3,4,5,5,3,1]. and I don't know which distributions it is. Can I use scipy.stats to calculate pmf,cdf automatically?
2)scipy.stats is just like a library of distributions? If I want to analysis data, I have to find one distributions or define one? I need to manually calculate some of data, like pmf. Am I understanding correctly?

Well, scipy.stats is not a library for telling you the distribution of data and calculating pmf and cdf automatically. Its a library for easing your tasks while estimating the probabily distribution. You have to explore your data and find which distribution which fits your data with least error ,which is the ultimate task and scipy.stats helps you achieving this.... you don't have to reinvent the wheel as they say by writing all the mathematical functions again and again.
Well, to answer your question in the comment, lets suppose you have a dataset, to get an insight and get a starting point for your analysis, what you'll do is plot the data in a historgram (which also shows distribution of data), now you can plot different distributions on the same plot using scipy.stats to get a feel of bestfit....
Check out this answer, it might help ya...
https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best

Related

Is there a Python package for monotonic splines?

I am trying to find a procedure to fit data monotonically in Python.
The data won’t be necessarily monotonic but the fit must be because of theoretical assumptions: so the signal must be monotonic but the measurements are taken with noise.
I imagine that a way of doing that would be to run an isotonic regression and then interpolate using a cubic spline. Are there easier alternatives?
In R, for example; I would use the cobs package for constrained splines. Does anything similar exists in Python?
Other ways of achieving the same result would also be fine if effective (e.g. fitting curves on monotonic transformations of the data that would maintain the overall shape of the relationship). I already know there are ways of achieving a similar result with GBM but I am looking for an alternative.
Thank you

Adjusted Boxplot in Python

For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an 𝚁R implementation of this
(πš›πš˜πš‹πšžπšœπšπš‹πšŠπšœπšŽ::πšŠπšπš“πš‹πš˜πš‘()robustbase::adjbox()) as well as
a matlab one (in a library called πš•πš’πš‹πš›πšŠlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.

How to use statsmodels to fit data

I have a dataset which I need to fit to a GEV distribution. The data is one dimensional, and is stored in a numpy array. Currently, I am using scipy.stats.genextreme.fit(data), which works ok, but gives totally inaccurate results (obvious by plotting the pdf). After some investigation it turns out that my data does not fit well in log space, which scipy uses in its MLE fitting algorithm, so I need to try something like GMM instead which is only available in statsmodels. The problem is that I can't find anything which looks like scipy's fit function. All the examples I've found seem to deal with far more complicated data than I have. Also, statsmodels requires endog and exog parameters for eveything, and I have no idea what these are.
This should be really simple, so I'm sure I'm missing something obvious. Has anyone used statsmodels in this way, and if so, any pointers as to how to do it?
I'm guessing you want Gaussian Mixture Model (GMM) and not Generalized Method of Moments (GMM). The former GMM is available in scikit-learn here. The latter has code in statsmodels, but it's a work in progress.
EDIT Actually it's not clear to me that you want GMM. Maybe you just want a kernel density estimator (KDE). This is available in statsmodels here with an example
Hmm, if you do want to use (Generalized) Method of Moments to fit some kind of probability weighted GEV, then you need to specify the moment conditions, but I don't have a ready example for (G)MM in statsmodels for how you specify the moment conditions. You might be better off asking on the mailing list.

Multivariate distributions with Python

Problem
I have computed a probability density function that depends on two variables. I want to use this multivariate distribution to generate some random numbers that occur with a probability proportional to the PDF.
As it seems, SciPy currently only supports univariate distributions. Are there any simple methods or easy-to-use packages that allow 2d-distributions?
As a workaround, I might try creating random numbers on the domain of interest and throwing them away or keeping them with a chance related to my PDF, but still there might be other options. The random number generation does not have to be fast.
Thanks for your help!
Here's a possible solution
Based on the answers (thanks a lot!), I hacked in some code the you may find in this gist. If you run this example with a sin^2*Gauss PDF, 2000 random random variates that fulfil a given condition (be inside a circle) will be plotted over the PDF. Maybe that's helpful for others, too.
So you have a PDF F(x,y) and you want to generate the pairs of x and y distributed according to this PDF?
I'd say unless you can use the multivariate version of the inversion technique (wiki), the rejection sampling is the way to go.
For variables X and Y, couldn't you separate it into sampling two univariate distributions by just generating an x with the independent distribution of X, and a y with the distribution of Y given x?

Using PyMC to perform double integration

I need to perform double integration using MCMC method. I have already done it using romberg and doublequad integrations with correct results. I need to also use MCMC integration to compare the results. I found it difficult to understand PyMC.
The outline is this: I have some timeseries data and I need to find out which distribution fits it. I have a set of equations that tells me what to do that involves Double Integration.
Hoping for some guidance.
I'd suggest you start with a simple direct sampling MC and do a trivial 2D integral for which you can obtain the answer by paper and pencil. Then move on to a MCMC for the same integral.

Categories