ks_2samp gives unexpected p-values - python

I want to compare two distributions using 2 sample K-S test.
I'm using python's (2.7) ks_2samp but I'm having some troubles.
First of all I don't understand if I have to put as parameters just the arrays with my data or to build a cumulative distribution of them. I guess the first one...
Secondly, when I use ks_2samp on my data, I obtain as return p-values that don't look realistic...
For example, for a couple of distribution that looks like this:
CDF of 2 datasets
ks_2samp returns:
D-value = 0.038629201101928384
P-value = 0.0
That means that the distributions don't descends from a same one (roughly speaking). I think it is very strange for these data... It looks strange also a result like "0.0", because usually the task gives results with many decimals...
Using similar data in input I get for example p-value = 6.65e-136, that actually it's very strange.
What could be the problem? Or is it all right?
In my array there are many "nans", but I also run ks_2samp on data where I masked the nans, getting the same result. So I don't think it cares...
Thank you very much in advance!

Related

Is there a way to sample from a value range in a GuassianMixture Model in python

I am using the sklearn GaussianMixture library. I got my GMM which is working fine, now I would like to sample from it. According to the docs I can use the model.sample(n) function to get n samples. This works, however I would like to sample, based on the fit probability, from a certain range of of values, not all the values.
To be precise, I want to be able to split the cumulative probability function into k equally probable ranges, like here:
And then be able to say I want to sample from range 5 (0.57~ to 0.71~ in the picture).
Is there a way to do this?

How do I apply a mean filter on a data set in Python?

I have a data set of 15,497 sets of values. The graph shows the raw data angle of pendulum vs. sample number which, obviously, looks awful. It should look like the second picture filtered data. A part of the assignment is introducing a mean filter to "smoothen" the data, making it look like the data on the 2nd graph. The data is put into np.array's in Python. But, using np.array's, I can't seem to figure out how to introduce a mean filter.
I'm interested in applying a mean filter on theta in the code screenshot of Python code, as theta are the values on the y axis on the plots. The code is added for you to easily see how the data file is introduced in the code.
There is a whole world of filtering techniques. There is a not a single unique 'mean filter'. Moreover, there are causal and non-causal filters (i.e. the difference between not using future values in the filter vs. using the future values in the filter.) I'm going to assume you are desiring a mean filter of size N, as that is pretty standard. Then, to apply this filter, convolve your 'theta' vector with a mean kernel.
I suggest printing the mean kernel and studying how it looks with different N. Then you may understand how it is averaging the values in the signal. I also urge you to think about why convolution is applying this filter to theta. I'll help you by telling you to think about the equivalent multiplication in the frequency domain. Also, investigate the different modes in the convolution function, as this may be more tailored for the specific solution you desire.
N=2
mean_kernel = np.ones(N)/N
filtered_sig = np.convolve(sig, mean_kernel, mode='same')

Anolamy detection in time series data using python

I am trying to write a python code which detects anomalies in time series data. My input data looks something like this:
Here, the regions marked in red are anomalies. I want it such that I get the x-coordinate of data-points which are anomalous. So far I have tried a basic if condition (ie if rate < 100, data-point is anomalous) and various statistical techniques like: Mean, Standard deviation, Rolling average with different window sizes etc. However, none of them have worked well. Is there a way to achieve what i want with using some statistical methods? If there are no simple ways to do this, I understand that I have to look to machine learning algorithms. In that case which algorithm would be suitable for my dataset? Thank you.
It looks as if your data comes in lumps, if you are able to distinguish between the lumps (maybe a certain delay between two samples), you can look at the distribution of the samples in the lump. If you know that your rate will never drop below 100, I would start with that, to clean it up a bit,then look at the remaining distribution. The mode value should kind of help identify the "middle", most occuring value. Cutting off everything a certain amount of standard deviations would maybe work to get clean data, but no guarantee that you won't cut off any of your required data.
Edit: you'd have to bin your data before getting the mode.

Comparing models fitting multivariate data

I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.

Scipy.optimize.minimize only iterates some variables.

I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.

Categories