Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute difference
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?
There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.
Related
I'm working with genetic data in which alleles were observed n times in t number of chromosomes sequenced. In other words, n successes in t trials.
I want to include an estimate of each allele's frequency as a feature in a machine learning algorithm. I can of course get a point estimate with n/t, but I want to represent the confidence of that point estimate -- i.e. something about the likelihood of that estimate.
Now, I believe the negative binomial (or just binomial) distribution would be the right one to use, but
How can I estimate the parameters of the distribution in Python?
What representation of the distribution would be ideal as a feature for classical (non-NN) machine learning? A conservative estimate might be the 95% CI upper bound, but how would I calculate that, and is there a better way to featurize the distribution than just taking that one value?
Thanks!
I suppose that all of the required information that you need can be calculated by mean of the standard statistical methods without applying machine learning.
MLE estimate of the parameter p of your Binomial distribution
Bin(t,p) is just n/t as you properly suggested. If you want to get a confidence interval instead of a point estimate, there is one way to do it by means of the
Wald method:
where z is 1 - 0.5α quantile of a standard normal distribution. You can find more possibilities via the following link depending on your modelling assumptions: Binomial confidence intervals.
95% CI for p̂ can be calculated as indicated above with z = 1.96.
As for the feature engineering for the machine learning algorithm: since your parametric distribution basically depends only on one estimated parameter p (except for t which is given), you can use it directly as a feature for the unique distribution representation. It is also possible to add CI or variance as additional features of course. Everything depends on what exactly you are going to learn and what is your final objective/criterion is.
Binoculars implements many methods for calculating binomial confidence intervals. (PS: i am the author of Binoculars).
pip install bincoulars
If N=(total chromosomes sequenced) and p=(number of times allele is observed / N), you can estimate the confidence interval straightforwardly:
from binoculars import binomial_confidence
N, p = 100, 0.2
binomial_confidence(p, N)
# (0.1307892803998113, 0.28628125447599173)
I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?
If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.
I have fit a series of SciPy continuous distributions for a Monte-Carlo simulation and am looking to take a large number of samples from these distributions. However, I would like to be able to take correlated samples, such that the ith sample takes the e.g., 90th percentile from each of the distributions.
In doing this, I've found a quirk in SciPy performance:
# very fast way to many uncorrelated samples of length n
for shape, loc, scale, in distro_props:
sp.stats.norm.rvs(*shape, loc=loc, scale=scale, size=n)
# verrrrryyyyy slow way to take correlated samples of length n
correlate = np.random.uniform(size=n)
for shape, loc, scale, in distro_props:
sp.stats.norm.ppf(correlate, *shape, loc=loc, scale=scale)
Most of the results about this claim that the slowness on these SciPy distros if from the type-checking etc. wrappers. However when I profiled the code, the vast bulk of the time is spent in the underlying math function [_continuous_distns.py:179(_norm_pdf)]1. Furthermore, it scales with n, implying that it's looping through every elemnt internally.
The SciPy docs on rv_continuous almost seem to suggest that the subclass should override this for performance, but it seems bizarre that I would monkeypatch into SciPy to speed up their ppf. I would just compute this for the normal from the ppf formula, but I also use lognormal and skewed normal, which are more of a pain to implement.
So, what is the best way in Python to compute a fast ppf for normal, lognormal, and skewed normal distributions? Or more broadly, to take correlated samples from several such distributions?
If you need just the normal ppf, it is indeed puzzling that it is so slow, but you can use scipy.special.erfinv instead:
x = np.random.uniform(0,1,100)
np.allclose(special.erfinv(2*x-1)*np.sqrt(2),stats.norm().ppf(x))
# True
timeit(lambda:stats.norm().ppf(x),number=1000)
# 0.7717257660115138
timeit(lambda:special.erfinv(2*x-1)*np.sqrt(2),number=1000)
# 0.015020604943856597
EDIT:
lognormal and triangle are also straight forward:
c = np.random.uniform()
np.allclose(np.exp(c*special.erfinv(2*x-1)*np.sqrt(2)),stats.lognorm(c).ppf(x))
# True
np.allclose(((1-np.sqrt(1-(x-c)/((x>c)-c)))*((x>c)-c))+c,stats.triang(c).ppf(x))
# True
skew normal I'm not familiar enough, unfortunately.
Ultimately, this issue was caused by my use of the skew-normal distribution. The ppf of the skew-normal actually does not have a closed-form analytic definition, so in order to compute the ppf, it fell back to scipy.continuous_rv's numeric approximation, which involved iteratively computing the cdf and using that to zero in on the ppf value. The skew-normal pdf is the product of the normal pdf and normal cdf, so this numeric approximation called the normal's pdf and cdf many many times. This is why when I profiled the code, it looked like the normal distribution was the problem, not the SKU normal. The other answer to this question was able to achieve time savings by skipping type-checking, but didn't actually make a difference on the run-time growth, just a difference on small-n runtimes.
To solve this problem, I have replaced the skew-normal distribution with the Johnson SU distribution. It has 2 more free parameters than a normal distribution so it can fit different types of skew and kurtosis effectively. It's supported for all real numbers, and it has a closed-form ppf definition with a fast implementation in SciPy. Below you can see example Johnson SU distributions I've been fitting from the 10th, 50th, and 90th percentiles.
I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!
I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?
I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.