I would like to investigate whether there are siginifcant differences between three different groups. There are about 20 numerical attributes for these groups. For each attribute there are about a thousand observations.
My first thought was to calculate a manova. Unfortunately, the data are not normally distributed (tested with Anderson Darling test). From just looking at the data, the distribution is too narrow around the mean and has no tail at all.
When I calculate the Manova anyway, highly significant results come out that are completely against my expectations.
Therefore, I would like to calculate a multivariate Kurskall Wallis test next. So far I have found scipy.stats.kruskal. Unfortunately, it only compares individual data series with each other. Is there already a similar implementation in Python to a MANOVA, where you read in all attributes and all three groups and then give a result?
If you need more information, please let me know.
Thanks a lot! :)
I have a simple yet broad question regarding two methods:
scipy.stats.randint
and
numpy.random.randint
After reading the API for both methods I'm a bit confused as to when it is best to use each method; therefore, I was wondering if someone could outline the differences between the two and possibly offer some examples of when one method would be preferable to use over the other. Thanks!
Edit: Links to each method's documentation -> numpy.random.randint, scipy.stats.randint
The major difference seems to be that scipy.stats.randint allows you to explicitly name the lower or upper tail probability, as well as specify the distributions you want to draw the random ints from (see the methods section of the scipy.stats.randint documentation). It's therefore much more useful if you want to draw random intervals from a given density function.
If you really just want to draw a random integer that falls within a certain range, with no requirements regarding the distribution, then numpy.random.randint is more straightforward. They would be drawn directly from a discrete uniform distribution, with no built in option to modify that.
I need to make a randomizing function in Python returning values using long tail distribution. Unfortunately, my math skills are nowhere near my programming skills so I'm stuck.
This is the kind of distribution I'm looking for:
(source: danvk.org)
Returned value must be between 0 and 1, and it must be possible to assign a peak value (where the graph peaks on the Y axis), which would be a number between 0 and 1.
Example usage:
def random_long_tail(peak):
#magic
value = random_long_tail(0.2)
print(value) #outputs i.e. 0.345811242
I will be incredibly grateful for any help in solving this issue. Thank you!
There are quite a few distributions with single peak value and some tail, log-normal, Gamma, Chi2 to name a few.
Typically, one can pick out numpy random module and see what's avalable and how they fit into your problem. Link: http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.random.html
You might look into Numpy's random sampling options. See which ones are included in the list of common heavy-tailed distributions.
The log-normal is a nice example. Numpy allows you to specify the mean and standard deviation. You will have to do a bit of algebra to "assign a peak value". This peak is called the "mode".
I have two questions.
1) I have an array like [1,2,3,4,5,5,3,1]. and I don't know which distributions it is. Can I use scipy.stats to calculate pmf,cdf automatically?
2)scipy.stats is just like a library of distributions? If I want to analysis data, I have to find one distributions or define one? I need to manually calculate some of data, like pmf. Am I understanding correctly?
Well, scipy.stats is not a library for telling you the distribution of data and calculating pmf and cdf automatically. Its a library for easing your tasks while estimating the probabily distribution. You have to explore your data and find which distribution which fits your data with least error ,which is the ultimate task and scipy.stats helps you achieving this.... you don't have to reinvent the wheel as they say by writing all the mathematical functions again and again.
Well, to answer your question in the comment, lets suppose you have a dataset, to get an insight and get a starting point for your analysis, what you'll do is plot the data in a historgram (which also shows distribution of data), now you can plot different distributions on the same plot using scipy.stats to get a feel of bestfit....
Check out this answer, it might help ya...
https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best
I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.