Python - calculate normal distribution - python

I'm quite new to python world. Also, I'm not a statistician. I'm in the need to implementing mathematical models developed by mathematicians in a computer science programming language. I've chosen python after some research. I'm comfortable with programming as such (PHP/HTML/javascript).
I have a column of values that I've extracted from a MySQL database & in need to calculate the below:
Normal distribution of it. (I don't have the sigma & mu values. These need to be calculated too apparently).
Mixture of normal distribution
Estimate density of normal distribution
Calculate 'Z' score
The array of values looks similar to the one below ( I've populated sample data)-
data = [3,3,3,3,3,3,3,9,12,6,3,3,3,3,9,21,3,12,3,6,3,30,12,6,3,3,24,30,3,3,3,12,3,3,3,3,3,3,3,6,9,3,3,3,3,3,3,3,3,3,3,3,3,33,3,3,3,6,3,3,6,6,15,3,3,3,3,6,3,3,3,3,3,3,3,3,12,12,3,3,3,3,3,3,78,9,12,3,6,3,15,6,3,3,3,30,3,6,78,3,9,9,3,78,3,3,3,3,3,12,15,3,3,78,3,3,33,78,15,9,3,3,21,6,3,6,30,6,6,3,3,3,3,12,3,3,3,3,3,12,3,3,3,3,3,3,3,3,3,3,3,3,12,6,3,3,9,3,3,12,3,3,3,3,6,3,3,6,3,3,18,6,3,3,3,3,3,6,3,3,3,3,3,3,3,3,9,21,3,9,3,3,12,12,3,3,15,30,3,12,3,3,6,3,3,3,9,9,6,6,3,3,27,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,12,6,3,3,3,3,30,3,3,3,3,6,18,24,6,3,3,42,3,3,6,3,15,3,3,3,3,9,3,60,81,54,3,9,3,3,6,3,6,3,3,3,3,6,3,3,3,33,24,3,3,3,3,3,3,3,3,3,3,3,3,3,93,3,3,21,3,3,3,3,6,6,30,3,3,3,3,6,3,9,3,3,6,3,6,3,3,3,39,9,30,6,45,3,3,3,3,3,24,12,3,6,3,78,3,3,3,3,3,3,3,3,3,3,3,9,6,3,3,3,6,15,3,78,3,3,30,3,3,3,33,24,3,3,6,3,3,3,6,3,3,3,12,15,3,3,3,21,3,3,3,3,9,6,3,6,3,3,3,3,6,6,3,15,6,9,3,3,18,3,3,3,3,3,3,3,3,21,3,3,6,3,3,3,3,3,3,12,3,3,3,3,3,3,6,21,12,3,6,9,3,3,3,3,9,15,3,6,78,6,6,3,9,3,9,3,6,3,3,3,24,3,3,6,3,3,27,3,6,3,3,3,3,3,3,3,3,3,3,3,3,21,3,9,6,6,9,27,30,3,3,9,12,6,3,3,12,9,3,21,3,6,9,9,3,3,3,3,9,6,3,3,6,3,3,3,3,3,6,3,6,3,3,3,24,6,3,3,3,3,3,3,3,3,3,3,18,3,3,3,3,3,9,6,3,3,3,18,3,9,3,3,15,9,12,3,18,3,6,3,3,3,6,3,3,3,3,3,3,3,21,9,15,3,3,3,21,3,3,3,3,3,6,9,3,3,21,6,3,3,15,3,18,3,3,21,3,21,3,9,3,6,21,3,9,15,3,69,21,3,3,3,9,3,3,3,12,3,3,9,3,3,27,3,3,9,3,9,3,3,3,3,3,30,3,12,21,18,27,3,3,12,3,6,3,30,3,21,9,15,6,3,3,3,15,9,12,12,33,3,3,30,3,6,6,21,3,3,12,3,3,6,51,3,3,3,3,12,3,6,3,9,78,21,3,3,21,18,6,12,3,3,3,21,9,6,3,3,3,3,3,3,6,3,6,27,3,3,3,3,3,3,12,3,3,3,3,6,3,18,3,3,15,3,3,18,9,6,3,3,24,3,6,12,30,3,12,24,3,3,3,9,3,12,27,3,3,6,3,9,3,9,3,15,3,6,3,3,9,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,6,3,3,3,9,15,3,3,3,3,9,3,6,3,3,3,3,27,3,6,3,3,3,3,3,3,3,3,3,3,9,3,3,3,12,3,3,3,27,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,9,3,3,3,3,3,3,15,3,3,3,3,3,3,12,3,6,6,3,3,3,3,6,3,3,6,3,3,3,3,3,6,3,3,3,3,6,12,6,3,3,3,3,6,3,3,3,3,3,3,3,3,3,6,3,6,3,3,6,3,3,6,3,3,3,6,6,6,3,3,27,3,3,3,3,3,3,3,27,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,6,3,3,3,6,3,54,75,3,57,3,6,27,18,3,3,3,3,27,3,3,3,3,3,9,3,27,3,3,6,6,30,3,3,6,3,3,3,6,15,3,6,3,3,6,3,3,3,3,6,3,3,27,9,3,18,3,3,6,6,3,9,3,3,3,6,3,3,3,3,3,3,3,3,6,3,3,3,6,3,3,6,3,3,3,3,6,6,3,3,3,6,6,3,3,3,3,3,3,3,6,3,3,6,3,3,3,3,3,6,3,18,3,3,6,3,6,3,3,3,3,3,3,3,3,6,15,3,6,15,6,3,3,3,3,3,3,3,3,3,3,3,3,6,3,6,3,3,6,12,3,3,6,3,3,6,3,3,3,3,3,27,3,3,3,3,9,3,27,3,3,27,3,3,3,3,3,3,9,6,3,9,3,6,3,3,6,3,6,3,3,3,6,3,3,6,3,18,3,3,3,9,6,3,3,3,3,3,6,3,6,6,3,18,27,3,3,3,6,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,21,3,3,3,3,6,9,3,3,3,3,3,3,6,3,6,3,3,3,3,3,6,3,6,3,3,3,3,3,18,3,3,18,3,3,3,3,6,3,3,3,18,6,3,3,3,3,3,3,3,6,3,3,3,6,3,3,3,3,3,3,6,3,3,3,3,3,3,6,3,3,6,3,6,3,3,3,6,3,3,6,3,3,3,3,6,3,3,3,6,3,3,3,3,3,3,3,6,6,3,3,3,3,3,6,3,6,3,54,3,6,3,6,6,6,3,3,3,3,3,3,6,3,3,6,3,3,6,3,3,9,12,3,6,3,3,3,3,3,6,6,3,3,3,3,6,3,6,3,3,3,3,3,3,3,3,6,3,3,3,3,3,6,3,3,3,3,3,12,3,3,6,9,27,21,3,3,3,3,3,21,6,3,3,3,3,3,3,3,3,3,3,3,6,3,3,12,3,3,3,3,3,3,3,3,3,3,3,6,3,3,6,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,9,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,6,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,6,3,3,3,3,6,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,6,3,3,3,3,3,3,6,6,3,3,3,3,3,3,6,3,3,6,3,3,3,6,3,3,3,3,6,6,3,6,3,6,6,3,9,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,6,3,3,3,9,9,3,3,3,3,3,6,3,3,3,3,6,3,3,3,3,6,3,3,3,3,3,6,3,6,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,6,3,3,6,3,3,3,3,3,3,3,6,3,3,3,135,3,9,3,3,6,9,3,3,3,6,3,3,3,3,6,3,3,6,6,3,3,3,3,3,3,3,3,3,3,3,3,6,6,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,6,3,3,3,135,3,3,3,6,3,3,3,3,6,6,3,3,69,87,57,9,3,3,3,12,3,6,3,3,3,6,3,3,3,3,3,3,3,3,3,3,6,9,12,3,3,3,3,3,3,3,3,6,3,3,9,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,9,3,3,3,3,12,3,3,33,3,6,3,3,3,3,3,3,6,3,6,3,3,6,3,3,3,6,3,6,3,3,6,3,3,3,6,3,3,6,3,3,3,6,3,3,3,3,9,3,3,6,6,3,3,3,6,6,3,3,3,3,3,3,6,3,3,3,3,6,3,3,3,6,3,18,3,6,3,3,3,3,9,3,3,3,3,3,3,6,3,3,6,3,3,3,3,3,135,3,9,3,3,3,3,3,3,3,3,6,6,3,6,6,3,3,6,3,3,3,6,6,3,3,3,3,6,9,3,3,3,3,3,3,6,6,3,3,3,3,3,3,135,3,3,3,6,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,6,6,6,3,3,3,6,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,9,6,3,3,3,9,3,3,3,3,9,3,3,3,3,3,3,3,3,3,9,3,6,6,3,6,3,3,6,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,9,3,24,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,6,3,3,3,3,3,3,6,3,135,3,3,3,3,3,3,6,6,3,3,3,3,3,3,3,3,6,3,3,3,3,3,9,6,3,3,3,9,3,3,3,3,3,3,6,3,3,6,3,9,3,3,3,6,3,3,3,6,6,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,9,3,3,3,3,3,9,6,3,9,3,6,3,3,21,9,3,3,3,6,3,3,3,3,6,3,3,3,3,9,3,3,3,3,3,3,3,135,3,6,6,6,3,6,3,3,9,6,6,3,3,3,3,3,3,9,3,6,3,3,3,3,3,3,3,6,9,6,3,3,6,3,6,6,3,3,3,3,6,3,6,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,6,3,6,3,12,3,24,3,3,3,3,3,3,21,3,3,3,3,3,3,3,6,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,15,3,3,3,3,3,3,3,6,3,3,6,6,3,3,9,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,9,3,3,3,6,3,3,3,6,3,6,3,3,3,3,3,3,3,3,3,12,3,3,3,3,3,3,6,3,6,6,3,3,3,6,3,3,6,3,3,3,3,9,6,3,3,3,6,9,3,3,3,6,9,3,6,3,3,3,3,3,3,6,3,3,3,3,6,6,3,3,3,3,3,3,3,3,3,3,9,15,3,3,3,6,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,12,3,3,3,6,6,6,3,3,3,6,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,12,12,6,3,3,3,3,3,3,3,3,3,9,6,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,6,3,3,3,3,6,3,3,3,6,3,3,3,3,3,3,3,6,3,3,3,6,3,3,6,3,3,12,3,3,3,6,3,3,3,3,564,84,3,60,6,15,3,3,3,3,3,6,3,3,3,3,3,3,3,9,3,3,3,3,3,3,3,3,3,3,3,6,9,3,3,3,3,3,9,3,3,3,3,3,12,6,3,3,3,3,3,3,3,3,6,3,3,3,3,9,57,3,6,3,6,3,3,6,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,9,3,3,3,3,6,3,3,3,6,12,3,6,3,3,3,3,3,3,3,3,6,3,6,3,3,3,6,3,3,6,3,3,36,3,3,6,6,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,12,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,6,3,3,6,3,6,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,3,3,3,3,12,6,3,3,3,3,3,3,3,12,3,3,3,6,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,9,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,12,3,3,3,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,9,3,3,3,3,3,3,3,9,3,3,3,3,3,3,3,3,3,6,3,3,3,3,3,3,3,3,3,3,6,3,3,3,27,3,3,6,3,3,3,3,3,6,3,3,3,3,6,3,3,9,3,3,3,12,3,3,3,3,3,6,9,3,6,3,3]
I've looked around & found quite a bit about cumulative distribution as here (These have the mu & sigma values ready anyway which isn't the case in my scenario). I'm not too sure if cumulative normal distribution & normal distribution are the same. Could I please get some pointers on how to get started with this please?
I'd very much appreciate any help here please.

A distribution and the cumulative distribution are not the same - the latter is the integral of the former. If the normal distribution looks like a "bell", the cumulative normal distribution looks like a gentle "step" function.
E.g., for the following "bells"
you'd get the following "steps"
If you have an array data, the following will fit it to a normal distribution using scipy.stats.norm:
import numpy as np
from scipy.stats import norm
mu, std = norm.fit(data)
This will return the mean and standard deviation, the combination of which define a normal distribution.

Normal and cumulative distributions are not the same. I'll leave that bit of research to you.
The formula for normal distribution is easy if you have the mean and standard deviation:

The thing that you may look at is the normal distribution not the cumulative normal distribution. You can calculate the frequency of each element that occurs in the array and plot it to visualize the distribution.
Then you can use numpy to calculate mean = numpy.mean(array) and standard deviation as std = numpy.std(array).
Hope this helps.

Related

How can I create a distribution in Python using Min, Max, Mean and Standard Deviation?

As the title states, I am trying to create a distribution in Python to be used in Monte Carlo simulations. I have the Minimum and Maximum values, as well as the most likely value and the Standard Deviation. Is there a way to model the distribution using all of these values?
The closest method I have found so far is splitting it so that I create the distribution like so:
GCH = np.arange(GCHmin, GCHmax, 0.0001)
GCH = np.random.normal(GCHavg, GCHstdv, num_reps)
This method does work, but I feel like there must be a better way to do it, as doing the simulation this way matches the results I am trying to recreate at the 50th percentile, but at the 10th and 90th percentiles there is a bit of difference from the desired result.
Any help is greatly appreciated.

Find underlaying normal distribution of random vectors

I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?
If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.

Estimaing the mode from a list of values sampled from a continuous distribution

I have values sampled from a continuous distribution, for example:
import numpy as np
values = np.random.normal(loc=0.4, scale=0.1, 1000)
How can I estimate the mode based on those values ?
The mean and median are easy to compute: np.mean(values) and np.median(values); but for the mode I don't know how estimate it, since the values are continuous.
Note that using something like scipy.stats.mode would not work because I have a finite set of values sampled from the continuous distribution.
If you have a known, underlying parametric model, life is easy. Fit your data (using MLE or whatever) and take the mode of the fitted distribution. If you don't have a good parametric model, life is harder. There are a number of things I've seen in the literature, but I don't know if any sort of consensus has been reached on this. When I had to do this (~20 years ago) I used an algorithm I found in Numerical Recipes in C. I have no idea if that was the best choice or not.

Studentized range statistic (q*) in Python Scipy

I am wondering if it is possible to find the Studentized range statistic (q*) in Python Scipy lib as an input into Tukey's HSD calculation, similar to interpolating a table such as this (http://cse.niaes.affrc.go.jp/miwa/probcalc/s-range/srng_tbl.html#fivepercent) or pulling from a continuous distribution.
I have found some guidance here (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tukeylambda.html#scipy.stats.tukeylambda), but lost on how to input the df (degrees of freedom) or k (# of samples groups).
I am looking for something like the critical F or critical t statistic, which can be obtained via
scipy.stats.f.isf(alpha, df-between, df-within)
or
scipy.stats.t.isf(alpha, df).
from statsmodels.stats.libqsturng import psturng, qsturng
provides cdf (or tail probabilities) and quantile function (inverse of cdf or of survival function, I don't remember)
It was written by Roger Lew as a package for interpolating the distribution of the studentized range statistic and was incorporated in statsmodels for use in tukeyhsd.
Until now it has only be used internally in statsmodels, and you would have to check the limitations and explanation in libqsturng.
As reference, statsmodels has a tukeyhsd function and a MultipleComparison class.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.multicomp.pairwise_tukeyhsd.html

python scipy.stats pdf and expect functions

I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise
rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.
The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))

Categories