I'm looking for a two-dimensional analog to the numpy.random.normal routine, i.e. numpy.random.normal generates a one-dimensional array with a mean, standard deviation and sample number as input, and what I'm looking for is a way to generate points in two-dimensional space with those same input parameters.
Looks like numpy.random.multivariate_normal can do this, but I don't quite understand what the cov parameter is supposed to be. The following excerpt describes this parameter in more detail and is from the scipy docs:
Covariance matrix of the distribution. Must be symmetric and
positive-semidefinite for “physically meaningful” results.
Later in the page, in the examples section, a sample cov value is given:
cov = [[1,0],[0,100]] # diagonal covariance, points lie on x or y-axis
The concept is still quite opaque to me, however.
If someone could clarify what the cov should be or suggest another way to generate points in two-dimensional space given a mean and standard deviation using python I would appreciate it.
If you pass size=[1, 2] to the normal() function, you get a 2D-array, which is actually what you're looking for:
>>> numpy.random.normal(size=[1, 2])
array([[-1.4734477 , -1.50257962]])
Related
I was just wondering how to go from mvnrnd([4 3], [.4 1.2], 300); in MATLAB code to np.random.multivariate_normal([4,3], [[x_1 x_2],[x_3 x_4]], 300) in Python.
My doubt namely lays on the sigma parameter, since, in MATLAB, a 2D vector is used to specify the covariance; whereas, in Python, a matrix must be used.
What is the theoretical meaning on that and what is the practical approach to go from one to another, for instance, in this case? Also, is there a rapid, mechanical way?
Thanks for reading.
Although python expects a matrix, it is essentially a symmetric covariance matrix. So it has to be a square matrix.
In 2x2 case, a symmetric matrix will have mirrored non diagonal elements.
I believe in python, it should look like [[.4 1.2],[1.2 .4]]
I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?
If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.
I currently have a collection of n-dimensional data points, each with a value associated with it (n typically will range from 2 to 4).
I would like to employ some form of non-linear interpolation on the data points I am supplied with so that I can try and minimise this value. Of course, I am open to better methods of minimising the value.
At the moment, I have code that works for 1D and 2D arrays
mesh = np.meshgrid(*[i['grid2'] for i in self.cambParams], indexing='ij')
chi2 = griddata(data[:,:-1], data[:,-1], tuple(mesh), method='cubic')
However scipy.interpolate.griddata only supports linear interpolation above 2D grids, meaning interpolation is useless as the minimum will be a defined point in the data. Does anyone know of an alternate interpolation method that might work, or a better way of solving the problem in general?
Cheers
Received a tip from an external source that work, so posting the answer in case it helps anyone in the future.
SciPy has an Rbf interpolation method (radial basis function) which allows better than linear interpolation at arbitrary dimensions.
Taking a variable data with rows of (x1,x2,x3...,xn,v) values, the follow code modification to the original post allows for interpolation:
rbfi = Rbf(*data.T)
mesh = np.meshgrid(*[i['grid2'] for i in self.cambParams], indexing='ij')
chi2 = rbfi(*mesh)
The documentation here is useful, and there is a simple and easy to follow example here, which will make more sense than the code snippet above.
I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise
rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.
The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))
I'm translating some code from MATLAB to Python and I'm stuck with the corrmtx() MATLAB function. Is there any similar function in Python, or how could I replace it?
The spectrum package has such a function.
How about:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.toeplitz.html
The matlab docs for corrmtx state:
X = corrmtx(x,m) returns an (n+m)-by-(m+1) rectangular Toeplitz matrix
X, such that X'X is a (biased) estimate of the autocorrelation matrix
for the length n data vector x.
The scipy function gives the Toeplitz matrix, although I'm not sure if the implementations are identical.
Here is a list of alternatives that can help you in translating your code, all of which contain that function:
scipy (toeplitz | corrmtx)
spectrum (corrmtx)
The following is a link to another post that tells you how to use numpy for the auto correlation since it seems to be the default funcationality of corrmtx
Additional Information:
Finding the correlation matrix in Python
Unbiased Estimation of Covariance Matrix