Generate a random symmetric tensor in python - python

I want to generate a random (gaussian) tensor symmetric with respect to all the permutations of the axes. In the end I want all the entries with the same distribution, so tricks like summing over all the permutation and rescaling by sqrt(k!), where k is the order of my tensor, don't work. eg:
import numpy as np
from itertools import permutations
noise_buffer = np.random.normal(size=n*n*n).reshape(n,n,n)/np.sqrt(6);
noise = np.zeros([n,n,n]);
for i in permutations([0,1,2]):
noise += np.transpose(noise_buffer,axes=list(i))
I could loop over all the coordinates (-1) and rescale opportunely, but this is time consuming.
Do you know any library where this is implemented? or do you know any fast implementation?

Actually, the method you present works; it just needs a small modification.
Use the fact that the sum of normal random variables, is another normal random variable with the variances summed, e.g. see here
The problem, it seems, was that you want the entries of noise to come from a certain distribution, but if you take noise_buffer from that distribution, then noise will have a different distribution.
The solution is to use a different distribution (specifically, normal, with standard deviation reduced by a factor of 1/sqrt(k!) ) for noise_buffer, then noise will have the right distribution.

Related

Why divide the output of NumPy FFT by N?

In many tutorials/blogs I've seen the output of np.fft.fft(signal) divided by the number of sample points N.
I understand that in some implementations that the transform is scaled/normalized by some factor like multiplying by N. However, I just read the docs, and by default the output of fft.fft() is unscaled. Yet I still see the output divided by N everywhere.
Why is this?
I have noticed that by scaling the output by 1/N I get back the correct amplitudes of the contributing wave signals. So obviously it is necessary, but I'd like to understand what the pure output is as compared to the scaled output.
For the DFT to be reversible (x == IDFT(DFT(x))), you need to divide by N somewhere. In signal processing this normalization is typically done in the inverse transform. For example Wikipedia shows it this way.
In other fields it is more often done in the forward transform. In physics I have seen half the normalization (1/sqrt(N)) applied to each transform, making them symmetric.
When the forward transform normalizes, then the values it returns are independent of the signal length (for example the zero frequency is the mean of all signal values). This is therefore the more useful variant when studying signal power.
The variant where the normalization is applied in the inverse transform (as commonly implemented in signal processing software, such as np.fft.fft(), and MATLAB's fft), then computing the convolution by multiplication in the frequency domain is easiest: one can directly write g = IDFT(DFT(f)*DFT(h)). If the normalization is applied elsewhere, it must be partly undone to obtain a correctly scaled result.
Other software, for example the FFTW library, does not normalize the transform at all, leaving that up to the user. This avoids unnecessary multiplications if the user wants a different normalization variant than what the library chooses.
Based on Parseval's theorem
This expresses that the energies in the time- and frequency-domain are the same. It means the magnitude |X[k]| of each frequency bin k is contributed by N samples. In order to find out the average contribution by each sample, the magnitude is normalized as |X[k]|/N, which leads to
where the LHS is the power of the signal.
However, such a scale normally doesn't matter unless you care about the unit of the magnitude, like in the case of the Sound Pressure Level (SPL) spectrum.

Getting random numbers from a truncated Maxwell-Boltzmann distribution

I would like to generate random numbers using a truncated Maxwell-Boltzmann distribution. I know that scipy has built-in Maxwell random variables, but there is no truncated version of it (I am also aware of a truncated normal distribution, which is irrelevant here). I have tried to write my own random variables using rvs_continuous:
import scipy.stats as st
class maxwell_boltzmann_pdf(st.rv_continuous):
def _pdf(self,x):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(x))*np.exp(-np.square(x/v_0))*np.heaviside(v_esc-x,0)
maxwell_boltzmann_cv = maxwell_boltzmann_pdf(a=0, b=v_esc, name='maxwell_boltzmann_pdf')
This does exactly what I want, but it is way too slow for my purpose (I am doing Monte Carlo simulations), even if I draw all the random velocities outside of all the loops. I have also thought of using Inverse transform sampling method, but the inverse of the CDF does not have an analytic form and I will need to do a bisection for every number I draw. It would be great if there is a convenient way for me to generate random numbers from a truncated Maxwell-Boltzmann distribution with decent speed.
There are several things you can do here.
For fixed parameters v_esc and v_0, n_0 is a constant, so it doesn't need to be calculated in the pdf method.
If you define only a PDF for a SciPy rv_continuous subclass, then the class's rvs, mean, and so on will be very slow, presumably because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. If speed is at a premium, you will thus need to add to maxwell_boltzmann_pdf an _rvs method that uses its own sampler. (See also this question.) One possible method is the rejection sampling method: Draw a number in a box until the box falls within the PDF. It works for any bounded PDF with a finite domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). See this question for example Python code.
If you know the distribution's CDF, then there are some additional tricks. One of them is the relatively new k-vector sampling method for sampling a continuous distribution. There are two phases: a setup phase and a sampling phase. The setup phase involves approximating the CDF's inverse via root finding, and the sampling phase uses this approximation to generate random numbers that follow the distribution in a very fast way without having to further evaluate the CDF. For a fixed distribution like this one, if you show me the CDF, I can precalculate the necessary data and the code needed to sample the distribution using that data. Essentially, the only non-trivial part of k-vector sampling is the root-finding step.
More information on sampling from an arbitrary distribution is found on my sampling methods page.
It turns out that there is a way to generate a truncated Maxwell-Boltzmann distribution with the inverse transform sampling method using the ppf feature of scipy. I am posting the code here for future reference.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import erf
from scipy.stats import maxwell
# parameters
v_0 = 220 #km/s
v_esc = 550 #km/s
N = 10000
# CDF(v_esc)
cdf_v_esc = maxwell.cdf(v_esc,scale=v_0/np.sqrt(2))
# pdf for the distribution
def f_MB(v_mag):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(v_mag))*np.exp(-np.square(v_mag/v_0))*np.heaviside(v_esc-v_mag,0)
# plot the pdf
x = np.arange(600)
y = [f_MB(i) for i in x]
plt.plot(x,y,label='pdf')
# use inverse transform sampling to get the truncated Maxwell-Boltzmann distribution
sample = maxwell.ppf(np.random.rand(N)*cdf_v_esc,scale=v_0/np.sqrt(2))
# plot the histogram of the samples
plt.hist(sample,bins=100,histtype='step',density=True,label='histogram')
plt.xlabel('v_mag')
plt.legend()
plt.show()
This code generates the required random variables and compare its histogram with the analytic form of the pdf, which matches with each other pretty well.

Find underlaying normal distribution of random vectors

I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?
If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.

Checking for randomness using the Chi-Square Test

I'm running a simulation for a class project that relies heavily on random number generators, and as a result we're asked to test the random number generator to see just how "random" it is using the Chi-Square static. After looking through the some posts here, I used the follow code to find the answer:
from random import randint
import numpy as np
from scipy.stats import chisquare
numIterations = 1000 #I've run it with other numbers as well
observed = []
for i in range(0, numIterations):
observed.append(randint(0, 100))
data = np.array(observed)
print "(chi squared statistic, p-value) with", numOfIter, "samples generated: ", chisquare(data)
However, I'm getting a p-value of zero when numIterations is greater than 10, which doesn't really make sense considering the null hypothesis is that the data is uniform. Am I misinterpreting the results? Or is my code simply wrong?
A chi-square test checks how many items you observed in a bin vs how many you expected to have in that bin. It does so by summing the squared deviations between observed and expected across all bins. You can't just feed it raw data, you need to bin it first using something like scipy.stats.histogram.
Depending on what distribution your going for you can test for it, remember that having more samples will approximate the distribution better (if you could take an infinite number of samples you would have the actual distribution). Since in real life we can't run our number generators an infinite number of times we only deal with approximated situations, so we bin the distribution (see how many numbers fall into a bin http://en.wikipedia.org/wiki/Bean_machine). Now if you ran your bean machine and you found that one of the bins was significantly higher than the expected distribution (in this case Gaussian) then you would say that the process is not Gaussian. Same thing with chi squared except your the shape is different than Gaussian because your sampling multiple normal (special case Gaussian) distributions. Since you want to find out if your data is normal/gaussian (think of shapes, the shapes are determined by the distributions parameters ie mean std kurtosis) here is an example of how to do that: http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/chi-square-test-for-normality/
I don't know what your data is so I can't really tell you what to look for. All in all you will need to know what your statistical data that your given is then try to fit it to a model (in this case chi-squared) then ask yourself if it matches up with the model (the curve, your probably trying to find if its Gaussian/normal or not which you can do with the chi-squared test). You should google chi-squared, Gaussian normal ect ect.

Find a random method that best fit list of values

I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.

Categories