I have a number of data sets, each containing x, y, and y_error values, and I'm simply trying to calculate the average value of y at each x across these data sets. However the data sets are not quite the same length. I thought the best way to get them to an equal length would be to use scipy's interoplate.interp1d for each data set. However, I still need to be able to calculate the error on each of these averaged values, and I'm quite lost on how to accomplish that after doing an interpolation.
I'm pretty new to Python and coding in general, so I appreciate your help!
As long as you can assume that your errors represent one-sigma intervals of normal distributions, you can always generate synthetic datasets, resample and interpolate those, and compute the 1-sigma errors of the results.
Or just interpolate values+err and values-err, if all you need is a quick and dirty rough estimate.
Related
I have values sampled from a continuous distribution, for example:
import numpy as np
values = np.random.normal(loc=0.4, scale=0.1, 1000)
How can I estimate the mode based on those values ?
The mean and median are easy to compute: np.mean(values) and np.median(values); but for the mode I don't know how estimate it, since the values are continuous.
Note that using something like scipy.stats.mode would not work because I have a finite set of values sampled from the continuous distribution.
If you have a known, underlying parametric model, life is easy. Fit your data (using MLE or whatever) and take the mode of the fitted distribution. If you don't have a good parametric model, life is harder. There are a number of things I've seen in the literature, but I don't know if any sort of consensus has been reached on this. When I had to do this (~20 years ago) I used an algorithm I found in Numerical Recipes in C. I have no idea if that was the best choice or not.
I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.
I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.
For data that is known to have seasonal, or daily patterns I'd like to use fourier analysis be used to make predictions. After running fft on time series data, I obtain coefficients. How can I use these coefficients for prediction?
I believe FFT assumes all data it receives constitute one period, then, if I simply regenerate data using ifft, I am also regenerating the continuation of my function, so can I use these values for future values?
Simply put: I run fft for t=0,1,2,..10 then using ifft on coef, can I use regenerated time series for t=11,12,..20 ?
I'm aware that this question may be not actual for you anymore, but for others that are looking for answers I wrote a very simple example of fourier extrapolation in Python https://gist.github.com/tartakynov/83f3cd8f44208a1856ce
Before you run the script make sure that you have all dependencies installed (numpy, matplotlib). Feel free to experiment with it.
P.S. Locally Stationary Wavelet may be better than fourier extrapolation. LSW is commonly used in predicting time series. The main disadvantage of fourier extrapolation is that it just repeats your series with period N, where N - length of your time series.
It sounds like you want a combination of extrapolation and denoising.
You say you want to repeat the observed data over multiple periods. Well, then just repeat the observed data. No need for Fourier analysis.
But you also want to find "patterns". I assume that means finding the dominant frequency components in the observed data. Then yes, take the Fourier transform, preserve the largest coefficients, and eliminate the rest.
X = scipy.fft(x)
Y = scipy.zeros(len(X))
Y[important frequencies] = X[important frequencies]
As for periodic repetition: Let z = [x, x], i.e., two periods of the signal x. Then Z[2k] = X[k] for all k in {0, 1, ..., N-1}, and zeros otherwise.
Z = scipy.zeros(2*len(X))
Z[::2] = X
When you run an FFT on time series data, you transform it into the frequency domain. The coefficients multiply the terms in the series (sines and cosines or complex exponentials), each with a different frequency.
Extrapolation is always a dangerous thing, but you're welcome to try it. You're using past information to predict the future when you do this: "Predict tomorrow's weather by looking at today." Just be aware of the risks.
I'd recommend reading "Black Swan".
you can use the library that #tartakynov posted and, to not repeat exactly the same time series in the forcast (overfitting), you can add a new parameter to the function called n_param and fix a lower bound h for the amplitudes of the frequencies.
def fourierExtrapolation(x, n_predict,n_param):
usually you will find that, in a signal, there are some frequencies that have significantly higher amplitude than others, so, if you select this frequencies you will be able to isolate the periodic nature of the signal
you can add this two lines who are determinated by certain number n_param
h=np.sort(x_freqdom)[-n_param]
x_freqdom=[ x_freqdom[i] if np.absolute(x_freqdom[i])>=h else 0 for i in range(len(x_freqdom)) ]
just adding this you will be able to forecast nice and smooth
another useful article about FFt:
forecast FFt in R
Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!
What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.
I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...