Good Afternoon I am trying to make an estimate of a data histogram.
To do this I am estimated the number of gaussians to fit by using Kmeans.
Then I divide the data histogram using the min and max of the data histogram along with
the total number of clusters or centroids from Kmeans.
Then I take these segments of the original data and make them into separate lists
So I can operate on them separately calculating the mean and standard deviation for each.
When displayed with the original data the mean fits and the standard deviations look ok
The peak of these gaussians are however much larger then the region of the original data the came from.
I think I want to do some kind of normalization here to make the total probability from the multiple gaussians be the same as the original data.
Can anyone with better math skills suggest a plan of attack here?
It seems like the cumulative probability for each new gaussian will sum to one over the interval
from which it was extracted, and I need to use Zscore of the original data histogram to make some kind of normalization over just the interval of each new data set I constructed.
The complication is that the y axis is in frequency not in number of observation, maybe...
I appreciate any help with this ...
I have tried many things but the formulation of my approach is likely flawed.
Related
This article https://www.bitweenie.com/listings/fft-zero-padding/ gives a simple relation between time-length of the input data to the FFT and the minimum distance between two frequencies that can be distinguished in the FFT. The article calls this Waveform frequency resolution.
In other words; if two input-frequencies are closer in frequency than 1/time-length_of_input_data, they will show as only one peak in the FFT-plot.
My question is: is there a way to increase this Waveform frequency resolution? I am finding it difficult to work with rather short data-series due to this limitation.
As an example, if I use a combination of sine series with periods 9.5, 10, and 11 over 240 datapoints I cannot distinguish between the different frequencies.
To have good frequency resultion you need a long time series.
This is a fundamental issue, called uncertainty principle. It cannot be overcome within Fourier analysis (Fourier transform, DFT, short-time Fourier transform and so on).
Also note that zero padding will not overcome this issue.
It gives more points in the frequency domain, in the sense that the same spectral information is sampled more densely, but it will not make peaks sharper or more separated.
The only way to overcome the uncertainty principle is to make further assumptions on the data.
If for example you know that there is only a single frequency component, it is possible to determine its frequency more accurately than the uncertainty principle predicts.
Also you can use transforms such as the Vigner-Wille transform . It is not bound by the uncertainty principle, but generates "crossterms", i.e. frequency component artifacts. However, when you only have few frequency compoents this might be acceptable. Depends on the use-case.
I have values sampled from a continuous distribution, for example:
import numpy as np
values = np.random.normal(loc=0.4, scale=0.1, 1000)
How can I estimate the mode based on those values ?
The mean and median are easy to compute: np.mean(values) and np.median(values); but for the mode I don't know how estimate it, since the values are continuous.
Note that using something like scipy.stats.mode would not work because I have a finite set of values sampled from the continuous distribution.
If you have a known, underlying parametric model, life is easy. Fit your data (using MLE or whatever) and take the mode of the fitted distribution. If you don't have a good parametric model, life is harder. There are a number of things I've seen in the literature, but I don't know if any sort of consensus has been reached on this. When I had to do this (~20 years ago) I used an algorithm I found in Numerical Recipes in C. I have no idea if that was the best choice or not.
I have a number of data sets, each containing x, y, and y_error values, and I'm simply trying to calculate the average value of y at each x across these data sets. However the data sets are not quite the same length. I thought the best way to get them to an equal length would be to use scipy's interoplate.interp1d for each data set. However, I still need to be able to calculate the error on each of these averaged values, and I'm quite lost on how to accomplish that after doing an interpolation.
I'm pretty new to Python and coding in general, so I appreciate your help!
As long as you can assume that your errors represent one-sigma intervals of normal distributions, you can always generate synthetic datasets, resample and interpolate those, and compute the 1-sigma errors of the results.
Or just interpolate values+err and values-err, if all you need is a quick and dirty rough estimate.
I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise
rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.
The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))
Does anyone know if it is possible to find a power spectral density of a signal with gaps in it. For example (in matlab syntax cause that is what I'm familiar with)
ta=1:1000;
tb=1200:3000;
t=[ta tb]; % this is the timebase
signal=randn(size(t)); this is a signal
figure(101)
plot(t,signal,'.')
I'd like to be able to determine frequencies on a longer time base that just the individual sections of data. Obviously I could just take the PSD of individual sections but that will limit the lowest frequency. I could interpolate the data, but this would colour the PSD.
Any thoughts would be much appreciated.
The Lomb-Scargle periodogram algorithm is usually used to perform analysis on unevenly spaced data (sampled at arbitrary time points) or when a proportion of the data is missing.
Here's a couple of MATLAB implementations:
lombscargle.m (FEX)
Lomb (Lomb-Scargle) Periodogram (FEX)
lomb.m - ECG tools by Gari Clifford
I found this Non Uniform FFT but I'm not sure that its exactly what I need as it might really be for data that is mostly sampled on an uneven time base, rather than evenly spaced data with significant gaps. I'll give it a go!
Leaving out segments of the Fourier basis vectors results in exactly the same FT, thus PSD, as using the complete basis, but multiplying by zeros within a zero padding in any signal "gaps".