Suppose I have a process where I push a button, and after a certain amount of time (from 1 to 30 minutes), an event occurs. I then run a very large number of trials, and record how long it takes the event to occur for each trial. This raw data is then reduced to a set of 30 data points where the x value is the number of minutes it took for the event to occur, and the y value is the percentage of trials which fell into that bucket. I do not have access to the original data.
How can I use this set of 30 points to identify an appropriate probability distribution which I can then use to generate representative random samples?
I feel like scipy.stats has all the tools I need built in, but for the life of me I can't figure out how to go about it. Any tips?
If you don't have any prior information about the underlying function of the data which have been produced, I suggest you to use numpy.polyfit which fits a polynomial of given degree.
import matplotlib.pyplot as plt
import numpy as np
y = np.array([ 0.005995184, ...]) # your array
x = np.arange(len(y))
f = np.poly1d(np.polyfit(x, y, 10))
x_new = np.linspace(x[0], x[-1], 30)
y_new = f(x_new)
plt.plot(x,y,'o', x_new, y_new)
plt.xlim([x[0]-1, x[-1] + 1 ])
Here is an example for degree = 10.
In order to get an unknown value from the produced polynomial distribution, you simply:
which in this case gives:
You can also use the histogram, piecewise uniform distribution directly, then you get exactly the corresponding random numbers instead of an approximation.
The inverse cdf, ppf, is piecewise linear and linear interpolation can be used to transform uniform random numbers appropriately.
I was able to come up with a solution, but it doesn't feel like a very elegant one. Basically, take the percentage value (y value) for each x value, multiply by some large number (say, 10,000), then add that many values of x to an array. Continue through all values of x, ending up with a single giant array. This array can then be fed into .fit() methods of the scipy.stats.rv_discrete subclasses. I'll leave the question open for now as I feel like there must be a better way.
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
xRange = 30
x = scipy.arange(0,xRange+1)
data = [
for i in range(len(data)):
for j in range(int(data[i]*10000)):
# creating the histogram
h = plt.hist(y, bins=x, normed=True)
dist_names = ['burr','f','rayleigh']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param =
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(pdf_fitted, label=dist_name, lw=4)
plt.legend(loc='upper right')
How could I get the coordinates of a point in the space with the greatest density.
I have this code to generate a random point and density analyze from this point.
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N)
x_data = random_data(50)
y_data = random_data(50)
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]), bw_method=0.05)
b = plt.plot(x_data, y_data, 'ro')
df = pd.DataFrame({"x":x_data,"y":y_data})
p = sns.jointplot(data=df,x='x', y='y',kind='kde')
Thank you for help. :)
For starters, let me state the obvious by saying that sns.jointplot computes the kernel density on its own, so your kernel variable is as of yet unused.
Here's what sns.jointplot generated for me with a random sample:
There's a nice maximum at around (7, 5.4).
Here's what your kernel corresponds to:
x,y = np.mgrid[:10:100j, :10:100j] # 100 x 100 grid for plotting
z = kernel.pdf(np.array([x.ravel(),y.ravel()])).reshape(x.shape)
fig,ax = plt.subplots()
ax.contourf(x, y, z, levels=10)
This will clearly not do: the density contains peaks centered around your input points; you will never be able to get a similar estimate than what sns.jointplot gave you.
We can easily fix this: you just have to drop the custom bw_method argument in the call to gaussian_kde:
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]))
x,y = np.mgrid[:10:100j, :10:100j] # 100 x 100 grid for plotting
z = kernel.pdf(np.array([x.ravel(),y.ravel()])).reshape(x.shape)
fig,ax = plt.subplots()
ax.contourf(x, y, z, levels=10)
This looks just the way you want it:
Now you know that this kernel.pdf is a bivariate function for which you're looking for the maximum.
And to find the maximum you should probably use something from scipy.optimize, for instance scipy.optimize.minimize (the trick is to look at the negative of your function, which turns maxima into minima).
Since your function will probably have a few local maxima, finding the global maximum reliably is not trivial. I would either use the aforementioned minimize, but first use a sparse mesh over the relevant domain and find the best maximum candidate first, or use a heavy-weight solver such as differential_evolution which is a stochastic solver that's supposed to be good at finding the true global minimum of a function.
Root finding and minimization is always fickle business, so you will have to play around with your real data and available methods to find a reliable workflow that gives you your maximum.
I have a function which is an interpolation of a relative large set of data. I use linear interpolation interp1d so there are a lot of non-smooth sharp point like this. The quad function from scipy will give warning because of the sharp points. I wonder how to do the integration without the warning?
Thank you!
Thanks for all the answers. Here I summarize the solutions in case some others run into the same problem:
Just like what #Stelios did, use points to avoid warnings and to get a more accurate result.
In practice the number of points are usually larger than the default limit(limit=50) of quad, so I choose quad(f_interp, a, b, limit=2*p.shape[0], points=p) to avoid all those warnings.
If a and b are not the same start or the end point of the data set x, the points p can be chosen by p = x[where(x>=a and x<=b)]
quad accepts an optional argument, called points. According to the documentation:
points : (sequence of floats,ints), optional
A sequence of break points in the bounded integration interval where
local difficulties of the integrand may occur (e.g., singularities,
discontinuities). The sequence does not have to be sorted.
In your case, the "difficult" points are exactly the x-coordinates of the data points. Here is an example:
import numpy as np
from scipy.integrate import quad
# generate random data set
x = np.arange(0,10)
y = np.random.rand(10)
# construct a linear interpolation function of the data set
f_interp = lambda xx: np.interp(xx, x, y)
Here is a plot of the data points and f_interp:
Now calling quad as
return a series of warnings along with
(4.89770017785734, 1.3762838395159349e-05)
If you provide the points argument, i.e.,
quad(f_interp,0,9, points = x)
it issues no warnings and the result is
(4.8977001778573435, 5.437539505167948e-14)
which also implies a much greater accuracy of the result compared to the previous call.
Instead of interp1d, you could use scipy.interpolate.InterpolatedUnivariateSpline. That interpolator has the method integral(a, b) that computes the definite integral.
Here's an example:
import numpy as np
from scipy.interpolate import InterpolatedUnivariateSpline
import matplotlib.pyplot as plt
# Create some test data.
x = np.linspace(0, np.pi, 21)
y = np.sin(1.5*x) + np.random.laplace(scale=0.35, size=len(x))**3
# Create the interpolator. Use k=1 for linear interpolation.
finterp = InterpolatedUnivariateSpline(x, y, k=1)
# Create a finer mesh of points on which to compute the integral.
xx = np.linspace(x[0], x[-1], 5*len(x))
# Use the interpolator to compute the integral from 0 to t for each
# t in xx.
qq = [finterp.integral(0, t) for t in xx]
# Plot stuff
p = plt.plot(x, y, '.', label='data')
plt.plot(x, y, '-', color=p[0].get_color(), label='linear interpolation')
plt.plot(xx, qq, label='integral of linear interpolation')
plt.legend(framealpha=1, shadow=True)
The plot:
What I'm trying to do is, from a list of x-y points that has a periodic pattern, calculate the period. With my limited mathematics knowledge I know that Fourier Transformation can do this sort of thing.
I'm writing Python code.
I found a related answer here, but it uses an evenly-distributed x axis, i.e. dt is fixed, which isn't the case for me. Since I don't really understand the math behind it, I'm not sure if it would work properly in my code.
My question is, does it work? Or, is there some method in numpy that already does my work? Or, how can I do it?
EDIT: All values are Pythonic float (i.e. double-precision)
For samples that are not evenly spaced, you can use scipy.signal.lombscargle to compute the Lomb-Scargle periodogram. Here's an example, with a signal whose
dominant frequency is 2.5 rad/s.
from __future__ import division
import numpy as np
from scipy.signal import lombscargle
import matplotlib.pyplot as plt
n = 100
x = np.sort(10*np.random.rand(n))
# Dominant periodic signal
y = np.sin(2.5*x)
# Add some smaller periodic components
y += 0.15*np.cos(0.75*x) + 0.2*np.sin(4*x+.1)
# Add some noise
y += 0.2*np.random.randn(x.size)
plt.plot(x, y, 'b')
dxmin = np.diff(x).min()
duration = x.ptp()
freqs = np.linspace(1/duration, n/duration, 5*n)
periodogram = lombscargle(x, y, freqs)
kmax = periodogram.argmax()
print("%8.3f" % (freqs[kmax],))
plt.plot(freqs, np.sqrt(4*periodogram/(5*n)))
plt.xlabel('Frequency (rad/s)')
plt.axvline(freqs[kmax], color='r', alpha=0.25)
The script prints 2.497 and generates the following plots:
As starting point:
(I assume all coordinates are positive and integer, otherwise map them to reasonable range like 0..4095)
find max coordinates xMax, yMax in list
make 2D array with dimensions yMax, xMax
fill it with zeros
walk through you list, set array elements, corresponding to coordinates, to 1
make 2D Fourier transform
look for peculiarities (peaks) in FT result
This page from Scipy shows you basic knowledge of how Discrete Fourier Transform works:
They also provide API for using DFT. For your case, you should look at how to use fft2.
I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help
You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.
I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
Below are the results for M for A = 1.e0 and A = 1.e-2 Amplitude = 1 Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.set_title(title + ' Smoothing')
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.