I want to test how close the distribution of a data set is from a Gaussian with mean=0 and variance=1.
The kuiper test from astropy.stats has a cdf parameter, from the documentation: "A callable to evaluate the CDF of the distribution being tested against. Will be called with a vector of all values at once. The default is a uniform distribution", but I don't know how to use this to test against a normal distribution. What if I want e.g. a normal distribution with mean 0.2 and variance 2?
So I used kuiper_two, also from astropy, and created a random normal distribution. See example below.
The problem I see with this is that it depends on the number of data points I generate to compare against. If I'd have used 100 instead of 10000 data points the probability (fpp) would have raised to 43%.
I guess the question is, how do I do this properly? Also, how do I interpret the D number?
# create data and its cdf
np.random.seed(0)
data = np.random.normal(loc=0.02, scale=1.15, size=50)
data_sort = np.sort(data)
data_cdf = [x/len(data) for x in range(0, len(data))]
# create the normal data with mean 0 and variance 1
xx = np.random.normal(loc=0, scale=1, size=10000)
xx_sort = np.sort(xx)
xx_cdf = [x/len(xx) for x in range(0, len(xx))]
# compute the pdf for a plot
x = np.linspace(-4, 4, 50)
x_pdf = stats.norm.pdf(x, 0, 1)
# we can see it all in a plot
fig, ax = plt.subplots(figsize=(8, 6))
plt.hist(xx, bins=20, density=True, stacked=True, histtype='stepfilled', alpha=0.6)
plt.hist(data, density=True, stacked=True, histtype='step', lw=3)
plt.plot(x, x_pdf, lw=3, label='G($\mu=0$, $\sigma^2=1$)')
ax2 = ax.twinx()
ax2.plot(xx_sort, xx_cdf, marker='o', ms=8, mec='green', mfc='green', ls='None')
ax2.plot(data_sort, data_cdf, marker='^', ms=8, mec='orange', mfc='orange', ls='None')
# Kuiper test
D, fpp = kuiper_two(data_sort, xx_sort)
print('# D number =', round(D, 5))
print('# fpp =', round(fpp, 5))
# Which resulted in:
# D number = 0.211
# fpp = 0.14802
astropy.stats.kuiper expects as first argument a sample from the distribution you want to test, and as second argument the CDF of the distribution you want to test against.
This variable is a Callable and expects itself (a) sample(s) for which to return the value(s) under the cumulative distribution function. You can use scipy.stat's CDFs for that. By using functools.partial we can set any parameters.
from scipy import stats
from scipy.stats import norm
from astropy.stats import kuiper
from functools import partial
from random import shuffle
np.random.seed(0)
data = np.random.normal(loc=0.02, scale=1.15, size=50)
print(kuiper(data, partial(norm.cdf, loc=0.2, scale=2.0)))
# Output: (0.2252118027033838, 0.08776036566607946)
# The data does not have to be sorted, in case you wondered:
shuffle(data)
print(kuiper(data, partial(norm.cdf, loc=0.2, scale=2.0)))
# Output: (0.2252118027033838, 0.08776036566607946)
In this diagram from the Wikipedia article about this test, you get an idea what the Kuiper statistic V measures:
[]3
The
If you share the parameters of the comparison distribution, the distance is lower and the estimated probability that the respective underlying CDFs are identical rises:
print(kuiper(data, partial(norm.cdf, loc=0.02, scale=1.15)))
# Output: (0.14926352419821276, 0.68365004302431)
The function astropy.stats.kuiper_two in contrast, expects two samples of empirical data to compare with one another. So if you want to compare to a distribution with a tractable CDF, it is preferable to use the CDF directly (with kuiper) instead of sampling from the comparison distribution (and using kuiper_two).
And one nit-pick. Apart from using CDF in variable names that are not a CDF, this formulation is much more readable than the above:
data_cdf = np.linspace(0.0, 1.0, len(data), endpoint=False)
xx_cdf = np.linspace(0.0, 1.0, len(xx), endpoint=False)
Related
I really like to the look of Seaborn's KDE plot:
I was wondering how can I replicate this for line plot.
In my case I actually have the function to generate the density instead of samples of the data.
So assuming I have the data in a data frame:
x - The value of x per sample.
y - The value of the density function at y.
μσ - Categorical variable to group data from the same density (In the code, I use the mean and standard deviation of a normal distribution).
I can use Seaborn's lineplot to get what I want without the area below the curve as in the image above.
I'm after achieving the look as above for the data I have.
Is there a way to replicate this theme, area under the curve included, for lineplot?
The code below shows what I got so far:
import numpy as np
import scipy as sp
import pandas as pd
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
num_grid_pts = 1000
val_μ = [0, -1, 1, 0]
val_σ = [1, 2, 3, 4]
num_var = len(val_μ) # variations
x = np.linspace(-10, 10, num_grid_pts)
P = np.zeros((num_grid_pts, num_var)) # PDF
μσ = [f'μ = {μ}, σ = {σ}' for μ, σ in zip(val_μ, val_σ)]
for ii, (μ, σ) in enumerate(zip(val_μ, val_σ)):
randVar = norm(μ, σ)
P[:, ii] = randVar.pdf(x)
df_P = pd.DataFrame(data = {'x': np.tile(x, num_var), 'PDF': P.flatten('F'), 'μσ': np.repeat(μσ, len(x))})
f, ax = plt.subplots(figsize=(15, 10))
sns.lineplot(data=df_P, x='x', y='PDF', hue='μσ', ax=ax)
plot_lines = ax.get_lines()
for ii in range(num_var):
ax.fill_between(x=plot_lines[ii].get_xdata(), y1=plot_lines[ii].get_ydata(), alpha=0.25, color=plot_lines[ii].get_color())
ax.set_title(f'Normal Distribution')
ax.set_xlabel(f'Value')
ax.set_ylabel(f'Probability')
plt.show()
I used the lineplot to create the lines and then created the fills. But this is a hack, I was wondering if I can do it more naturally within Seaborn.
I found a way to manually play with the elements do so using the area object:
(
so.Plot(healthexp, "Year", "Spending_USD", color="Country")
.add(so.Area(alpha=.7), so.Stack())
)
The result is:
Yet for some reason the example code doesn't work.
What I did was using Seabron's lineplot() and then manually add fill_between() polygon:
ax = sns.lineplot(data=data_frame, x='data_x', y='data_y', hue='data_color')
plot_lines = ax.get_lines()
for i in range(num_unique_colors):
ax.fill_between(x=plot_lines[i].get_xdata(), y1=plot_lines[i].get_ydata(), alpha=0.25, color=plot_lines[i].get_color())
I am attempting to fit multiple data sets to the same equation and to find the value of the fitting parameters between them. There are two independent variables, which I think I've dealt with. I ended up with something that works as expected for a single data set, but not one that works for multiple data sets. The code itself works, but the fit looks like a bow (a straight line and curved line connected at the end) instead of just a curve. I want separate curves per data set, with shared values for the parameters. I know I need to break up the data somehow, maybe by having my data stacked and adjusting the function with indexes, but I'm getting confused by the examples I've found and am not sure how to execute them here. Below is the code:
#import things
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##set-up data##
#Have x-data as numpy array
xfrac = [1., 0.85,0.725,0.6,0.5,0.4,0.]
x = np.concatenate((xfrac,xfrac))
#Write function to generate and populate arrays using ideal values
#data sets (I have pasted the values instead of posting the code used to calculate them)
mix_850 = [1.701 3.642865 4.6762 5.0739 5.5177 5.9923 6.9408]
mix_1000 = [1.651185 3.53359 4.4854 4.8978 5.32525 5.7388 6.792]
dat = np.concatenate((mix_850,mix_1000))
#Temperature values
c=np.repeat(850.,7.)
d=np.repeat(1000.,7.)
Temp = np.concatenate((c,d))
#Define function
def f(Z, a1, b1, a2, b2):
x1,T= Z
x2= 1.-x1
excess = a1+b1+(a2+b2*T)*(x1-(1.-x1)*(x1*(1.-x1)))
ideal = ((x1*25.939)+((1.0-x1)*314.02))/(((x1*25.939)/1.701-0.3321e-3*T)+(((1.0-x1)*314.02)/7.784-0.9920e-3*T))
mix = excess + ideal
return mix
#Fitting
popt,_ = curve_fit(f,(x,Temp),dat)
fit_a1 = popt[0]
fit_b1 = popt[1]
fit_a2 = popt[2]
fit_b2 = popt[3]
Define xfrac as np.array:
xfrac = np.array([1., 0.85, 0.725, 0.6, 0.5, 0.4, 0.])
Use xfrac in the plot instead of x
# plotting
mix1 = f((xfrac, c), *popt)
mix2 = f((xfrac, d), *popt)
# temperature 1
plt.plot(xfrac, mix1, label=c[0], c='blue')
plt.plot(xfrac, mix_850, linestyle='', c='blue',
marker='o', label='Data {}'.format(c[0]))
# temperature 2
plt.plot(xfrac, mix2, label=d[0], c='red')
plt.plot(xfrac, mix_1000, linestyle='', c='red',
marker='o', label='Data {}'.format(d[0]))
plt.xlabel('xfrac')
plt.ylabel('mix')
plt.legend()
I am working on hyperparameter tuning of neural networks and going through examples. I came across this code in one example:
train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
I understand that adding noise has regularization effect on data. Reading the documentation for this tells that it adds guassian noise. However, in above code, I could not understand what does it means to add 0.05 noise in the data. How would this affect data mathematically here?
I tried below code. I could see values changing but could not figure out how, for example, row1 values of x in array 1 changed by adding noise= .05 to corresponding row in array 2 i.e. x_1 here?
np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])
x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])
Output:
[[-9.92114701e-01 -1.25333234e-01]
[-1.49905052e-01 -7.85829801e-01]
[ 9.68583161e-01 2.48689887e-01]
[ 6.47213595e-01 4.70228202e-01]
[-8.00000000e-01 -2.57299624e-16]]
[[-0.66187208 0.75151712]
[-0.86331995 -0.56582111]
[-0.19574479 0.7798686 ]
[ 0.40634757 -0.78263011]
[-0.7433193 0.26658851]]
According to the documentation:
sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)
Make a large circle containing a smaller circle in 2d. A simple toy dataset to visualize clustering and classification algorithms.
noise: double or None (default=None)
Standard deviation of Gaussian noise added to the data.
The statement make_circles(noise=0.05) means that it is creating random circles with a little bit of variation following a Gaussian distribution, also known as a normal distribution. You should already know that a random Gaussian distribution means that the numbers being generated have some mean and standard definition. In this case, the call make_circles(noise=0.05) means that the standard deviation is 0.05.
Let's invoke the function, check out its output, and see what's the effect of changing the parameter noise. I'll borrow liberally from this nice tutorial on generating scikit-learn dummy data.
Let's first call make_circles() with noise=0.0 and take a look at the data. I'll use a Pandas dataframe so we can see the data in a tabular way.
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd
n_samples = 100
noise = 0.00
features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
# x y label
# 0 -0.050232 0.798421 1
# 1 0.968583 0.248690 0
# 2 -0.809017 0.587785 0
# 3 -0.535827 0.844328 0
# 4 0.425779 -0.904827 0
You can see that make_circles returns data instances where each instance is a point with two features, x and y, and a label. Let's plot them to see how they actually look like.
# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()
So it looks like it's creating two concentric circles, each with a different label.
Let's increase the noise to noise=0.05 and see the result:
n_samples = 100
noise = 0.05 # <--- The only change
features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()
It looks like the noise is added to each of the x, y coordinates to make each point shift around a little bit. When we inspect the code for make_circles() we see that the implementation does exactly that:
def make_circles( ..., noise=None, ...):
...
if noise is not None:
X += generator.normal(scale=noise, size=X.shape)
So now we've seen two visualizations of the dataset with two values of noise. But two visualizations isn't cool. You know what's cool? Five visualizations with the noise increasing progressively by 10x. Here's a function that does it:
def make_circles_plot(n_samples, noise):
assert n_samples > 0
assert noise >= 0
# Use make_circles() to generate random data points with noise.
features, labels = make_circles(n_samples=n_samples, noise=noise)
# Create a dataframe for later plotting.
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(5, 5))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points with noise=%f' % noise)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.tight_layout()
plt.show()
Calling the above function with different values of noise, it can clearly be seen that increasing this value makes the points move around more, i.e. it makes them more "noisy", exactly as we should expect intuitively.
for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
make_circles_plot(500, noise)
Suppose I create a histogram using scipy/numpy, so I have two arrays: one for the bin counts, and one for the bin edges. If I use the histogram to represent a probability distribution function, how can I efficiently generate random numbers from that distribution?
It's probably what np.random.choice does in #Ophion's answer, but you can construct a normalized cumulative density function, then choose based on a uniform random number:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)
bin_midpoints = bins[:-1] + np.diff(bins)/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]
plt.subplot(121)
plt.hist(data, 50)
plt.subplot(122)
plt.hist(random_from_cdf, 50)
plt.show()
A 2D case can be done as follows:
data = np.column_stack((np.random.normal(scale=10, size=1000),
np.random.normal(scale=20, size=1000)))
x, y = data.T
hist, x_bins, y_bins = np.histogram2d(x, y, bins=(50, 50))
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
cdf = np.cumsum(hist.ravel())
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = random_from_cdf.T
plt.subplot(121, aspect='equal')
plt.hist2d(x, y, bins=(50, 50))
plt.subplot(122, aspect='equal')
plt.hist2d(new_x, new_y, bins=(50, 50))
plt.show()
#Jaime solution is great, but you should consider using the kde (kernel density estimation) of the histogram. A great explanation why it's problematic to do statistics over histogram, and why you should use kde instead can be found here
I edited #Jaime's code to show how to use kde from scipy. It looks almost the same, but captures better the histogram generator.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
def run():
data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)
x_grid = np.linspace(min(data), max(data), 1000)
kdepdf = kde(data, x_grid, bandwidth=0.1)
random_from_kde = generate_rand_from_pdf(kdepdf, x_grid)
bin_midpoints = bins[:-1] + np.diff(bins) / 2
random_from_cdf = generate_rand_from_pdf(hist, bin_midpoints)
plt.subplot(121)
plt.hist(data, 50, normed=True, alpha=0.5, label='hist')
plt.plot(x_grid, kdepdf, color='r', alpha=0.5, lw=3, label='kde')
plt.legend()
plt.subplot(122)
plt.hist(random_from_cdf, 50, alpha=0.5, label='from hist')
plt.hist(random_from_kde, 50, alpha=0.5, label='from kde')
plt.legend()
plt.show()
def kde(x, x_grid, bandwidth=0.2, **kwargs):
"""Kernel Density Estimation with Scipy"""
kde = gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1), **kwargs)
return kde.evaluate(x_grid)
def generate_rand_from_pdf(pdf, x_grid):
cdf = np.cumsum(pdf)
cdf = cdf / cdf[-1]
values = np.random.rand(1000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = x_grid[value_bins]
return random_from_cdf
Perhaps something like this. Uses the count of the histogram as a weight and chooses values of indices based on this weight.
import numpy as np
initial=np.random.rand(1000)
values,indices=np.histogram(initial,bins=20)
values=values.astype(np.float32)
weights=values/np.sum(values)
#Below, 5 is the dimension of the returned array.
new_random=np.random.choice(indices[1:],5,p=weights)
print new_random
#[ 0.55141614 0.30226256 0.25243184 0.90023117 0.55141614]
I had the same problem as the OP and I would like to share my approach to this problem.
Following Jaime answer and Noam Peled answer I've built a solution for a 2D problem using a Kernel Density Estimation (KDE).
Frist, let's generate some random data and then calculate its Probability Density Function (PDF) from the KDE. I will use the example available in SciPy for that.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots()
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
And the plot is:
Now, we obtain random data from the PDF obtained from the KDE, which is the variable Z.
# Generate the bins for each axis
x_bins = np.linspace(xmin, xmax, Z.shape[0]+1)
y_bins = np.linspace(ymin, ymax, Z.shape[1]+1)
# Find the middle point for each bin
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
# Calculate the Cumulative Distribution Function(CDF)from the PDF
cdf = np.cumsum(Z.ravel())
cdf = cdf / cdf[-1] # Normalização
# Create random data
values = np.random.rand(10000)
# Find the data position
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
# Create the new data
new_data = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = new_data.T
And we can calculate the KDE from this new data and the plot it.
kernel = stats.gaussian_kde(new_data.T)
new_Z = np.reshape(kernel(positions).T, X.shape)
fig, ax = plt.subplots()
ax.imshow(np.rot90(new_Z), cmap=plt.cm.gist_earth_r,
extent=[xmin, xmax, ymin, ymax])
ax.plot(new_x, new_y, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
Here is a solution, that returns datapoints that are uniformly distributed within each bin instead of the bin center:
def draw_from_hist(hist, bins, nsamples = 100000):
cumsum = [0] + list(I.np.cumsum(hist))
rand = I.np.random.rand(nsamples)*max(cumsum)
return [I.np.interp(x, cumsum, bins) for x in rand]
A few things do not work well for the solutions suggested by #daniel, #arco-bast, et al
Taking the last example
def draw_from_hist(hist, bins, nsamples = 100000):
cumsum = [0] + list(I.np.cumsum(hist))
rand = I.np.random.rand(nsamples)*max(cumsum)
return [I.np.interp(x, cumsum, bins) for x in rand]
This assumes that at least the first bin has zero content, which may or may not be true. Secondly, this assumes that the value of the PDF is at the upper bound of the bins, which it isn't - it's mostly in the centre of the bin.
Here's another solution done in two parts
def init_cdf(hist,bins):
"""Initialize CDF from histogram
Parameters
----------
hist : array-like, float of size N
Histogram height
bins : array-like, float of size N+1
Histogram bin boundaries
Returns:
--------
cdf : array-like, float of size N+1
"""
from numpy import concatenate, diff,cumsum
# Calculate half bin sizes
steps = diff(bins) / 2 # Half bin size
# Calculate slope between bin centres
slopes = diff(hist) / (steps[:-1]+steps[1:])
# Find height of end points by linear interpolation
# - First part is linear interpolation from second over first
# point to lowest bin edge
# - Second part is linear interpolation left neighbor to
# right neighbor up to but not including last point
# - Third part is linear interpolation from second to last point
# over last point to highest bin edge
# Can probably be done more elegant
ends = concatenate(([hist[0] - steps[0] * slopes[0]],
hist[:-1] + steps[:-1] * slopes,
[hist[-1] + steps[-1] * slopes[-1]]))
# Calculate cumulative sum
sum = cumsum(ends)
# Subtract off lower bound and scale by upper bound
sum -= sum[0]
sum /= sum[-1]
# Return the CDF
return sum
def sample_cdf(cdf,bins,size):
"""Sample a CDF defined at specific points.
Linear interpolation between defined points
Parameters
----------
cdf : array-like, float, size N
CDF evaluated at all points of bins. First and
last point of bins are assumed to define the domain
over which the CDF is normalized.
bins : array-like, float, size N
Points where the CDF is evaluated. First and last points
are assumed to define the end-points of the CDF's domain
size : integer, non-zero
Number of samples to draw
Returns
-------
sample : array-like, float, of size ``size``
Random sample
"""
from numpy import interp
from numpy.random import random
return interp(random(size), cdf, bins)
# Begin example code
import numpy as np
import matplotlib.pyplot as plt
# initial histogram, coarse binning
hist,bins = np.histogram(np.random.normal(size=1000),np.linspace(-2,2,21))
# Calculate CDF, make sample, and new histogram w/finer binning
cdf = init_cdf(hist,bins)
sample = sample_cdf(cdf,bins,1000)
hist2,bins2 = np.histogram(sample,np.linspace(-3,3,61))
# Calculate bin centres and widths
mx = (bins[1:]+bins[:-1])/2
dx = np.diff(bins)
mx2 = (bins2[1:]+bins2[:-1])/2
dx2 = np.diff(bins2)
# Plot, taking care to show uncertainties and so on
plt.errorbar(mx,hist/dx,np.sqrt(hist)/dx,dx/2,'.',label='original')
plt.errorbar(mx2,hist2/dx2,np.sqrt(hist2)/dx2,dx2/2,'.',label='new')
plt.legend()
Sorry, I don't know how to get this to show up in StackOverflow, so copy'n'paste and run to see the point.
I stumbled upon this question when I was looking for a way to generate a random array based on a distribution of another array. If this would be in numpy, I would call it random_like() function.
Then I realized, I have written a package Redistributor which might do this for me even though the package was created with a bit different motivation (Sklearn transformer capable of transforming data from an arbitrary distribution to an arbitrary known distribution for machine learning purposes). Of course I understand unnecessary dependencies are not desired, but at least knowing this package might be useful to you someday. The thing OP asked about is basically done under the hood here.
WARNING: under the hood, everything is done in 1D. The package also implements multidimensional wrapper, but I have not written this example using it as I find it to be too niche.
Installation:
pip install git+https://gitlab.com/paloha/redistributor
Implementation:
import numpy as np
import matplotlib.pyplot as plt
def random_like(source, bins=0, seed=None):
from redistributor import Redistributor
np.random.seed(seed)
noise = np.random.uniform(source.min(), source.max(), size=source.shape)
s = Redistributor(bins=bins, bbox=[source.min(), source.max()]).fit(source.ravel())
s.cdf, s.ppf = s.source_cdf, s.source_ppf
r = Redistributor(target=s, bbox=[noise.min(), noise.max()]).fit(noise.ravel())
return r.transform(noise.ravel()).reshape(noise.shape)
source = np.random.normal(loc=0, scale=1, size=(100,100))
t = random_like(source, bins=80) # More bins more precision (0 = automatic)
# Plotting
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title(f'Distribution of source data, shape: {source.shape}')
plt.hist(source.ravel(), bins=100)
plt.subplot(122); plt.title(f'Distribution of generated data, shape: {t.shape}')
plt.hist(t.ravel(), bins=100); plt.show()
Explanation:
import numpy as np
import matplotlib.pyplot as plt
from redistributor import Redistributor
from sklearn.metrics import mean_squared_error
# We have some source array with "some unknown" distribution (e.g. an image)
# For the sake of example we just generate a random gaussian matrix
source = np.random.normal(loc=0, scale=1, size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Source data'); plt.imshow(source, origin='lower')
plt.subplot(122); plt.title('Source data hist'); plt.hist(source.ravel(), bins=100); plt.show()
# We want to generate a random matrix from the distribution of the source
# So we create a random uniformly distributed array called noise
noise = np.random.uniform(source.min(), source.max(), size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Uniform noise'); plt.imshow(noise, origin='lower')
plt.subplot(122); plt.title('Uniform noise hist'); plt.hist(noise.ravel(), bins=100); plt.show()
# Then we fit (approximate) the source distribution using Redistributor
# This step internally approximates the cdf and ppf functions.
s = Redistributor(bins=200, bbox=[source.min(), source.max()]).fit(source.ravel())
# A little naming workaround to make obj s work as a target distribution
s.cdf = s.source_cdf
s.ppf = s.source_ppf
# Here we create another Redistributor but now we use the fitted Redistributor s as a target
r = Redistributor(target=s, bbox=[noise.min(), noise.max()])
# Here we fit the Redistributor r to the noise array's distribution
r.fit(noise.ravel())
# And finally, we transform the noise into the source's distribution
t = r.transform(noise.ravel()).reshape(noise.shape)
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Transformed noise'); plt.imshow(t, origin='lower')
plt.subplot(122); plt.title('Transformed noise hist'); plt.hist(t.ravel(), bins=100); plt.show()
# Computing the difference between the two arrays
print('Mean Squared Error between source and transformed: ', mean_squared_error(source, t))
Mean Squared Error between source and transformed: 2.0574123162302143
I have a disordered list named d that looks like:
[0.0000, 123.9877,0.0000,9870.9876, ...]
I just simply want to plot a cdf graph based on this list by using Matplotlib in Python. But don't know if there's any function I can use
d = []
d_sorted = []
for line in fd.readlines():
(addr, videoid, userag, usertp, timeinterval) = line.split()
d.append(float(timeinterval))
d_sorted = sorted(d)
class discrete_cdf:
def __init__(data):
self._data = data # must be sorted
self._data_len = float(len(data))
def __call__(point):
return (len(self._data[:bisect_left(self._data, point)]) /
self._data_len)
cdf = discrete_cdf(d_sorted)
xvalues = range(0, max(d_sorted))
yvalues = [cdf(point) for point in xvalues]
plt.plot(xvalues, yvalues)
Now I am using this code, but the error message is :
Traceback (most recent call last):
File "hitratioparea_0117.py", line 43, in <module>
cdf = discrete_cdf(d_sorted)
TypeError: __init__() takes exactly 1 argument (2 given)
I know I'm late to the party. But, there is a simpler way if you just want the cdf for your plot and not for future calculations:
plt.hist(put_data_here, normed=True, cumulative=True, label='CDF',
histtype='step', alpha=0.8, color='k')
As an example,
plt.hist(dataset, bins=bins, normed=True, cumulative=True, label='CDF DATA',
histtype='step', alpha=0.55, color='purple')
# bins and (lognormal / normal) datasets are pre-defined
EDIT: This example from the matplotlib docs may be more helpful.
As mentioned, cumsum from numpy works well. Make sure that your data is a proper PDF (ie. sums to one), otherwise the CDF won't end at unity as it should. Here is a minimal working example:
import numpy as np
from pylab import *
# Create some test data
dx = 0.01
X = np.arange(-2, 2, dx)
Y = np.exp(-X ** 2)
# Normalize the data to a proper PDF
Y /= (dx * Y).sum()
# Compute the CDF
CY = np.cumsum(Y * dx)
# Plot both
plot(X, Y)
plot(X, CY, 'r--')
show()
The numpy function to compute cumulative sums cumsum can be useful here
In [1]: from numpy import cumsum
In [2]: cumsum([.2, .2, .2, .2, .2])
Out[2]: array([ 0.2, 0.4, 0.6, 0.8, 1. ])
Nowadays, you can just use seaborn's kdeplot function with cumulative as True to generate a CDF.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
X1 = np.arange(100)
X2 = (X1 ** 2) / 100
sns.kdeplot(data = X1, cumulative = True, label = "X1")
sns.kdeplot(data = X2, cumulative = True, label = "X2")
plt.legend()
plt.show()
For an arbitrary collection of values, x:
def cdf(x, plot=True, *args, **kwargs):
x, y = sorted(x), np.arange(len(x)) / len(x)
return plt.plot(x, y, *args, **kwargs) if plot else (x, y)
((If you're new to python, the *args, and **kwargs allow you to pass arguments and named arguments without declaring and managing them explicitly))
What works best for me is quantile function of pandas.
Say I have 71 participants. Each participant have a certain number of interruptions. I want to compute the CDF plot of #interruptions for participants. Goal is to be able to tell how many percent of participants have at least 30 interventions.
step=0.05
indices = np.arange(0,1+step,step)
num_interruptions_per_participant = [32,70,52,52,39,20,37,31,60,57,31,71,24,23,38,4,77,37,79,43,63,43,75,13
,45,31,57,28,61,29,30,52,65,11,76,37,65,28,33,73,65,43,50,33,45,40,50,44
,33,49,24,69,55,47,22,45,54,11,30,13,32,52,31,50,10,46,10,25,47,51,83]
CDF = pd.DataFrame({'dummy':num_interruptions_per_participant})['dummy'].quantile(indices)
plt.plot(CDF,indices,linewidth=9, label='#interventions', color='blue')
According to Graph Almost 25% of the participants have less than 30 interventions.
You can use this statistic for your further analysis. For instance, In my case I need at least 30 intervention for each participant in order to meet minimum sample requirement needed for leave-one-subject out evaluation. CDF tells me that I have problem with 25% of the participants.
import matplotlib.pyplot as plt
X=sorted(data)
Y=[]
l=len(X)
Y.append(float(1)/l)
for i in range(2,l+1):
Y.append(float(1)/l+Y[i-2])
plt.plot(X,Y,color=c,marker='o',label='xyz')
I guess this would do,for the procedure refer http://www.youtube.com/watch?v=vcoCVVs0fRI