I plotted two normal distribution curves on the graph as seen below:
Using the below code:
import random
import numpy as np
import matplotlib.pyplot as plt
d1 = np.array([random.randrange(0,100) for i in range(100)])
d2 = np.array([random.randrange(0,50) for i in range(100)])
d1,d2 = np.sort(d1),np.sort(d2)
d1_mean,d1_std = np.mean(d1),np.std(d1)
d2_mean,d2_std = np.mean(d2),np.std(d2)
# y = np.repeat(15,100)
plt.figure(figsize=(8,6))
plt.plot(d1,stats.norm.pdf(d1,d1_mean,d1_std)*1000,'r')
plt.plot(d2,stats.norm.pdf(d2,d2_mean,d2_std)*1000,'b')
# plt.hlines(y,d1.min(),d1.max(),'g')
plt.title("Flattening the curve")
plt.show()
What I want to achieve is this:
The second graph is easier to plot because both the mean for both the normal distribution curves for the above graph were the same. However, I am unable to achieve the same with the graph at the very top. When I set the mean to be similar, this happens:
Any tips on how to achieve this? Any help is much appreciated. Thank you for reading.
I think the easiest way to generate the two Gaussian curves would be to plug x-values in the range [-20, 20] into the Gaussian function with two different values of sigma. matplotlib will then make the boundaries of your plot [-20, 20], and it will be centered around 0.
import random
import numpy as np
import matplotlib.pyplot as plt
def gaussian(x, mu, sig):
return 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
plt.figure(figsize=(8,6))
x_values = np.linspace(-20, 20, 200)
# generate gaussian curves with mu = 0, sigma = 5, 10
plt.plot(x_values, gaussian(x_values, 0, 5), color = 'blue')
plt.plot(x_values, gaussian(x_values, 0, 10), color = 'red')
plt.title("Flattening the curve")
plt.show()
Related
What I am trying to do is to play around with some random distribution. I don't want it to be normal. But for the time being normal is easier.
import matplotlib.pyplot as plt
from scipy.stats import norm
ws=norm.rvs(4.0, 1.5, size=100)
density, bins = np.histogram(ws, 50,normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2)) = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(12,6))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], unity_density, width=widths)
ax2.bar(bins[1:], unity_density.cumsum(), width=widths)
fig.tight_layout()
Then what I can do it visualize CDF in terms of points.
density1=unity_density.cumsum()
x=bins[:-1]
y=density1
plt.plot(x, density1, 'o')
So what I have been trying to do is to use the np.interp function on the output of np.histogram in order to obtain a smooth curve representing the CDF and extracting the percent points to plot them. Ideally, I need to try to do it all both manually and using ppf function from scipy.
I have always struggled with statistics as an undergraduate. I am in grad school now and try to put me through as many exercises like this as possible in order to get a deeper understanding of what is happening. I've reached a point of desperation with this task.
Thank you!
One possibility to get smoother results is to use more samples, by using 10^5 samples and 100 bins I get the following images:
ws = norm.rvs(loc=4.0, scale=1.5, size=100000)
density, bins = np.histogram(ws, bins=100, normed=True, density=True)
In general you could use scipys interpolation module to smooth your CDF.
For 100 samples and a smoothing factor of s=0.01 I get:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import splev, splrep
density1 = unity_density.cumsum()
x = bins[:-1]
y = density1
# Interpolation
spl = splrep(x, y, s=0.01, per=False)
x2 = np.linspace(x[0], x[-1], 200)
y2 = splev(x2, spl)
# Plotting
fig, ax = plt.subplots()
plt.plot(x, density1, 'o')
plt.plot(x2, y2, 'r-')
The third possibility is to calculate the CDF analytically. If you generate the noise yourself with a numpy / scipy function most of the time there is already an implementation of the CDF available, otherwise you should find it on Wikipedia. If your samples come from measurements that is of course a different story.
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = np.linspace(-2, 10)
y = norm(loc=4.0, scale=1.5).cdf(x)
ax.plot(x, y, 'bo-')
As a minimal reproducible example, suppose I have the following multivariate normal distribution:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal, gaussian_kde
# Choose mean vector and variance-covariance matrix
mu = np.array([0, 0])
sigma = np.array([[2, 0], [0, 3]])
# Create surface plot data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
rv = multivariate_normal(mean=mu, cov=sigma)
Z = np.array([rv.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
pos = ax.plot_surface(X, Y, Z)
plt.show()
This gives the following surface plot:
My goal is to marginalize this and use Kernel Density Estimation to get a nice and smooth 1D Gaussian. I am running into 2 problems:
Not sure my marginalization technique makes sense.
After marginalizing I am left with a barplot, but gaussian_kde requires actual data (not frequencies of it) in order to fit KDE, so I am unable to use this function.
Here is how I marginalize it:
# find marginal distribution over y by summing over all x
y_distribution = Z.sum(axis=1) / Z.sum() # Do I need to normalize?
# plot bars
plt.bar(y, y_distribution)
plt.show()
and this is the barplot that I obtain:
Next, I follow this StackOverflow question to find the KDE only from "histogram" data. To do this, we resample the histogram and fit KDE on the resamples:
# sample the histogram
resamples = np.random.choice(y, size=1000, p=y_distribution)
kde = gaussian_kde(resamples)
# plot bars
fig, ax = plt.subplots(nrows=1, ncols=2)
ax[0].bar(y, y_distribution)
ax[1].plot(y, kde.pdf(y))
plt.show()
This produces the following plot:
which looks "okay-ish" but the two plots are clearly not on the same scale.
Coding Issue
How come the KDE is coming out on a different scale? Or rather, why is the barplot on a different scale than the KDE?
To further highlight this, I've changed the variance covariance matrix so that we know that the marginal distribution over y is a normal distribution centered at 0 with variance 3. At this point we can compare the KDE with the actual normal distribution as follows:
plt.plot(y, norm.pdf(y, loc=0, scale=np.sqrt(3)), label='norm')
plt.plot(y, kde.pdf(y), label='kde')
plt.legend()
plt.show()
This gives:
Which means the bar plot is on the wrong scale. What coding issue made the barplot in the wrong scale?
When estimating the pdf of values that are in [0, 1] using stats.kde.gaussian_kde, then if the values are uniformly distributed, stats.kde.gaussian_kde gives very poor results close to the boundaries (close to 0 and close to 1). See the code and picture below.
Is there any way to deal with this poor estimation close to the boundaries ?
import numpy as np
import random
import scipy.stats as stats
import matplotlib.pyplot as plt
X = [random.uniform(0,1) for _ in range(10000)]
linsp = np.linspace(0, 1, 1000)
nparam_density = stats.kde.gaussian_kde(X)
nparam_density = nparam_density(linsp)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(X, bins=30, normed=True)
ax.plot(linsp, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.legend(loc='best')
fig.savefig('pdf.png')
I'm using Python and some of its extensions to get and plot the Probability Density Function. While I manage to plot it, in its form, at least, I don't manage to succeed on scalating the axis.
import decimal
import numpy as np
import scipy.stats as stats
import pylab as pl
import matplotlib.pyplot as plt
from decimal import *
from scipy.stats import norm
lines=[]
fig, ax = plt.subplots(1, 1)
mean, var, skew, kurt = norm.stats(moments='mvsk')
#Here I delete some lines aimed to fill the list with values
Long = len(lines)
Maxim = max(lines) #MaxValue
Minim = min(lines) #MinValue
av = np.mean(lines) #Average
StDev = np.std(lines) #Standard Dev.
x = np.linspace(Minim, Maxim, Long)
ax.plot(x, norm.pdf(x, av, StDev),'r-', lw=3, alpha=0.9, label='norm pdf')
weights = np.ones_like(lines)/len(lines)
ax.hist(lines, weights = weights, normed=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()
The result is
While I would like to have it expressed
- In the x-axis centered in 0 and related to the standard deviation
- In the y-axis, related to the histogram and the %s (normalized to 1)
For the x-axis as the image below
And like this last image for the y-axis
I've managed to escalate the y-axis in a histogram by plotting it individually with the instruction weights = weights and setting it into the plot, but I can't do it here. I include it in the code but actually it does nothing in this case.
Any help would be appreciated
the y-axis is normed in a way, that the area under the curve is one.
And adding equal weights for every data point makes no sense if you normalize anyway with normed=True.
first you need to shift your data to 0:
lines -= mean(lines)
then plot it.
ythis should be a working minimal example:
import numpy as np
from numpy.random import normal
import matplotlib.pyplot as plt
from scipy.stats import norm
# gaussian distributed random numbers with mu =4 and sigma=2
x = normal(4, 2, 10000)
mean = np.mean(x)
sigma = np.std(x)
x -= mean
x_plot = np.linspace(min(x), max(x), 1000)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(x, bins=50, normed=True, label="data")
ax.plot(x_plot, norm.pdf(x_plot, mean, sigma), 'r-', label="pdf")
ax.legend(loc='best')
x_ticks = np.arange(-4*sigma, 4.1*sigma, sigma)
x_labels = [r"${} \sigma$".format(i) for i in range(-4,5)]
ax.set_xticks(x_ticks)
ax.set_xticklabels(x_labels)
plt.show()
output image is this:
and you have too much imports.
you import decimals twice, one time even with *
and then numpy, pyplot and scipy are included in pylab. Also why import the whole scipy.stats and then again import just norm from it?
I have some points (about 3000) and edges (about 6000) in this type of format:
points = numpy.array([1,2],[4,5],[2,7],[3,9],[9,2])
edges = numpy.array([0,1],[3,4],[3,2],[2,4])
where the edges are indices into points, so that the start and end coordinates of each edges is given by:
points[edges]
I am looking for a faster / better way to plot them. Currently I have:
from matplotlib import pyplot as plt
x = points[:,0].flatten()
y = points[:,1].flatten()
plt.plot(x[edges.T], y[edges.T], 'y-') # Edges
plt.plot(x, y, 'ro') # Points
plt.savefig('figure.png')
I read about lineCollections, but am unsure how to use them. Is there a way to use my data more directly? What is the bottleneck here?
Some more realistic test data, time to plot is about 132 seconds:
points = numpy.random.randint(0, 100, (3000, 2))
edges = numpy.random.randint(0, 3000, (6000, 2))
Well, I found the following which is much faster:
from matplotlib import pyplot as plt
from matplotlib.collections import LineCollection
lc = LineCollection(points[edges])
fig = plt.figure()
plt.gca().add_collection(lc)
plt.xlim(points[:,0].min(), points[:,0].max())
plt.ylim(points[:,1].min(), points[:,1].max())
plt.plot(points[:,0], points[:,1], 'ro')
fig.savefig('full_figure.png')
Is it still possible to do it faster?
You could also just do it in a single plot call, which is considerably faster than two (though probably essentially the same as adding a LineCollection).
import numpy
import matplotlib.pyplot as plt
points = numpy.array([[1,2],[4,5],[2,7],[3,9],[9,2]])
edges = numpy.array([[0,1],[3,4],[3,2],[2,4]])
x = points[:,0].flatten()
y = points[:,1].flatten()
plt.plot(x[edges.T], y[edges.T], linestyle='-', color='y',
markerfacecolor='red', marker='o')
plt.show()