Turning a scatter plot into a histogram in python - python

I need to plot a histogram from some data in another file.
At the moment I have code that plots a scatter plot and fits a Gaussian.
The x-value is whatever the number is on the corresponding line from the data file it's reading in (after the first 12 lines of other information, i.e. Line 13 is the first event), and the y-value is the number of the line multiplied by a value.
Then plots and fits the scatter, but I need to be able to plot it as a histogram, and be able to change the bin width / number (i.e. add bin 1, 2, 3 and 4 together to have 1/4 of the bins overall with 4 times the numbers of events - so I guess adding together multiple lines from the data), which is where I am stuck.
How would I go about making this into the histogram and adjust the width / numbers?
Code below, didn't know how to make it pretty. Let me know if I can make it a bit easier to read.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Define gaussian
def gaussian(x, amp, cen, wid):
"""1-d gaussian: gaussian(x, amp, cen, wid)"""
return (amp / (sqrt(2*pi) * wid)) * exp(-(x-cen)**2 / (2*wid**2))
## Define constants
stderrThreshold = 10
minimumAmplitude = 0.1
approxcen = 780
MaestroT = 53
## Define paramaters
amps = []; ampserr = []; ts = []
folderToAnalyze = baseFolder + fileToRun + '\\'
## Generate the time array
for n in range(0, numfiles):
## Load text file
x = np.linspace(0, 8191, 8192)
fullprefix = folderToAnalyze + prefix + str(n).zfill(3)
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Coincidence Detections', fontsize=20)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.plot(x, y, 'bo')
ax.set_xlim(600,1000)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x, amp=8, cen=approxcen, wid=1)
## Plot results and save figure
ax.plot(x, result.best_fit, 'r-', label='best fit')
ax.legend(loc='best')
texttoplot = result.fit_report()
ax.text(0.02, 0.5, texttoplot, transform=ax.transAxes)
plt.close()
fig.savefig(fullprefix + ".png", pad_inches='0.5')
Current Output: Scatter plots, that do show the expected distribution and plot of the data (they do however have a crappy reduced chi^2, but one problem at a time)
Expected Output: Histogram plots of the same data, with the same distribution and fitting, when each event is plotted as a separate bin, and hopefully the possibility to add these bins together to reduce error bars
Errors: N/A
Data: It's basically a standard distribution over 8192 lines. Full data for 1 file is here. Also the original .Spe file, the scatter plot and the full version of the code
2020-11-23 Update From Answer Comments:
Hi, I've been trying to implement this for a little while and come unstuck. I've tried to follow your example closely, however I am getting out histogram still with a bin width of 1 (i.e. not adding together). I also get a second blank graph in the printouts, and the reports only printout in the IDE (though i am working on that one, and reckon i will have it soon). Also for some reason, it seems to stop after 3 out of 50 iterations of the loop.
This is the code in it's current state:
This is the output I'm getting:
This is the original output:
And just in case it's useful, this is the raw data. I seem to be having trouble replicating your last 2 figure
The ideal would just be able to alter the constant on line 30, to whatever the desired bin width is, and have it run with that bin width on that occasion.

The scatter plot, in this case, is a histogram, except with dots instead of bars.
.Spe is a bin count for each event.
x = np.linspace(0, 8191, 8192) defines the bins, and the bin width is 1.
Construct a bar plot instead of a scatter plot
ax.bar(x, y) instead of ax.plot(x, y, 'bo')
As a result of the existing data, the following plot is a histogram with a very wide distribution.
There are values ranging from 321 to 1585
ax.set_xlim(300, 1800)
The benefit of this data is, it's easy to recreate the raw distribution based on x, a bin size of 1, and y being the respective count for each x.
np.repeat can create an array with repeat elements
import numpy
import matplotlib.pyplot as plt
# given x and y from the loop
# set the type as int
y = y.astype(int)
x = x.astype(int)
# create the data
data = np.repeat(x, y)
# determine the range of x
x_range = range(min(data), max(data)+1)
# determine the length of x
x_len = len(x_range)
# plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
ax2.boxplot(data, vert=False)
plt.show()
Given data, you can now perform whatever analysis is required.
SO: Fit gaussian to noisy data with lmfit
LMFIT Docs
Cross Validated might be a better site for diving into the model
All of the error calculation parameters come from the model result. If you calculate new x and y from np.histogram for different bin widths, that may affect the errors.
approxcen = 780 is also an input to result
# given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
# determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
# Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
# result
print(result.fit_report())
[out]:
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 314
# data points = 158
# variables = 3
chi-square = 397.702574
reduced chi-square = 2.56582306
Akaike info crit = 151.851284
Bayesian info crit = 161.039069
[[Variables]]
amp: 1174.80608 +/- 37.1663147 (3.16%) (init = 8)
cen: 775.535731 +/- 0.46232727 (0.06%) (init = 780)
wid: 12.6563219 +/- 0.46232727 (3.65%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(amp, wid) = 0.577
# plot
plt.figure(figsize=(10, 6))
plt.bar(x[:-1], y)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.figure(figsize=(20, 8))
plt.bar(x[:-1], y)
plt.xlim(700, 850)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.grid()
As we can see from the next code block, the error is related to the following parameters
stderrThreshold = 10
minimumAmplitude = 0.1
MaestroT = 53
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)

Related

How to add verticale scatter with specific values on basic plot?

I'm trying to reproduce the following plot from this paper :
The plot show the mean accuracy across five runs and the vertical values shows the min and max accuracy.
How can I add those verticale scatter with specific values ?
My current code :
def plot_losses(losses: Dict[float, Dict[float, List[float]]]) -> None:
"""
Plot the evolution of the loss regarding the sparsity level and iteration step
Args:
losses (Dict[float, Dict[float, List[float]]]): Dict containing the losses regarding the sparsity level and iteration step
"""
plt.clf()
plt.figure(figsize=(20, 10))
plt.tight_layout()
sparsity_levels = [round(sparsity_level, 2) for sparsity_level in losses.keys()]
for sparsity_level, key in zip(sparsity_levels, losses.keys()):
plt.plot(list(losses[key].keys()), list(losses[key].values()), '+--', label=f"{100 - sparsity_level:.2f}%")
plt.show()
Prefer plt.errorbar (over plt.plot inside the for loop of plot_losses) and use the argument yerr to add the vertical bars with min and max values.
Here is an example:
import numpy as np
import matplotlib.pyplot as plt
# Generate data
x = np.arange(10) + 1
y1 = x/20
y2 = x/25
# Generate data for pseudo-errorbars
y1_err = np.array([y1[0::2]/20, y1[1::2]/7]).reshape(1, 10)
y2_err = np.array([y2[0::2]/30, y1[1::2]/13]).reshape(1, 10)
# Plot data
plt.errorbar(x, y1, yerr=y1_err, label="100", capsize=3, capthick=3.5)
plt.errorbar(x, y2, yerr=y2_err, label="51.3", capsize=3, capthick=3.5)
plt.legend(bbox_to_anchor=(0.95, 1.1), ncol=3)
plt.show()
This gives:

Plotting histograms in Python using Matplotlib or Pandas

I have gone from different posts on this forum, but I cannot find an answer to the behaviour I am seeing.
I have a csv file which header has many entries with 300 points each.
For each fiel (column of the csv file) I would like to plot an histogram. The x axis contains the elements on that column and the y-axis should have the number of samples that fall inside each bin.
As I have 300 points, the total number of samples in all bins added together should be 300, so the y-axis should go from 0 to, let's say, 50 (just an example). However, the values are gigantic (400e8), which makes not sense.
sample of the table
point mydata
1 | 250.23e-9
2 | 250.123e-9
... | ...
300 | 251.34e-9
Please check my code, below. I am using pandas to open the csv and Matplotlib for the rest.
df=pd.read_csv("/home/pcardoso/raw_data/myData.csv")
# Figure parameters
figPath='/home/pcardoso/scripts/python/matplotlib/figures/'
figPrefix='hist_' # Prefix to the name of the file.
figSuffix='_something' # Suffix to the name of the file.
figString='' # Full string passed as the figure name to be saved
precision=3
num_bins = 50
columns=list(df)
for fieldName in columns:
vectorData=df[fieldName]
# statistical data
mu = np.mean(vectorData) # mean of distribution
sigma = np.std(vectorData) # standard deviation of distribution
# Create plot instance
fig, ax = plt.subplots()
# Histogram
n, bins, patches = ax.hist(vectorData, num_bins, density='True',alpha=0.75,rwidth=0.9, label=fieldName)
ax.legend()
# Best-fit curve
y=mlab.normpdf(bins, mu, sigma)
ax.plot(bins, y, '--')
# Setting axis names, grid and title
ax.set_xlabel(fieldName)
ax.set_ylabel('Number of points')
ax.set_title(fieldName + ': $\mu=$' + eng_notation(mu,precision) + ', $\sigma=$' + eng_notation(sigma,precision))
ax.grid(True, alpha=0.2)
fig.tight_layout() # Tweak spacing to prevent clipping of ylabel
# Saving figure
figString=figPrefix + fieldName +figSuffix
fig.savefig(figPath + figString)
plt.show()
plt.close(fig)
In summary, I would like to know how to have the y-axis values right.
Edit: 6 July 2020
Edit 08 June 2020
I would like the density estimator to follow the plot like this:
Thanks in advance.
Best regards,
Pedro
Don't use density='True', as with that option, the value displayed is the members in the bin divided by the width of the bin. If that width is small (as in your case of rather small x-values, the values become large.
Edit:
Ok, to un-norm the normed curve, you need to multiply it with the number of points and the width of one bin. I made a more reduced example:
from numpy.random import normal
from scipy.stats import norm
import pylab
N = 300
sigma = 10.0
B = 30
def main():
x = normal(0, sigma, N)
h, bins, _ = pylab.hist(x, bins=B, rwidth=0.8)
bin_width = bins[1] - bins[0]
h_n = norm.pdf(bins[:-1], 0, sigma) * N * bin_width
pylab.plot(bins[:-1], h_n)
if __name__ == "__main__":
main()

How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:
The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)
My strategy is to use the inverse CDF (or Smirnov transform I believe):
Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.
The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.
Here is my code:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm
# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)
# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)
# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)
# Generate random uniform data
input = np.random.uniform(size=10000)
# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)
Which produces the output:
(223.99999999999997, 2480.0000000000009)
... and not the (100, 1000) that I would like to see. Any advice appreciated!
First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 102.5 = e5.77.
Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.
If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:
Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8
where CDF is expressed via error function (which is pretty much common function found in any library)
So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling
Severin's approach is much leaner than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its quite trivial to do it manually):
# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
from scipy.stats import lognorm
import math
target_modal_value = 320
# Define function to find roots of
def equation(s):
# From Wikipedia: Mode = exp(m - s*s) = 320
m = math.log(target_modal_value) + s**2
# Get probability mass from CDF at 100 and 1000, should equal to 0.8.
# Rearange equation so that =0, to find root (value of s)
return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)
# Solve non-linear equation to find s
s_initial_guess = 1
s = fsolve(equation, s_initial_guess)
# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))
# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

How do I plot a spectrogram the same way that pylab's specgram() does?

In Pylab, the specgram() function creates a spectrogram for a given list of amplitudes and automatically creates a window for the spectrogram.
I would like to generate the spectrogram (instantaneous power is given by Pxx), modify it by running an edge detector on it, and then plot the result.
(Pxx, freqs, bins, im) = pylab.specgram( self.data, Fs=self.rate, ...... )
The problem is that whenever I try to plot the modified Pxx using imshow or even NonUniformImage, I run into the error message below.
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/image.py:336: UserWarning: Images are not supported on non-linear axes.
warnings.warn("Images are not supported on non-linear axes.")
For example, a part of the code I'm working on right is below.
# how many instantaneous spectra did we calculate
(numBins, numSpectra) = Pxx.shape
# how many seconds in entire audio recording
numSeconds = float(self.data.size) / self.rate
ax = fig.add_subplot(212)
im = NonUniformImage(ax, interpolation='bilinear')
x = np.arange(0, numSpectra)
y = np.arange(0, numBins)
z = Pxx
im.set_data(x, y, z)
ax.images.append(im)
ax.set_xlim(0, numSpectra)
ax.set_ylim(0, numBins)
ax.set_yscale('symlog') # see http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set_yscale
ax.set_title('Spectrogram 2')
Actual Question
How do you plot image-like data with a logarithmic y axis with matplotlib/pylab?
Use pcolor or pcolormesh. pcolormesh is much faster, but is limited to rectilinear grids, where as pcolor can handle arbitrary shaped cells. specgram uses pcolormesh, if I recall correctly. (It uses imshow.)
As a quick example:
import numpy as np
import matplotlib.pyplot as plt
z = np.random.random((11,11))
x, y = np.mgrid[:11, :11]
fig, ax = plt.subplots()
ax.set_yscale('symlog')
ax.pcolormesh(x, y, z)
plt.show()
The differences you're seeing are due to plotting the "raw" values that specgram returns. What specgram actually plots is a scaled version.
import matplotlib.pyplot as plt
import numpy as np
x = np.cumsum(np.random.random(1000) - 0.5)
fig, (ax1, ax2) = plt.subplots(nrows=2)
data, freqs, bins, im = ax1.specgram(x)
ax1.axis('tight')
# "specgram" actually plots 10 * log10(data)...
ax2.pcolormesh(bins, freqs, 10 * np.log10(data))
ax2.axis('tight')
plt.show()
Notice that when we plot things using pcolormesh, there's no interpolation. (That's part of the point of pcolormesh--it's just vector rectangles instead of an image.)
If you want things on a log scale, you can use pcolormesh with it:
import matplotlib.pyplot as plt
import numpy as np
x = np.cumsum(np.random.random(1000) - 0.5)
fig, (ax1, ax2) = plt.subplots(nrows=2)
data, freqs, bins, im = ax1.specgram(x)
ax1.axis('tight')
# We need to explictly set the linear threshold in this case...
# Ideally you should calculate this from your bin size...
ax2.set_yscale('symlog', linthreshy=0.01)
ax2.pcolormesh(bins, freqs, 10 * np.log10(data))
ax2.axis('tight')
plt.show()
Just to add to Joe's answer...
I was getting small differences between the visual output of specgram compared to pcolormesh (as noisygecko also was) that were bugging me.
Turns out that if you pass frequency and time bins returned from specgram to pcolormesh, it treats these values as values on which to centre the rectangles rather than edges of them.
A bit of fiddling gets them to allign better (though still not 100% perfect). The colours are identical now also.
x = np.cumsum(np.random.random(1024) - 0.2)
overlap_frac = 0
plt.subplot(3,1,1)
data, freqs, bins, im = pylab.specgram(x, NFFT=128, Fs=44100, noverlap = 128*overlap_frac, cmap='plasma')
plt.title("specgram plot")
plt.subplot(3,1,2)
plt.pcolormesh(bins, freqs, 20 * np.log10(data), cmap='plasma')
plt.title("pcolormesh no adj.")
# bins actually returns middle value of each chunk
# so need to add an extra element at zero, and then add first to all
bins = bins+(bins[0]*(1-overlap_frac))
bins = np.concatenate((np.zeros(1),bins))
max_freq = freqs.max()
diff = (max_freq/freqs.shape[0]) - (max_freq/(freqs.shape[0]-1))
temp_vec = np.arange(freqs.shape[0])
freqs = freqs+(temp_vec*diff)
freqs = np.concatenate((freqs,np.ones(1)*max_freq))
plt.subplot(3,1,3)
plt.pcolormesh(bins, freqs, 20 * np.log10(data), cmap='plasma')
plt.title("pcolormesh post adj.")

cumulative distribution plots python

I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.
I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.
You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:
import numpy as np
import matplotlib.pyplot as plt
# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')
plt.show()
Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:
import numpy as np
import matplotlib.pyplot as plt
# Some fake data:
data = np.random.randn(1000)
sorted_data = np.sort(data) # Or data.sort(), if data can be modified
# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size)) # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size)) # From the number of data points-1 to 0
plt.show()
Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.
The result is:
You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).
PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:
plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
np.arange(sorted_data.size+1))
There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.
After conclusive discussion with #EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:
import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt
def pdf(x, mu=0, sigma=1):
"""
Calculates the normal distribution's probability density
function (PDF).
"""
term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
return term1 * term2
# Drawing sample date poi
##################################################
# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()
min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))
##################################################
fig = plt.gcf()
fig.set_size_inches(12,11)
# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Cumulative distributions, smooth:
plt.subplot(2,2,2)
plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Probability densities of the sample points function
plt.subplot(2,2,3)
pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
# Probability density function
plt.subplot(2,2,4)
x = np.arange(min_val, max_val, 0.05)
pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')
plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
plt.show()
In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :
def hist(data, bins, title, labels, range = None):
fig = plt.figure(figsize=(15, 8))
ax = plt.axes()
plt.ylabel("Proportion")
values, base, _ = plt.hist( data , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
ax_bis = ax.twinx()
values = np.append(values,0)
ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
plt.xlabel(labels)
plt.ylabel("Proportion")
plt.title(title)
ax_bis.legend();
ax.legend();
plt.show()
return
if anyone wonders how it looks like, please take a look (with seaborn activated):
Also, concerning the double grid (the white lines), I always used to struggle to have nice double grid. Here is an interesting way to circumvent the problem: How to put grid lines from the secondary axis behind the primary plot?
The simplest way to generate this graph is with seaborn:
import seaborn as sns
sns.ecdfplot()
Here is the documentation

Categories