How to add verticale scatter with specific values on basic plot? - python

I'm trying to reproduce the following plot from this paper :
The plot show the mean accuracy across five runs and the vertical values shows the min and max accuracy.
How can I add those verticale scatter with specific values ?
My current code :
def plot_losses(losses: Dict[float, Dict[float, List[float]]]) -> None:
"""
Plot the evolution of the loss regarding the sparsity level and iteration step
Args:
losses (Dict[float, Dict[float, List[float]]]): Dict containing the losses regarding the sparsity level and iteration step
"""
plt.clf()
plt.figure(figsize=(20, 10))
plt.tight_layout()
sparsity_levels = [round(sparsity_level, 2) for sparsity_level in losses.keys()]
for sparsity_level, key in zip(sparsity_levels, losses.keys()):
plt.plot(list(losses[key].keys()), list(losses[key].values()), '+--', label=f"{100 - sparsity_level:.2f}%")
plt.show()

Prefer plt.errorbar (over plt.plot inside the for loop of plot_losses) and use the argument yerr to add the vertical bars with min and max values.
Here is an example:
import numpy as np
import matplotlib.pyplot as plt
# Generate data
x = np.arange(10) + 1
y1 = x/20
y2 = x/25
# Generate data for pseudo-errorbars
y1_err = np.array([y1[0::2]/20, y1[1::2]/7]).reshape(1, 10)
y2_err = np.array([y2[0::2]/30, y1[1::2]/13]).reshape(1, 10)
# Plot data
plt.errorbar(x, y1, yerr=y1_err, label="100", capsize=3, capthick=3.5)
plt.errorbar(x, y2, yerr=y2_err, label="51.3", capsize=3, capthick=3.5)
plt.legend(bbox_to_anchor=(0.95, 1.1), ncol=3)
plt.show()
This gives:

Related

How to determine the x value on the edge of the violinplot for a mean line

I am trying to draw a mean line on violin plots, since I was not able to find a way to make sns replace the "median" line that comes from "quartiles", I decided to code so that for each case it draws on top. I am planning on drawing horizontal lines using plt.plot on the mean value (y value) of each of the three graphs I have.
I have the exact y (height) values where I want my horizontal line to be drawn, however, I am having difficulty trying to figure out the bound of each violin graph on that specific y value. I know since it is symmetric the domain is (-x, x), so I need a way to find that "x" value for me to be able to have 3 added horizontal lines which each bounded by the violin graphs that I have.
Here is my code, the x value of the plt.plot is -0.37, which is something I found by trial and error, I want python to find that for me for a given y value.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = [2.57e-05, 4.17e-06, -5.4e-06, -5.05e-06, 1.15e-05, -6.7e-06, 1.01e-05, 5.53e-06, 8.13e-06, 1.27e-05, 1.11e-06, -2.87e-06, -1.38e-06, -1.07e-05, -8.04e-06, 4.77e-06, 3.22e-07, 9.86e-06, 1.38e-05, 1.32e-05, -3.48e-06, -4.69e-06, 8.15e-06, 4.21e-07, 2.71e-06, 7.52e-08, 1.04e-06, -1.92e-06, -4.08e-06, 4.76e-06]
vg = sns.violinplot(data=data, inner="quartile", scale="width")
a = sns.pointplot(data=data, zlinestyles='-', join=False, ci=None, color='red')
for p in vg.lines:
p.set_linestyle('-')
p.set_linewidth(0.8) # Sets the thickness of the quartile lines
p.set_color('white') # Sets the color of the quartile lines
p.set_alpha(0.8)
for p in vg.lines[1::3]: # these are the median lines; not means
p.set_linestyle('-')
p.set_linewidth(0) # Sets the thickness of the median lines
p.set_color('black') # Sets the color of the median lines
p.set_alpha(0.8)
# add a mean line from the edge of the violin plot
plt.plot([-0.37, 0], [np.mean(data), np.mean(data)], 'k-', lw=1)
plt.show()
Refer to the picture where I removed the median point but left the quartile lines, where I want to draw mean lines across where the blue dots are visible
And here is a picture once I draw that plt.plot with the x value I found via trial and error: For case I only
You can draw a line that is too long, and then clip it with the polygon forming the violin.
Note that inner='quartile' shows the 25%, 50% and 75% lines. The 50% line is also known as the median. This is similar to how boxplots are usually drawn. It is rather confusing to show the mean in a too similar style. That's why seaborn (and many other libraries) prefer to show the mean as a point.
Here is some example code (note that the return value of sns.violinplot is an ax, and naming it very different makes it rather hard to find your way into matplotlib and seaborn docs and examples).
import matplotlib.pyplot as plt
from matplotlib.patches import PathPatch
import seaborn as sns
import pandas as pd
import numpy as np
tips = sns.load_dataset('tips')
tips['day'] = pd.Categorical(tips['day'])
ax = sns.violinplot(data=tips, x='day', y='total_bill', hue='day', inner='quartile', scale='width', dodge=False)
sns.pointplot(data=tips, x='day', y='total_bill', join=False, ci=None, color='yellow', ax=ax)
ax.legend_.remove()
for p in ax.lines:
p.set_linestyle('-')
p.set_linewidth(0.8) # Sets the thickness of the quartile lines
p.set_color('white') # Sets the color of the quartile lines
p.set_alpha(0.8)
for x, (day, violin) in enumerate(zip(tips['day'].cat.categories, ax.collections)):
line = ax.hlines(tips[tips['day'] == day]['total_bill'].mean(), x - 0.5, x + 0.5, color='black', ls=':', lw=2)
patch = PathPatch(violin.get_paths()[0], transform=ax.transData)
line.set_clip_path(patch) # clip the line by the form of the violin
plt.show()
Updated to use a list of lists of data:
data = [np.random.randn(10, 7).cumsum(axis=0).ravel() for _ in range(3)]
ax = sns.violinplot(data=data, inner='quartile', scale='width', palette='Set2')
# sns.pointplot(data=data, join=False, ci=None, color='red', ax=ax) # shows the means
ax.set_xticks(range(len(data)))
ax.set_xticklabels(['I' * (k + 1) for k in range(len(data))])
for p in ax.lines:
p.set_linestyle('-')
p.set_linewidth(0.8) # Sets the thickness of the quartile lines
p.set_color('white') # Sets the color of the quartile lines
p.set_alpha(0.8)
for x, (data_x, violin) in enumerate(zip(data, ax.collections)):
line = ax.hlines(np.mean(data_x), x - 0.5, x + 0.5, color='black', ls=':', lw=2)
patch = PathPatch(violin.get_paths()[0], transform=ax.transData)
line.set_clip_path(patch)
plt.show()
PS: Some further explanation about enumerate(zip(...))
for data_x in data: would loop through the entries of the list data, first assigning data[0] to data_x etc.
for x, data_x in enumerate(data): would loop through the entries of the list data and at the same time increment a variable x from 0 to 1 and finally to 2.
for data_x, violin in zip(data, ax.collections): would the data_x loop through the entries of the list data and simultaneously a variable violin through the list stored in ax.collections (this is where matplotlib stores the shapes of the violins)
for x, (data_x, violin) in enumerate(zip(data, ax.collections)): combines the enumeration with zip`

Turning a scatter plot into a histogram in python

I need to plot a histogram from some data in another file.
At the moment I have code that plots a scatter plot and fits a Gaussian.
The x-value is whatever the number is on the corresponding line from the data file it's reading in (after the first 12 lines of other information, i.e. Line 13 is the first event), and the y-value is the number of the line multiplied by a value.
Then plots and fits the scatter, but I need to be able to plot it as a histogram, and be able to change the bin width / number (i.e. add bin 1, 2, 3 and 4 together to have 1/4 of the bins overall with 4 times the numbers of events - so I guess adding together multiple lines from the data), which is where I am stuck.
How would I go about making this into the histogram and adjust the width / numbers?
Code below, didn't know how to make it pretty. Let me know if I can make it a bit easier to read.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Define gaussian
def gaussian(x, amp, cen, wid):
"""1-d gaussian: gaussian(x, amp, cen, wid)"""
return (amp / (sqrt(2*pi) * wid)) * exp(-(x-cen)**2 / (2*wid**2))
## Define constants
stderrThreshold = 10
minimumAmplitude = 0.1
approxcen = 780
MaestroT = 53
## Define paramaters
amps = []; ampserr = []; ts = []
folderToAnalyze = baseFolder + fileToRun + '\\'
## Generate the time array
for n in range(0, numfiles):
## Load text file
x = np.linspace(0, 8191, 8192)
fullprefix = folderToAnalyze + prefix + str(n).zfill(3)
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Coincidence Detections', fontsize=20)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.plot(x, y, 'bo')
ax.set_xlim(600,1000)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x, amp=8, cen=approxcen, wid=1)
## Plot results and save figure
ax.plot(x, result.best_fit, 'r-', label='best fit')
ax.legend(loc='best')
texttoplot = result.fit_report()
ax.text(0.02, 0.5, texttoplot, transform=ax.transAxes)
plt.close()
fig.savefig(fullprefix + ".png", pad_inches='0.5')
Current Output: Scatter plots, that do show the expected distribution and plot of the data (they do however have a crappy reduced chi^2, but one problem at a time)
Expected Output: Histogram plots of the same data, with the same distribution and fitting, when each event is plotted as a separate bin, and hopefully the possibility to add these bins together to reduce error bars
Errors: N/A
Data: It's basically a standard distribution over 8192 lines. Full data for 1 file is here. Also the original .Spe file, the scatter plot and the full version of the code
2020-11-23 Update From Answer Comments:
Hi, I've been trying to implement this for a little while and come unstuck. I've tried to follow your example closely, however I am getting out histogram still with a bin width of 1 (i.e. not adding together). I also get a second blank graph in the printouts, and the reports only printout in the IDE (though i am working on that one, and reckon i will have it soon). Also for some reason, it seems to stop after 3 out of 50 iterations of the loop.
This is the code in it's current state:
This is the output I'm getting:
This is the original output:
And just in case it's useful, this is the raw data. I seem to be having trouble replicating your last 2 figure
The ideal would just be able to alter the constant on line 30, to whatever the desired bin width is, and have it run with that bin width on that occasion.
The scatter plot, in this case, is a histogram, except with dots instead of bars.
.Spe is a bin count for each event.
x = np.linspace(0, 8191, 8192) defines the bins, and the bin width is 1.
Construct a bar plot instead of a scatter plot
ax.bar(x, y) instead of ax.plot(x, y, 'bo')
As a result of the existing data, the following plot is a histogram with a very wide distribution.
There are values ranging from 321 to 1585
ax.set_xlim(300, 1800)
The benefit of this data is, it's easy to recreate the raw distribution based on x, a bin size of 1, and y being the respective count for each x.
np.repeat can create an array with repeat elements
import numpy
import matplotlib.pyplot as plt
# given x and y from the loop
# set the type as int
y = y.astype(int)
x = x.astype(int)
# create the data
data = np.repeat(x, y)
# determine the range of x
x_range = range(min(data), max(data)+1)
# determine the length of x
x_len = len(x_range)
# plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
ax2.boxplot(data, vert=False)
plt.show()
Given data, you can now perform whatever analysis is required.
SO: Fit gaussian to noisy data with lmfit
LMFIT Docs
Cross Validated might be a better site for diving into the model
All of the error calculation parameters come from the model result. If you calculate new x and y from np.histogram for different bin widths, that may affect the errors.
approxcen = 780 is also an input to result
# given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
# determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
# Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
# result
print(result.fit_report())
[out]:
[[Model]]
Model(gaussian)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 314
# data points = 158
# variables = 3
chi-square = 397.702574
reduced chi-square = 2.56582306
Akaike info crit = 151.851284
Bayesian info crit = 161.039069
[[Variables]]
amp: 1174.80608 +/- 37.1663147 (3.16%) (init = 8)
cen: 775.535731 +/- 0.46232727 (0.06%) (init = 780)
wid: 12.6563219 +/- 0.46232727 (3.65%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(amp, wid) = 0.577
# plot
plt.figure(figsize=(10, 6))
plt.bar(x[:-1], y)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.figure(figsize=(20, 8))
plt.bar(x[:-1], y)
plt.xlim(700, 850)
plt.plot(x[:-1], result.best_fit, 'r-', label='best fit')
plt.grid()
As we can see from the next code block, the error is related to the following parameters
stderrThreshold = 10
minimumAmplitude = 0.1
MaestroT = 53
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)

Extract mean and confidence intervals from Seaborn regplot

Given that regplot calculates means in intervals and bootstraps to find confidence intervals for each bin, it seems like a waste to have to recalculate them manually for further study, so:
Question: How do I access the calculated means and confidence intervals of a regplot?
Example: This code produces a nice plot of bin means with CIs:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# just some random numbers to get started
fig, ax = plt.subplots()
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth=4./10
x_bins=np.arange(-2+binwidth/2,2,binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
plt.show()
Result:
Regplot showing binned data w. CIs
Not that calculating the means bin by bin isn't easily doable, but the CIs are calculated using random numbers. It would be nice to have the exact same numbers accessible as are plotted, so how do I access them? There must be some sort of get_*-method I'm overlooking.
Set-up
Setting up as in your MWE:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Random numbers for plotting
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth = 4 / 10
x_bins = np.arange(binwidth/2 - 2, 2, binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
This gives our starting point as:
Extracting the Confidence Intervals
We can extract the confidence intervals by looping over the plotted lines and extracting the miniumum and maximum values (corresponding to the upper and lower CIs respectively):
ax = plt.gca()
lower = [line.get_ydata().min() for line in ax.lines]
upper = [line.get_ydata().max() for line in ax.lines]
As a sanity check we can plot these extracted points on top of our original data (shown here by red crosses):
plt.scatter(x_bins, lower, marker='x', color='C3', zorder=3)
plt.scatter(x_bins, upper, marker='x', color='C3', zorder=3)
Extracting the Means
The values of the means can be extracted from ax.collections as:
means = ax.collections[0].get_offsets()[:, 1]
Again, as a sanity check we can overlay our extracted values on the original plot:
plt.scatter(x_bins, means, color='C1', marker='x', zorder=3)

How to create a "dot plot" in Matplotlib? (not a scatter plot)

I'd like to create what my statistics book calls a "dot plot" where the number of dots in the plot equals the number of observations. Here's an example from mathisfun.com:
In the example, there are six dots above the 0 value on the X-axis representing the six observations of the value zero.
It seems that a "dot plot" can have several variations. In looking up how to create this with Matplotlib, I only came across what I know of as a scatter plot with a data point representing the relationship between the X and Y value.
Is the type of plot I'm trying to create possible with Matplotlib?
Supoose you have some data that would produce a histogram like the following,
import numpy as np; np.random.seed(13)
import matplotlib.pyplot as plt
data = np.random.randint(0,12,size=72)
plt.hist(data, bins=np.arange(13)-0.5, ec="k")
plt.show()
You may create your dot plot by calculating the histogram and plotting a scatter plot of all possible points, the color of the points being white if they exceed the number given by the histogram.
import numpy as np; np.random.seed(13)
import matplotlib.pyplot as plt
data = np.random.randint(0,12,size=72)
bins = np.arange(13)-0.5
hist, edges = np.histogram(data, bins=bins)
y = np.arange(1,hist.max()+1)
x = np.arange(12)
X,Y = np.meshgrid(x,y)
plt.scatter(X,Y, c=Y<=hist, cmap="Greys")
plt.show()
Alternatively you may set the unwanted points to nan,
Y = Y.astype(np.float)
Y[Y>hist] = np.nan
plt.scatter(X,Y)
This answer is built on the code posted by eyllanesc in his comment to the question as I find it elegant enough to merit an illustrative example. I provide two versions: a simple one where formatting parameters have been set manually and a second version where some of the formatting parameters are set automatically based on the data.
Simple version with manual formatting
import numpy as np # v 1.19.2
import matplotlib.pyplot as plt # v 3.3.2
# Create random data
rng = np.random.default_rng(123) # random number generator
data = rng.integers(0, 13, size=40)
values, counts = np.unique(data, return_counts=True)
# Draw dot plot with appropriate figure size, marker size and y-axis limits
fig, ax = plt.subplots(figsize=(6, 2.25))
for value, count in zip(values, counts):
ax.plot([value]*count, list(range(count)), 'co', ms=10, linestyle='')
for spine in ['top', 'right', 'left']:
ax.spines[spine].set_visible(False)
ax.yaxis.set_visible(False)
ax.set_ylim(-1, max(counts))
ax.set_xticks(range(min(values), max(values)+1))
ax.tick_params(axis='x', length=0, pad=8, labelsize=12)
plt.show()
Advanced version with automated formatting
If you plan on using this plot quite often, it can be useful to add some automated formatting parameters to get appropriate figure dimensions and marker size. In the following example, the parameters are defined in a way that works best with the kind of data for which this type of plot is typically useful (integer data with a range of up to a few dozen units and no more than a few hundred data points).
# Create random data
rng = np.random.default_rng(1) # random number generator
data = rng.integers(0, 21, size=100)
values, counts = np.unique(data, return_counts=True)
# Set formatting parameters based on data
data_range = max(values)-min(values)
width = data_range/2 if data_range<30 else 15
height = max(counts)/3 if data_range<50 else max(counts)/4
marker_size = 10 if data_range<50 else np.ceil(30/(data_range//10))
# Create dot plot with appropriate format
fig, ax = plt.subplots(figsize=(width, height))
for value, count in zip(values, counts):
ax.plot([value]*count, list(range(count)), marker='o', color='tab:blue',
ms=marker_size, linestyle='')
for spine in ['top', 'right', 'left']:
ax.spines[spine].set_visible(False)
ax.yaxis.set_visible(False)
ax.set_ylim(-1, max(counts))
ax.set_xticks(range(min(values), max(values)+1))
ax.tick_params(axis='x', length=0, pad=10)
plt.show()
Pass your dataset to this function:
def dot_diagram(dataset):
values, counts = np.unique(dataset, return_counts=True)
data_range = max(values)-min(values)
width = data_range/2 if data_range<30 else 15
height = max(counts)/3 if data_range<50 else max(counts)/4
marker_size = 10 if data_range<50 else np.ceil(30/(data_range//10))
fig, ax = plt.subplots(figsize=(width, height))
for value, count in zip(values, counts):
ax.plot([value]*count, list(range(count)), marker='o', color='tab:blue',
ms=marker_size, linestyle='')
for spine in ['top', 'right', 'left']:
ax.spines[spine].set_visible(False)
ax.yaxis.set_visible(False)
ax.set_ylim(-1, max(counts))
ax.set_xticks(range(min(values), max(values)+1))
ax.tick_params(axis='x', length=0, pad=10)
Let's say this is my data:
data = [5,8,3,7,1,5,3,2,3,3,8,5]
In order to plot a "dot plot", I will need the data (x-axis) and frequency (y-axis)
pos = []
keys = {} # this dict will help to keep track ...
# this loop will give us a list of frequencies to each number
for num in data:
if num not in keys:
keys[num] = 1
pos.append(1)
else:
keys[num] += 1
apos.append(keys[num])
print(pos)
[1, 1, 1, 1, 1, 2, 2, 1, 3, 4, 2, 3]
plt.scatter(data, pos)
plt.show()
Recently, I have also come up with something like this. And I have made the following for my case.
Hope this is helpful.
Well, we will first generate the frequency table and then we will generate points from that to do a scatter plot. Thats all! Superb simple.
For example, in your case, we have for 0 minutes, 6 people. This frequency can be converted into
[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6)]
Then, these points has to be simply plotted using the pyplot.scatter.
import numpy as np
import matplotlib.pyplot as plt
def generate_points_for_dotplot(arr):
freq = np.unique(arr,return_counts=True)
ls = []
for (value, count) in zip(freq[0],freq[1]):
ls += [(value,num) for num in range(count)]
x = [x for (x,y) in ls]
y = [y for (x,y) in ls]
return np.array([x,y])
Of course, this function return an array of two arrays, one for x co-ordinates and the other for y co-ordinates (Just because, thats how pyplot needs the points!). Now, we have the function to generate the points required to us, let us plot it then.
arr = np.random.randint(1,21,size=100)
x,y = generate_points_for_dotplot(arr)
# Plotting
fig,ax = plt.subplots(figsize = (max(x)/3,3)) # feel free to use Patricks answer to make it more dynamic
ax.scatter(x,y,s=100,facecolors='none',edgecolors='black')
ax.set_xticks(np.unique(x))
ax.yaxis.set_visible(False)
# removing the spines
for spine in ['top', 'right', 'left']:
ax.spines[spine].set_visible(False)
plt.show()
Output:
Probably, if the x ticks becomes over whelming, you can rotate them. However, for more number of values, that also becomes clumsy.

cumulative distribution plots python

I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.
I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.
You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:
import numpy as np
import matplotlib.pyplot as plt
# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')
plt.show()
Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:
import numpy as np
import matplotlib.pyplot as plt
# Some fake data:
data = np.random.randn(1000)
sorted_data = np.sort(data) # Or data.sort(), if data can be modified
# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size)) # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size)) # From the number of data points-1 to 0
plt.show()
Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.
The result is:
You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).
PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:
plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
np.arange(sorted_data.size+1))
There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.
After conclusive discussion with #EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:
import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt
def pdf(x, mu=0, sigma=1):
"""
Calculates the normal distribution's probability density
function (PDF).
"""
term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
return term1 * term2
# Drawing sample date poi
##################################################
# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()
min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))
##################################################
fig = plt.gcf()
fig.set_size_inches(12,11)
# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Cumulative distributions, smooth:
plt.subplot(2,2,2)
plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Probability densities of the sample points function
plt.subplot(2,2,3)
pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
# Probability density function
plt.subplot(2,2,4)
x = np.arange(min_val, max_val, 0.05)
pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')
plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
plt.show()
In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :
def hist(data, bins, title, labels, range = None):
fig = plt.figure(figsize=(15, 8))
ax = plt.axes()
plt.ylabel("Proportion")
values, base, _ = plt.hist( data , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
ax_bis = ax.twinx()
values = np.append(values,0)
ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
plt.xlabel(labels)
plt.ylabel("Proportion")
plt.title(title)
ax_bis.legend();
ax.legend();
plt.show()
return
if anyone wonders how it looks like, please take a look (with seaborn activated):
Also, concerning the double grid (the white lines), I always used to struggle to have nice double grid. Here is an interesting way to circumvent the problem: How to put grid lines from the secondary axis behind the primary plot?
The simplest way to generate this graph is with seaborn:
import seaborn as sns
sns.ecdfplot()
Here is the documentation

Categories