Simulate CDF curve for penetration/adoption extrapolation - python

I'd like to be able to plot a line like the cumulative distribution function for the normal distribution, because it's useful for simulating the adoption curve:
Specifically, I'd like to be able to use initial data (percentage adoption of a product) to extrapolate what the rest of that curve would look like, to give a rough estimate of the timeline to each of the phases. So, for example, if we got to 10% penetration by 30 days and 20% penetration by 40 days, and we try to fit this curve, I'd like to know when we're going to get to 80% penetration (vs another population that may have taken 50 days to get to 10% penetration).
So, my question is, how could I go about doing this? I would ideally be able to provide initial data (time and penetration), and use python (e.g. matplotlib) to plot out the rest of the chart for me. But I don't know where to start! Can anyone point me in the right direction?
(Incidentally, I also posted this question on CrossValidated, but I wasn't sure whether it belonged there, as it's a stats question, or here, as it's a python question. Apologies for duplication!)

The cdf can be calculated via scipy.stats.norm.cdf(). Its ppf can be used to help map the desired correspondences. scipy.interpolate.pchip can then create a function to so that the transformation interpolates smoothly.
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
from scipy.interpolate import pchip # monotonic cubic interpolation
from scipy.stats import norm
desired_xy = np.array([(30, 10), (40, 20)]) # (number of days, percentage adoption)
# desired_xy = np.array([(0, 1), (30, 10), (40, 20), (90, 99)])
labels = ['Innovators', 'Early\nAdopters', 'Early\nMajority', 'Late\nMajority', 'Laggards']
xmin, xmax = 0, 90 # minimum and maximum day on the x-axis
px = desired_xy[:, 0]
py = desired_xy[:, 1] / 100
# smooth function that transforms the x-values to the corresponding spots to get the desired y-values
interpfunc = pchip(px, norm.ppf(py))
fig, ax = plt.subplots(figsize=(12, 4))
# ax.scatter(px, py, color='crimson', s=50, zorder=3) # show desired correspondances
x = np.linspace(xmin, xmax, 1000)
ax.plot(x, norm.cdf(interpfunc(x)), lw=4, color='navy', clip_on=False)
label_divs = np.linspace(xmin, xmax, len(labels) + 1)
label_pos = (label_divs[:-1] + label_divs[1:]) / 2
ax.set_xticks(label_pos)
ax.set_xticklabels(labels, size=18, color='navy')
min_alpha, max_alpha = 0.1, 0.4
for p0, p1, alpha in zip(label_divs[:-1], label_divs[1:], np.linspace(min_alpha, max_alpha, len(labels))):
ax.axvspan(p0, p1, color='navy', alpha=alpha, zorder=-1)
ax.axvline(p0, color='white', lw=1, zorder=0)
ax.axhline(0, color='navy', lw=2, clip_on=False)
ax.axvline(0, color='navy', lw=2, clip_on=False)
ax.yaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlim(xmin, xmax)
ax.set_ylim(0, 1)
ax.set_ylabel('Total Adoption', size=18, color='navy')
ax.set_title('Adoption Curve', size=24, color='navy')
for s in ax.spines:
ax.spines[s].set_visible(False)
ax.tick_params(axis='x', length=0)
ax.tick_params(axis='y', labelcolor='navy')
plt.tight_layout()
plt.show()
Using just two points for desired_xy the curve will be linearly stretched. If more points are given, a smooth transformation will be applied. Here is how it looks like with [(0, 1), (30, 10), (40, 20), (90, 99)]. Note that 0 % and 100 % will cause problems, as they lie at minus at plus infinity.

Related

How to plot Comparative Boxplot with a PDF like KDnuggets Style

While going through the Understanding Boxplots from the KDnuggets Article. I found a detailed plot of Boxplot with a probability density function (pdf)
I'm trying to plot a comparative Boxplot and a probability density function (pdf) as shown in the article as below fig
I know plotting a basic box plot and pdf individually. My knowledge of visualization was minimum.I'm not asking the exact replicate of the above Plot, a similar plot with detail would be highly appreciated.
I'm open to new ideas and approaches and wanted to put some feelers out before diving into getting started
Can it be possible to plot the above plot with Python if YES, Which package would be used to plot the above plot? Can anybody shed some light on plotting the above plot with Python? I would be happy to receive any leads on it from you.
Here is an attempt to recreate the graphical elements of the plot. Instead of a perfect normal distribution, some random data is used, so you can plug in your own data. (For a more perfect curve, generate a higher number of samples.)
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
x = np.random.normal(0, 1, 1000)
mean = x.mean()
std = x.std()
q1, median, q3 = np.percentile(x, [25, 50, 75])
iqr = q3 - q1
fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)
medianprops = dict(linestyle='-', linewidth=2, color='yellow')
sns.boxplot(x=x, color='lightcoral', saturation=1, medianprops=medianprops,
flierprops={'markerfacecolor': 'mediumseagreen'}, whis=1.5, ax=ax1)
ticks = [mean + std * i for i in range(-4, 5)]
ticklabels = [f'${i}\\sigma$' for i in range(-4, 5)]
ax1.set_xticks(ticks)
ax1.set_xticklabels(ticklabels)
ax1.set_yticks([])
ax1.tick_params(labelbottom=True)
ax1.set_ylim(-1, 1.5)
ax1.errorbar([q1, q3], [1, 1], yerr=[-0.2, 0.2], color='black', lw=1)
ax1.text(q1, 0.6, 'Q1', ha='center', va='center', color='black')
ax1.text(q3, 0.6, 'Q3', ha='center', va='center', color='black')
ax1.text(median, -0.6, 'median', ha='center', va='center', color='black')
ax1.text(median, 1.2, 'IQR', ha='center', va='center', color='black')
ax1.text(q1 - 1.5*iqr, 0.4, 'Q1 - 1.5*IQR', ha='center', va='center', color='black')
ax1.text(q3 + 1.5*iqr, 0.4, 'Q3 + 1.5*IQR', ha='center', va='center', color='black')
# ax1.vlines([q1 - 1.5*iqr, q1, q3, q3 + 1.5*iqr], 0, -2, color='darkgrey', ls=':', clip_on=False, zorder=0)
sns.kdeplot(x, ax=ax2)
kdeline = ax2.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
ylims = ax2.get_ylim()
ax2.fill_between(xs, 0, ys, color='mediumseagreen')
ax2.fill_between(xs, 0, ys, where=(xs >= q1 - 1.5*iqr) & (xs <= q3 + 1.5*iqr), color='skyblue')
ax2.fill_between(xs, 0, ys, where=(xs >= q1) & (xs <= q3), color='lightcoral')
# ax2.vlines([q1 - 1.5*iqr, q1, q3, q3 + 1.5*iqr], 0, 100, color='darkgrey', ls=':', zorder=0)
ax2.set_ylim(0, ylims[1])
plt.show()
Some remarks:
Often the median and the mean don't coincide, so the 0 sigma might be a bit off from the median line.
Matplotlib draws the whiskers at the data point that is closest to the calculated Q1 - 1.5 IQR and Q3 + 1.5 IQR, so when there aren't a huge number of points, the position of the whisker might be off a bit.
For real data, the distribution seldom looks like a perfect bell curve.
Here is an example for 1 million samples:

How to plot a mean line on a distplot between 0 and the y value of the mean?

I have a distplot and I would like to plot a mean line that goes from 0 to the y value of the mean frequency. I want to do this, but have the line stop at when the distplot does. Why isn't there a simple parameter that does this? It would be very useful.
I have some code that gets me almost there:
plt.plot([x.mean(),x.mean()], [0, *what here?*])
This code plots a line just as I'd like except for my desired y-value. What would the correct math be to get the y max to stop at the frequency of the mean in the distplot? An example of one of my distplots is below using 0.6 as the y-max. It would be awesome if there was some math to make it stop at the y-value of the mean. I have tried dividing the mean by the count etc.
Update for the latest versions of matplotlib (3.3.4) and seaborn (0.11.1): the kdeplot with shade=True now doesn't create a line object anymore. To get the same outcome as before, setting shade=False will still create the line object. The curve can then be filled with ax.fill_between(). The code below is changed accordingly. (Use the revision history to see the older version.)
ax.lines[0] gets the curve of the kde, of which you can extract the x and y data.
np.interp then can find the height of the curve for a given x-value:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.normal(np.tile(np.random.uniform(10, 30, 5), 50), 3)
ax = sns.kdeplot(x, shade=False, color='crimson')
kdeline = ax.lines[0]
mean = x.mean()
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
height = np.interp(mean, xs, ys)
ax.vlines(mean, 0, height, color='crimson', ls=':')
ax.fill_between(xs, 0, ys, facecolor='crimson', alpha=0.2)
plt.show()
The same approach can be extended to show the mean together with the standard deviation, or the median and the quartiles:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
x = np.random.normal(np.tile(np.random.uniform(10, 30, 5), 50), 3)
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
for ax in axes:
sns.kdeplot(x, shade=False, color='crimson', ax=ax)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
if ax == axes[0]:
middle = x.mean()
sdev = x.std()
left = middle - sdev
right = middle + sdev
ax.set_title('Showing mean and sdev')
else:
left, middle, right = np.percentile(x, [25, 50, 75])
ax.set_title('Showing median and quartiles')
ax.vlines(middle, 0, np.interp(middle, xs, ys), color='crimson', ls=':')
ax.fill_between(xs, 0, ys, facecolor='crimson', alpha=0.2)
ax.fill_between(xs, 0, ys, where=(left <= xs) & (xs <= right), interpolate=True, facecolor='crimson', alpha=0.2)
# ax.set_ylim(ymin=0)
plt.show()
PS: for the mode of the kde:
mode_idx = np.argmax(ys)
ax.vlines(xs[mode_idx], 0, ys[mode_idx], color='lime', ls='--')
With plt.get_ylim() you can get the limits of the current plot: [bottom, top].
So, in your case, you can extract the actual limits and save them in ylim, then draw the line:
fig, ax = plt.subplots()
ylim = ax.get_ylim()
ax.plot([x.mean(),x.mean()], ax.get_ylim())
ax.set_ylim(ylim)
As ax.plot changes the ylims afterwards, you have to re-set them with ax.set_ylim as above.

How to draw the normal distribution of a barplot with log x axis?

I'd like to draw a lognormal distribution of a given bar plot.
Here's the code
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np; np.random.seed(1)
import scipy.stats as stats
import math
inter = 33
x = np.logspace(-2, 1, num=3*inter+1)
yaxis = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01,0.03,0.3,0.75,1.24,1.72,2.2,3.1,3.9,
4.3,4.9,5.3,5.6,5.87,5.96,6.01,5.83,5.42,4.97,4.60,4.15,3.66,3.07,2.58,2.19,1.90,1.54,1.24,1.08,0.85,0.73,
0.84,0.59,0.55,0.53,0.48,0.35,0.29,0.15,0.15,0.14,0.12,0.14,0.15,0.05,0.05,0.05,0.04,0.03,0.03,0.03, 0.02,
0.02,0.03,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0,0]
fig, ax = plt.subplots()
ax.bar(x[:-1], yaxis, width=np.diff(x), align="center", ec='k', color='w')
ax.set_xscale('log')
plt.xlabel('Diameter (mm)', fontsize='12')
plt.ylabel('Percentage of Total Particles (%)', fontsize='12')
plt.ylim(0,8)
plt.xlim(0.01, 10)
fig.set_size_inches(12, 12)
plt.savefig("Test.png", dpi=300, bbox_inches='tight')
Resulting plot:
What I'm trying to do is to draw the Probability Density Function exactly like the one shown in red in the graph below:
An idea is to convert everything to logspace, with u = log10(x). Then draw the density histogram in there. And also calculate a kde in the same space. Everything gets drawn as y versus u. When we have u at a top twin axes, x can stay at the bottom. Both axes get aligned by setting the same xlims, but converted to logspace on the top axis. The top axis can be hidden to get the desired result.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
inter = 33
u = np.linspace(-2, 1, num=3*inter+1)
x = 10**u
us = np.linspace(u[0], u[-1], 500)
yaxis = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01,0.03,0.3,0.75,1.24,1.72,2.2,3.1,3.9,
4.3,4.9,5.3,5.6,5.87,5.96,6.01,5.83,5.42,4.97,4.60,4.15,3.66,3.07,2.58,2.19,1.90,1.54,1.24,1.08,0.85,0.73,
0.84,0.59,0.55,0.53,0.48,0.35,0.29,0.15,0.15,0.14,0.12,0.14,0.15,0.05,0.05,0.05,0.04,0.03,0.03,0.03, 0.02,
0.02,0.03,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0,0]
yaxis = np.array(yaxis)
# reconstruct data from the given frequencies
u_data = np.repeat((u[:-1] + u[1:]) / 2, (yaxis * 100).astype(np.int))
kde = stats.gaussian_kde((u[:-1]+u[1:])/2, weights=yaxis, bw_method=0.2)
total_area = (np.diff(u)*yaxis).sum() # total area of all bars; divide by this area to normalize
fig, ax = plt.subplots()
ax2 = ax.twiny()
ax2.bar(u[:-1], yaxis, width=np.diff(u), align="edge", ec='k', color='w', label='frequencies')
ax2.plot(us, total_area*kde(us), color='crimson', label='kde')
ax2.plot(us, total_area * stats.norm.pdf(us, u_data.mean(), u_data.std()), color='dodgerblue', label='lognormal')
ax2.legend()
ax.set_xscale('log')
ax.set_xlabel('Diameter (mm)', fontsize='12')
ax.set_ylabel('Percentage of Total Particles (%)', fontsize='12')
ax.set_ylim(0,8)
xlim = np.array([0.01,10])
ax.set_xlim(xlim)
ax2.set_xlim(np.log10(xlim))
ax2.set_xticks([]) # hide the ticks at the top
plt.tight_layout()
plt.show()
PS: Apparently this also can be achieved directly without explicitly using u (at the cost of being slightly more cryptic):
x = np.logspace(-2, 1, num=3*inter+1)
xs = np.logspace(-2, 1, 500)
total_area = (np.diff(np.log10(x))*yaxis).sum() # total area of all bars; divide by this area to normalize
kde = gaussian_kde((np.log10(x[:-1])+np.log10(x[1:]))/2, weights=yaxis, bw_method=0.2)
ax.bar(x[:-1], yaxis, width=np.diff(x), align="edge", ec='k', color='w')
ax.plot(xs, total_area*kde(np.log10(xs)), color='crimson')
ax.set_xscale('log')
Note that the bandwidth set for gaussian_kde is a somewhat arbitrarily value. Larger values give a more equalized curve, smaller values keep closer to the data. Some experimentation can help.

Custom Spider chart --> Display curves instead of lines between point on a polar plot in matplotlib

I have measured the positions of different products in different angles positions (6 values in steps of 60 deg. over a complete rotation). Instead of representing my values on a Cartesian graph where 0 and 360 are the same point, I want to use a polar graph.
With matplotlib, I got a spider chart type graph, but I want to avoid straight lines between points and display and extrapolated values between those. I have a solution that is kind of OK, but I was hoping there is a nice "one liner" I could use to have a more realistic representation or a better tangent handling for some points.
Does anyone have an idea to improve my code below ?
# Libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Some data to play with
df = pd.DataFrame({'measure':[10, -5, 15,20,20, 20,15,5,10], 'angle':[0,45,90,135,180, 225, 270, 315,360]})
# The few lines I would like to avoid...
angles = [y/180*np.pi for x in [np.arange(x, x+45,5) for x in df.angle[:-1]] for y in x]
values = [y for x in [np.linspace(x, df.measure[i+1], 10)[:-1] for i, x in enumerate(df.measure[:-1])] for y in x]
angles.append(360/180*np.pi)
values.append(values[0])
# Initialise the spider plot
ax = plt.subplot(polar=True)
# Plot data
ax.plot(df.angle/180*np.pi, df['measure'], linewidth=1, linestyle='solid', label="Spider chart")
ax.plot(angles, values, linewidth=1, linestyle='solid', label='what I want')
ax.legend()
# Fill area
ax.fill(angles, values, 'b', alpha=0.1)
plt.show()
the result is below, I want something similar to the orange line with some kind of spline to avoid sharp corners I currently get
I have a solution that is a patchwork of other solutions. It needs to be cleaned and optimized, but it does the job !
Comments and improvements are always welcome, see below
# https://stackoverflow.com/questions/33962717/interpolating-a-closed-curve-using-scipy
from scipy import interpolate
x=df.measure[:-1] * np.cos(df.angle[:-1]/180*np.pi)
y=df.measure[:-1] * np.sin(df.angle[:-1]/180*np.pi)
x = np.r_[x, x[0]]
y = np.r_[y, y[0]]
# fit splines to x=f(u) and y=g(u), treating both as periodic. also note that s=0
# is needed in order to force the spline fit to pass through all the input points.
tck, u = interpolate.splprep([x, y], s=0, per=True)
# evaluate the spline fits for 1000 evenly spaced distance values
xi, yi = interpolate.splev(np.linspace(0, 1, 1000), tck)
def cart2pol(x, y):
rho = np.sqrt(x**2 + y**2)
phi = np.arctan2(y, x)
return(rho, phi)
# Initialise the spider plot
plt.figure(figsize=(12,8))
ax = plt.subplot(polar=True)
# Plot data
ax.plot(df.angle/180*np.pi, df['measure'], linewidth=1, linestyle='solid', label="Spider chart")
ax.plot(angles, values, linewidth=1, linestyle='solid', label='Interval linearisation')
ax.plot(cart2pol(xi, yi)[1], cart2pol(xi, yi)[0], linewidth=1, linestyle='solid', label='Smooth interpolation')
ax.legend()
# Fill area
ax.fill(angles, values, 'b', alpha=0.1)
plt.show()

Python manipulate axis (x and y) update

we measure the radius over an entire device (each degree, 360 points), which is around 148mm. It should be between 146 and 150.
If you plot the data with the corresponding limits, you get this:
CirclPlot
I like to change the axis that between -145 and 145 is small, and between 145- 150 / -145 - -150 is large. So I can see the measured value nice in between the limits.
Is that possible with python?
import matplotlib.pyplot as plt
import matplotlib.scale as mscale
import pandas as pd
#read CSV
EBRData = pd.read_csv('C://Users/vanderey/Documents/MATLAB/EBRTest2.csv', header = 0)
# Define data
Dates = EBRData['Date']
Rx = EBRData['xCoat']
Ry = EBRData['yCoat']
RLSLx = EBRData['xCoat_LSL']
RLSLy = EBRData['yCoat_LSL']
RUSLx = EBRData['xCoat_USL']
RUSLy = EBRData['yCoat_USL']
#Create plot
my_dpi=96
plt.figure(figsize=(480/my_dpi, 480/my_dpi), dpi=my_dpi)
plt.plot(Rx, Ry, color='blue', marker='.', linewidth=1, alpha=0.4)
plt.plot(RLSLx, RLSLy, color='red', marker='.', linewidth=1, alpha=0.4)
plt.plot(RUSLx, RUSLy, color='red', marker='.', linewidth=1, alpha=0.4)
plt.title('EBR')
plt.show()
If radius is what you want to show, I'd also recommend to calculate R from x and y measurements and put that into a plot together with the target limits.
You can do so by calculating the complete polar coordinates from your x/y-values
phi = np.arctan2(df.yCoat, df.xCoat)
R = pd.DataFrame(np.sqrt(df.xCoat.values**2 + df.yCoat.values**2), columns=['R'], index=phi)
If you rather like to plot over the nominal angular values instead of the actual measured angle positions, you could set phi also to e.g.
phi = np.linspace(-np.pi, np.pi, 360, endpoint=False)
However, this can be plotted simply as a normal line plot with two indicated limit lines like
R.plot()
plt.hlines(146, -np.pi, np.pi, 'k')
plt.hlines(150, -np.pi, np.pi, 'k')
or e.g. as a polar plot
f = plt.figure()
ax = f.add_subplot(111, projection='polar')
ax.set_rlim(144, 152)
plt.plot(R, 'b.-')
ax.fill_between(np.linspace(-np.pi, np.pi, 360), 140, 146, color='gray')
ax.fill_between(np.linspace(-np.pi, np.pi, 360), 150, 160, color='gray')
To show samples outside the wanted range, you can simply add e.g.
plt.plot(R[R<146], 'r.')
plt.plot(R[R>150], 'r.')
to immediately see if there's a problem:

Categories