Modified Bland–Altman plot in Seaborn - python

My lab uses what our PI calls "modified Bland–Altman plots" to analyze regression quality. The code I wrote using Seaborn only handles discrete data, and I'd like to generalize it.
A Bland–Altman plot compares the difference between two measures to their average. The "modification" is that the x-axis is, instead of the average, the ground truth value. The y-axis is the difference between the predicted and true values. In effect, the modified B–A plot can be seen as the plot of residuals from the line y=x—i.e. the line predicted=truth.
The code to generate this plot, as well as an example, is given below.
def modified_bland_altman_plot(predicted, truth):
predicted = np.asarray(predicted)
truth = np.asarray(truth, dtype=np.int) # np.int is a hack for stripplot
diff = predicted - truth
ax = sns.stripplot(truth, diff, jitter=True)
ax.set(xlabel='truth', ylabel='difference from truth', title="Modified Bland-Altman Plot")
# Plot a horizontal line at 0
ax.axhline(0, ls=":", c=".2")
return ax
Admittedly, this example has terrible bias in its prediction, shown by the downward slope.
I'm curious about two things:
Is there a generally accepted name for these "modified Bland–Altman plots"?
How can one create these for non-discrete data? We use stripplot, which requires discrete data. I know that seaborn has the residplot function, but it doesn't take a custom function for the line from which residuals are measured, e.g. predicted=true. Instead, it measures from the best-fit line it computes.

It seems you're looking for a standard scatter plot here:
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(1)
def modified_bland_altman_plot(predicted, truth):
predicted = np.asarray(predicted)
truth = np.asarray(truth)
diff = predicted - truth
fig, ax = plt.subplots()
ax.scatter(truth, diff, s=9, c=truth, cmap="rainbow")
ax.set_xlabel('truth')
ax.set_ylabel('difference from truth')
ax.set_title("Modified Bland-Altman Plot")
# Plot a horizontal line at 0
ax.axhline(0, ls=":", c=".2")
return ax
x = np.random.rayleigh(scale=10, size=201)
y = np.random.normal(size=len(x))+10-x/10.
modified_bland_altman_plot(y, x)
plt.show()

Related

Making a joint plot of calculated data using Seaborn

I want to create a 2D joint plot with the following data and from what I've read Seaborn is the best solution for this
I have completed a desired 1_D line plot, and have attempted to create the joint plot in Seaborn by putting the equations for each plot in the respective axes.
I am expecting the plot on the x axis to look similar to the plot I created using matplotlib and therefore the jointplot should have some vertical lines through the circular region.
However the plot output from seaborn on the x axis appears to have smoothed out many of the data points desired giving a smooth curve.
From reading about Seaborn it may not fit my needs for this kind of data, I have attempted using a matrix also but it did not seem to work with Seaborn.
This is the code I used
#imported as required
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
#Set limits for y values (x - axis)
ymin=-.6
ymax= .6
#Set up an array of angle values between defined y values in mm
angle = np.linspace(np.deg2rad(ymin), np.deg2rad(ymax), 1000)
# Define known values
L = 480
a = 0.09
d = 0.4
lam = 670e-6
# Calculate values for position y, alpha and beta
y = np.tan(angle)*L
alpha = (np.pi*a/lam)*np.sin(angle)
beta = (np.pi*d/lam)*np.sin(angle)
I = ((np.sin(alpha)/alpha)**2)*((np.cos(beta))**2)
# Plot the graph of intensity versus displacement
plt.plot(y, I)
import seaborn as sns
p = ((np.sin(alpha)/alpha)**2)*((np.cos(beta))**2) # Interference term and decaying term
q = (np.sin(alpha)/alpha)**2 # Decaying term
sns.jointplot(x=p, y=q, kind='kde',marginal_kws=dict(bw=0.6),bw=0.8)
plt.show()
You might recognize this as famous the Double Slits Experiment
These are the outputs. Note the smooth plot on Seaborn x axis
edit: I have used JointGrid as follows to plot on the axes in an attempt to solve the problem
g = sns.JointGrid(x=p, y=q)
g.plot_joint(sns.kdeplot)
g.plot_marginals(sns.kdeplot)
I am not familiar with Seaborn syntax, so this simple snippet is all I could get to give an output, which had the same problem as my initial attempt.

How to draw distribution plot for discrete variables in seaborn

When I draw displot for discrete variables, the distribution might not be as what I think. For example.
We can find that there are crevices in the barplot so that the curve in kdeplot is "lower" in y axis.
In my work, it was even worse:
I think it may because the "width" or "weight" was not 1 for each bar. But I didn't find any parameter that can justify it.
I'd like to draw such curve (It should be more smooth)
One way to deal with this problem might be to adjust the "bandwidth" of the KDE (see the documentation for seaborn.kdeplot())
n = np.round(np.random.normal(5,2,size=(10000,)))
sns.distplot(n, kde_kws={'bw':1})
EDIT Here is an alternative with a different scale for the bars and the KDE
n = np.round(np.random.normal(5,2,size=(10000,)))
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
sns.distplot(n, kde=False, ax=ax1)
sns.distplot(n, hist=False, ax=ax2, kde_kws={'bw':1})
If the problem is that there are some emptry bins in the histogram, it probably makes sense to specify the bins to match the data. In this case, use bins=np.arange(0,16) to get the bins for all integers in the data.
import numpy as np; np.random.seed(1)
import matplotlib.pyplot as plt
import seaborn as sns
n = np.random.randint(0,15,10000)
sns.distplot(n, bins=np.arange(0,16), hist_kws=dict(ec="k"))
plt.show()
It seems sns.distplot (or displot https://seaborn.pydata.org/generated/seaborn.displot.html) is for plotting histograms and no barplots. Both Histogram and KDE (which is an approximation of the probability density function) make sense only with continuous random variables.
So in your case, as you'd like to plot a distribution of a discrete random variable, you must go for a bar plot and plotting the Probability Mass Function (PMF) instead.
import numpy as np
import matplotlib.pyplot as plt
array = np.random.randint(15, size=10000)
unique, counts = np.unique(array, return_counts=True)
freq =counts/10000 # to change into frequency, no count
# plotting the points
plt.bar(unique, freq)
# naming the x axis
plt.xlabel('Value')
# naming the y axis
plt.ylabel('Frequency')
#Title
plt.title("Discrete uniform distribution")
# function to show the plot
plt.show()

Plotting KDE with logarithmic x-data in Matplotlib

I want to plot a KDE for some data with data that covers a large range in x-values. Therefore I want to use a logarithmic scale for the x-axis. For plotting I was using seaborn and the solution from Plotting 2D Kernel Density Estimation with Python, both of which fail once I set the xscale to logarithmic. When I take the logarithm of my x-data beforehand, everything looks fine, except the tics and ticlabels are still linear with the logarithm of the actual values as the labels. I could manually change the tics using something like:
labels = np.array(ax.get_xticks().tolist(), dtype=np.float64)
new_labels = [r'$10^{%.1f}$' % (labels[i]) for i in range(len(labels))]
ax.set_xticklabels(new_labels)
but in my eyes that looks just wrong and is nothing close to the axis labels (including the minor tics) when I would just use
ax.set_xscale('log')
Is there an easier way to plot a KDE with logarithmic x-data? Or is it possible to just change the tic- or label-scale without changing the scaling of the data, so that I could plot the logarithmic values of x and change the scaling of the labels afterwards?
Edit:
The plot I want to create looks like this:
The two right columns are what it is supposed to look like. There I used the the x data with the logarithm already applied. I don't like the labels on the x-axis, though.
The left column displays the plots, when the original data is used for the kde and all the other plots, and afterwards the scale is changed using
ax.set_xscale('log')
For some reason the kde, does not look like it is supposed to look. This is also not a result of erroneous data, since it looks just fine if the logarithmic data is used.
Edit 2:
A working example of code is
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.random.multivariate_normal((0, 0), [[0.8, 0.05], [0.05, 0.7]], 100)
x = np.power(10, data[:, 0])
y = data[:, 1]
fig, ax = plt.subplots(2, 1)
sns.kdeplot(data=np.log10(x), data2=y, ax=ax[0])
sns.kdeplot(data=x, data2=y, ax=ax[1])
ax[1].set_xscale('log')
plt.show()
The ax[1] plot is not displayed correctly for me (the x-axis is inverted), but the general behavior is the same as for the case described above. I believe the problem lies with the bandwidth of the kde, which should probably account for the logarithmic x-data.
I found an answer that works for me and wanted to post it in case someone else has a similar problem.
Based on the accepted answer from this post, I defined a function that first applies the logarithm to the x-data and after the KDE was performed, transforms the x-values back to the original values. Afterwards I can simply plot the contours and use ax.set_xscale('log')
import numpy as np
import scipy.stats as st
def logx_kde(x, y, xmin, xmax, ymin, ymax):
x = np.log10(x)
# Peform the kernel density estimate
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([xx.ravel(), yy.ravel()])
values = np.vstack([x, y])
kernel = st.gaussian_kde(values)
f = np.reshape(kernel(positions).T, xx.shape)
return np.power(10, xx), yy, f

python How to plot scatter and regression line with more than 127 or 128?

I am trying to make a simple scatter and also overlay a simple regression. All the x,y points plot in a scatter form, as expected, no matter what. Great. My problem is that if N is >127 then all the (x,y) points are plotted, but the regression line does not extend from the min(x) to the max(x). The regression line should extend all the way from the left side (to min(x)) all the way to the max(x). What is going on here and how can I fix it?
fig1, ax1 = plt.subplots(1,1)
N=128
x=np.random.rand(N)
y=np.random.rand(N)
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
ya=fit_fn(x)
ax1.plot(x,y, 'bo',x, ya,'-k')
I did notice that if I change the last line to
ax1.plot(x,y, 'bo',x, ya,'-ko')
then all the points plot, but this is not what i want since this gives me a scatter plot for x,ya instead of a line.
I get it now. I'm not quite sure why that happens like that, but there's a way around it. Does this produce the same result? (see mine bellow)
import matplotlib.pyplot as plt
import numpy as np
fig1, ax1 = plt.subplots(1,1)
#distribute N random points in interval [0,1>
N=300
x=np.random.rand(N)
y=np.random.rand(N)
#get fit information
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
#extend fitted line interval to make sure you
#get min and max on x axis
current = np.arange(min(x), max(x), 0.01)
current_fit = np.polyval(fit_fn, current)
#you can extend it even, default is color blue
future = np.arange(min(x)-0.5, max(x)+0.5, 0.01)
future_fit = np.polyval(fit_fn, future)
#plot
ax1.plot(x,y, 'bo')
ax1.plot(current, current_fit, "-ko")
ax1.plot(future, future_fit)
plt.show()

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?
There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()
If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.
I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()
I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.
In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()

Categories