Extract mean and confidence intervals from Seaborn regplot - python

Given that regplot calculates means in intervals and bootstraps to find confidence intervals for each bin, it seems like a waste to have to recalculate them manually for further study, so:
Question: How do I access the calculated means and confidence intervals of a regplot?
Example: This code produces a nice plot of bin means with CIs:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# just some random numbers to get started
fig, ax = plt.subplots()
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth=4./10
x_bins=np.arange(-2+binwidth/2,2,binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
plt.show()
Result:
Regplot showing binned data w. CIs
Not that calculating the means bin by bin isn't easily doable, but the CIs are calculated using random numbers. It would be nice to have the exact same numbers accessible as are plotted, so how do I access them? There must be some sort of get_*-method I'm overlooking.

Set-up
Setting up as in your MWE:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Random numbers for plotting
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth = 4 / 10
x_bins = np.arange(binwidth/2 - 2, 2, binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
This gives our starting point as:
Extracting the Confidence Intervals
We can extract the confidence intervals by looping over the plotted lines and extracting the miniumum and maximum values (corresponding to the upper and lower CIs respectively):
ax = plt.gca()
lower = [line.get_ydata().min() for line in ax.lines]
upper = [line.get_ydata().max() for line in ax.lines]
As a sanity check we can plot these extracted points on top of our original data (shown here by red crosses):
plt.scatter(x_bins, lower, marker='x', color='C3', zorder=3)
plt.scatter(x_bins, upper, marker='x', color='C3', zorder=3)
Extracting the Means
The values of the means can be extracted from ax.collections as:
means = ax.collections[0].get_offsets()[:, 1]
Again, as a sanity check we can overlay our extracted values on the original plot:
plt.scatter(x_bins, means, color='C1', marker='x', zorder=3)

Related

Plot a fitted curve on percentage histogram (not the actual data)

I first try to draw my data as percentage as follows:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
plt.hist(data, weights=np.ones(len(data)) / len(data), bins=5)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.grid()
plt.show()
This will give me this.
Now I used this line to fit a curve on the "percentage data" as follows :
import seaborn as sns
p=sns.displot(data=data, x="Dist",kde=True, bins=5)
Which gives me this:
But this curve was fitted according to the data not the percent per 5 bins. If for example you had 10 bins you could understand why there was a bump at the end. That bump we don't want to see. What I really want is a curve as this
The kde plot approximates the data as a sum of guassian bell curves.
An idea could be to regroup the data and place them at the centers of each bar.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
z = [1.83E-05,2.03E-05,3.19E-05,3.39E-05,3.46E-05,3.56E-05,3.63E-05,3.66E-05,4.13E-05,4.29E-05,4.29E-05,4.79E-05,5.01E-05,5.07E-05,5.08E-05,5.21E-05,5.39E-05,5.75E-05,5.91E-05,5.95E-05,5.98E-05,6.00E-05,6.40E-05,6.41E-05,6.67E-05,6.79E-05,6.79E-05,6.92E-05,7.03E-05,7.17E-05,7.45E-05,7.75E-05,7.99E-05,8.03E-05,8.31E-05,8.74E-05,9.69E-05,9.80E-05,9.86E-05,0.000108267,0.000108961,0.000109634,0.000111083,0.000111933,0.00011491,0.000126831,0.000135493,0.000138174,0.000141792,0.000150507,0.000155346,0.000155516,0.000202407,0.000243149,0.000248106,0.00025259,0.000254496,0.000258372,0.000258929,0.000265318,0.000293665,0.000312719,0.000430077]
counts, bin_edges = np.histogram(z, 5)
centers = (bin_edges[:-1] + bin_edges[1:]) / 2
regrouped_data = np.repeat(centers, counts)
sns.histplot(data=regrouped_data, kde=True, bins=bin_edges)
Normally, a kdeplot can be extended via the clip= parameter, but unfortunately kde_kws={'clip':bin_edges[[0,-1]]} doesn't work here.
To extend the kde, a trick could be to keep the highest and lowest value of the original data. So, subtracting one of the counts of the first and last bin, and append the lowest and highest value to the regrouped data.
counts, bin_edges = np.histogram(z, 5)
centers = (bin_edges[:-1] + bin_edges[1:]) / 2
counts[[0, -1]] -= 1
regrouped_data = np.concatenate([np.repeat(centers, counts), bin_edges[[0, -1]]])
sns.histplot(data=regrouped_data, kde=True, bins=bin_edges, stat='percent')

How to bin a 2D data along the x-axis with Python

I have two arrays of corresponding data (x and y) that I plot as above on a log-log plot. The data is currently too granular and I would like to bin them to get a smoother relationship. Could I get some guidance on how I can bin along the x-axis, in exponential bin sizes, so that it appears linear on the log-log scale?
For example, if the first bin is of range x = 10^0 to 10^1, I want to collect all y-values with corresponding x in that range and average them into one value for that bin. I don't think np.hist or plt.hist quite does the trick, since they do binning by counting occurrences.
Edit: For context, if it helps, the above plot is an assortativity plot that plots the in vs out degree of a certain network.
You may use scipy.stats.binned_statistic to get the mean of the data in each bin. The bins would best be created via numpy.logspace. You may then plot those means e.g. as horiziontal lines spanning the bin width or as scatter at the mean position.
import numpy as np; np.random.seed(42)
from scipy.stats import binned_statistic
import matplotlib.pyplot as plt
x = np.logspace(0,5,300)
y = np.logspace(0,5,300)+np.random.rand(300)*1.e3
fig, ax = plt.subplots()
ax.scatter(x,y, s=9)
s, edges, _ = binned_statistic(x,y, statistic='mean', bins=np.logspace(0,5,6))
ys = np.repeat(s,2)
xs = np.repeat(edges,2)[1:-1]
ax.hlines(s,edges[:-1],edges[1:], color="crimson", )
for e in edges:
ax.axvline(e, color="grey", linestyle="--")
ax.scatter(edges[:-1]+np.diff(edges)/2, s, c="limegreen", zorder=3)
ax.set_xscale("log")
ax.set_yscale("log")
plt.show()
You can achieve this with pandas. The idea is to assign each X value to an interval using np.digitize. Since you are using a log scale, it makes sense to use np.logspace to choose intervals of exponentially changing lengths. Finally, you can group X values in each interval and compute mean Y values.
import pandas as pd
import numpy as np
x_max = 10
xs = np.exp(x_max * np.random.rand(1000))
ys = np.exp(np.random.rand(1000))
df = pd.DataFrame({
'X': xs,
'Y': ys,
})
df['Xbins'] = np.digitize(df.X, np.logspace(0, x_max, 30, base=np.exp(1)))
df['Ymean'] = df.groupby('Xbins').Y.transform('mean')
df.plot(kind='scatter', x='X', y='Ymean')

Vertical line at the end of a CDF histogram using matplotlib

I'm trying to create a CDF but at the end of the graph, there is a vertical line, shown below:
I've read that his is because matplotlib uses the end of the bins to draw the vertical lines, which makes sense, so I added into my code as:
bins = sorted(X) + [np.inf]
where X is the data set I'm using and set the bin size to this when plotting:
plt.hist(X, bins = bins, cumulative = True, histtype = 'step', color = 'b')
This does remove the line at the end and produce the desired effect, however when I normalise this graph now it produces an error:
ymin = max(ymin*0.9, minimum) if not input_empty else minimum
UnboundLocalError: local variable 'ymin' referenced before assignment
Is there anyway to either normalise the data with
bins = sorted(X) + [np.inf]
in my code or is there another way to remove the line on the graph?
An alternative way to plot a CDF would be as follows (in my example, X is a bunch of samples drawn from the unit normal):
import numpy as np
import matplotlib.pyplot as plt
X = np.random.randn(10000)
n = np.arange(1,len(X)+1) / np.float(len(X))
Xs = np.sort(X)
fig, ax = plt.subplots()
ax.step(Xs,n)
I needed a solution where I would not need to alter the rest of my code (using plt.hist(...) or, with pandas, dataframe.plot.hist(...)) and that I could reuse easily many times in the same jupyter notebook.
I now use this little helper function to do so:
def fix_hist_step_vertical_line_at_end(ax):
axpolygons = [poly for poly in ax.get_children() if isinstance(poly, mpl.patches.Polygon)]
for poly in axpolygons:
poly.set_xy(poly.get_xy()[:-1])
Which can be used like this (without pandas):
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
X = np.sort(np.random.randn(1000))
fig, ax = plt.subplots()
plt.hist(X, bins=100, cumulative=True, density=True, histtype='step')
fix_hist_step_vertical_line_at_end(ax)
Or like this (with pandas):
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000))
fig, ax = plt.subplots()
ax = df.plot.hist(ax=ax, bins=100, cumulative=True, density=True, histtype='step', legend=False)
fix_hist_step_vertical_line_at_end(ax)
This works well even if you have multiple cumulative density histograms on the same axes.
Warning: this may not lead to the wanted results if your axes contain other patches falling under the mpl.patches.Polygon category. That was not my case so I prefer using this little helper function in my plots.
Assuming that your intentions are pure aesthetic, add a vertical line, of the same color as your plot background:
ax.axvline(x = value, color = 'white', linewidth = 2)
Where "value" stands for the right extreme of the rightmost bin.

cumulative distribution plots python

I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.
I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.
You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:
import numpy as np
import matplotlib.pyplot as plt
# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')
plt.show()
Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:
import numpy as np
import matplotlib.pyplot as plt
# Some fake data:
data = np.random.randn(1000)
sorted_data = np.sort(data) # Or data.sort(), if data can be modified
# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size)) # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size)) # From the number of data points-1 to 0
plt.show()
Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.
The result is:
You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).
PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:
plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
np.arange(sorted_data.size+1))
There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.
After conclusive discussion with #EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:
import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt
def pdf(x, mu=0, sigma=1):
"""
Calculates the normal distribution's probability density
function (PDF).
"""
term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
return term1 * term2
# Drawing sample date poi
##################################################
# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()
min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))
##################################################
fig = plt.gcf()
fig.set_size_inches(12,11)
# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Cumulative distributions, smooth:
plt.subplot(2,2,2)
plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Probability densities of the sample points function
plt.subplot(2,2,3)
pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
# Probability density function
plt.subplot(2,2,4)
x = np.arange(min_val, max_val, 0.05)
pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')
plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
plt.show()
In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :
def hist(data, bins, title, labels, range = None):
fig = plt.figure(figsize=(15, 8))
ax = plt.axes()
plt.ylabel("Proportion")
values, base, _ = plt.hist( data , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
ax_bis = ax.twinx()
values = np.append(values,0)
ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
plt.xlabel(labels)
plt.ylabel("Proportion")
plt.title(title)
ax_bis.legend();
ax.legend();
plt.show()
return
if anyone wonders how it looks like, please take a look (with seaborn activated):
Also, concerning the double grid (the white lines), I always used to struggle to have nice double grid. Here is an interesting way to circumvent the problem: How to put grid lines from the secondary axis behind the primary plot?
The simplest way to generate this graph is with seaborn:
import seaborn as sns
sns.ecdfplot()
Here is the documentation

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?
There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()
If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.
I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()
I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.
In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()

Categories