How to bin a 2D data along the x-axis with Python

How to bin a 2D data along the x-axis with Python - python

I have two arrays of corresponding data (x and y) that I plot as above on a log-log plot. The data is currently too granular and I would like to bin them to get a smoother relationship. Could I get some guidance on how I can bin along the x-axis, in exponential bin sizes, so that it appears linear on the log-log scale?
For example, if the first bin is of range x = 10^0 to 10^1, I want to collect all y-values with corresponding x in that range and average them into one value for that bin. I don't think np.hist or plt.hist quite does the trick, since they do binning by counting occurrences.
Edit: For context, if it helps, the above plot is an assortativity plot that plots the in vs out degree of a certain network.

You may use scipy.stats.binned_statistic to get the mean of the data in each bin. The bins would best be created via numpy.logspace. You may then plot those means e.g. as horiziontal lines spanning the bin width or as scatter at the mean position.
import numpy as np; np.random.seed(42)
from scipy.stats import binned_statistic
import matplotlib.pyplot as plt
x = np.logspace(0,5,300)
y = np.logspace(0,5,300)+np.random.rand(300)*1.e3
fig, ax = plt.subplots()
ax.scatter(x,y, s=9)
s, edges, _ = binned_statistic(x,y, statistic='mean', bins=np.logspace(0,5,6))
ys = np.repeat(s,2)
xs = np.repeat(edges,2)[1:-1]
ax.hlines(s,edges[:-1],edges[1:], color="crimson", )
for e in edges:
ax.axvline(e, color="grey", linestyle="--")
ax.scatter(edges[:-1]+np.diff(edges)/2, s, c="limegreen", zorder=3)
ax.set_xscale("log")
ax.set_yscale("log")
plt.show()

You can achieve this with pandas. The idea is to assign each X value to an interval using np.digitize. Since you are using a log scale, it makes sense to use np.logspace to choose intervals of exponentially changing lengths. Finally, you can group X values in each interval and compute mean Y values.
import pandas as pd
import numpy as np
x_max = 10
xs = np.exp(x_max * np.random.rand(1000))
ys = np.exp(np.random.rand(1000))
df = pd.DataFrame({
'X': xs,
'Y': ys,
})
df['Xbins'] = np.digitize(df.X, np.logspace(0, x_max, 30, base=np.exp(1)))
df['Ymean'] = df.groupby('Xbins').Y.transform('mean')
df.plot(kind='scatter', x='X', y='Ymean')

Related

Extract mean and confidence intervals from Seaborn regplot

Given that regplot calculates means in intervals and bootstraps to find confidence intervals for each bin, it seems like a waste to have to recalculate them manually for further study, so:
Question: How do I access the calculated means and confidence intervals of a regplot?
Example: This code produces a nice plot of bin means with CIs:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# just some random numbers to get started
fig, ax = plt.subplots()
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth=4./10
x_bins=np.arange(-2+binwidth/2,2,binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
plt.show()
Result:
Regplot showing binned data w. CIs
Not that calculating the means bin by bin isn't easily doable, but the CIs are calculated using random numbers. It would be nice to have the exact same numbers accessible as are plotted, so how do I access them? There must be some sort of get_*-method I'm overlooking.

Set-up
Setting up as in your MWE:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Random numbers for plotting
x = np.random.uniform(-2, 2, 1000)
y = np.random.normal(x**2, np.abs(x) + 1)
# Manual binning to retain control
binwidth = 4 / 10
x_bins = np.arange(binwidth/2 - 2, 2, binwidth)
sns.regplot(x=x, y=y, x_bins=x_bins, fit_reg=None)
This gives our starting point as:
Extracting the Confidence Intervals
We can extract the confidence intervals by looping over the plotted lines and extracting the miniumum and maximum values (corresponding to the upper and lower CIs respectively):
ax = plt.gca()
lower = [line.get_ydata().min() for line in ax.lines]
upper = [line.get_ydata().max() for line in ax.lines]
As a sanity check we can plot these extracted points on top of our original data (shown here by red crosses):
plt.scatter(x_bins, lower, marker='x', color='C3', zorder=3)
plt.scatter(x_bins, upper, marker='x', color='C3', zorder=3)
Extracting the Means
The values of the means can be extracted from ax.collections as:
means = ax.collections[0].get_offsets()[:, 1]
Again, as a sanity check we can overlay our extracted values on the original plot:
plt.scatter(x_bins, means, color='C1', marker='x', zorder=3)

display a histogram with very non-uniform bin widths

Here is the histogram
To generate this plot, I did:
bins = np.array([0.03, 0.3, 2, 100])
plt.hist(m, bins = bins, weights=np.zeros_like(m) + 1. / m.size)
However, as you noticed, I want to plot the histogram of the relative frequency of each data point with only 3 bins that have different sizes:
bin1 = 0.03 -> 0.3
bin2 = 0.3 -> 2
bin3 = 2 -> 100
The histogram looks ugly since the size of the last bin is extremely large relative to the other bins. How can I fix the histogram? I want to change the width of the bins but I do not want to change the range of each bin.

As #cel pointed out, this is no longer a histogram, but you can do what you are asking using plt.bar and np.histogram. You then just need to set the xticklabels to a string describing the bin edges. For example:
import numpy as np
import matplotlib.pyplot as plt
bins = [0.03,0.3,2,100] # your bins
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99] # random data
hist, bin_edges = np.histogram(data,bins) # make the histogram
fig,ax = plt.subplots()
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1)
# Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])
# Set the xticklabels to a string that tells us what the bin edges were
ax.set_xticklabels(['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])
plt.show()
EDIT
If you update to matplotlib v1.5.0, you will find that bar now takes a kwarg tick_label, which can make this plotting even easier (see here):
hist, bin_edges = np.histogram(data,bins)
ax.bar(range(len(hist)),hist,width=1,align='center',tick_label=
['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])

If your actual values of the bins are not important but you want to have a histogram of values of completely different orders of magnitude, you can use a logarithmic scaling along the x axis. This here gives you bars with equal widths
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
plt.hist(data,bins=10**np.linspace(-2,2,5))
plt.xscale('log')
plt.show()
When you have to use your bin values you can do
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
bins = [0.03,0.3,2,100]
plt.hist(data,bins=bins)
plt.xscale('log')
plt.show()
However, in this case the widths are not perfectly equal but still readable. If the widths must be equal and you have to use your bins I recommend #tom's solution.

How to draw enveloping line with a shaded area which incorporates a large number of data points?

For the figure above, how can I draw an enveloping line with a shaded area, similar to the figure below?

Replicating your example is easy because it's possible to calculate the min and max at each x and fill between them. eg.
import matplotlib.pyplot as plt
import numpy as np
#dummy data
y = [range(20) + 3 * i for i in np.random.randn(3, 20)]
x = list(range(20))
#calculate the min and max series for each x
min_ser = [min(i) for i in np.transpose(y)]
max_ser = [max(i) for i in np.transpose(y)]
#initial plot
fig, axs = plt.subplots()
axs.plot(x, x)
for s in y:
axs.scatter(x, s)
#plot the min and max series over the top
axs.fill_between(x, min_ser, max_ser, alpha=0.2)
giving
For your displayed data, that might prove problematic because the series do not share x values in all cases. If that's the case then you need some statistical technique to smooth the series somehow. One option is to use a package like seaborn, which provides functions to handle all the details for you.

Normalize a multiple data histogram

I have several arrays that I'm plotting a histogram of, like so:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
plt.hist((x,y),normed=True)
Of course, however, this normalizes both of the arrays individually, so that they both have the same peak. I'm looking to normalize them to the total number of elements, so that the histogram of y will be visibly taller than that of x. Is there a handy way to do this in matplotlib or will I have to mess around in numpy? I haven't found anything about it.
Another way to put it is that if I were instead to make a cumulative plot of the two arrays, they shouldn't both top out at 1, but should add to 1.

Yes, you can compute the histogram with numpy and renormalise it.
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
xhist, xbins = np.histogram(x, normed=True)
yhist, ybins = np.histogram(x, normed=True)
And now, you apply your regularisation. For example, if you want x to be normalised to 1 and y proportional:
yhist *= len(y) / len(x)
Now, to plot the histogram:
def plot_histogram(data, edge_bins, **kwargs):
bins = edge_bins[:-1] + edge_bins[1:]
plt.step(bins, data, **kwargs)
plot_histogram(xhist, xbins, c='b')
plot_histogram(yhist, ybins, c='g')

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?

There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()

If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.

I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()

I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.

In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to bin a 2D data along the x-axis with Python - python

Related

Extract mean and confidence intervals from Seaborn regplot

display a histogram with very non-uniform bin widths

How to draw enveloping line with a shaded area which incorporates a large number of data points?

Normalize a multiple data histogram

matplotlib: disregard outliers when plotting

Categories

Resources