Python (numpy) - correlate two binned plots - python

My question is how do I correlate my two binned plots and output a Pearson's correlation coefficient?
I'm not sure how to properly extract the binned arrays necessary for the np.corrcoef function. Here's my script:
import numpy as np
import matplotlib.pyplot as plt
A = np.genfromtxt('data1.txt')
x1 = A[:,1]
y1 = A[:,2]
B=np.genfromtxt('data2.txt')
x2 = B[:,1]
y2 = B[:,2]
fig = plt.figure()
plt.subplots_adjust(hspace=0.5)
plt.subplot(121)
AA = plt.hexbin(x1,y1,cmap='jet',gridsize=500,vmin=0,vmax=450,mincnt=1)
plt.axis([-180,180,-180,180])
cb = plt.colorbar()
plt.title('Data1')
plt.subplot(122)
BB = plt.hexbin(x2,y2,cmap='jet',gridsize=500,vmin=0,vmax=450,mincnt=1)
plt.axis([-180,180,-180,180])
cb = plt.colorbar()
plt.title('Data 2')
array1 = np.ndarray.flatten(AA)
array2 = np.ndarray.flatten(BB)
print np.corrcoef(array1,array2)
plt.show()

The answer can be found in the documentation:
Returns: object
a PolyCollection instance; use get_array() on this PolyCollection to get the counts in each hexagon.
Here's a revised version of you code:
A = np.genfromtxt('data1.txt')
x1 = A[:,1]
y1 = A[:,2]
B = np.genfromtxt('data2.txt')
x2 = B[:,1]
y2 = B[:,2]
# make figure and axes
fig, (ax1, ax2) = plt.subplots(1, 2)
# define common keyword arguments
hex_params = dict(cmap='jet', gridsize=500, vmin=0, vmax=450, mincnt=1)
# plot and set titles
hex1 = ax1.hexbin(x1, y1, **hex_params)
hex2 = ax2.hexbin(x2, y2, **hex_params)
ax1.set_title('Data 1')
ax2.set_title('Data 2')
# set axes lims
[ax.set_xlim(-180, 180) for ax in (ax1, ax2)]
[ax.set_ylim(-180, 180) for ax in (ax1, ax2)]
# add single colorbar
fig.subplots_adjust(right=0.8, hspace=0.5)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
fig.colorbar(hex2, cax=cbar_ax)
# get binned data and corr coeff
binned1 = hex1.get_array()
binned2 = hex2.get_array()
print np.corrcoef(binned1, binned2)
plt.show()
Two comments though: are you sure you want the pearson correlation coefficient? What are you actually trying to show? If you want to show the distributions are the same/different, you might want to use a Kolmogorov-Smirnov test.
Also don't use jet as a colormap. Jet is bad.

Related

Visualizing the difference between two numeric arrays

I have two numeric arrays of equal length, with one array always having the element value >= to the corresponding (same index) element in the second array.
I am trying to visualize in a single graph:
i) difference between the corresponding elements,
ii) values of the corresponding elements in the two arrays.
I have tried plotting the CDF as below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
arr1 = np.random.uniform(1,20,[25,1])
arr2 = arr1 + np.random.uniform(1,10,[25,1])
df1 = pd.DataFrame(arr1)
df2 = pd.DataFrame(arr2)
fix, ax = plt.subplots()
sns.kdeplot(df1[0], cumulative=True, color='orange', label='arr1')
sns.kdeplot(df2[0], cumulative=True, color='b', label='arr2')
sns.kdeplot(df2[0]-df1[0], cumulative=True, color='r', label='difference')
plt.show()
which gives the following output:
However, it does not capture the difference, and values of the corresponding elements together. For example, suppose the difference between two elements is 3. The two numbers can be 2 and 5, but they can also be 15 and 18, and this can not be determined from the CDF.
Which kind of plotting can visualize both the difference between the elements and the values of the elements?
I do not wish to line plot as below because not much statistical insights can be derived from the visualization.
ax.plot(df1[0])
ax.plot(df2[0])
ax.plot(df2[0]-df1[0])
There are lots of ways to show difference between two values. It really depends on your goal for the chart, how quantitative or qualitative you want to be, or if you want to show the raw data somehow. Here are a few ideas that come to mind that do not involve simple line plots or density functions. I strongly recommend the book Better Data Visualization by Johnathan Schwabish. He discusses interesting considerations regarding data presentation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
arr1 = np.random.uniform(1,20, size=25)
arr2 = arr1 + np.random.uniform(1,10, size=25)
df = pd.DataFrame({
'col1' : arr1,
'col2' : arr2
})
df['diff'] = df.col2 - df.col1
df['sum'] = df.col1 + df.col2
fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(15,15))
axes = axes.flatten()
# Pyramid chart
df_sorted = df.sort_values(by='sum', ascending=True)
axes[0].barh(
y = np.arange(1,26),
width = -df_sorted.col1
)
axes[0].barh(
y = np.arange(1,26),
width = df_sorted.col2
)
# Style axes[0]
style_func(axes[0], 'Pyramid Chart')
# Dot Plot
axes[1].scatter(df.col1, np.arange(1, 26), label='col1')
axes[1].scatter(df.col2, np.arange(1, 26), label='col2')
axes[1].hlines(
y = np.arange(1, 26),
xmin = df.col1, xmax = df.col2,
zorder=0, linewidth=1.5, color='k'
)
# Style axes[1]
legend = axes[1].legend(ncol=2, loc='center', bbox_to_anchor=(0.14,1.025), edgecolor='w')
style_func(axes[1], 'Dot Plot')
set_xlim = axes[1].set_xlim(0,25)
# Dot Plot 2
df_sorted = df.sort_values(by=['col1', 'diff'], ascending=False)
axes[2].scatter(df_sorted.col1, np.arange(1, 26), label='col1')
axes[2].scatter(df_sorted.col2, np.arange(1, 26), label='col2')
axes[2].hlines(
y = np.arange(1, 26),
xmin = df_sorted.col1, xmax = df_sorted.col2,
zorder=0, linewidth=1.5, color='k'
)
# Style axes[2]
legend = axes[2].legend(ncol=2, loc='center', bbox_to_anchor=(0.14,1.025), edgecolor='w')
style_func(axes[2], 'Dot Plot')
set_xlim = axes[2].set_xlim(0,25)
# Dot Plot 3
df_sorted = df.sort_values(by='sum', ascending=True)
axes[3].scatter(-df_sorted.col1, np.arange(1, 26), label='col1')
axes[3].scatter(df_sorted.col2, np.arange(1, 26), label='col2')
axes[3].vlines(x=0, ymin=-1, ymax=27, linewidth=2.5, color='k')
axes[3].hlines(
y = np.arange(1, 26),
xmin = -df_sorted.col1, xmax = df_sorted.col2,
zorder=0, linewidth=2
)
# Style axes[3]
legend = axes[3].legend(ncol=2, loc='center', bbox_to_anchor=(0.14,1.025), edgecolor='w')
style_func(axes[3], 'Dot Plot')
# Strip plot
axes[4].scatter(df.col1, [4] * 25)
axes[4].scatter(df.col2, [6] * 25)
axes[4].set_ylim(0, 10)
axes[4].vlines(
x = [df.col1.mean(), df.col2.mean()],
ymin = [3.5, 5.5], ymax=[4.5,6.5],
color='black', linewidth =2
)
# Style axes[4]
axes[4].yaxis.set_major_locator(ticker.FixedLocator([4,6]))
axes[4].yaxis.set_major_formatter(ticker.FixedFormatter(['col1','col2']))
hide_spines = [axes[4].spines[x].set_visible(False) for x in ['left','top','right']]
set_title = axes[4].set_title('Strip Plot', fontweight='bold')
tick_params = axes[4].tick_params(axis='y', left=False)
grid = axes[4].grid(axis='y', dashes=(8,3), alpha=0.3, color='gray')
# Slope chart
for i in range(25):
axes[5].plot([0,1], [df.col1[i], df.col2[i]], color='k')
align = ['left', 'right']
for i in range(1,3):
axes[5].text(x = i - 1, y = 0, s = 'col' + str(i),
fontsize=14, fontweight='bold', ha=align[i-1])
set_title = axes[5].set_title('Slope chart', fontweight='bold')
axes[5].axis('off')
def style_func(ax, title):
hide_spines = [ax.spines[x].set_visible(False) for x in ['left','top','right']]
set_title = ax.set_title(title, fontweight='bold')
set_xlim = ax.set_xlim(-25,25)
x_locator = ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
y_locator = ax.yaxis.set_major_locator(ticker.FixedLocator(np.arange(1,26, 2)))
spine_width = ax.spines['bottom'].set_linewidth(1.5)
x_tick_params = ax.tick_params(axis='x', length=8, width=1.5)
x_tick_params = ax.tick_params(axis='y', left=False)
What about a parallel coordinates plot with plotly? This will allow to see the distinct values of each original array but then also if they converge on the same diffrence?
https://plot.ly/python/parallel-coordinates-plot/

How to avoid overlapping error bars in matplotlib?

I want to create a plot for two different datasets similar to the one presented in this answer:
In the above image, the author managed to fix the overlapping problem of the error bars by adding some small random scatter in x to the new dataset.
In my problem, I must plot a similar graphic, but having some categorical data in the x axis:
Any ideas on how to slightly move one the error bars of the second dataset using categorical variables at the x axis? I want to avoid the overlapping between the bars for making the visualization easier.
You can translate each errorbar by adding the default data transform to a prior translation in data space. This is possible when knowing that categories are in general one data unit away from each other.
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
from matplotlib.transforms import Affine2D
x = list("ABCDEF")
y1, y2 = np.random.randn(2, len(x))
yerr1, yerr2 = np.random.rand(2, len(x))*4+0.3
fig, ax = plt.subplots()
trans1 = Affine2D().translate(-0.1, 0.0) + ax.transData
trans2 = Affine2D().translate(+0.1, 0.0) + ax.transData
er1 = ax.errorbar(x, y1, yerr=yerr1, marker="o", linestyle="none", transform=trans1)
er2 = ax.errorbar(x, y2, yerr=yerr2, marker="o", linestyle="none", transform=trans2)
plt.show()
Alternatively, you could translate the errorbars after applying the data transform and hence move them in units of points.
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
from matplotlib.transforms import ScaledTranslation
x = list("ABCDEF")
y1, y2 = np.random.randn(2, len(x))
yerr1, yerr2 = np.random.rand(2, len(x))*4+0.3
fig, ax = plt.subplots()
trans1 = ax.transData + ScaledTranslation(-5/72, 0, fig.dpi_scale_trans)
trans2 = ax.transData + ScaledTranslation(+5/72, 0, fig.dpi_scale_trans)
er1 = ax.errorbar(x, y1, yerr=yerr1, marker="o", linestyle="none", transform=trans1)
er2 = ax.errorbar(x, y2, yerr=yerr2, marker="o", linestyle="none", transform=trans2)
plt.show()
While results look similar in both cases, they are fundamentally different. You will observe this difference when interactively zooming the axes or changing the figure size.
Consider the following approach to highlight plots - combination of errorbar and fill_between with non-zero transparency:
import random
import matplotlib.pyplot as plt
# create sample data
N = 8
data_1 = {
'x': list(range(N)),
'y': [10. + random.random() for dummy in range(N)],
'yerr': [.25 + random.random() for dummy in range(N)]}
data_2 = {
'x': list(range(N)),
'y': [10.25 + .5 * random.random() for dummy in range(N)],
'yerr': [.5 * random.random() for dummy in range(N)]}
# plot
plt.figure()
# only errorbar
plt.subplot(211)
for data in [data_1, data_2]:
plt.errorbar(**data, fmt='o')
# errorbar + fill_between
plt.subplot(212)
for data in [data_1, data_2]:
plt.errorbar(**data, alpha=.75, fmt=':', capsize=3, capthick=1)
data = {
'x': data['x'],
'y1': [y - e for y, e in zip(data['y'], data['yerr'])],
'y2': [y + e for y, e in zip(data['y'], data['yerr'])]}
plt.fill_between(**data, alpha=.25)
Result:
Threre is example on lib site: https://matplotlib.org/stable/gallery/lines_bars_and_markers/errorbar_subsample.html
enter image description here
You need parameter errorevery=(m, n),
n - how often plot error lines, m - shift with range from 0 to n

python: sns distplot area overlap

How can I get the overlapping area of 2 sns.distplots?
Apart from the difference in mean (as below) I would like to add a number that descripes how different the (normalised) distributions are (for example 2 distributions could have the same mean but still look very different if they are not normal).
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
x1 = np.random.normal(size=2000)
x2 = np.random.normal(size=1000)+1
sns.distplot(x1, hist=False, kde=True, color="r", norm_hist=True)
sns.distplot(x2, hist=False, kde=True, color="b", norm_hist=True)
m1 = x1.mean()
m2 = x2.mean()
plt.title("m1={:2.2f}, m2={:2.2f} (diffInMean={:2.2f})".format(m1, m2, m1-m2))
plt.show(block=True)
If somebody is interested: I have approximated it now with an integral of the distributions (unfortunately not quite the 1-liner I was searching for):
data1 = np.random.normal(size=9000)
data2 = np.random.normal(size=5000, loc=0.5, scale=1.5)
num_bins = 100
xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
bins = np.linspace(xmin, xmax, num_bins)
weights1 = np.ones_like(data1) / float(len(data1))
weights2 = np.ones_like(data2) / float(len(data2))
hist_1 = np.histogram(data1, bins, weights=weights1)[0]
hist_2 = np.histogram(data2, bins, weights=weights2)[0]
tvd = 0.5*sum(abs(hist_1 - hist_2))
print("overlap: {:2.2f} percent".format((1-tvd)*100))
plt.figure()
ax = plt.gca()
ax.hist(data1, bins, weights=weights1, color='red', edgecolor='white', alpha=0.5)[0]
ax.hist(data2, bins, weights=weights2, color='blue', edgecolor='white', alpha=0.5)[0]
plt.show()

How to add hierarchical axis across subplots in order to label groups?

I am having a set of different times series which can be grouped. E.g. the plot below shows series A, B, C and D. However, A and B are in group G1 and C and D are in group G2.
I would like to reflect that in the plot by adding another axis on the left which goes across groups of turbines and label thes axis accordingly.
I've tried a few thing so far but apparently that one's not so easy.
Does some body know how I can do that?
PS: Since I am using panda's plot(subplots=True) on a data frame which has already columns
| G1 | G2 |
|-------|------|
index | A B | C D |
------|-------|------|
it might be that pandas can do that already for me. That's why I am using the pandas tag.
You can create additional axes in the plot, which span each two plots but only have a left y-axis, no ticks and other decorations. Only a ylabel is set. This will make the whole thing look well aligned.
The good thing is that you can work with your existing pandas plot. The drawback is that is more than 15 lines of code.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
df = pd.DataFrame(np.random.rand(26,4), columns=list("ABCD"))
axes = df.plot(subplots=True)
fig = axes[0].figure
gs = gridspec.GridSpec(4,2)
gs.update(left=0.1, right=0.48, wspace=0.05)
fig.subplots_adjust(left=.2)
for i, ax in enumerate(axes):
ax.set_subplotspec(gs[i,1])
aux1 = fig.add_subplot(gs[:2,0])
aux2 = fig.add_subplot(gs[2:,0])
aux1.set_ylabel("G1")
aux2.set_ylabel("G2")
for ax in [aux1, aux2]:
ax.tick_params(size=0)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_facecolor("none")
for pos in ["right", "top", "bottom"]:
ax.spines[pos].set_visible(False)
ax.spines["left"].set_linewidth(3)
ax.spines["left"].set_color("crimson")
plt.show()
Here is an example I came up with. Since you did not provide your code, I did it without pandas, because I am not proficient with it.
You basically plot as one would and then create another axis around all your previous ones, remove its axis with ax5.axis('off') and plot the 2 lines and text on it.
from matplotlib import lines
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 4*np.pi, 100)
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.tan(x)
y4 = np.cos(x)/(x+1)
fig = plt.figure()
fig.subplots_adjust(hspace=.5)
ax1 = plt.subplot(411)
ax1.plot(x, y1)
ax2 = plt.subplot(412)
ax2.plot(x, y2)
ax3 = plt.subplot(413)
ax3.plot(x, y3)
ax4 = plt.subplot(414)
ax4.plot(x, y4)
# new axis around the others with 0-1 limits
ax5 = plt.axes([0, 0, 1, 1])
ax5.axis('off')
line_x1, line_y1 = np.array([[0.05, 0.05], [0.05, 0.5]])
line1 = lines.Line2D(line_x1, line_y1, lw=2., color='k')
ax5.add_line(line1)
line_x2, line_y2 = np.array([[0.05, 0.05], [0.55, 0.9]])
line2 = lines.Line2D(line_x2, line_y2, lw=2., color='k')
ax5.add_line(line2)
ax5.text(0.0, 0.75, "G1")
ax5.text(0.0, 0.25, "G2")
plt.show()
Inspired by How to draw a line outside of an axis in matplotlib (in figure coordinates)?

Arrange plots that have subplots called from functions on grid in matplotlib

I am looking for something similar to arrangeGrob in R:
I have a function (say, function FUN1) that creates a plot with subplots. The number of subplots FUN1 creates may vary and the plot itself is quite complex. I have two other functions FUN2 and FUN3 which also create plots of varying structure.
Is there a simple way to define/arrange an overall GRID, for example a simple 3 rows 1 column style and simply pass
FUN1 --> GRID(row 1, col 1)
FUN2 --> GRID(row 2, col 1)
FUN3 --> GRID(row 3, col 1)
afterwards such that the complicated plot generated by FUN1 gets plotted in in row 1, the plot generated by FUN2 in row 2 and so on, without specifying the subplot criteria in the FUNs before?
The usual way to create plots with matplotlib would be to create some axes first and then plot to those axes. The axes can be set up on a grid using plt.subplots, figure.add_subplot, plt.subplot2grid or more sophisticated, using GridSpec.
Once those axes are created, they can be given to functions, which plot content to the axes. The following would be an example where 6 axes are created and 3 different functions are used to plot to them.
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
def func1(ax, bx, cx):
x = np.arange(3)
x2 = np.linspace(-3,3)
y1 = [1,2,4]
y2 = [3,2.5,3.4]
f = lambda x: np.exp(-x**2)
ax.bar(x-0.5, y1, width=0.4)
ax.bar(x, y2, width=0.4)
bx.plot(x,y1, label="lab1")
bx.scatter(x,y2, label="lab2")
bx.legend()
cx.fill_between(x2, f(x2))
def func2(ax, bx):
x = np.arange(1,18)/1.9
y = np.arange(1,6)/1.4
z = np.outer(np.sin(x), -np.sqrt(y)).T
ax.imshow(z, aspect="auto", cmap="Purples_r")
X, Y = np.meshgrid(np.linspace(-3,3),np.linspace(-3,3))
U = -1-X**2+Y
V = 1+X-Y**2
bx.streamplot(X, Y, U, V, color=U, linewidth=2, cmap="autumn")
def func3(ax):
data = [sorted(np.random.normal(0, s, 100)) for s in range(2,5)]
ax.violinplot(data)
gs = gridspec.GridSpec(3, 4,
width_ratios=[1,1.5,0.75,1], height_ratios=[3,2,2] )
ax1 = plt.subplot(gs[0:2,0])
ax2 = plt.subplot(gs[2,0:2])
ax3 = plt.subplot(gs[0,1:3])
ax4 = plt.subplot(gs[1,1])
ax5 = plt.subplot(gs[0,3])
ax6 = plt.subplot(gs[1:,2:])
func1(ax1, ax3, ax5)
func3(ax2)
func2(ax4, ax6)
plt.tight_layout()
plt.show()

Categories