pandas - plotting integration with matplotlib

pandas - plotting integration with matplotlib - python

Given this data frame:
xlabel = list('xxxxxxyyyyyyzzzzzz')
fill= list('abc'*6)
val = np.random.rand(18)
df = pd.DataFrame({ 'xlabel':xlabel, 'fill':fill, 'val':val})
This is what I'm aiming at: http://matplotlib.org/mpl_examples/pylab_examples/barchart_demo.png
Applied to my example, Group would be x, y and z, Gender would be a, b and c, and Scores would be val.
I'm aware that in pandas plotting integration with matplotlib is still work in progress, so is it possible to do it directly in matplotlib?

Is this what you want?
df.groupby(['fill', 'xlabel']).mean().unstack().plot(kind='bar')
or
df.pivot_table(rows='fill', cols='xlabel', values='val').plot(kind='bar')
You can brake it apart and fiddle with the labels and columns and title, but I think this basically gives you the plot you wanted.
For the error bars currently you'll have to go to the mpl directly.
mean_df = df.pivot_table(rows='fill', cols='xlabel',
values='val', aggfunc='mean')
err_df = df.pivot_table(rows='fill', cols='xlabel',
values='val', aggfunc='std')
rows = len(mean_df)
cols = len(mean_df.columns)
ind = np.arange(rows)
width = 0.8 / cols
colors = 'grb'
fig, ax = plt.subplots()
for i, col in enumerate(mean_df.columns):
ax.bar(ind + i * width, mean_df[col], width=width,
color=colors[i], yerr=err_df[col], label=col)
ax.set_xticks(ind + cols / 2.0 * width)
ax.set_xticklabels(mean_df.index)
ax.legend()
But there will be an enhancement, probably in the 0.13: issue 3796

This was the only solution I found for displaying the error bars:
means = df.groupby(['fill', 'xlabel']).mean().unstack()
x_mean,y_mean,z_mean = means.val.x, means.val.y,means.val.z
sems = df.groupby(['fill','xlabel']).aggregate(stats.sem).unstack()
x_sem,y_sem,z_sem = sems.val.x, sems.val.y,sems.val.z
ind = np.array([0,1.5,3])
fig, ax = plt.subplots()
width = 0.35
bar_x = ax.bar(ind, x_mean, width, color='r', yerr=x_sem, ecolor='r')
bar_y = ax.bar(ind+width, y_mean, width, color='g', yerr=y_sem, ecolor='g')
bar_z = ax.bar(ind+width*2, z_mean, width, color='b', yerr=z_sem, ecolor='b')
ax.legend((bar_x[0], bar_y[0], bar_z[0]), ('X','Y','Z'))
I'd be happy to see a neater approach to tackle the problem though, possibly as an extension of Viktor Kerkez answer.

Related

Visualizing the difference between two numeric arrays

I have two numeric arrays of equal length, with one array always having the element value >= to the corresponding (same index) element in the second array.
I am trying to visualize in a single graph:
i) difference between the corresponding elements,
ii) values of the corresponding elements in the two arrays.
I have tried plotting the CDF as below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
arr1 = np.random.uniform(1,20,[25,1])
arr2 = arr1 + np.random.uniform(1,10,[25,1])
df1 = pd.DataFrame(arr1)
df2 = pd.DataFrame(arr2)
fix, ax = plt.subplots()
sns.kdeplot(df1[0], cumulative=True, color='orange', label='arr1')
sns.kdeplot(df2[0], cumulative=True, color='b', label='arr2')
sns.kdeplot(df2[0]-df1[0], cumulative=True, color='r', label='difference')
plt.show()
which gives the following output:
However, it does not capture the difference, and values of the corresponding elements together. For example, suppose the difference between two elements is 3. The two numbers can be 2 and 5, but they can also be 15 and 18, and this can not be determined from the CDF.
Which kind of plotting can visualize both the difference between the elements and the values of the elements?
I do not wish to line plot as below because not much statistical insights can be derived from the visualization.
ax.plot(df1[0])
ax.plot(df2[0])
ax.plot(df2[0]-df1[0])

There are lots of ways to show difference between two values. It really depends on your goal for the chart, how quantitative or qualitative you want to be, or if you want to show the raw data somehow. Here are a few ideas that come to mind that do not involve simple line plots or density functions. I strongly recommend the book Better Data Visualization by Johnathan Schwabish. He discusses interesting considerations regarding data presentation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
arr1 = np.random.uniform(1,20, size=25)
arr2 = arr1 + np.random.uniform(1,10, size=25)
df = pd.DataFrame({
'col1' : arr1,
'col2' : arr2
})
df['diff'] = df.col2 - df.col1
df['sum'] = df.col1 + df.col2
fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(15,15))
axes = axes.flatten()
# Pyramid chart
df_sorted = df.sort_values(by='sum', ascending=True)
axes[0].barh(
y = np.arange(1,26),
width = -df_sorted.col1
)
axes[0].barh(
y = np.arange(1,26),
width = df_sorted.col2
)
# Style axes[0]
style_func(axes[0], 'Pyramid Chart')
# Dot Plot
axes[1].scatter(df.col1, np.arange(1, 26), label='col1')
axes[1].scatter(df.col2, np.arange(1, 26), label='col2')
axes[1].hlines(
y = np.arange(1, 26),
xmin = df.col1, xmax = df.col2,
zorder=0, linewidth=1.5, color='k'
)
# Style axes[1]
legend = axes[1].legend(ncol=2, loc='center', bbox_to_anchor=(0.14,1.025), edgecolor='w')
style_func(axes[1], 'Dot Plot')
set_xlim = axes[1].set_xlim(0,25)
# Dot Plot 2
df_sorted = df.sort_values(by=['col1', 'diff'], ascending=False)
axes[2].scatter(df_sorted.col1, np.arange(1, 26), label='col1')
axes[2].scatter(df_sorted.col2, np.arange(1, 26), label='col2')
axes[2].hlines(
y = np.arange(1, 26),
xmin = df_sorted.col1, xmax = df_sorted.col2,
zorder=0, linewidth=1.5, color='k'
)
# Style axes[2]
legend = axes[2].legend(ncol=2, loc='center', bbox_to_anchor=(0.14,1.025), edgecolor='w')
style_func(axes[2], 'Dot Plot')
set_xlim = axes[2].set_xlim(0,25)
# Dot Plot 3
df_sorted = df.sort_values(by='sum', ascending=True)
axes[3].scatter(-df_sorted.col1, np.arange(1, 26), label='col1')
axes[3].scatter(df_sorted.col2, np.arange(1, 26), label='col2')
axes[3].vlines(x=0, ymin=-1, ymax=27, linewidth=2.5, color='k')
axes[3].hlines(
y = np.arange(1, 26),
xmin = -df_sorted.col1, xmax = df_sorted.col2,
zorder=0, linewidth=2
)
# Style axes[3]
legend = axes[3].legend(ncol=2, loc='center', bbox_to_anchor=(0.14,1.025), edgecolor='w')
style_func(axes[3], 'Dot Plot')
# Strip plot
axes[4].scatter(df.col1, [4] * 25)
axes[4].scatter(df.col2, [6] * 25)
axes[4].set_ylim(0, 10)
axes[4].vlines(
x = [df.col1.mean(), df.col2.mean()],
ymin = [3.5, 5.5], ymax=[4.5,6.5],
color='black', linewidth =2
)
# Style axes[4]
axes[4].yaxis.set_major_locator(ticker.FixedLocator([4,6]))
axes[4].yaxis.set_major_formatter(ticker.FixedFormatter(['col1','col2']))
hide_spines = [axes[4].spines[x].set_visible(False) for x in ['left','top','right']]
set_title = axes[4].set_title('Strip Plot', fontweight='bold')
tick_params = axes[4].tick_params(axis='y', left=False)
grid = axes[4].grid(axis='y', dashes=(8,3), alpha=0.3, color='gray')
# Slope chart
for i in range(25):
axes[5].plot([0,1], [df.col1[i], df.col2[i]], color='k')
align = ['left', 'right']
for i in range(1,3):
axes[5].text(x = i - 1, y = 0, s = 'col' + str(i),
fontsize=14, fontweight='bold', ha=align[i-1])
set_title = axes[5].set_title('Slope chart', fontweight='bold')
axes[5].axis('off')
def style_func(ax, title):
hide_spines = [ax.spines[x].set_visible(False) for x in ['left','top','right']]
set_title = ax.set_title(title, fontweight='bold')
set_xlim = ax.set_xlim(-25,25)
x_locator = ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
y_locator = ax.yaxis.set_major_locator(ticker.FixedLocator(np.arange(1,26, 2)))
spine_width = ax.spines['bottom'].set_linewidth(1.5)
x_tick_params = ax.tick_params(axis='x', length=8, width=1.5)
x_tick_params = ax.tick_params(axis='y', left=False)

What about a parallel coordinates plot with plotly? This will allow to see the distinct values of each original array but then also if they converge on the same diffrence?
https://plot.ly/python/parallel-coordinates-plot/

Seaborn how to add number of samples per HUE in sns.catplot

I have a catplot drawing using:
s = sns.catplot(x="type", y="val", hue="Condition", kind='box', data=df)
However, the size of "Condition" per hue is not equal:
The blue has n=8 samples , and The green has n=11 samples.
What is the best way to add this info to the graph?

This is essentially the same solution as an earlier answer of mine, which I simplified a bit since:
df = sns.load_dataset('tips')
x_col='day'
y_col='total_bill'
order=['Thur','Fri','Sat','Sun']
hue_col='smoker'
hue_order=['Yes','No']
width=0.8
g = sns.catplot(kind="box", x=x_col, y=y_col, order=order, hue=hue_col, hue_order=hue_order, data=df)
ax = g.axes[0,0]
# get the offsets used by boxplot when hue-nesting is used
# https://github.com/mwaskom/seaborn/blob/c73055b2a9d9830c6fbbace07127c370389d04dd/seaborn/categorical.py#L367
n_levels = len(df[hue_col].unique())
each_width = width / n_levels
offsets = np.linspace(0, width - each_width, n_levels)
offsets -= offsets.mean()
pos = [x+o for x in np.arange(len(order)) for o in offsets]
counts = df.groupby([x_col,hue_col])[y_col].size()
counts = counts.reindex(pd.MultiIndex.from_product([order,hue_order]))
medians = df.groupby([x_col,hue_col])[y_col].median()
medians = medians.reindex(pd.MultiIndex.from_product([order,hue_order]))
for p,n,m in zip(pos,counts,medians):
if not np.isnan(m):
ax.annotate('N={:.0f}'.format(n), xy=(p, m), xycoords='data', ha='center', va='bottom')

Using Hlines ruins legends in Matplotlib

I'm struggling to adjust my plot legend after adding the axline/ hline on 100 level in the graph.(screenshot added)
if there's a way to run this correctly so no information will be lost in legend, and maybe add another hline and adding it to the legend.
adding the code here, maybe i'm not writing it properly.
fig, ax1 = plt.subplots(figsize = (9,6),sharex=True)
BundleFc_Outcome['Spend'].plot(kind = 'bar',color = 'blue',width = 0.4, ax = ax1,position = 1)
#
# Make the y-axis label, ticks and tick labels match the line color.
ax1.set_ylabel('SPEND', color='b', size = 18)
ax1.set_xlabel('Bundle FC',color='w',size = 18)
ax2 = ax1.twinx()
ax2.set_ylabel('ROAS', color='r',size = 18)
ax1.tick_params(axis='x', colors='w',size = 20)
ax2.tick_params(axis = 'y', colors='w',size = 20)
ax1.tick_params(axis = 'y', colors='w',size = 20)
#ax1.text()
#
ax2.axhline(100)
BundleFc_Outcome['ROAS'].plot(kind = 'bar',color = 'red',width = 0.4, ax = ax2,position = 0.25)
plt.grid()
#ax2.set_ylim(0, 4000)
ax2.set_ylim(0,300)
plt.title('ROAS & SPEND By Bundle FC',color = 'w',size= 20)
plt.legend([ax2,ax1],labels = ['SPEND','ROAS'],loc = 0)
The code gives me the following picture:
After implementing the suggestion in the comments, the picture looks like this (does not solve the problem):

You can use bbox_to_anchor attribute to set legend location manually.
ax1.legend([ax1],labels = ['SPEND'],loc='upper right', bbox_to_anchor=(1.25,0.70))
plt.legend([ax2,ax1],labels = ['SPEND','ROAS'],loc='upper right', bbox_to_anchor=(1.25,0.70))
https://matplotlib.org/users/legend_guide.html#legend-location

So finally figured it out , was simpler for a some reason
Even managed to add another threshold at level 2 for minimum spend.
fig, ax1 = plt.subplots(figsize = (9,6),sharex=True)
BundleFc_Outcome['Spend'].plot(kind = 'bar',color = 'blue',width = 0.4, ax = ax1,position = 1)
#
# Make the y-axis label, ticks and tick labels match the line color.
ax1.set_ylabel('SPEND', color='b', size = 18)
ax1.set_xlabel('Region',color='w',size = 18)
ax2 = ax1.twinx()
ax2.set_ylabel('ROAS', color='r',size = 18)
ax1.tick_params(axis='x', colors='w',size = 20)
ax2.tick_params(axis = 'y', colors='w',size = 20)
ax1.tick_params(axis = 'y', colors='w',size = 20)
#ax1.text()
#
BundleFc_Outcome['ROAS'].plot(kind = 'bar',color = 'red',width = 0.4, ax = ax2,position = 0.25)
plt.grid()
#ax2.set_ylim(0, 4000)
ax2.set_ylim(0,300)
plt.title('ROAS & SPEND By Region',color = 'w',size= 20)
fig.legend([ax2,ax1],labels = ['SPEND','ROAS'],loc = 0)
plt.hlines([100,20],xmin = 0,xmax = 8,color= ['r','b'])

I don't recommend using the builtin functions of pandas to do more complex plotting. Also when asking a question it is common courtesy to provide a minimal and verifiable example (see here). I took the liberty to simulate your problem.
Due to the change in axes, we need to generate our own legend. First the results:
Which can be achieved with:
import matplotlib.pyplot as plt, pandas as pd, numpy as np
# generate dummy data.
X = np.random.rand(10, 2)
X[:,1] *= 1000
x = np.arange(X.shape[0]) * 2 # xticks
df = pd.DataFrame(X, columns = 'Spend Roast'.split())
# end dummy data
fig, ax1 = plt.subplots(figsize = (9,6),sharex=True)
ax2 = ax1.twinx()
# tmp axes
axes = [ax1, ax2] # setup axes
colors = plt.cm.tab20(x)
width = .5 # bar width
# generate dummy legend
elements = []
# plot data
for idx, col in enumerate(df.columns):
tax = axes[idx]
tax.bar(x + idx * width, df[col], label = col, width = width, color = colors[idx])
element = tax.Line2D([0], [0], color = colors[idx], label = col) # setup dummy label
elements.append(element)
# desired hline
tax.axhline(200, color = 'red')
tax.set(xlabel = 'Bundle FC', ylabel = 'ROAST')
axes[0].set_ylabel('SPEND')
tax.legend(handles = elements)

matplotlib uneven group size bar charts side-by-side

I am trying to plot groups of data which have different bar sizes and may have different group sizes. How can I group the bars that belong to the same groups (shown as the same color) so that they are side by side? (Similar to this, except the same colors should be side-by-side)
width = 0.50
groupgap=2
y1=[20,80]
y2=[60,30,10]
x1 = np.arange(len(y1))
x2 = np.arange(len(y2))+groupgap
ind = np.concatenate((x1,x2))
fig, ax = plt.subplots()
rects1 = ax.bar(x1, y1, width, color='r', ecolor= "black",label="Gender")
rects2 = ax.bar(x2, y2, width, color='b', ecolor= "black",label="Type")
ax.set_ylabel('Population',fontsize=14)
ax.set_xticks(ind)
ax.set_xticklabels(('Male', 'Female','Student', 'Faculty','Others'),fontsize=14)
ax.legend()

The idea of using a gap between the categories (groupgap) is indeed a way to go. You would just have to add the length of the first group as well:
x2 = np.arange(len(y2))+groupgap+len(y1)
Here is the complete example where I used groupgap=1:
import matplotlib.pyplot as plt
import numpy as np
width = 1
groupgap=1
y1=[20,80]
y2=[60,30,10]
x1 = np.arange(len(y1))
x2 = np.arange(len(y2))+groupgap+len(y1)
ind = np.concatenate((x1,x2))
fig, ax = plt.subplots()
rects1 = ax.bar(x1, y1, width, color='r', edgecolor= "black",label="Gender")
rects2 = ax.bar(x2, y2, width, color='b', edgecolor= "black",label="Type")
ax.set_ylabel('Population',fontsize=14)
ax.set_xticks(ind)
ax.set_xticklabels(('Male', 'Female','Student', 'Faculty','Others'),fontsize=14)
plt.show()

Matplotlib - 2 problems. Common colorbar / labels not showing up

I finally forced the 3 plots I want into one plot with 3 subplots...now I need to add a common colorbar, preferably horizontally oriented. Also, now that I have them as subplots, I have lost the labels that were there in a previous iteration.
It seems that the examples suggest I add an axes, but I don't quite get what the numbers in the arguments are.
def plot_that_2(x_vals, y_vals, z_1_vals, z_2_vals, z_3_vals, figname, units, efficiency_or_not):
global letter_pic_width
plt.close() #I moved this up from the end of the file because it solved my QTagg problem
UI = [uniformity_calc(z_1_vals), uniformity_calc(z_2_vals), uniformity_calc(z_3_vals)]
ranges = [ str(int(np.max(z_1_vals) - np.min(z_1_vals))), str(int(np.max(z_2_vals) - np.min(z_2_vals))), str(int(np.max(z_3_vals) - np.min(z_3_vals)))]
z_vals = [z_1_vals, z_2_vals, z_3_vals]
fig = plt.figure(figsize = (letter_pic_width, letter_pic_width/3 ))
ax0 = fig.add_subplot(1,3,1, aspect = 1)
ax1 = fig.add_subplot(1,3,2, aspect = 1)
ax2 = fig.add_subplot(1,3,3, aspect = 1)
axenames = [ax0, ax1, ax2]
for z_val, unif, rangenum, ax in zip(z_vals, UI, ranges, axenames):
ax.scatter(x_vals, y_vals, c = z_val, s = 100, cmap = 'rainbow')
if efficiency_or_not:
ax.vmin = 0
ax.vmax = 1
ax.xlabel = 'Uniformity: ' + unif
else:
ax.xlabel = 'Uniformity: ' + unif + ' ' + rangenum + ' ppm'
plt.savefig('./'+ figname + '.jpg', dpi = 100)

To set the xlabel, use ax.set_xlabel('Uniformity: ' + unif) See more information here in the documentation for axes.
The example you linked to uses the add_axes method of a figure as an alternative to add_subplot. The documentation for figures explains what the numbers in add_axes are: "Add an axes at position rect [left, bottom, width, height] where all quantities are in fractions of figure width and height."
rect = l,b,w,h
fig.add_axes(rect)

To answer your question about the colorbar axis, the numbers represent
[bottom_left_x_coord, bottom_left_y_coord, width, height]
An appropriate colorbar might be
# x y w h
[0.2, 0.1, 0.6, 0.05]
Here's your code, somewhat reworked which adds a colorbar:
import numpy as np
import matplotlib.pyplot as plt
WIDTH = 9
def uniformity_calc(x):
return x.mean()
def plotter(x, y, zs, name, units, efficiency=True):
fig, axarr = plt.subplots(1, 3, figsize=(WIDTH, WIDTH/3),
subplot_kw={'aspect':1})
fig.suptitle(name)
UI = map(uniformity_calc, zs)
ranges = map(lambda x: int(np.max(x)-np.min(x)), zs)
for ax, z, unif, rangenum in zip(axarr, zs, UI, ranges):
scat = ax.scatter(x, y, c=z, s=100, cmap='rainbow')
label = 'Uniformity: %i'%unif
if not efficiency:
label += ' %i ppm'%rangenum
ax.set_xlabel(label)
# Colorbar [left, bottom, width, height
cax = fig.add_axes([0.2, 0.1, 0.6, 0.05])
cbar = fig.colorbar(scat, cax, orientation='horizontal')
cbar.set_label('This is a colorbar')
plt.show()
def main():
x, y = np.meshgrid(np.arange(10), np.arange(10))
zs = [np.random.rand(*y.shape) for _ in range(3)]
plotter(x.flatten(), y.flatten(), zs, 'name', None)
if __name__ == "__main__":
main()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - plotting integration with matplotlib - python

Related

Visualizing the difference between two numeric arrays

Seaborn how to add number of samples per HUE in sns.catplot

Using Hlines ruins legends in Matplotlib

matplotlib uneven group size bar charts side-by-side

Matplotlib - 2 problems. Common colorbar / labels not showing up

Categories

Resources