Altering individual graphs in Pandas hist groupby plot - python

I'm plotting a frequencies group by countries in an iPython notebook using:
df['country'].hist(by=df['frequency'], bins=yr_bins)
But the resultant figure is badly formatted;
Things I'd like/like to be able to define:
y axis log or not
sizing of individual graphs
x axis limits
auto layout
spacing between each individual graph so the labels don't over lap
Things I've realised so far:
the call to .hist outputs a 9x9 2d array of matplotlib.axes._subplots.AxesSubplot objects
all of these AxesSubplotss are embedded in a single figure

Best working case so far:
For log or not: just using keyword log=True
Sizing of individual graphs and an auto layout:
Determine the number of groups: n = len(df.groupby('country')
Then use the combination of keywords layout=(row, column) and figsize(width, height):
Hard code number of columns to c, and desired width w and height h ofeach graph
Then use layout((n / c), c) and figsize=((c * w), (( (n/c) * h ))
Setting x axis limits: get the axes array by axes = df... then loop over the axes applying set_xlim(lim)
for row in axes:
for ax in row:
ax.set_xlim(lim)
The spacing turnout of but if required then do:
plt.subplots_adjust(wspace=0.5, hspace=1)

Related

Seaborn scatterplot with varying marker sizes and informative legend

I'm trying to create a scatterplot in which the size of the markers vary according to the values of specific columns in the dataframe. This can be done with size and the size of the markers may be scaled using sizes. However, when creating the appropriate legend, three problems arise. The following is a simple example:
df_ex = pd.DataFrame({'x': np.random.normal(size=10),
'y1': np.random.normal(size=10),
'y2': 2 + np.random.normal(size=10),
'p': np.random.uniform(size=10),
'q': np.random.uniform(size=10)})
# Note that `y2` has been shifted but this is on purpose as it will be useful below.
fig, axs = plt.subplots(figsize=(12,7), tight_layout=True)
axs = sns.scatterplot(data = df_ex, x='x',y='y1', size='p',label = 'Y1')
axs = sns.scatterplot(data = df_ex, x='x',y='y2', size='q',label = 'Y2')
axs.set_xlabel('X')
axs.set_ylabel('Y')
plt.show()
First, the legend presents both the original label of the series of interest (that defines the vertical axis) and the variable that defines the size of the markers. In the example below, this is shown both as Y1 and Y2, the labels of the variables that define the vertical axis, and p and q that define the marker size for each variable y1 and y2, respectively. I would like to just show Y1 and Y2 in the legend, without referring to p and q, but still showing the different marker sizes and the associated values in the legend (in other words, take the same legend that the plot currently shows and delete p and q).
Second, in the legend the marker sizes are all presented in black. I would like to use the same color as the one used to show Y1 and Y2 (that correspond to the colors used in the plot).
Third, is there a way that the two series can use the same marker size standardization so a marker of a specific size represents the same underlying values across series? In the plot, each series has its own standardization that depends on the values in the columns y1 and y2 in the dataframe. However, in some cases it may be necessary for the marker size to be defined considering both series jointly, and not individually as it is currently being done.

Python 2d Ratio Plot with weighted mean trendline

Hello and thanks in advance. I am starting with a pandas dataframe and I would like like make a 2d plot with a trendline showing the weighteed mean y value with error bars for the uncertainty on the mean. The mean should be weighted by the total number of events in each bin. I start by grouping the df into a "photon" group and a "total" group where "photon" is a subset of the total. In each bin, I am plotting the ratio of photon events to total. On the x axis and y axis I have two unrelated variables "cluster energy" and "perimeter energy".
My attempt:
#make the 2d binning and total hist
energybins=[11,12,13,14,15,16,17,18,19,20,21,22]
ybins = [0,.125,.25,.5,.625,.75,1.,1.5,2.5]
total_hist,x,y,i = plt.hist2d(train['total_energy'].values,train['max_perimeter'].values,[energybins,ybins])
total_hist = np.array(total_hist)
#make the photon 2d hist with same bins
groups = train.groupby(['isPhoton'])
prompt_hist,x,y,i = plt.hist2d(groups.get_group(1)['total_energy'].values,groups.get_group(1)['max_perimeter'].values,bins=[energybins,ybins])
prompt_hist = np.array(prompt_hist)
ratio = np.divide(prompt_hist,total_hist,out=np.zeros_like(prompt_hist),where = total_hist!=0)
#plot the ratio
fig, ax = plt.subplots()
ratio=np.transpose(ratio)
p = ax.pcolormesh(ratio,)
for i in range(len(ratio)):
for j in range(len(ratio[i])):
text = ax.text(j+1, i+1, round(ratio[i, j], 2),ha="right", va="top", color="w")
ax.set_xticklabels(energybins)
ax.set_yticklabels(ybins)
plt.xlabel("Cluster Energy")
plt.ylabel("5x5 Perimeter Energy")
plt.title("Prompt Photon Fraction")
def myBinnedStat(x,v,bins):
means,_,_ = stats.binned_statistic(x,v,'mean',bins)
std,_ ,_= stats.binned_statistic(x,v,'std',bins)
count,_,_ = stats.binned_statistic(x,v,'count',bins)
return [ufloat(m,s/(c**(1./2))) for m,s,c in zip(means,std,count)]
I can then plot an errorbar plot, but I have not been able to plot the errorbar on the same axis as the pcolormesh. I was able to do this with hist2d. I am not sure why that is. I feel like there is a cleaner way to do the whole thing.
This yields a plot
pcolormesh plots each element as a unit on the x axis. That is, if you plot 8 columns, this data will span 0-8 on the x axis. However, you also redefined the x axis ticklabel so that 0-10 is labeled as 11-21.
For your errorbars, you specified x values at 11-21, or so it looks, which is where the data is plotted. But is not labeled since you changed the ticklabels to correspond to pcolormesh.
This discrepancy is why your two plots do not align. Instead, you could use "default" x values for errorbar or define x values for pcolormesh. For example, use:
ax.errorbar(range(11), means[0:11], yerr=uncertainties[0:11])

Matplotlib 3D Waterfall Plot with Colored Heights

I'm trying to visualise a dataset in 3D which consists of a time series (along y) of x-z data, using Python and Matplotlib.
I'd like to create a plot like the one below (which was made in Python: http://austringer.net/wp/index.php/2011/05/20/plotting-a-dolphin-biosonar-click-train/), but where the colour varies with Z - i.e. so the intensity is shown by a colormap as well as the peak height, for clarity.
An example showing the colormap in Z is (apparently made using MATLAB):
This effect can be created using the waterfall plot option in MATLAB, but I understand there is no direct equivalent of this in Python.
I have also tried using the plot_surface option in Python (below), which works ok, but I'd like to 'force' the lines running over the surface to only be in the x direction (i.e. making it look more like a stacked time series than a surface). Is this possible?
Any help or advice greatly welcomed. Thanks.
I have generated a function that replicates the matlab waterfall behaviour in matplotlib, but I don't think it is the best solution when it comes to performance.
I started from two examples in matplotlib documentation: multicolor lines and multiple lines in 3d plot. From these examples, I only saw possible to draw lines whose color varies following a given colormap according to its z value following the example, which is reshaping the input array to draw the line by segments of 2 points and setting the color of the segment to the z mean value between the 2 points.
Thus, given the input matrixes n,m matrixes X,Y and Z, the function loops over the smallest dimension between n,m to plot each line like in the example, by 2 points segments, where the reshaping to plot by segments is done reshaping the array with the same code as the example.
def waterfall_plot(fig,ax,X,Y,Z):
'''
Make a waterfall plot
Input:
fig,ax : matplotlib figure and axes to populate
Z : n,m numpy array. Must be a 2d array even if only one line should be plotted
X,Y : n,m array
'''
# Set normalization to the same values for all plots
norm = plt.Normalize(Z.min().min(), Z.max().max())
# Check sizes to loop always over the smallest dimension
n,m = Z.shape
if n>m:
X=X.T; Y=Y.T; Z=Z.T
m,n = n,m
for j in range(n):
# reshape the X,Z into pairs
points = np.array([X[j,:], Z[j,:]]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
lc = LineCollection(segments, cmap='plasma', norm=norm)
# Set the values used for colormapping
lc.set_array((Z[j,1:]+Z[j,:-1])/2)
lc.set_linewidth(2) # set linewidth a little larger to see properly the colormap variation
line = ax.add_collection3d(lc,zs=(Y[j,1:]+Y[j,:-1])/2, zdir='y') # add line to axes
fig.colorbar(lc) # add colorbar, as the normalization is the same for all, it doesent matter which of the lc objects we use
Therefore, plots looking like matlab waterfall can be easily generated with the same input matrixes as a matplotlib surface plot:
import numpy as np; import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
from mpl_toolkits.mplot3d import Axes3D
# Generate data
x = np.linspace(-2,2, 500)
y = np.linspace(-2,2, 40)
X,Y = np.meshgrid(x,y)
Z = np.sin(X**2+Y**2)
# Generate waterfall plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
waterfall_plot(fig,ax,X,Y,Z)
ax.set_xlabel('X') ; ax.set_xlim3d(-2,2)
ax.set_ylabel('Y') ; ax.set_ylim3d(-2,2)
ax.set_zlabel('Z') ; ax.set_zlim3d(-1,1)
The function assumes that when generating the meshgrid, the x array is the longest, and by default the lines have fixed y, and its the x coordinate what varies. However, if the size of the y dimension is larger, the matrixes are transposed, generating the lines with fixed x. Thus, generating the meshgrid with the sizes inverted (len(x)=40 and len(y)=500) yields:
with a pandas dataframe with the x axis as the first column and each spectra as another column
offset=0
for c in s.columns[1:]:
plt.plot(s.wavelength,s[c]+offset)
offset+=.25
plt.xlim([1325,1375])

marking specific ordinates on pandas hist

I have a Pandas DataFrame of which I plot histogram of counts using DataFrame.hist(), for example
my_v['v'].hist(bins=50)
Of course, there is a grid, but I would like to add vertical lines for specific values of some ordinates, say at values of df where
w0 = 144.0
df=pd.DataFrame(w0/np.arange(1,6))
Any clue?
Thank you in advance
You need to use axvline to add vertical lines.
# Create some random data
np.random.seed(42)
df = pd.DataFrame(np.random.choice(list(range(200)), (100,5)), columns=list('abcde'))
Plot the histogram on the current figure. Iterate over the array to plot the vertical lines on this existing axes object.
w0 = 144.0
df['a'].hist(bins=50, color='g')
for co_ords in np.nditer(w0/np.arange(1,6)):
plt.axvline(co_ords, color='k')
You can even vary the line-widths/y-axis span limits of the multiple vertical lines by tweaking various keyword arguments to suit your purpose.

matplotlib legend order horizontally first

Is there a way to make a plot legend run horizontally (left to right) instead of vertically, without specifying the number of columns (ncol=...)?
I'm plotting a varying number of lines (roughly 5-15), and I'd rather not try to calculate the optimal number of columns dynamically (i.e. the number of columns that will fit across the figure without running off, when the labels are varying). Also, when there is more than a single row, the ordering of entries goes top-down, then left-right; if it could default to horizontal, this would also be alleviated.
Related: Matplotlib legend, add items across columns instead of down
It seems at this time that matplotlib defaults to a vertical layout. Though not ideal, an option is to do the number of lines/2, as a workaround:
import math
import numpy as np
npoints = 1000
xs = [np.random.randn(npoints) for i in 10]
ys = [np.random.randn(npoints) for i in 10]
for x, y in zip(xs, ys):
ax.scatter(x, y)
nlines = len(xs)
ncol = int(math.ceil(nlines/2.))
plt.legend(ncol=ncol)
So here you would take the length of the number of lines you're plotting (via nlines = len(xs)) and then transform that into a number of columns, via ncol = int(math.ceil(nlines/2.)) and send that to plt.legend(ncol=ncol)

Categories