Python 2d Ratio Plot with weighted mean trendline - python

Hello and thanks in advance. I am starting with a pandas dataframe and I would like like make a 2d plot with a trendline showing the weighteed mean y value with error bars for the uncertainty on the mean. The mean should be weighted by the total number of events in each bin. I start by grouping the df into a "photon" group and a "total" group where "photon" is a subset of the total. In each bin, I am plotting the ratio of photon events to total. On the x axis and y axis I have two unrelated variables "cluster energy" and "perimeter energy".
My attempt:
#make the 2d binning and total hist
ybins = [0,.125,.25,.5,.625,.75,1.,1.5,2.5]
total_hist,x,y,i = plt.hist2d(train['total_energy'].values,train['max_perimeter'].values,[energybins,ybins])
total_hist = np.array(total_hist)
#make the photon 2d hist with same bins
groups = train.groupby(['isPhoton'])
prompt_hist,x,y,i = plt.hist2d(groups.get_group(1)['total_energy'].values,groups.get_group(1)['max_perimeter'].values,bins=[energybins,ybins])
prompt_hist = np.array(prompt_hist)
ratio = np.divide(prompt_hist,total_hist,out=np.zeros_like(prompt_hist),where = total_hist!=0)
#plot the ratio
fig, ax = plt.subplots()
p = ax.pcolormesh(ratio,)
for i in range(len(ratio)):
for j in range(len(ratio[i])):
text = ax.text(j+1, i+1, round(ratio[i, j], 2),ha="right", va="top", color="w")
plt.xlabel("Cluster Energy")
plt.ylabel("5x5 Perimeter Energy")
plt.title("Prompt Photon Fraction")
def myBinnedStat(x,v,bins):
means,_,_ = stats.binned_statistic(x,v,'mean',bins)
std,_ ,_= stats.binned_statistic(x,v,'std',bins)
count,_,_ = stats.binned_statistic(x,v,'count',bins)
return [ufloat(m,s/(c**(1./2))) for m,s,c in zip(means,std,count)]
I can then plot an errorbar plot, but I have not been able to plot the errorbar on the same axis as the pcolormesh. I was able to do this with hist2d. I am not sure why that is. I feel like there is a cleaner way to do the whole thing.
This yields a plot

pcolormesh plots each element as a unit on the x axis. That is, if you plot 8 columns, this data will span 0-8 on the x axis. However, you also redefined the x axis ticklabel so that 0-10 is labeled as 11-21.
For your errorbars, you specified x values at 11-21, or so it looks, which is where the data is plotted. But is not labeled since you changed the ticklabels to correspond to pcolormesh.
This discrepancy is why your two plots do not align. Instead, you could use "default" x values for errorbar or define x values for pcolormesh. For example, use:
ax.errorbar(range(11), means[0:11], yerr=uncertainties[0:11])


Setting specific display range for scatter plot axes in Matplotlib

I am trying to make a scatter plot showing the housing prices in Manhattan using the longitude and latitude from a data set. When creating the scatter plot. The output only shows the extreme values of the longitude although all the other values are grouped in the range -74 to -72 (longitude). I don't know how to set the specific range in the x axis so the longitudes represented show the relevant data from the data set.
x = dataset_noise['Lon']
y = dataset_noise['Lat']
no_of_values = len(dataset_noise['Lon'])
index = np.arange(no_of_values)
plt.scatter(x, y, cmap=plt.get_cmap("jet"),linewidths=0.5,marker='.',alpha=0.2,label='Prices')
plt.title('House prices in Manhattan')
This is what I coded and the output
You can use the following two functions provided by the mpl :
these functions accept an interval (a list/tuple argument with lower (L) and upper (U) limits) [L, U] or (L, U)

Cumulative histogram for 2D data in Python

My data consists of a 2-D array of masses and distances. I want to produce a plot where the x-axis is distance and the y axis is the number of data elements with distance <= x (i.e. a cumulative histogram plot). What is the most efficient way to do this with Python?
PS: the masses are irrelevant since I already have filtered by mass, so all I am trying to produce is a plot using the distance data.
Example plot below:
You can combine numpy.cumsum() and plt.step():
import matplotlib.pyplot as plt
import numpy as np
N = 15
distances = np.random.uniform(1, 4, 15).cumsum()
counts = np.random.uniform(0.5, 3, 15)
plt.step(distances, counts.cumsum())
Alternatively, can be used to draw a histogram, with the widths defined by the difference between successive distances. Optionally, an extra distance needs to be appended to give the last bar a width., counts.cumsum(), width=np.diff(distances, append=distances[-1]+1), align='edge')
plt.autoscale(enable=True, axis='x', tight=True) # make x-axis tight
Instead of appending a value, e.g. a zero could be prepended, depending on the exact interpretation of the data., counts.cumsum(), width=-np.diff(distances, prepend=0), align='edge')
This is what I figured I can do given a 1D array of data:
counts = np.ones(len(data))
plt.step(np.sort(data), counts.cumsum())
This apparently works with duplicate elements also, as the ys will be added for each x.

Seaborn heatmap, custom tick values

I plotting a pandas dataframe to a seaborn heatmap, and I would like to set specific y-axis ticks for specific locations.
My dataframe index is 100 rows which corresponds to a "depth" parameter, but the values in this index are not arranged with a nice interval :
I would like to set tick labels at multiples of 100. I can do this fine using :
yticks = np.linspace(10,100,10)
ylabels = np.linspace(100,1000,10)
for my dataframe which has 100 rows, with values from approx 100 - 1000, but the result is clearly not desirable, as the position of the tick labels clearly do not correspond to the correct depth values (index value), only the position in the index.
How can I produce a heatmap where the plot is warped so that the actual depth values (index values) are aligned with the ylabels I am setting?
A complicating factor for this is also that the index values are not sampled linearly...
My solution is a little bit ugly but it works for me. Suppose your depth data is in depth_list and num_ticks is the number of ticks you want:
num_ticks = 10
# the index of the position of yticks
yticks = np.linspace(0, len(depth_list) - 1, num_ticks,
# the content of labels of these yticks
yticklabels = [depth_list[idx] for idx in yticks]
then plot the heatmap in this way (where your data is in data):
ax = sns.heatmap(data, yticklabels=yticklabels)
While plotting with seaborn you have to specify arguments xticklabels and yticklabels for heatmap function. These arguments in you case have to be lists with custom tick labels.
I have developed a solution which does what I intended, modified after liwt31's solution:
def round(n, k):
# function to round number 'n' up/down to nearest 'k'
# use positive k to round up
# use negative k to round down
return n - n % k
# note: the df.index is a series of elevation values
tick_step = 25
tick_min = int(round(data.index.min(), (-1 * tick_step))) # round down
tick_max = (int(round(data.index.max(), (1 * tick_step)))) + tick_step # round up
# the depth values for the tick labels
# I want my y tick labels to refer to these elevations,
# but with min and max values being a multiple of 25.
yticklabels = range(tick_min, tick_max, tick_step)
# the index position of the tick labels
yticks = []
for label in yticklabels:
idx_pos = df.index.get_loc(label)
cmap = sns.color_palette("coolwarm", 128)
plt.figure(figsize=(30, 10))
ax1 = sns.heatmap(df, annot=False, cmap=cmap, yticklabels=yticklabels)

Plotting data points on where they fall in a distribution

Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)

How do I plot more than one set of bars per axis on a bar plot in python?

I currently use the align=’edge’ parameter and positive/negative widths in to plot the bar data of one metric to each axis. However, if I try to plot a second set of data to one axis, it covers the first set. Is there a way for pyplot to automatically space this data correctly?
lns3 = ax[1].bar(bucket_df.index,bucket_df.original_revenue,color='c',width=-0.4,align='edge')
lns4 = ax[1].bar(bucket_df.index,bucket_df.revenue_lift,color='m',bottom=bucket_df.original_revenue,width=-0.4,align='edge')
lns5 =,bucket_df.perc_first_priced,color='grey',width=0.4,align='edge')
lns6 =,bucket_df.perc_revenue_lift,color='y',width=0.4,align='edge')
This is what it looks like when I show the plot:
The data shown in yellow completely covers the data in grey. I'd like it to be shown next to the grey data.
Is there any easy way to do this? Thanks!
The first argument to the bar() plotting method is an array of the x-coordinates for your bars. Since you pass the same x-coordinates they will all overlap. You can get what you want by staggering the bars by doing something like this:
x = np.arange(10) # define your x-coordinates
width = 0.1 # set a width for your plots
offset = 0.15 # define an offset to separate each set of bars
fig, ax = plt.subplots() # define your figure and axes objects, y1) # plot the first set of bars + offset, y2) # plot the second set of bars
Since you have a few sets of data to plot, it makes more sense to make the code a bit more concise (assume y_vals is a list containing the y-coordinates you'd like to plot, bucket_df.original_revenue, bucket_df.revenue_lift, etc.). Then your plotting code could look like this:
for i, y in enumerate(y_vals): + i * offset, y)
If you want to plot more sets of bars you can decrease the width and offset accordingly.
