Seaborn distplot y-axis normalisation wrong ticklabels - python

Just to note, I have already checked this question and this question.
So, I'm using distplot to draw some histograms on separate subplots:
import numpy as np
#import netCDF4 as nc # used to get p0_dict
import matplotlib.pyplot as plt
from collections import OrderedDict
import seaborn.apionly as sns
import cPickle as pickle
'''
LINK TO PICKLE
https://drive.google.com/file/d/0B8Xks3meeDq0aTFYcTZEZGFFVk0/view?usp=sharing
'''
p0_dict = pickle.load(open('/path/to/pickle/test.dat', 'r'))
fig = plt.figure(figsize = (15,10))
ax = plt.gca()
j=1
for region, val in p0_dict.iteritems():
val = np.asarray(val)
subax = plt.subplot(5,5,j)
print region
try:
sns.distplot(val, bins=11, hist=True, kde=True, rug=True,
ax = subax, color = 'k', norm_hist=True)
except Exception as Ex:
print Ex
subax.set_title(region)
subax.set_xlim(0, 1) # the data varies from 0 to 1
j+=1
plt.subplots_adjust(left = 0.06, right = 0.99, bottom = 0.07,
top = 0.92, wspace = 0.14, hspace = 0.6)
fig.text(0.5, 0.02, r'$ P(W) = 0,1 $', ha ='center', fontsize = 15)
fig.text(0.02, 0.5, '% occurrence', ha ='center',
rotation='vertical', fontsize = 15)
# obviously I'd multiply the fractional ticklabels by 100 to get
# the percentage...
plt.show()
What I expect is for the area under the KDE curve to sum to 1, and for the y axis ticklabels to reflect this. However, I get the following:
As you can see, the y axis ticklabels are not in the range [0,1], as would be expected. Turning on/off norm_hist or kde does not change this. For reference, the output with both turned off:
Just to verify:
aus = np.asarray(p0_dict['AUS'])
aus_bins = np.histogram(aus, bins=11)[0]
plt.subplot(121)
plt.hist(aus,11)
plt.subplot(122)
plt.bar(range(0,11),aus_bins.astype(np.float)/np.sum(aus_bins))
plt.show()
The y ticklabels in this case properly reflect those of a normalised histogram.
What am I doing wrong?
Thank you for your help.

The y axis is a density, not a probability. I think you are expecting the normalized histogram to show a probability mass function, where the sum the bar heights equals 1. But that's wrong; the normalization ensures that the sum of the bar heights times the bar widths equals 1. This is what ensures that the normalized histogram is comparable to the kernel density estimate, which is normalized so that the area under the curve is equal to 1.

Related

Render y-axis properly when overlaying pandas KDE and histogram

Similar questions to this have been asked before but not using these exact two plotting functions together so here we are:
I have a column from a pandas DataFrame that I am plotting both a histogram and the KDE. However, when I plot them, the y-axis is using the raw data value range instead of discrete number of samples/bin (what I want). How can I fix this? The actual plot is perfect, but the y-axis is wrong.
Data:
t2 = [140547476703.0, 113395471484.0, 158360225172.0, 105497674121.0, 186457736557.0, 153705359063.0, 36826568371.0, 200653068740.0, 190761317478.0, 126529980843.0, 98776029557.0, 132773701862.0, 14780432449.0, 167507656251.0, 121353262386.0, 136377019007.0, 134190768743.0, 218619462126.0, 07912778721.0, 215628911255.0, 147024833865.0, 94136343562.0, 135685803096.0, 165901502129.0, 45476074790.0, 125195690010.0, 113910844263.0, 123134290987.0, 112028565305.0, 93448218430.0, 07341012378.0, 93146854494.0, 132958913610.0, 102326700019.0, 196826471714.0, 122045354980.0, 76591131961.0, 134694468251.0, 120212625727.0, 108456858852.0, 106363042112.0, 193367024628.0, 39578667378.0, 178075400604.0, 155513974664.0, 132834624567.0, 137336282646.0, 125379267464.0]
Code:
fig = plt.figure()
# plot hist + kde
t2[t2.columns[0]].plot.kde(color = "maroon", label = "_nolegend_")
t2[t2.columns[0]].plot.hist(density = True, edgecolor = "grey", color = "tomato", title = t2.columns[0])
# plot mean/stdev
m = t2[t2.columns[0]].mean()
stdev = t2[t2.columns[0]].std()
plt.axvline(m, color = "black", ymax = 0.05, label = "mean")
plt.axvline(m-2*stdev, color = "black", ymax = 0.05, linestyle = ":", label = "+/- 2*Stdev")
plt.axvline(m+2*stdev, color = "black", ymax = 0.05, linestyle = ":")
plt.legend()
What it looks like now:
If you want the real counts, the you'll need to scale the KDE up by the width of the bins multiplied by the number of observations. The trickiest part is accessing the data pandas uses to plot the KDE. (I've removed parts related to the legend to simplify the problem at hand).
import matplotlib.pyplot as plt
import numpy as np
# Calculate KDE, get data
axis = t2[t2.columns[0]].plot.kde(color = "maroon", label = "_nolegend_")
xdata = axis.get_children()[0]._x
ydata = axis.get_children()[0]._y
plt.clf()
# Real figure
fig, ax = plt.subplots(figsize=(7,5))
# Plot Histogram, no density.
x = ax.hist(t2[t2.columns[0]], edgecolor = "grey", color = "tomato")
# size of the bins * N obs
scale = np.diff(x[1])[0]*len(t2)
# Plot scaled KDE
ax.plot(xdata, ydata*scale, color='blue')
ax.set_ylabel('N observations')
plt.show()

How to draw the normal distribution of a barplot with log x axis?

I'd like to draw a lognormal distribution of a given bar plot.
Here's the code
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np; np.random.seed(1)
import scipy.stats as stats
import math
inter = 33
x = np.logspace(-2, 1, num=3*inter+1)
yaxis = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01,0.03,0.3,0.75,1.24,1.72,2.2,3.1,3.9,
4.3,4.9,5.3,5.6,5.87,5.96,6.01,5.83,5.42,4.97,4.60,4.15,3.66,3.07,2.58,2.19,1.90,1.54,1.24,1.08,0.85,0.73,
0.84,0.59,0.55,0.53,0.48,0.35,0.29,0.15,0.15,0.14,0.12,0.14,0.15,0.05,0.05,0.05,0.04,0.03,0.03,0.03, 0.02,
0.02,0.03,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0,0]
fig, ax = plt.subplots()
ax.bar(x[:-1], yaxis, width=np.diff(x), align="center", ec='k', color='w')
ax.set_xscale('log')
plt.xlabel('Diameter (mm)', fontsize='12')
plt.ylabel('Percentage of Total Particles (%)', fontsize='12')
plt.ylim(0,8)
plt.xlim(0.01, 10)
fig.set_size_inches(12, 12)
plt.savefig("Test.png", dpi=300, bbox_inches='tight')
Resulting plot:
What I'm trying to do is to draw the Probability Density Function exactly like the one shown in red in the graph below:
An idea is to convert everything to logspace, with u = log10(x). Then draw the density histogram in there. And also calculate a kde in the same space. Everything gets drawn as y versus u. When we have u at a top twin axes, x can stay at the bottom. Both axes get aligned by setting the same xlims, but converted to logspace on the top axis. The top axis can be hidden to get the desired result.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
inter = 33
u = np.linspace(-2, 1, num=3*inter+1)
x = 10**u
us = np.linspace(u[0], u[-1], 500)
yaxis = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01,0.03,0.3,0.75,1.24,1.72,2.2,3.1,3.9,
4.3,4.9,5.3,5.6,5.87,5.96,6.01,5.83,5.42,4.97,4.60,4.15,3.66,3.07,2.58,2.19,1.90,1.54,1.24,1.08,0.85,0.73,
0.84,0.59,0.55,0.53,0.48,0.35,0.29,0.15,0.15,0.14,0.12,0.14,0.15,0.05,0.05,0.05,0.04,0.03,0.03,0.03, 0.02,
0.02,0.03,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0,0]
yaxis = np.array(yaxis)
# reconstruct data from the given frequencies
u_data = np.repeat((u[:-1] + u[1:]) / 2, (yaxis * 100).astype(np.int))
kde = stats.gaussian_kde((u[:-1]+u[1:])/2, weights=yaxis, bw_method=0.2)
total_area = (np.diff(u)*yaxis).sum() # total area of all bars; divide by this area to normalize
fig, ax = plt.subplots()
ax2 = ax.twiny()
ax2.bar(u[:-1], yaxis, width=np.diff(u), align="edge", ec='k', color='w', label='frequencies')
ax2.plot(us, total_area*kde(us), color='crimson', label='kde')
ax2.plot(us, total_area * stats.norm.pdf(us, u_data.mean(), u_data.std()), color='dodgerblue', label='lognormal')
ax2.legend()
ax.set_xscale('log')
ax.set_xlabel('Diameter (mm)', fontsize='12')
ax.set_ylabel('Percentage of Total Particles (%)', fontsize='12')
ax.set_ylim(0,8)
xlim = np.array([0.01,10])
ax.set_xlim(xlim)
ax2.set_xlim(np.log10(xlim))
ax2.set_xticks([]) # hide the ticks at the top
plt.tight_layout()
plt.show()
PS: Apparently this also can be achieved directly without explicitly using u (at the cost of being slightly more cryptic):
x = np.logspace(-2, 1, num=3*inter+1)
xs = np.logspace(-2, 1, 500)
total_area = (np.diff(np.log10(x))*yaxis).sum() # total area of all bars; divide by this area to normalize
kde = gaussian_kde((np.log10(x[:-1])+np.log10(x[1:]))/2, weights=yaxis, bw_method=0.2)
ax.bar(x[:-1], yaxis, width=np.diff(x), align="edge", ec='k', color='w')
ax.plot(xs, total_area*kde(np.log10(xs)), color='crimson')
ax.set_xscale('log')
Note that the bandwidth set for gaussian_kde is a somewhat arbitrarily value. Larger values give a more equalized curve, smaller values keep closer to the data. Some experimentation can help.

Parasite x-axis in loglog plot

I have a graph where the x-axis is the temperature in GeV, but I also need to put a reference of the temperature in Kelvin, so I thought of putting a parasite axis with the temperature in K. Trying to follow this answer How to add a second x-axis in matplotlib , Here is the example of the code. I get a second axis at the top of my graph, but it is not the temperature in K as I need.
import numpy as np
import matplotlib.pyplot as plt
tt = np.logspace(-14,10,100)
yy = np.logspace(-10,-2,100)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
ax1.loglog(tt,yy)
ax1.set_xlabel('Temperature (GeV')
new_tick_locations = np.array([.2, .5, .9])
def tick_function(X):
V = X*1.16e13
return ["%.1f" % z for z in V]
ax2.set_xlim(ax1.get_xlim())
ax2.set_xticks(new_tick_locations)
ax2.set_xticklabels(tick_function(ax1Xs))
ax2.set_xlabel('Temp (Kelvin)')
plt.show()
This is what I get when I run the code.
loglog plot
I need the parasite axis be proportional to the original x-axis. And that it can be easy to read the temperature in Kelvin when anyone sees the graph. Thanks in advance.
A general purpose solution may look as follows. Since you have a non-linear scale, the idea is to find the positions of nice ticks in Kelvin, convert to GeV, set the positions in units of GeV, but label them in units of Kelvin. This sounds complicated, but the advantage is that you do not need to find the ticks yourself, just rely on matplotlib for finding them.
What this requires though is the functional dependence between the two scales, i.e. the converion between GeV and Kelvin and its inverse.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
tt = np.logspace(-14,10,100)
yy = np.logspace(-10,-2,100)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
plt.setp([ax1,ax2], xscale="log", yscale="log")
ax1.get_shared_x_axes().join(ax1, ax2)
ax1.plot(tt,yy)
ax1.set_xlabel('Temperature (GeV)')
ax2.set_xlabel('Temp (Kelvin)')
fig.canvas.draw()
# 1 GeV == 1.16 × 10^13 Kelvin
Kelvin2GeV = lambda k: k / 1.16e13
GeV2Kelvin = lambda gev: gev * 1.16e13
loc = mticker.LogLocator()
locs = loc.tick_values(*GeV2Kelvin(np.array(ax1.get_xlim())))
ax2.set_xticks(Kelvin2GeV(locs))
ax2.set_xlim(ax1.get_xlim())
f = mticker.ScalarFormatter(useOffset=False, useMathText=True)
g = lambda x,pos : "${}$".format(f._formatSciNotation('%1.10e' % GeV2Kelvin(x)))
fmt = mticker.FuncFormatter(g)
ax2.xaxis.set_major_formatter(mticker.FuncFormatter(fmt))
plt.show()
The problem appears to be the following: When you use ax2.set_xlim(ax1.get_xlim()), you are basically setting the limit of upper x-axis to be the same as that of the lower x-axis. Now if you do
print(ax1.get_xlim())
print(ax2.get_xlim())
you get for both axes the same values as
(6.309573444801943e-16, 158489319246.11108)
(6.309573444801943e-16, 158489319246.11108)
but your lower x-axis is having a logarithmic scale. When you assign the limits using ax2.set_xlim(), the limits of ax2 are the same but the scale is still linear. That's why when you set the ticks at [.2, .5, .9], these values appear as ticks on the far left of the upper x-axis as in your figure.
The solution is to set the upper x-axis also to be a logarithmic scale. This is required because your new_tick_locations corresponds to the actual values on the lower x-axis. You just want to rename these values to show the ticklabels in Kelvin. It is clear from your variable names that new_tick_locations corresponds to the new tick locations. I use some modified values of new_tick_locations to highlight the problem.
I am using scientific formatting '%.0e' because 1 GeV = 1.16e13 K and so 0.5 GeV would be a very large value with many zeros.
Below is a sample answer:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
tt = np.logspace(-14,10,100)
yy = np.logspace(-10,-2,100)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
ax1.loglog(tt,yy)
ax1.set_xlabel('Temperature (GeV)')
new_tick_locations = np.array([0.000002, 0.05, 9000])
def tick_function(X):
V = X*1.16e13
return ["%.1f" % z for z in V]
ax2.set_xscale('log') # Setting the logarithmic scale
ax2.set_xlim(ax1.get_xlim())
ax2.set_xticks(new_tick_locations)
ax2.set_xticklabels(tick_function(new_tick_locations))
ax2.xaxis.set_major_formatter(mtick.FormatStrFormatter('%.0e'))
ax2.set_xlabel('Temp (Kelvin)')
plt.show()

Polar plot - Put one grid line in bold

I am trying to make use the polar plot projection to make a radar chart. I would like to know how to put only one grid line in bold (while the others should remain standard).
For my specific case, I would like to highlight the gridline associated to the ytick "0".
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
#Variables
sespi = pd.read_csv("country_progress.csv")
labels = sespi.country
progress = sespi.progress
angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
#Concatenation to close the plots
progress=np.concatenate((progress,[progress[0]]))
angles=np.concatenate((angles,[angles[0]]))
#Polar plot
fig=plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, progress, '.--', linewidth=1, c="g")
#ax.fill(angles, progress, alpha=0.25)
ax.set_thetagrids(angles * 180/np.pi, labels)
ax.set_yticklabels([-200,-150,-100,-50,0,50,100,150,200])
#ax.set_title()
ax.grid(True)
plt.show()
The gridlines of a plot are Line2D objects. Therefore you can't make it bold. What you can do (as shown, in part, in the other answer) is to increase the linewidth and change the colour but rather than plot a new line you can do this to the specified gridline.
You first need to find the index of the y tick labels which you want to change:
y_tick_labels = [-100,-10,0,10]
ind = y_tick_labels.index(0) # find index of value 0
You can then get a list of the gridlines using gridlines = ax.yaxis.get_gridlines(). Then use the index you found previously on this list to change the properties of the correct gridline.
Using the example from the gallery as a basis, a full example is shown below:
r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
ax = plt.subplot(111, projection='polar')
ax.set_rmax(2)
ax.set_rticks([0.5, 1, 1.5, 2]) # less radial ticks
ax.set_rlabel_position(-22.5) # get radial labels away from plotted line
ax.grid(True)
y_tick_labels = [-100, -10, 0, 10]
ax.set_yticklabels(y_tick_labels)
ind = y_tick_labels.index(0) # find index of value 0
gridlines = ax.yaxis.get_gridlines()
gridlines[ind].set_color("k")
gridlines[ind].set_linewidth(2.5)
plt.show()
Which gives:
It is just a trick, but I guess you could just plot a circle and change its linewidth and color to whatever could be bold for you.
For example:
import matplotlib.pyplot as plt
import numpy as np
Yline = 0
Npoints = 300
angles = np.linspace(0,360,Npoints)*np.pi/180
line = 0*angles + Yline
ax = plt.subplot(111, projection='polar')
plt.plot(angles, line, color = 'k', linewidth = 3)
plt.ylim([-1,1])
plt.grid(True)
plt.show()
In this piece of code, I plot a line using plt.plot between any point of the two vectors angles and line. The former is actually all the angles between 0 and 2*np.pi. The latter is constant, and equal to the 'height' you want to plot that line Yline.
I suggest you try to decrease and increase Npoints while having a look to the documentaion of np.linspace() in order to understand your problem with the roundness of the circle.

How to make a twin y-axis by only taking the maximum value of the line indicated from 0 to 1

I know how to plot twin y-axis but for my problem, I need to make a twin y-axis in percentages (like 0.1 = 10% to 1 = 100%) and here the maximum value or the end point of the line graph should be matched with '1' i.e.100%.
For example: x = 1,3,6,8 & y=2,7,9,17 now if we plot this we know the highest point is (8,17) then w.r.t. this point(8,17) I need to show on the twin y-axis as '1' the twin y-axis can be a dummy axis as it is just needed as a reference axis to compare the graph.
I tried using normalization method but the original plot is getting disturbed and also the y-axis points are displayed according to new twin axis not with the original y-axis values.
For reference, this is the current output:
Below is my code which I tried.
import numpy as np
import matplotlib.pyplot as plt
#below x and y points can be changed according to the user
y = np.array([0, 38, 71, 99, 123, 205])
x = np.array([0, 0.37, 0.74, 1.11, 1.48, 1.85])
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y,'b--')
yrep = y-min(y[:])
yrep = yrep/max(yrep[:])
print (yrep,"yrep")
ax2 = ax1.twinx()
ax2.plot(x, yrep, alpha=0)
plt.show()

Categories