Show density and frequency on the same histogram - python

I would like to see both the density and frequency on my histogram. For example, display density on the left side and frequency on the right side.
Here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
def plot_histogram():
bins = range(-11, 12, 1)
bins_str = []
for i in bins:
bins_str.append(str(i)+"%")
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist(np.clip(df.Returns, bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
N_labels = len(xlabels)
plt.xlim([bins[0], bins[-1]])
plt.xticks(bins)
ax.set_xticklabels(xlabels)
plt.title("Returns distribution")
plt.grid(axis="y", linewidth=0.5)
plot_histogram()
I tried adding density=True in plt.hist() but it removes the count from the histogram. Is it possible to display both the frequency and density on the same histogram?

A density plot sets the heights of the bars such that the area of all the bars (taking rwidth=1 for that calculation) sums to 1. As such, the bar heights of a counting histogram get divided by (the number of values times the bar widths).
With that conversion factor, you can recalculate the counts from the density (or vice versa). The recalculation can be used to label the bars and/or set a secondary y-axis. Note that the ticks of both y axes are aligned, so the grid only works well for one of them. (A secondary y-axis is a bit different from ax.twiny(), as the former has a fixed conversion between both y axes).
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
bins = range(-11, 12, 1)
bins_str = [str(i) + "%" for i in bins]
fig, ax = plt.subplots(figsize=(9, 5))
values, bins, patches = ax.hist(np.clip(df["Returns"], bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
# conversion between counts and density: number of values times bin width
factor = len(df) * (bins[1] - bins[0])
ax.bar_label(patches, ['' if v == 0 else f'{v * factor:.0f}' for v in values])
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
ax.set_xlim([bins[0], bins[-1]])
ax.set_xticks(bins, xlabels)
ax.set_title("Returns distribution")
ax.grid(axis="y", linewidth=0.5)
secax = ax.secondary_yaxis('right', functions=(lambda y: y * factor, lambda y: y / factor))
secax.set_ylabel('counts')
ax.set_ylabel('density')
plt.show()
To have the same grid positions for both y-axes, you can copy the ticks of one and convert them to set them at the other. For the ticks to be calculated, the plot needs to be drawn once (at the end of the code). Note that the converted values are only shown with a limited number of digits.
fig.canvas.draw()
ax.set_yticks(secax.get_yticks() / factor)
plt.show()

Related

Change the width of merged bins in Matplotlib and Seaborn

I have a table of grades and I want all of the bins to be of the same width
i want the bins to be in the range of [0,56,60,65,70,80,85,90,95,100]
when the first bin is from 0-56 then 56-60 ... with the same width
sns.set_style('darkgrid')
newBins = [0,56,60,65,70,80,85,90,95,100]
sns.displot(data= scores , bins=newBins)
plt.xlabel('grade')
plt.xlim(0,100)
plt.xticks(newBins);
Expected output
how I can balance the width of the bins?
You need to cheat a bit. Define you own bins and name the bins with a linear range. Here is an example:
s = pd.Series(np.random.randint(100, size=100000))
bins = [-0.1, 50, 75, 95, 101]
s2 = pd.cut(s, bins=bins, labels=range(len(bins)-1))
ax = s2.astype(int).plot.hist(bins=len(bins)-
1)
ax.set_xticks(np.linspace(0, len(bins)-2, len(bins)))
ax.set_xticklabels(bins)
Output:
Old answer:
Why don't you let seaborn pick the bins for you:
sns.displot(data=scores, bins='auto')
Or set the number of bins that you want:
sns.displot(data=scores, bins=10)
They will be evenly distributed
You assigning a list to the bins argument of sns.distplot(). This specifies the edges of bins. Since these edges are not spaced evenly, the widths of bins vary.
I think that you may want to use a bar plot (sbs.barplot()) and not a histogram. You would need to compute how many data points are in each bin, and then plot bars without the information what range of values each bar represents. Something like this:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
import numpy as np
# sample data
data = np.random.randint(0, 100, 200)
newBins = [0,56,60,65,70,80,85,90,95,100]
# compute bar heights
hist, _ = np.histogram(data, bins=newBins)
# plot a bar diagram
sns.barplot(x = list(range(len(hist))), y = hist)
plt.show()
It gives:
just change the list of values that are you using as binds:
newBins = numpy.arange(0, 100, 1)
 You can use bin parameter from histplots but to get exact answer you have to use pd.cut() to creating your own bins.
np.random.seed(101)
df = pd.DataFrame({'scores':pd.Series(np.random.randint(100,size=175)),
'bins_created':pd.cut(scores,bins=[0,55,60,65,70,75,80,85,90,95,100])})
new_data = df['bins_created'].value_counts()
plt.figure(figsize=(10,5),dpi=100)
plots = sns.barplot(x=new_data.index,y=new_data.values)
plt.xlabel('grades')
plt.ylabel('counts')
for bar in plots.patches:
plots.annotate(format(bar.get_height(), '.2f'),
(bar.get_x() + bar.get_width() / 2,
bar.get_height()), ha='center', va='center',
size=10, xytext=(0,5),
textcoords='offset points')
plt.show()

How to automatically set the y-axis limits after limiting the x-axis

Let's say that I have a certain number of data sets that I want to plot together.
And then I want to zoom on a certain part (for example, using ax.set_xlim, or plt.xlim or plt.axis). When I do that it still keeps the calculated range prior to the zoom. How can I make it rescale to what is currently being shown?
For example, using
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
data_x = [d for d in range(100)]
data_y = [2*d for d in range(100)]
data_y2 = [(d-50)*(d-50) for d in range(100)]
fig = plt.figure(constrained_layout=True)
gs = gridspec.GridSpec(2, 1, figure=fig)
ax1 = fig.add_subplot(gs[0, 0])
ax1.grid()
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.scatter(data_x, data_y, s=0.5)
ax1.scatter(data_x, data_y2, s=0.5)
ax2 = fig.add_subplot(gs[1, 0])
ax2.grid()
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.scatter(data_x, data_y, s=0.5)
ax2.scatter(data_x, data_y2, s=0.5)
ax2.set_xlim(35,45)
fig.savefig('scaling.png', dpi=300)
plt.close(fig)
Which generate
as you can see the plot below gets hard to see something since the y-axis kept using the same range as the non-limited version.
I have tried using relim, autoscale or autoscale_view but that did not work. For a single data set, I could use ylim with the minimum and maximum values for that dataset. But for different data set, I would have to look through all of them.
Is there a better way to force a recalculation of the y-axis range?
Convert the lists to numpy arrays
create a Boolean mask of data_x based on xlim_min and xlim_max
use the mask to select the relevant data points in the y data
combine the two selected y arrays
select the min and max values from the selected y values and set them as ylim
import numpy as np
import matplotlib.pyplot as plt
# use a variable for the xlim limits
xlim_min = 35
xlim_max = 45
# convert lists to arrays
data_x = np.array(data_x)
data_y = np.array(data_y)
data_y2 = np.array(data_y2)
# create a mask for the values to be plotted based on the xlims
x_mask = (data_x >= xlim_min) & (data_x <= xlim_max)
# use the mask on y arrays
y2_vals = data_y2[x_mask]
y_vals = data_y[x_mask]
# combine y arrays
y_all = np.concatenate((y2_vals, y_vals))
# get min and max y
ylim_min = y_all.min()
ylim_max = y_all.max()
# other code from op
...
# use the values to set xlim and ylim
ax2.set_xlim(xlim_min, xlim_max)
ax2.set_ylim(ylim_min, ylim_max)
Instead of using ylim and xlim, you can do x_vals = data_x[x_mask] and then plot x_vals with y_vals and y2_vals, which removes 5 lines of code.
This is similar to Matplotlib - fixing x axis scale and autoscale y axis
# use a variable for the xlim limits
xlim_min = 35
xlim_max = 45
# convert lists to arrays
data_x = np.array(data_x)
data_y = np.array(data_y)
data_y2 = np.array(data_y2)
# create a mask for the values to be plotted based on the xlims
x_mask = (data_x >= xlim_min) & (data_x <= xlim_max)
# use the mask on x
x_vals = data_x[x_mask]
# use the mask on y
y2_vals = data_y2[x_mask]
y_vals = data_y[x_mask]
# other code from op
...
# plot
ax2.scatter(x_vals, y_vals, s=0.5)
ax2.scatter(x_vals, y2_vals, s=0.5)

peak_widths w.r.t to x axis

I am using scipy.signal to calculate the width of the different peaks. I have 100 values wrt to different time points. I am using following code to calculate the peak, then width. The problem is it is not considering the time on x axis while calculating the width.
peaks_control, _ = find_peaks(x_control, height=2100)
time_control = time[:100]
width_control = peak_widths(x_control, peaks_control, rel_height=0.9)
The output of width_control is
array([12.84785714, 13.21299534, 13.4502381 , 12.71311143]),
array([2042.5, 2048.8, 2057.4, 2065. ]),
array([ 5.795 ,28.29469697, 51.245 , 74.17150396]),
array([18.64285714, 41.50769231, 64.6952381 , 86.88461538]))
I am using following to use time on x axis and show the signals, which is correct
plt.plot(time_control, x_control)
plt.plot(time_control[peaks_control], x_control[peaks_control], "x")
#plt.plot(np.zeros_like(x_control), "--", color="gray")
#plt.xlim(time_control.tolist())
plt.title('Control')
plt.xlabel('Time (s)')
plt.ylabel('RFU')
plt.show()
I am using following code to show the width also, but not able to put the actual time on x axis.
plt.plot(x_control)
plt.plot(peaks_control, x_control[peaks_control], "x")
plt.hlines(*width_control[1:], color="C3")
plt.title('Control')
plt.xlabel('Time (s)')
plt.ylabel('RFU')
plt.show()
I had the same problem just now, so here's my solution (there are probably more elegant solutions but this worked for me):
peak_widths() returns the widths (in samples), height at which the widths were calculated, and the interpolated positions of left and right intersection points of a horizontal line at the respective evaluation height (also in samples).
To convert those values from samples back to our x-axis we can use scipy.interpolate.interp1():
from scipy.signal import find_peaks, peak_widths
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import numpy as np
def index_to_xdata(xdata, indices):
"interpolate the values from signal.peak_widths to xdata"
ind = np.arange(len(xdata))
f = interp1d(ind,xdata)
return f(indices)
x = np.linspace(0, 1, 10)
y = np.sin(4*x-0.2)
peaks, _ = find_peaks(y)
widths, width_heights, left_ips, right_ips = peak_widths(y, peaks)
widths = index_to_xdata(x, widths)
left_ips = index_to_xdata(x, left_ips)
right_ips = index_to_xdata(x, right_ips)
plt.plot(x,y)
plt.plot(x[peaks], y[peaks], "x")
plt.hlines(width_heights, left_ips, right_ips, color='r')
plt.xlabel('x values')
plt.ylabel('y values')
plt.show()
image of plot

How to scale histogram y-axis in million in matplotlib

I am plotting a histogram using matplotlib but my y-axis range is in the millions. How can I scale the y-axis so that instead of printing 5000000 it will print 5
Here is my code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
filename = './norstar10readlength.csv'
df=pd.read_csv(filename, sep=',',header=None)
n, bins, patches = plt.hist(x=df.values, bins=10, color='#0504aa',
alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('My Very Own Histogram')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
plt.show()
And here is the plot I am generating now
An elegant solution is to apply a FuncFormatter to format y labels.
Instead of your source data, I used the following DataFrame:
Val
0 800000
1 2600000
2 6700000
3 1400000
4 1700000
5 1600000
and made a bar plot. "Ordinary" bar plot:
df.Val.plot.bar(rot=0, width=0.75);
yields a picture with original values on the y axis (1000000 to
7000000).
But if you run:
from matplotlib.ticker import FuncFormatter
def lblFormat(n, pos):
return str(int(n / 1e6))
lblFormatter = FuncFormatter(lblFormat)
ax = df.Val.plot.bar(rot=0, width=0.75)
ax.yaxis.set_major_formatter(lblFormatter)
then y axis labels are integers (the number of millions):
So you can arrange your code something like this:
n, bins, patches = plt.hist(x=df.values, ...)
#
# Other drawing actions, up to "plt.ylim" (including)
#
ax = plt.gca()
ax.yaxis.set_major_formatter(lblFormatter)
plt.show()
You can modify your df itself, you just need to decide one ratio
so if you want to make 50000 to 5 then it means the ratio is 5/50000 which is 0.0001
Once you have the ratio just multiply all the values of y-axis with the ratio in your DataFrame itself.
Hope this helps!!

Plotting negative values using matplotlib scatter

I want to plot scatter points corresponding to 6 different datasets over global maps of the Earth. The problem is that some of these quantities have negative values and they don't appear in the maps. I have tried to overcome this problem by taking absolute values of the data and multiplying (or taking the power of) them by some factors, but nothing seems to work the way I want. The problem is that the datasets have very different ranges. Ideally, I want them all to have the same scale so everything will be more organized, but I don't know how to do this.
I created some synthetic data to illustrate this issue
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.basemap import Basemap, addcyclic, shiftgrid
from matplotlib.pyplot import cm
np.random.seed(100)
VarReTx = np.random.uniform(low=-0.087, high=0.0798, size=(52,))
VarReTy = np.random.uniform(low=-0.076, high=0.1919, size=(52,))
VarImTx = np.random.uniform(low=-0.0331, high=0.0527, size=(52,))
VarImTy = np.random.uniform(low=-0.0311, high=0.2007, size=(52,))
eTx = np.random.uniform(low=0.0019, high=0.0612, size=(52,))
eTx = np.random.uniform(low=0.0031, high=0.0258, size=(52,))
obslat = np.array([18.62, -65.25, -13.8, -7.95, -23.77, 51.84, 40.14, 58.07,
-12.1875, -35.32, 36.37, -46.43, 40.957, -43.474, 38.2 , 37.09,
48.17, 0.6946, 13.59, 28.32, 51., -25.88, -34.43, 21.32,
-12.05, 52.27, 36.23, -12.69, 31.42, 5.21, -22.22, 36.1,
14.38, -54.5, 43.91, 61.16, 48.27, 52.07, 54.85, 45.403,
52.971, -17.57, -51.7, 18.11, 39.55, 47.595, 22.79, -37.067,
-1.2, 32.18, 51.933, 48.52])
obslong = np.array([-287.13, -64.25, -171.78, -14.38, -226.12, -339.21, -105.24,
-321.77, -263.1664, -210.64, -233.146, -308.13, -359.667, -187.607,
-77.37, -119.72, -348.72, -287.8463, -215.13, -16.43, -4.48,
-332.29, -340.77, -158., -75.33, -255.55, -219.82, -227.53,
-229.12, -52.73, -245.9, -256.16, -16.97, -201.05, -215.81,
-45.442, -117.12, -347.32, -276.77, -75.552, -201.752, -149.58,
-57.89, -66.15, -4.35, -52.677, -354.47, -12.315, -48.5,
-110.73, -10.25, -123.42, ])
fig, ([ax1, ax2], [ax3, ax4], [eax1, eax2]) = plt.subplots(3,2, figsize=(24,23))
matplotlib.rc('xtick', labelsize=12)
matplotlib.rc('ytick', labelsize=12)
plots = [ax1, ax2, ax3, ax4, eax1, eax2]
Vars = [VarReTx, VarReTy, VarImTx, VarImTy, eTx, eTy]
titles = [r'$\Delta$ ReTx', r'$\Delta$ ReTy', r'$\Delta$ ImTx', r'$\Delta$ ImTy', 'Error (X)', 'Error (Y)']
colors = iter(cm.jet(np.reshape(np.linspace(0.0, 1.0, len(plots)), ((len(plots), 1)))))
for j in range(len(plots)):
c3 = next(colors)
lat = np.arange(-91, 91, 0.5)
long = np.arange(-0.1, 360.1, 0.5)
longrid, latgrid = np.meshgrid(long, lat)
plots[j].set_title(titles[j], fontsize=48, y=1.05)
condmap = Basemap(projection='robin', llcrnrlat=-90, urcrnrlat=90,\
llcrnrlon=-180, urcrnrlon=180, resolution='c', lon_0=0, ax=plots[j])
maplong, maplat = condmap(longrid, latgrid)
condmap.drawcoastlines()
condmap.drawmapboundary(fill_color='white')
parallels = np.arange(-90, 90, 15)
condmap.drawparallels(parallels,labels=[False,True,True,False], fontsize=15)
x,y = condmap(obslong, obslat)
w = []
for m in range(obslong.size):
w.append(Vars[j][m])
w = np.array(w)
condmap.scatter(x, y, s = w*1e+4, c=c3)
r = np.linspace(np.min(Vars[j]), np.max(Vars[j]), 4)
for n in r:
condmap.scatter([], [], c=c3, s=n*1e+4, label=str(np.round(n, 4)))
plots[j].legend(bbox_to_anchor=(0., -0.2, 1., .102), loc='lower left',
ncol=4, mode="expand", borderaxespad=0., fontsize=16, frameon = False)
plt.show()
plt.close('all')
As you can see in the map, negative data does not are not being exhibited. I want they all to appear in the maps and that all the scatter plots have the same scale in their respective ranges. Thanks!
It looks like you are trying to map your dataset to dot size. Obviously you cannot have negative size dots, so that won't work.
Instead, you need to normalize your dataset to a strictly positive range and use those normalized values for the size parameter. A simple way to do this would be to use matplotlib.colors.Normalize(vmin, vmax), which allows you to map any values in the interval [vmin, vmax] to the interval [0,1].
If you want to have a shared scale for all your datasets, first find the global min and max, and use that to instantiate your normalization, then normalize each dataset when plotting:
datasets = [VarReTx,VarReTy,VarImTx,VarImTy,eTx,eTx]
min_val = min([d.min() for d in datasets])
max_val = max([d.max() for d in datasets])
norm = matplotlib.colors.Normalize(vmin=min_val, vmax=max_val)
plt.scatter(x,y,s=norm(VarReTx)*100) # choose appropiate scaling factor instead of 100 to get nicely sized dots

Categories