How to scale histogram y-axis in million in matplotlib - python

I am plotting a histogram using matplotlib but my y-axis range is in the millions. How can I scale the y-axis so that instead of printing 5000000 it will print 5
Here is my code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
filename = './norstar10readlength.csv'
df=pd.read_csv(filename, sep=',',header=None)
n, bins, patches = plt.hist(x=df.values, bins=10, color='#0504aa',
alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('My Very Own Histogram')
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
plt.show()
And here is the plot I am generating now

An elegant solution is to apply a FuncFormatter to format y labels.
Instead of your source data, I used the following DataFrame:
Val
0 800000
1 2600000
2 6700000
3 1400000
4 1700000
5 1600000
and made a bar plot. "Ordinary" bar plot:
df.Val.plot.bar(rot=0, width=0.75);
yields a picture with original values on the y axis (1000000 to
7000000).
But if you run:
from matplotlib.ticker import FuncFormatter
def lblFormat(n, pos):
return str(int(n / 1e6))
lblFormatter = FuncFormatter(lblFormat)
ax = df.Val.plot.bar(rot=0, width=0.75)
ax.yaxis.set_major_formatter(lblFormatter)
then y axis labels are integers (the number of millions):
So you can arrange your code something like this:
n, bins, patches = plt.hist(x=df.values, ...)
#
# Other drawing actions, up to "plt.ylim" (including)
#
ax = plt.gca()
ax.yaxis.set_major_formatter(lblFormatter)
plt.show()

You can modify your df itself, you just need to decide one ratio
so if you want to make 50000 to 5 then it means the ratio is 5/50000 which is 0.0001
Once you have the ratio just multiply all the values of y-axis with the ratio in your DataFrame itself.
Hope this helps!!

Related

Show density and frequency on the same histogram

I would like to see both the density and frequency on my histogram. For example, display density on the left side and frequency on the right side.
Here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
def plot_histogram():
bins = range(-11, 12, 1)
bins_str = []
for i in bins:
bins_str.append(str(i)+"%")
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist(np.clip(df.Returns, bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
N_labels = len(xlabels)
plt.xlim([bins[0], bins[-1]])
plt.xticks(bins)
ax.set_xticklabels(xlabels)
plt.title("Returns distribution")
plt.grid(axis="y", linewidth=0.5)
plot_histogram()
I tried adding density=True in plt.hist() but it removes the count from the histogram. Is it possible to display both the frequency and density on the same histogram?
A density plot sets the heights of the bars such that the area of all the bars (taking rwidth=1 for that calculation) sums to 1. As such, the bar heights of a counting histogram get divided by (the number of values times the bar widths).
With that conversion factor, you can recalculate the counts from the density (or vice versa). The recalculation can be used to label the bars and/or set a secondary y-axis. Note that the ticks of both y axes are aligned, so the grid only works well for one of them. (A secondary y-axis is a bit different from ax.twiny(), as the former has a fixed conversion between both y axes).
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
bins = range(-11, 12, 1)
bins_str = [str(i) + "%" for i in bins]
fig, ax = plt.subplots(figsize=(9, 5))
values, bins, patches = ax.hist(np.clip(df["Returns"], bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
# conversion between counts and density: number of values times bin width
factor = len(df) * (bins[1] - bins[0])
ax.bar_label(patches, ['' if v == 0 else f'{v * factor:.0f}' for v in values])
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
ax.set_xlim([bins[0], bins[-1]])
ax.set_xticks(bins, xlabels)
ax.set_title("Returns distribution")
ax.grid(axis="y", linewidth=0.5)
secax = ax.secondary_yaxis('right', functions=(lambda y: y * factor, lambda y: y / factor))
secax.set_ylabel('counts')
ax.set_ylabel('density')
plt.show()
To have the same grid positions for both y-axes, you can copy the ticks of one and convert them to set them at the other. For the ticks to be calculated, the plot needs to be drawn once (at the end of the code). Note that the converted values are only shown with a limited number of digits.
fig.canvas.draw()
ax.set_yticks(secax.get_yticks() / factor)
plt.show()

Make bars in a bar chart fade out as the values decrease

The following code plots a horizontal bar chart in a decreasing order. I would like to change the colors of the bars, so they fade out as the values decrease. In this case California will stay as it is, but Minnesota will be very light blue, almost transparent.
I know that I can manually hardcode the values in a list of colors, but is there a better way to achieve this?
x_state = df_top_states["Percent"].nlargest(10).sort_values(ascending=True).index
y_percent = df_top_states["Percent"].nlargest(10).sort_values(ascending=True).values
plt_size = plt.figure(figsize=(9,6))
plt.barh(x_state, y_percent)
plt.title("Top 10 States with the most number of accidents (2016 - 2021)", fontsize=16)
plt.ylabel("State", fontsize=13)
plt.yticks(size=13)
plt.xlabel("% of Total Accidents", fontsize=13)
plt.xticks(size=13)
plt.tight_layout()
plt.show()
You could create a list of colors with decreasing alpha from the list of percentages. Here is some example code:
import matplotlib.pyplot as plt
from matplotlib.colors import to_rgba
import seaborn as sns # to set the ' darkgrid' style
import pandas as pd
import numpy as np
sns.set_style('darkgrid')
# first, create some suitable test data
df_top_states = pd.DataFrame({"Percent": np.random.rand(20)**3},
index=["".join(np.random.choice([*'abcdef'], np.random.randint(3, 9))) for _ in range(20)])
df_top_states["Percent"] = df_top_states["Percent"] / df_top_states["Percent"].sum() * 100
df_largest10 = df_top_states["Percent"].nlargest(10).sort_values(ascending=True)
x_state = df_largest10.index
y_percent = df_largest10.values
max_perc = y_percent.max()
fig = plt.figure(figsize=(9, 6))
plt.barh(x_state, y_percent, color=[to_rgba('dodgerblue', alpha=perc / max_perc) for perc in y_percent])
plt.title("Top 10 States with the most number of accidents (2016 - 2021)", fontsize=16)
plt.ylabel("State", fontsize=13)
plt.yticks(size=13)
plt.xlabel("% of Total Accidents", fontsize=13)
plt.xticks(size=13)
plt.margins(y=0.02) # less vertical margin
plt.tight_layout()
plt.show()
PS: Note that plt.figure(...) returns a matplotlib figure, not some kind of size element.

How to format bar chart values to two decimal places, only for the values present in the data?

I am trying to plot a horizontal barchart using searborn. But I'd like the y-axis to display two decimal points of my data, but only for the values that are present in the data, for example 0.96, 0.93, ... .
Here is what I have:
df=pd.read_excel('file.xlsx', sheet_name='all')
print(df['digits'])
1. 0.96270
1 0.93870
2 0.93610
3 0.69610
4 0.61250
5 0.61280
6 0.52965
7 0.50520
sns.histplot(y=df['digits'])
plt.xlabel("frequency", fontsize=15)
plt.ylabel("results",fontsize=15)
Here is the output
To create a histogram where the value rounded to 2 decimals defines the bin, you can create bin edges halfway between these values. E.g. bin edges at 0.195 and 0.205 would define the bin around 0.20. You can use `np.arange(-0.005, 1.01, 0.01)' to create an array with these bin edges.
In order to only set tick labels at the used positions, you can use ax.set_yticks(). You can round all the y-values and use the unique values for the y ticks.
If you don't want rounding, but truncation, you could use bins=np.arange(0, 1.01, 0.01) and ax.set_yticks(np.unique(np.round(y-0.005, 2))).
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
y = np.array([0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520])
ax = sns.histplot(y=y, bins=np.arange(-0.005, 1.01, 0.01), color='crimson')
ax.set_yticks(np.unique(np.round(y, 2)))
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.tick_params(axis='y', labelsize=6)
ax.set_xlabel("frequency", fontsize=15)
ax.set_ylabel("results", fontsize=15)
plt.show()
Note that even with a small fontsize the tick labels can overlap.
Another approach, is to use a countplot on the rounded (or truncated) values. Then the bars get evenly spaced, without taking empty spots into account:
y = np.array([0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520])
y_rounded = [f'{yi:.2f}' for yi in sorted(y)]
# y_truncated = [f'{yi - .005:.2f}' for yi in sorted(y)]
ax = sns.countplot(y=y_rounded, color='dodgerblue')
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
This is handled by matplotlib:
ax = sns.histplot(y=np.random.randn(20))
ax.xaxis.set_major_formatter("{x:.2f}")
ax.set_xlabel("frequency", fontsize=15)
ax.set_ylabel("results",fontsize=15)
You might want to use this:
import matplotlib.ticker as tkr
y = [0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520]
g = sns.histplot(y=y)
plt.xlabel("frequency", fontsize=15)
plt.ylabel("results",fontsize=15)
g.axes.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: f'{y:.2f}'))
or this:
import matplotlib.ticker as tkr
y = [0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520]
g = sns.histplot(y=y, binwidth=0.01)
plt.xlabel("frequency", fontsize=15)
plt.ylabel("results",fontsize=15)
g.axes.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: f'{y:.2f}'))
binwidth=0.01:

How to plot median values in each bin, and to show the 25 and 75 percent value

I would like to plot my data similar to the following figure with showing median in each bin and 25 and 75 percent value.[The solid line and open circles show the median values in each bin, and the broken lines show the 25% and 75% values.]
I have this sample data. And I did like this to get the similar plot
import numpy as np
import matplotlib.pyplot as plt
from astropy.table import Table
data=Table.read('sample_data.fits')
# Sample data
X=data['density']
Y=data['lineflux']
total_bins = 15
bins = np.linspace(min(X),max(X), total_bins)
delta = bins[1]-bins[0]
idx = np.digitize(X,bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'--r',marker='o',fillstyle='none',markersize=20,alpha=1)
plt.xlabel('log $\delta_{5th}[Mpc^{-3}]$')
plt.ylabel('log OII[flux]')
plt.loglog()
plt.axis('tight')
plt.show()
And I got this plot.
There is a large offset. I change the size of the bin also, still, I got the large offset.
How to plot in the correct way and how to include the 25 and 75 percent value like the previous figure in my plot.
To also answer the other question: you can use np.percentile. I had to lower the bin number (there was a bin without data, this leads to problems with the percentile). For the logarithmic bins see my comment above:
import numpy as np
import matplotlib.pyplot as plt
from astropy.table import Table
data=Table.read('sample_data.fits')
# Sample data
X=data['density']
Y=data['lineflux']
total_bins = 10
#bins = np.linspace(min(X), max(X), total_bins)
bins = np.logspace(np.log10(0.0001), np.log10(0.1), total_bins)
delta = bins[1]-bins[0]
idx = np.digitize(X, bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
running_prc25 = [np.percentile(Y[idx==k], 25) for k in range(total_bins)]
running_prc75 = [np.percentile(Y[idx==k], 75) for k in range(total_bins)]
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'-r',marker='o',fillstyle='none',markersize=20,alpha=1)
plt.plot(bins-delta/2,running_prc25,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
plt.plot(bins-delta/2,running_prc75,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
plt.xlabel('log $\delta_{5th}[Mpc^{-3}]$')
plt.ylabel('log OII[flux]')
plt.loglog()
plt.axis('tight')
plt.show()
which produces
EDIT:
To show a filled plot you may try (just relevant section shown):
fig, ax = plt.subplots()
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'-r',marker='o',fillstyle='none',markersize=20,alpha=1)
#plt.plot(bins-delta/2,running_prc25,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
#plt.plot(bins-delta/2,running_prc75,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
ax.fill_between(bins-delta/2,running_prc25,running_median, facecolor='orange')
ax.fill_between(bins-delta/2,running_prc75,running_median, facecolor='orange')
which produces

How to set fixed spaces between ticks in maptlotlib

I am preparing a graph of latency percentile results. This is my pd.DataFrame looks like:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
result = pd.DataFrame(np.random.randint(133000, size=(5,3)), columns=list('ABC'), index=[99.0, 99.9, 99.99, 99.999, 99.9999])
I am using this function (commented lines are different pyplot methods I have already tried to achieve my goal):
def plot_latency_time_bar(result):
ind = np.arange(4)
means = []
stds = []
for index, row in result.iterrows():
means.append(np.mean([row[0]//1000, row[1]//1000, row[2]//1000]))
stds.append(np .std([row[0]//1000, row[1]//1000, row[2]//1000]))
plt.bar(result.index.values, means, 0.2, yerr=stds, align='center')
plt.xlabel('Percentile')
plt.ylabel('Latency')
plt.xticks(result.index.values)
# plt.xticks(ind, ('99.0', '99.9', '99.99', '99.999', '99.99999'))
# plt.autoscale(enable=False, axis='x', tight=False)
# plt.axis('auto')
# plt.margins(0.8, 0)
# plt.semilogx(basex=5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig = plt.gcf()
fig.set_size_inches(15.5, 10.5)
And here is the figure:
As you can see bars for all percentiles above 99.0 overlaps and are completely unreadable. I would like to set some fixed space between ticks to have a same space between all of them.
Since you're using pandas, you can do all this from within that library:
means = df.mean(axis=1)/1000
stds = df.std(axis=1)/1000
means.plot.bar(yerr=stds, fc='b')
# Make some room for the x-axis tick labels
plt.subplots_adjust(bottom=0.2)
plt.show()
Not wishing to take anything away from xnx's answer (which is the most elegant way to do things given that you're working in pandas, and therefore likely the best answer for you) but the key insight you're missing is that, in matplotlib, the x positions of the data you're plotting and the x tick labels are independent things. If you say:
nominalX = np.arange( 1, 6 ) ** 2
y = np.arange( 1, 6 ) ** 4
positionalX = np.arange(len(y))
plt.bar( positionalX, y ) # graph y against the numbers 1..n
plt.gca().set(xticks=positionalX + 0.4, xticklabels=nominalX) # ...but superficially label the X values as something else
then that's different from tying positions to your nominal X values:
plt.bar( nominalX, y )
Note that I added 0.4 to the x position of the ticks, because that's half the default width of the bars bar( ..., width=0.8 )—so the ticks end up in the middle of the bar.

Categories