Related
I would like to see both the density and frequency on my histogram. For example, display density on the left side and frequency on the right side.
Here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
def plot_histogram():
bins = range(-11, 12, 1)
bins_str = []
for i in bins:
bins_str.append(str(i)+"%")
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist(np.clip(df.Returns, bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
N_labels = len(xlabels)
plt.xlim([bins[0], bins[-1]])
plt.xticks(bins)
ax.set_xticklabels(xlabels)
plt.title("Returns distribution")
plt.grid(axis="y", linewidth=0.5)
plot_histogram()
I tried adding density=True in plt.hist() but it removes the count from the histogram. Is it possible to display both the frequency and density on the same histogram?
A density plot sets the heights of the bars such that the area of all the bars (taking rwidth=1 for that calculation) sums to 1. As such, the bar heights of a counting histogram get divided by (the number of values times the bar widths).
With that conversion factor, you can recalculate the counts from the density (or vice versa). The recalculation can be used to label the bars and/or set a secondary y-axis. Note that the ticks of both y axes are aligned, so the grid only works well for one of them. (A secondary y-axis is a bit different from ax.twiny(), as the former has a fixed conversion between both y axes).
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = [6.950915827194559, 0.5704464713012669, -1.655326152372283, 5.867122206816244, -1.809359944941513, -6.164821482653027, -2.538999462076397, 0.2108693568484643, -8.740600769897465, 2.121232876712331, 7.967032967032961, 10.61701196601832, 1.847419201771516, 0.6858006670780847, -2.008695652173909, 2.86991153132885, 1.703131050506168, -1.346913193356314, 3.334927671049193, -15.64688995215311, 20.00022688856367, 10.05956454173731, 2.044936877124148, 3.06513409961684, -0.9973614775725559, 1.190631873030967, -1.509991311902692, -0.3333827233664155, 1.898473282442747, 1.618299899267539, -0.1897860593512823, 1.000000000000001, 3.03501945525293, -7.646697418593529, -0.9769069279216391, -2.918403811792736, -3.90929422276739, 9.609846259653532, 3.240690674452962, 10.08973134408675, 1.98356309650054, 1.915301127899549, -0.7792207792207684, -3.308682400714091, -3.312977099236647, 19.98101265822785, 3.661973444534827, -5.770676691729326, 0.5268044012063156, -1.573767040370533, 3.234974862888484, -1.514352732634994, 6.564849624060143, 9.956794019127146, 3.232590278195024, 2.042007001166857, 1.601164483260553, -2.384737678855331, -2.731242556570068, 0.6069707315088602, 1.40561881957264, -6.805306861851957, 2.492102492102499, -3.639688275501762, 0.7958485384154335, 2.799187725631769, 0.9195966872689088, -2.366608280379856, 0.797679477882518, -3.80380434782609]
df = pd.DataFrame(x, columns=["Returns"])
bins = range(-11, 12, 1)
bins_str = [str(i) + "%" for i in bins]
fig, ax = plt.subplots(figsize=(9, 5))
values, bins, patches = ax.hist(np.clip(df["Returns"], bins[0], bins[-1]),
bins=bins, density=True, rwidth=0.8)
# conversion between counts and density: number of values times bin width
factor = len(df) * (bins[1] - bins[0])
ax.bar_label(patches, ['' if v == 0 else f'{v * factor:.0f}' for v in values])
xlabels = bins_str[:]
xlabels[-1] = "Over"
xlabels[0] = "Under"
ax.set_xlim([bins[0], bins[-1]])
ax.set_xticks(bins, xlabels)
ax.set_title("Returns distribution")
ax.grid(axis="y", linewidth=0.5)
secax = ax.secondary_yaxis('right', functions=(lambda y: y * factor, lambda y: y / factor))
secax.set_ylabel('counts')
ax.set_ylabel('density')
plt.show()
To have the same grid positions for both y-axes, you can copy the ticks of one and convert them to set them at the other. For the ticks to be calculated, the plot needs to be drawn once (at the end of the code). Note that the converted values are only shown with a limited number of digits.
fig.canvas.draw()
ax.set_yticks(secax.get_yticks() / factor)
plt.show()
The following code plots a horizontal bar chart in a decreasing order. I would like to change the colors of the bars, so they fade out as the values decrease. In this case California will stay as it is, but Minnesota will be very light blue, almost transparent.
I know that I can manually hardcode the values in a list of colors, but is there a better way to achieve this?
x_state = df_top_states["Percent"].nlargest(10).sort_values(ascending=True).index
y_percent = df_top_states["Percent"].nlargest(10).sort_values(ascending=True).values
plt_size = plt.figure(figsize=(9,6))
plt.barh(x_state, y_percent)
plt.title("Top 10 States with the most number of accidents (2016 - 2021)", fontsize=16)
plt.ylabel("State", fontsize=13)
plt.yticks(size=13)
plt.xlabel("% of Total Accidents", fontsize=13)
plt.xticks(size=13)
plt.tight_layout()
plt.show()
You could create a list of colors with decreasing alpha from the list of percentages. Here is some example code:
import matplotlib.pyplot as plt
from matplotlib.colors import to_rgba
import seaborn as sns # to set the ' darkgrid' style
import pandas as pd
import numpy as np
sns.set_style('darkgrid')
# first, create some suitable test data
df_top_states = pd.DataFrame({"Percent": np.random.rand(20)**3},
index=["".join(np.random.choice([*'abcdef'], np.random.randint(3, 9))) for _ in range(20)])
df_top_states["Percent"] = df_top_states["Percent"] / df_top_states["Percent"].sum() * 100
df_largest10 = df_top_states["Percent"].nlargest(10).sort_values(ascending=True)
x_state = df_largest10.index
y_percent = df_largest10.values
max_perc = y_percent.max()
fig = plt.figure(figsize=(9, 6))
plt.barh(x_state, y_percent, color=[to_rgba('dodgerblue', alpha=perc / max_perc) for perc in y_percent])
plt.title("Top 10 States with the most number of accidents (2016 - 2021)", fontsize=16)
plt.ylabel("State", fontsize=13)
plt.yticks(size=13)
plt.xlabel("% of Total Accidents", fontsize=13)
plt.xticks(size=13)
plt.margins(y=0.02) # less vertical margin
plt.tight_layout()
plt.show()
PS: Note that plt.figure(...) returns a matplotlib figure, not some kind of size element.
I am trying to plot a horizontal barchart using searborn. But I'd like the y-axis to display two decimal points of my data, but only for the values that are present in the data, for example 0.96, 0.93, ... .
Here is what I have:
df=pd.read_excel('file.xlsx', sheet_name='all')
print(df['digits'])
1. 0.96270
1 0.93870
2 0.93610
3 0.69610
4 0.61250
5 0.61280
6 0.52965
7 0.50520
sns.histplot(y=df['digits'])
plt.xlabel("frequency", fontsize=15)
plt.ylabel("results",fontsize=15)
Here is the output
To create a histogram where the value rounded to 2 decimals defines the bin, you can create bin edges halfway between these values. E.g. bin edges at 0.195 and 0.205 would define the bin around 0.20. You can use `np.arange(-0.005, 1.01, 0.01)' to create an array with these bin edges.
In order to only set tick labels at the used positions, you can use ax.set_yticks(). You can round all the y-values and use the unique values for the y ticks.
If you don't want rounding, but truncation, you could use bins=np.arange(0, 1.01, 0.01) and ax.set_yticks(np.unique(np.round(y-0.005, 2))).
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
y = np.array([0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520])
ax = sns.histplot(y=y, bins=np.arange(-0.005, 1.01, 0.01), color='crimson')
ax.set_yticks(np.unique(np.round(y, 2)))
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.tick_params(axis='y', labelsize=6)
ax.set_xlabel("frequency", fontsize=15)
ax.set_ylabel("results", fontsize=15)
plt.show()
Note that even with a small fontsize the tick labels can overlap.
Another approach, is to use a countplot on the rounded (or truncated) values. Then the bars get evenly spaced, without taking empty spots into account:
y = np.array([0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520])
y_rounded = [f'{yi:.2f}' for yi in sorted(y)]
# y_truncated = [f'{yi - .005:.2f}' for yi in sorted(y)]
ax = sns.countplot(y=y_rounded, color='dodgerblue')
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
This is handled by matplotlib:
ax = sns.histplot(y=np.random.randn(20))
ax.xaxis.set_major_formatter("{x:.2f}")
ax.set_xlabel("frequency", fontsize=15)
ax.set_ylabel("results",fontsize=15)
You might want to use this:
import matplotlib.ticker as tkr
y = [0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520]
g = sns.histplot(y=y)
plt.xlabel("frequency", fontsize=15)
plt.ylabel("results",fontsize=15)
g.axes.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: f'{y:.2f}'))
or this:
import matplotlib.ticker as tkr
y = [0.96270, 0.93870, 0.93610, 0.69610, 0.61250, 0.61280, 0.52965, 0.50520]
g = sns.histplot(y=y, binwidth=0.01)
plt.xlabel("frequency", fontsize=15)
plt.ylabel("results",fontsize=15)
g.axes.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: f'{y:.2f}'))
binwidth=0.01:
I would like to plot my data similar to the following figure with showing median in each bin and 25 and 75 percent value.[The solid line and open circles show the median values in each bin, and the broken lines show the 25% and 75% values.]
I have this sample data. And I did like this to get the similar plot
import numpy as np
import matplotlib.pyplot as plt
from astropy.table import Table
data=Table.read('sample_data.fits')
# Sample data
X=data['density']
Y=data['lineflux']
total_bins = 15
bins = np.linspace(min(X),max(X), total_bins)
delta = bins[1]-bins[0]
idx = np.digitize(X,bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'--r',marker='o',fillstyle='none',markersize=20,alpha=1)
plt.xlabel('log $\delta_{5th}[Mpc^{-3}]$')
plt.ylabel('log OII[flux]')
plt.loglog()
plt.axis('tight')
plt.show()
And I got this plot.
There is a large offset. I change the size of the bin also, still, I got the large offset.
How to plot in the correct way and how to include the 25 and 75 percent value like the previous figure in my plot.
To also answer the other question: you can use np.percentile. I had to lower the bin number (there was a bin without data, this leads to problems with the percentile). For the logarithmic bins see my comment above:
import numpy as np
import matplotlib.pyplot as plt
from astropy.table import Table
data=Table.read('sample_data.fits')
# Sample data
X=data['density']
Y=data['lineflux']
total_bins = 10
#bins = np.linspace(min(X), max(X), total_bins)
bins = np.logspace(np.log10(0.0001), np.log10(0.1), total_bins)
delta = bins[1]-bins[0]
idx = np.digitize(X, bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
running_prc25 = [np.percentile(Y[idx==k], 25) for k in range(total_bins)]
running_prc75 = [np.percentile(Y[idx==k], 75) for k in range(total_bins)]
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'-r',marker='o',fillstyle='none',markersize=20,alpha=1)
plt.plot(bins-delta/2,running_prc25,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
plt.plot(bins-delta/2,running_prc75,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
plt.xlabel('log $\delta_{5th}[Mpc^{-3}]$')
plt.ylabel('log OII[flux]')
plt.loglog()
plt.axis('tight')
plt.show()
which produces
EDIT:
To show a filled plot you may try (just relevant section shown):
fig, ax = plt.subplots()
plt.plot(X,Y,'.')
plt.plot(bins-delta/2,running_median,'-r',marker='o',fillstyle='none',markersize=20,alpha=1)
#plt.plot(bins-delta/2,running_prc25,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
#plt.plot(bins-delta/2,running_prc75,'--r',marker=None,fillstyle='none',markersize=20,alpha=1)
ax.fill_between(bins-delta/2,running_prc25,running_median, facecolor='orange')
ax.fill_between(bins-delta/2,running_prc75,running_median, facecolor='orange')
which produces
I am preparing a graph of latency percentile results. This is my pd.DataFrame looks like:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
result = pd.DataFrame(np.random.randint(133000, size=(5,3)), columns=list('ABC'), index=[99.0, 99.9, 99.99, 99.999, 99.9999])
I am using this function (commented lines are different pyplot methods I have already tried to achieve my goal):
def plot_latency_time_bar(result):
ind = np.arange(4)
means = []
stds = []
for index, row in result.iterrows():
means.append(np.mean([row[0]//1000, row[1]//1000, row[2]//1000]))
stds.append(np .std([row[0]//1000, row[1]//1000, row[2]//1000]))
plt.bar(result.index.values, means, 0.2, yerr=stds, align='center')
plt.xlabel('Percentile')
plt.ylabel('Latency')
plt.xticks(result.index.values)
# plt.xticks(ind, ('99.0', '99.9', '99.99', '99.999', '99.99999'))
# plt.autoscale(enable=False, axis='x', tight=False)
# plt.axis('auto')
# plt.margins(0.8, 0)
# plt.semilogx(basex=5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig = plt.gcf()
fig.set_size_inches(15.5, 10.5)
And here is the figure:
As you can see bars for all percentiles above 99.0 overlaps and are completely unreadable. I would like to set some fixed space between ticks to have a same space between all of them.
Since you're using pandas, you can do all this from within that library:
means = df.mean(axis=1)/1000
stds = df.std(axis=1)/1000
means.plot.bar(yerr=stds, fc='b')
# Make some room for the x-axis tick labels
plt.subplots_adjust(bottom=0.2)
plt.show()
Not wishing to take anything away from xnx's answer (which is the most elegant way to do things given that you're working in pandas, and therefore likely the best answer for you) but the key insight you're missing is that, in matplotlib, the x positions of the data you're plotting and the x tick labels are independent things. If you say:
nominalX = np.arange( 1, 6 ) ** 2
y = np.arange( 1, 6 ) ** 4
positionalX = np.arange(len(y))
plt.bar( positionalX, y ) # graph y against the numbers 1..n
plt.gca().set(xticks=positionalX + 0.4, xticklabels=nominalX) # ...but superficially label the X values as something else
then that's different from tying positions to your nominal X values:
plt.bar( nominalX, y )
Note that I added 0.4 to the x position of the ticks, because that's half the default width of the bars bar( ..., width=0.8 )—so the ticks end up in the middle of the bar.