Seaborn histplot uses weird y axis limits?

Seaborn histplot uses weird y axis limits? - python

I'm iterating through all columns of my df to plot their densities to see if and how I need to transform/normalize my data. I'm using Seaborn and this code:
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16,40))
fig.tight_layout() #othwerwise the plots overlapped each other and I couldn't see the column names
for i, column in enumerate(df.columns):
sns.histplot(df[column],ax=axes[i//n_cols,i%n_cols], kde=True, legend=True, fmt='g')
This results in a mostly okay graph, however the scaling of the y axis is waaay too big in some cases:
City 3 and 4 are just fine, however, the highest Count for City 4 is at around 200, yet the plot scales y until 10 000, which makes the data hard to interpret. The x axis also goes way beyond where it should, as the highest cost is at about 1000000, but the plot goes until 25000000. When I plot City 4 separately and force a ylim of 200 and xlim of 1000000 I get a much more understandable plot:
Why is the y axis (and actually, the x axis also) scaled so weirdly, and how can I change my code to scale it down so that I don't get a ylim much higher than the actually displayed data?
Thank you!

Set the shared_yaxis to False.
This will get the subplots to plot at the respective maximum points of the corresponding data.
Example:
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16,40), sharey=False)

Related

How to Generate Two Separate Y-Axes For A Histogram on the Same Figure In Seaborn

I'd like to generate a single figure that has two y axes: Count (from the histogram) and Density (from the KDE).
I want to use sns.displot in Seaborn >= v 0.11.
import seaborn as sns
df = sns.load_dataset('tips')
# graph 1: This should be the Y-Axis on the left side of the figure
sns.displot(df['total_bill'], kind='hist', bins=10)
# graph 2: This should be the Y-axis on the right side of the figure
sns.displot(df['total_bill'], kind='kde')
The code I've written generates two separate graphs; I could just use a facet grid for two separate graphs, but I want to be more concise and place the two y-axes on the two separate grids into a single figure sharing the same x-axis.

displot() is a figure-level function, which can create multiple subplots inside a figure. As such, you don't have control over individual axes.
To create combined plots, you can use the underlying axes-level functions: histplot() and kdeplot() for Seaborn v.0.11. These functions accept an ax= parameter. twinx() creates a second y-axis.
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('tips')
fig, ax = plt.subplots()
sns.histplot(df['total_bill'], bins=10, ax=ax)
ax2 = ax.twinx()
sns.kdeplot(df['total_bill'], ax=ax2)
plt.tight_layout()
plt.show()
Edit:
As mentioned in the comments, the y-axes aren't aligned. The left axis only tells something about the histogram. E.g. the highest bin having height 68 means that there are exactly 68 total bills between 12.618 and 17.392. The right axis only tells something about the kde. E.g. a y-value of 0.043 for x=20 would mean there is about 4.3 % probability that the total bill would be between 19.5 and 20.5.
To align both similar to sns.histplot(..., kde=True), the area of the histogram can be calculated (bin width times number of data values) and used as a scaling factor. Such scaling would make the area of the histogram and the area below the kde curve equal when measured in pixels:
num_bins = 10
bin_width = (df['total_bill'].max() - df['total_bill'].min()) / num_bins
hist_area = len(df) * bin_width
ax2.set_ylim(ymax=ax.get_ylim()[1] / hist_area)
Note that the right axis would be more similar to a percentage if the histogram would use a bin width with a power of ten (e.g. sns.histplot(..., bins=np.arange(0, df['total_bill'].max()+10, 10)). Which bins would be most suitable strongly depends on how you want to interpret your data.

How to center the histogram bars around tick marks using seaborn displot? Stacking bars is essential

I have searched many ways of making histograms centered around tick marks but not able to find a solution that works with seaborn displot. The function displot lets me stack the histogram according to a column in the dataframe and thus would prefer a solution using displot or something that allows stacking based on a column in a data frame with color-coding as with palette.
Even after setting the tick values, I am not able to get the bars to center around the tick marks.
Example code
# Center the histogram on the tick marks
tips = sns.load_dataset('tips')
sns.displot(x="total_bill",
hue="day", multiple = 'stack', data=tips)
plt.xticks(np.arange(0, 50, 5))
I would also like to plot a histogram of a variable that takes a single value and choose the bin width of the resulting histogram in such a way that it is centered around the value. (0.5 in this example.)
I can get the center point by choosing the number of bins equal to a number of tick marks but the resulting bar is very thin. How can I increase the bin size in this case, where there is only one bar but want to display all the other possible points. By displaying all the tick marks, the bar width is very tiny.
I want the same centering of the bar at the 0.5 tick mark but make it wider as it is the only value for which counts are displayed.
Any solutions?
tips['single'] = 0.5
sns.displot(x='single',
hue="day", multiple = 'stack', data=tips, bins = 10)
plt.xticks(np.arange(0, 1, 0.1))
Edit:
Would it be possible to have more control over the tick marks in the second case? I would not want to display the round off to 1 decimal place but chose which of the tick marks to display. Is it possible to display just one value in the tick mark and have it centered around that?
Does the min_val and max_val in this case refer to value of the variable which will be 0 in this case and then the x axis would be plotted on negative values even when there are none and dont want to display them.

For your first problem, you may want to figure out a few properties of the data that your plotting. For example the range of the data. Additionally, you may want to choose beforehand the number of bins that you want displayed.
tips = sns.load_dataset('tips')
min_val = tips.total_bill.min()
max_val = tips.total_bill.max()
val_width = max_val - min_val
n_bins = 10
bin_width = val_width/n_bins
sns.histplot(x="total_bill",
hue="day", multiple = 'stack', data=tips,
bins=n_bins, binrange=(min_val, max_val),
palette='Paired')
plt.xlim(0, 55) # Define x-axis limits
Another thing to remember is that width a of a bar in a histogram identifies the bounds of its range. So a bar spanning [2,5] on the x-axis implies that the values represented by that bar belong to that range.
Considering this, it is easy to formulate a solution. Assume that we want the original bar graphs - identifying the bounds of each bar graph, one solution may look like
plt.xticks(np.arange(min_val-bin_width, max_val+bin_width, bin_width))
Now, if we offset the ticks by half a bin-width, we will get to the centers of the bars.
plt.xticks(np.arange(min_val-bin_width/2, max_val+bin_width/2, bin_width))
For your single value plot, the idea remains the same. Control the bin_width and the x-axis range and ticks. Bin-width has to be controlled explicitly since automatic inference of bin-width will probably be 1 unit wide which on the plot will have no thickness. Histogram bars always indicate a range - even though when we have just one single value. This is illustrated in the following example and figure.
single_val = 23.5
tips['single'] = single_val
bin_width = 4
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(12,4)) # Get 2 subplots
# Case 1 - With the single value as x-tick label on subplot 0
sns.histplot(x='single',
hue="day", multiple = 'stack', data=tips,
binwidth=bin_width, binrange=(single_val-bin_width, single_val+bin_width),
palette='rocket',
ax=axs[0])
ticks = [single_val, single_val+bin_width] # 2 ticks - given value and given_value + width
axs[0].set(
title='Given value as tick-label starts the bin on x-axis',
xticks=ticks,
xlim=(0, int(single_val*2)+bin_width)) # x-range such that bar is at middle of x-axis
axs[0].xaxis.set_major_formatter(FormatStrFormatter('%.1f'))
# Case 2 - With centering on the bin starting at single-value on subplot 1
sns.histplot(x='single',
hue="day", multiple = 'stack', data=tips,
binwidth=bin_width, binrange=(single_val-bin_width, single_val+bin_width),
palette='rocket',
ax=axs[1])
ticks = [single_val+bin_width/2] # Just the bin center
axs[1].set(
title='Bin centre is offset from single_value by bin_width/2',
xticks=ticks,
xlim=(0, int(single_val*2)+bin_width) ) # x-range such that bar is at middle of x-axis
axs[1].xaxis.set_major_formatter(FormatStrFormatter('%.1f'))
Output:
I feel from your description that what you are really implying by a bar graph is a categorical bar graph. The centering is then automatic. Because the bar is not a range anymore but a discrete category. For the numeric and continuous nature of the variable in the example data, I would not recommend such an approach. Pandas provides for plotting categorical bar plots. See here. For our example, one way to do this is as follows:
n_colors = len(tips['day'].unique()) # Get number of uniques categories
agg_df = tips[['single', 'day']].groupby(['day']).agg(
val_count=('single', 'count'),
val=('single','max')
).reset_index() # Get aggregated information along the categories
agg_df.pivot(columns='day', values='val_count', index='val').plot.bar(
stacked=True,
color=sns.color_palette("Paired", n_colors), # Choose "number of days" colors from palette
width=0.05 # Set bar width
)
plt.show()
This yields:

Matplotlib: Plot on double y-axis plot misaligned

I'm trying to plot two datasets into one plot with matplotlib. One of the two plots is misaligned by 1 on the x-axis.
This MWE pretty much sums up the problem. What do I have to adjust to bring the box-plot further to the left?
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
titles = ["nlnd", "nlmd", "nlhd", "mlnd", "mlmd", "mlhd", "hlnd", "hlmd", "hlhd"]
plotData = pd.DataFrame(np.random.rand(25, 9), columns=titles)
failureRates = pd.DataFrame(np.random.rand(9, 1), index=titles)
color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue',
'caps': 'Gray'}
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx()
plotData.plot.box(ax=ax1, color=color, sym='+')
failureRates.plot(ax=ax2, color='b', legend=False)
ax1.set_ylabel('Seconds')
ax2.set_ylabel('Failure Rate in %')
plt.xlim(-0.7, 8.7)
ax1.set_xticks(range(len(titles)))
ax1.set_xticklabels(titles)
fig.tight_layout()
fig.show()
Actual result. Note that its only 8 box-plots instead of 9 and that they're starting at index 1.

The issue is a mismatch between how box() and plot() work - box() starts at x-position 1 and plot() depends on the index of the dataframe (which defaults to starting at 0). There are only 8 plots because the 9th is being cut off since you specify plt.xlim(-0.7, 8.7). There are several easy ways to fix this, as #Sheldore's answer indicates, you can explicitly set the positions for the boxplot. Another way you can do this is to change the indexing of the failureRates dataframe to start at 1 in construction of the dataframe, i.e.
failureRates = pd.DataFrame(np.random.rand(9, 1), index=range(1, len(titles)+1))
note that you need not specify the xticks or the xlim for the question MCVE, but you may need to for your complete code.

You can specify the positions on the x-axis where you want to have the box plots. Since you have 9 boxes, use the following which generates the figure below
plotData.plot.box(ax=ax1, color=color, sym='+', positions=range(9))

Plot two datasets at same position based on their index

I'm trying to plot two datasets (called Height and Temperature) on different y axes.
Both datasets have the same length.
Both datasets are linked together by a third dataset, RH.
I have tried to use matplotlib to plot the data using twiny() but I am struggling to align both datasets together on the same plot.
Here is the plot I want to align.
The horizontal black line on the figure is defined as the 0°C degree line that was found from Height and was used to test if both datasets, when plotted, would be aligned. They do not. There is a noticable difference between the black line and the 0°C tick from Temperature.
Rather than the two y axes changing independently from each other I would like to plot each index from Height and Temperature at the same y position on the plot.
Here is the code that I used to create the plot:
#Define number of subplots sharing y axis
f, ax1 = plt.subplots()
ax1.minorticks_on()
ax1.grid(which='major',axis='both',c='grey')
#Set axis parameters
ax1.set_ylabel('Height $(km)$')
ax1.set_ylim([np.nanmin(Height), np.nanmax(Height)])
#Plot RH
ax1.plot(RH, Height, label='Original', lw=0.5)
ax1.set_xlabel('RH $(\%)$')
ax2 = ax1.twinx()
ax2.plot(RH, Temperature, label='Original', lw=0.5, c='black')
ax2.set_ylabel('Temperature ($^\circ$C)')
ax2.set_ylim([np.nanmin(Temperature), np.nanmax(Temperature)])
Any help on this would be amazing. Thanks.

Maybe the atmosphere is wrong. :)
It sounds like you are trying to align the two y axes at particular values. Why are you doing this? The relationship of Height vs. Temperature is non-linear, so I think you are setting the stage for a confusing graph. Any particular line you plot can only be interpreted against one vertical axis.
If needed, I think you will be forced to "do some math" on the limits of the y axes. This link may be helpful:
align scales

Setting both axes logarithmic in bar plot matploblib

I have already binned data to plot a histogram. For this reason I'm using the plt.bar() function. I'd like to set both axes in the plot to a logarithmic scale.
If I set plt.bar(x, y, width=10, color='b', log=True) which lets me set the y-axis to log but I can't set the x-axis logarithmic.
I've tried plt.xscale('log') unfortunately this doesn't work right. The x-axis ticks vanish and the sizes of the bars don't have equal width.
I would be grateful for any help.

By default, the bars of a barplot have a width of 0.8. Therefore they appear larger for smaller x values on a logarithmic scale. If instead of specifying a constant width, one uses the distance between the bin edges and supplies this to the width argument, the bars will have the correct width. One would also need to set the align to "edge" for this to work.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(1)
x = np.logspace(0, 5, num=21)
y = (np.sin(1.e-2*(x[:-1]-20))+3)**10
fig, ax = plt.subplots()
ax.bar(x[:-1], y, width=np.diff(x), log=True,ec="k", align="edge")
ax.set_xscale("log")
plt.show()
I cannot reproduce missing ticklabels for a logarithmic scaling. This may be due to some settings in the code that are not shown in the question or due to the fact that an older matplotlib version is used. The example here works fine with matplotlib 2.0.

If the goal is to have equal width bars, assuming datapoints are not equidistant, then the most proper solution is to set width as
plt.bar(x, y, width=c*np.array(x), color='b', log=True) for a constant c appropriate for the plot. Alignment can be anything.

I know it is a very old question and you might have solved it but I've come to this post because I was with something like this but at the y axis and I manage to solve it just using ax.set_ylim(df['my data'].min()+100, df['my data'].max()+100). In y axis I have some sensible information which I thouhg the best way was to show in log scale but when I set log scale I couldn't see the numbers proper (as this post in x axis) so I just leave the idea of use log and use the min and max argment. It sets the scale of my graph much like as log. Still looking for another way for doesnt need use that -+100 at set_ylim.

While this does not actually use pyplot.bar, I think this method could be helpful in achieving what the OP is trying to do. I found this to be easier than trying to calibrate the width as a function of the log-scale, though it's more steps. Create a line collection whose width is independent of the chart scale.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.collections as coll
#Generate data and sort into bins
a = np.random.logseries(0.5, 1000)
hist, bin_edges = np.histogram(a, bins=20, density=False)
x = bin_edges[:-1] # remove the top-end from bin_edges to match dimensions of hist
lines = []
for i in range(len(x)):
pair=[(x[i],0), (x[i], hist[i])]
lines.append(pair)
linecoll = coll.LineCollection(lines, linewidths=10, linestyles='solid')
fig, ax = plt.subplots()
ax.add_collection(linecoll)
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_xlim(min(x)/10,max(x)*10)
ax.set_ylim(0.1,1.1*max(hist)) #since this is an unweighted histogram, the logy doesn't make much sense.
Resulting plot - no frills
One drawback is that the "bars" will be centered, but this could be changed by offsetting the x-values by half of the linewidth value ... I think it would be
x_new = x + (linewidth/2)*10**round(np.log10(x),0).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.