Bivariate histogram with user-defined contour lines in python [duplicate] - python

I am trying to plot two displots side by side with this code
fig,(ax1,ax2) = plt.subplots(1,2)
sns.displot(x =X_train['Age'], hue=y_train, ax=ax1)
sns.displot(x =X_train['Fare'], hue=y_train, ax=ax2)
It returns the following result (two empty subplots followed by one displot each on two lines)-
If I try the same code with violinplot, it returns result as expected
fig,(ax1,ax2) = plt.subplots(1,2)
sns.violinplot(y_train, X_train['Age'], ax=ax1)
sns.violinplot(y_train, X_train['Fare'], ax=ax2)
Why is displot returning a different kind of output and what can I do to output two plots on the same line?

seaborn.distplot has been DEPRECATED in seaborn 0.11 and is replaced with the following:
displot(), a figure-level function with a similar flexibility over the kind of plot to draw. This is a FacetGrid, and does not have the ax parameter, so it will not work with matplotlib.pyplot.subplots.
histplot(), an axes-level function for plotting histograms, including with kernel density smoothing. This does have the ax parameter, so it will work with matplotlib.pyplot.subplots.
It is applicable to any of the seaborn FacetGrid plots that there is no ax parameter. Use the equivalent axes-level plot.
Look at the documentation for the figure-level plot to find the appropriate axes-level plot function for your needs.
See Figure-level vs. axes-level functions
Because the histogram of two different columns is desired, it's easier to use histplot.
See How to plot in multiple subplots for a number of different ways to plot into maplotlib.pyplot.subplots
Also review seaborn histplot and displot output doesn't match
Tested in seaborn 0.11.1 & matplotlib 3.4.2
fig, (ax1, ax2) = plt.subplots(1, 2)
sns.histplot(x=X_train['Age'], hue=y_train, ax=ax1)
sns.histplot(x=X_train['Fare'], hue=y_train, ax=ax2)
Imports and DataFrame Sample
import seaborn as sns
import matplotlib.pyplot as plt
# load data
penguins = sns.load_dataset("penguins", cache=False)
# display(penguins.head())
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
Axes Level Plot
With the data in a wide format, use sns.histplot
# select the columns to be plotted
cols = ['bill_length_mm', 'bill_depth_mm']
# create the figure and axes
fig, axes = plt.subplots(1, 2)
axes = axes.ravel() # flattening the array makes indexing easier
for col, ax in zip(cols, axes):
sns.histplot(data=penguins[col], kde=True, stat='density', ax=ax)
fig.tight_layout()
plt.show()
Figure Level Plot
With the dataframe in a long format, use displot
# create a long dataframe
dfl = penguins.melt(id_vars='species', value_vars=['bill_length_mm', 'bill_depth_mm'], var_name='bill_size', value_name='vals')
# display(dfl.head())
species bill_size vals
0 Adelie bill_length_mm 39.1
1 Adelie bill_depth_mm 18.7
2 Adelie bill_length_mm 39.5
3 Adelie bill_depth_mm 17.4
4 Adelie bill_length_mm 40.3
# plot
sns.displot(data=dfl, x='vals', col='bill_size', kde=True, stat='density', common_bins=False, common_norm=False, height=4, facet_kws={'sharey': False, 'sharex': False})
Multiple DataFrames
If there are multiple dataframes, they can be combined with pd.concat, and use .assign to create an identifying 'source' column, which can be used for row=, col=, or hue=
# list of dataframe
lod = [df1, df2, df3]
# create one dataframe with a new 'source' column to use for row, col, or hue
df = pd.concat((d.assign(source=f'df{i}') for i, d in enumerate(lod, 1)), ignore_index=True)
See Import multiple csv files into pandas and concatenate into one DataFrame to read multiple files into a single dataframe with an identifying column.

Related

how to set x_axis label(not xtick label) for all subplots in relplot?

I tried drawing subplot through relplot method of seaborn. Now the question is, due to the original dataset is varying, sometimes I don't know how much final subplots will be.
I set col_wrap to limit it, but sometimes the results looks not so good. For example, I set col_wrap = 3, while there are 5 subplots as below:
As the figure shows, the x_axis only occurs in the C D E, which seems strange. I want x axis label is shown in all subplots(from A to E).
Now I already know that facet_kws={'sharex': 'col'} allows plots to have independent axis scales(according to set axis limits on individual facets of seaborn facetgrid).
But I want set labels for x axis of all subplots.I haven't found any solution for it.
Any keyword like set_xlabels in object FacetGrid seems to be useless, because official document announces they only control "on the bottom row of the grid".
FacetGrid.set_xlabels(label=None, clear_inner=True, **kwargs)
Label the x axis on the bottom row of the grid.
The following are my example data and my code:
city date value
0 A 1 9
1 B 1 20
2 C 1 4
3 D 1 33
4 E 1 2
5 A 2 22
6 B 2 32
7 C 2 27
8 D 2 32
9 E 2 18
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_excel("data/example_data.xlsx")
# print(df)
g = sns.relplot(data=df, x="date", y="value", kind="line", col="city", col_wrap=3,
errorbar=None, facet_kws={'sharex': 'col'})
(g.set_axis_labels("x_axis", "y_axis", )
.set_titles("{col_name}")
.tight_layout()
.add_legend()
)
plt.subplots_adjust(top=0.94, wspace=None, hspace=0.4)
plt.show()
Thanks in advance.
In order to reduce superfluous information, Seaborn makes these inner labels invisible. You can make them visible again:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.repeat([1, 2], 5),
'value': np.random.randint(1, 20, 10),
'city': np.tile([*'abcde'], 2)})
# print(df)
g = sns.relplot(data=df, x="date", y="value", kind="line", col="city", col_wrap=3,
errorbar=None, facet_kws={'sharex': 'col'})
g.set_titles("{col_name}")
g.add_legend()
for ax in g.axes.flat:
ax.set_xlabel('x axis', visible=True)
ax.set_ylabel('y axis', visible=True)
plt.subplots_adjust(top=0.94, wspace=None, hspace=0.4)
plt.show()

Creating a multi-bar plot in MatplotLib

Given a simple pd.Dataframe df that looks like this:
workflow blocked_14 blocked_7 blocked_5 blocked_2 blocked_1
au_in_service_order_response au_in_service_order_response 12.00 11.76 15.38 25.0 0.0
au_in_cats_sync_billing_period au_in_cats_sync_billing_period 3.33 0.00 0.00 0.0 0.0
au_in_MeterDataNotification au_in_MeterDataNotification 8.70 0.00 0.00 0.0 0.0
I want to create a bar-chart that shows the blocked_* columns as the x-axis.
Since df.plot(x='workflow', kind='bar') obviously puts the workflows on the x-axis, I tried ax = blocked_df.plot(x=['blocked_14','blocked_7',...], kind='bar') but this gives me
ValueError: x must be a label or position
How would I create 5 y-Values and have each bar show the according value of the workflow?
Since pandas interprets the x as the index and y as the values you want to plot, you'll need to transpose your dataframe first.
import matplotlib.pyplot as plt
ax = df.set_index('workflow').T.plot.bar()
plt.show()
But that doesn't look too good does it? Let's ensure all of the labels fit on the Axes and move the legend outside of the plot so it doesn't obscure the data.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14, 6), layout='constrained')
ax = df.set_index('workflow').T.plot.bar(legend=False, ax=ax)
ax.legend(loc='upper left', bbox_to_anchor=(1, .8))
plt.show()

How to take my 1x4 set of suplots and convert to 2x2 set of subplots in seaborn? [duplicate]

I am trying to plot two displots side by side with this code
fig,(ax1,ax2) = plt.subplots(1,2)
sns.displot(x =X_train['Age'], hue=y_train, ax=ax1)
sns.displot(x =X_train['Fare'], hue=y_train, ax=ax2)
It returns the following result (two empty subplots followed by one displot each on two lines)-
If I try the same code with violinplot, it returns result as expected
fig,(ax1,ax2) = plt.subplots(1,2)
sns.violinplot(y_train, X_train['Age'], ax=ax1)
sns.violinplot(y_train, X_train['Fare'], ax=ax2)
Why is displot returning a different kind of output and what can I do to output two plots on the same line?
seaborn.distplot has been DEPRECATED in seaborn 0.11 and is replaced with the following:
displot(), a figure-level function with a similar flexibility over the kind of plot to draw. This is a FacetGrid, and does not have the ax parameter, so it will not work with matplotlib.pyplot.subplots.
histplot(), an axes-level function for plotting histograms, including with kernel density smoothing. This does have the ax parameter, so it will work with matplotlib.pyplot.subplots.
It is applicable to any of the seaborn FacetGrid plots that there is no ax parameter. Use the equivalent axes-level plot.
Look at the documentation for the figure-level plot to find the appropriate axes-level plot function for your needs.
See Figure-level vs. axes-level functions
Because the histogram of two different columns is desired, it's easier to use histplot.
See How to plot in multiple subplots for a number of different ways to plot into maplotlib.pyplot.subplots
Also review seaborn histplot and displot output doesn't match
Tested in seaborn 0.11.1 & matplotlib 3.4.2
fig, (ax1, ax2) = plt.subplots(1, 2)
sns.histplot(x=X_train['Age'], hue=y_train, ax=ax1)
sns.histplot(x=X_train['Fare'], hue=y_train, ax=ax2)
Imports and DataFrame Sample
import seaborn as sns
import matplotlib.pyplot as plt
# load data
penguins = sns.load_dataset("penguins", cache=False)
# display(penguins.head())
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
Axes Level Plot
With the data in a wide format, use sns.histplot
# select the columns to be plotted
cols = ['bill_length_mm', 'bill_depth_mm']
# create the figure and axes
fig, axes = plt.subplots(1, 2)
axes = axes.ravel() # flattening the array makes indexing easier
for col, ax in zip(cols, axes):
sns.histplot(data=penguins[col], kde=True, stat='density', ax=ax)
fig.tight_layout()
plt.show()
Figure Level Plot
With the dataframe in a long format, use displot
# create a long dataframe
dfl = penguins.melt(id_vars='species', value_vars=['bill_length_mm', 'bill_depth_mm'], var_name='bill_size', value_name='vals')
# display(dfl.head())
species bill_size vals
0 Adelie bill_length_mm 39.1
1 Adelie bill_depth_mm 18.7
2 Adelie bill_length_mm 39.5
3 Adelie bill_depth_mm 17.4
4 Adelie bill_length_mm 40.3
# plot
sns.displot(data=dfl, x='vals', col='bill_size', kde=True, stat='density', common_bins=False, common_norm=False, height=4, facet_kws={'sharey': False, 'sharex': False})
Multiple DataFrames
If there are multiple dataframes, they can be combined with pd.concat, and use .assign to create an identifying 'source' column, which can be used for row=, col=, or hue=
# list of dataframe
lod = [df1, df2, df3]
# create one dataframe with a new 'source' column to use for row, col, or hue
df = pd.concat((d.assign(source=f'df{i}') for i, d in enumerate(lod, 1)), ignore_index=True)
See Import multiple csv files into pandas and concatenate into one DataFrame to read multiple files into a single dataframe with an identifying column.

Seaborn boxplot with grouped data into categories with count column

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

Pandas dataframe plotting - issue when switching from two subplots to single plot w/ secondary axis

I have two sets of data I want to plot together on a single figure. I have a set of flow data at 15 minute intervals I want to plot as a line plot, and a set of precipitation data at hourly intervals, which I am resampling to a daily time step and plotting as a bar plot. Here is what the format of the data looks like:
2016-06-01 00:00:00 56.8
2016-06-01 00:15:00 52.1
2016-06-01 00:30:00 44.0
2016-06-01 00:45:00 43.6
2016-06-01 01:00:00 34.3
At first I set this up as two subplots, with precipitation and flow rate on different axis. This works totally fine. Here's my code:
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
filename = 'manhole_B.csv'
plotname = 'SSMH-2A B'
plt.style.use('bmh')
# Read csv with precipitation data, change index to datetime object
pdf = pd.read_csv('precip.csv', delimiter=',', header=None, index_col=0)
pdf.columns = ['Precipitation[in]']
pdf.index.name = ''
pdf.index = pd.to_datetime(pdf.index)
pdf = pdf.resample('D').sum()
print(pdf.head())
# Read csv with flow data, change index to datetime object
qdf = pd.read_csv(filename, delimiter=',', header=None, index_col=0)
qdf.columns = ['Flow rate [gpm]']
qdf.index.name = ''
qdf.index = pd.to_datetime(qdf.index)
# Plot
f, ax = plt.subplots(2)
qdf.plot(ax=ax[1], rot=30)
pdf.plot(ax=ax[0], kind='bar', color='r', rot=30, width=1)
ax[0].get_xaxis().set_ticks([])
ax[1].set_ylabel('Flow Rate [gpm]')
ax[0].set_ylabel('Precipitation [in]')
ax[0].set_title(plotname)
f.set_facecolor('white')
f.tight_layout()
plt.show()
2 Axis Plot
However, I decided I want to show everything on a single axis, so I modified my code to put precipitation on a secondary axis. Now my flow data data has disppeared from the plot, and even when I set the axis ticks to an empty set, I get these 00:15 00:30 and 00:45 tick marks along the x-axis.
Secondary-y axis plots
Any ideas why this might be occuring?
Here is my code for the single axis plot:
f, ax = plt.subplots()
qdf.plot(ax=ax, rot=30)
pdf.plot(ax=ax, kind='bar', color='r', rot=30, secondary_y=True)
ax.get_xaxis().set_ticks([])
Here is an example:
Setup
In [1]: from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({'x' : np.arange(10),
'y1' : np.random.rand(10,),
'y2' : np.square(np.arange(10))})
df
Out[1]: x y1 y2
0 0 0.451314 0
1 1 0.321124 1
2 2 0.050852 4
3 3 0.731084 9
4 4 0.689950 16
5 5 0.581768 25
6 6 0.962147 36
7 7 0.743512 49
8 8 0.993304 64
9 9 0.666703 81
Plot
In [2]: fig, ax1 = plt.subplots()
ax1.plot(df['x'], df['y1'], 'b-')
ax1.set_xlabel('Series')
ax1.set_ylabel('Random', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
ax2 = ax1.twinx() # Note twinx, not twiny. I was wrong when I commented on your question.
ax2.plot(df['x'], df['y2'], 'ro')
ax2.set_ylabel('Square', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
Out[2]:

Categories