Custom sort for histogram - python

After looking at countless questions and answers on how to do custom sorting of the bars in bar charts (or a histogram in my case) it seemed the answer was to sort the dataframe as desired and then do the plot, only to find that the plot ignores the data and blithely sorts alphabetically. There does not seem to be a simple option to turn sorting off, or just supply a list to the plot to sort by.
Here's my sample code
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
diamonds.set_index('cut', inplace=True)
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
df = pd.DataFrame(diamonds.loc[cuts_order].carat)
df.reset_index(inplace=True)
plt.hist(df.cut);
This returns the 'cuts' in alphabetical order but not as sorted in the data. I was quite excited to have figured out a clever way of sorting the data, so much bigger the disappointment the plot is ignorant.
What is the most straightforward way of doing this?
Here's what I get with the above code:

Update of your code with the answers in the comments:
In [1]:
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
diamonds.set_index('cut', inplace=True)
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
df = pd.DataFrame(diamonds.loc[cuts_order].carat)
df.plot.bar(use_index=True, y='carat')
Out [1]:

A histogram was not the right plot here. With the following code the bars, sorted as desired, are created:
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
c_classes = pd.api.types.CategoricalDtype(ordered = True, categories = cuts_order)
diamonds['cut'] = diamonds['cut'].astype(c_classes)
to_plot = diamonds.cut.value_counts(sort=False)
plt.bar(to_plot.index, to_plot.values)
Side note, matplotlib 2.1.0 behaves differently because plt.bar will blithely ignore the sort order that it is given, I can only confirm this works with 3.0.3 (and hopefully higher).
I also tried sorting the data by index but this does not take effect for some reason, looks like value_counts(sort=False) does not return values in the order it is found in the data:
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
diamonds.set_index('cut', inplace=True)
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
diamonds = diamonds.loc[cuts_order]
to_plot = diamonds.index.value_counts(sort=False)
plt.bar(to_plot.index, to_plot.values)
Seaborn is also an option as it potentially removes the dependency on the available matplotlib version:
import pandas as pd
import seaborn as sb
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
c_classes = pd.api.types.CategoricalDtype(ordered = True, categories = cuts_order)
diamonds['cut'] = diamonds['cut'].astype(c_classes)
to_plot = diamonds.cut.value_counts(sort=False)
ax = sb.barplot(data = diamonds, x = to_plot.index, y = to_plot.values)

Related

Create a Single Boxplot from Multiple DataFrames

I have multiple data frames with different no. of rows and same no.of columns i.e
DATA
female_df1 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df2 = pd.DataFrame({'ID': [75,1,7], 'value': [39, 66.7, 77.9]})
female_df3 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df4 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
male_df1 = pd.DataFrame({'ID': [35,1,7], 'value': [15, 36.7, 87.9]})
male_df2 = pd.DataFrame({'ID': [5,11,17], 'value': [99, 96.7, 97.9]})
male_df3 = pd.DataFrame({'ID': [35,41,37], 'value': [15, 16.7, 17.9]})
male_df4 = pd.DataFrame({'ID': [51,11,27], 'value': [35, 36.7, 37.9]})
Now, I would like to plot a single boxplot from above multiple df's. I used below code to do so
fig, ax2 = plt.subplots(figsize = (15,10))
vec = [female_df1['value'].values,female_df2['value'].values,female_df3['value'].values,female_df4['value'].values]
labels = ['f1','f2','f3', 'f4']
ax2.boxplot(vec, labels = labels)
plt.show()
The Output in female values boxplot, now similarly I have Male data frames with values, and I want to plot side by side (i.e fbeta1.0 and mbeta1.0) to observe the difference in data distribution. Valuable insights much appreciated
Desired Output plot:
Desired Output
This is a bit manual, but should do what you need...
### DATA ###
female_df1 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df2 = pd.DataFrame({'ID': [75,1,7], 'value': [39, 66.7, 77.9]})
female_df3 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df4 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
male_df1 = pd.DataFrame({'ID': [35,1,7], 'value': [15, 36.7, 87.9]})
male_df2 = pd.DataFrame({'ID': [5,11,17], 'value': [99, 96.7, 97.9]})
male_df3 = pd.DataFrame({'ID': [35,41,37], 'value': [15, 16.7, 17.9]})
male_df4 = pd.DataFrame({'ID': [51,11,27], 'value': [35, 36.7, 37.9]})
### PLOTTING ###
fig, ax = plt.subplots(1,4, figsize = (15,6))
ax[0].boxplot([female_df1['value'].values, male_df1['value'].values], labels = ['f1','m1'])
ax[1].boxplot([female_df2['value'].values, male_df2['value'].values], labels = ['f1','m1'])
ax[2].boxplot([female_df3['value'].values, male_df3['value'].values], labels = ['f1','m1'])
ax[3].boxplot([female_df4['value'].values, male_df4['value'].values], labels = ['f1','m1'])
ax[0].set_title("M1 & F1")
ax[1].set_title("M2 & F2")
ax[2].set_title("M3 & F3")
ax[3].set_title("M4 & F4")
plt.show()
Plot

How to make a cell higher in matplotlib using the plt.table function?

The following code uses colwidths to adjust the cell's width:
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['font.sans-serif'] = ['FangSong']
mpl.rcParams['axes.unicode_minus'] = False
labels = ['A难度水平', 'B难度水平', 'C难度水平', 'D难度水平']
students = [0.35, 0.15, 0.2, 0.3]
explode = [0.1, 0.1, 0.1, 0.1]
colors = ['r', 'y', 'b', 'gray']
plt.pie(students, autopct='%3.1f%%',
labels=labels, textprops={'fontsize': 12,
'family': 'FangSong',
'fontweight': 'bold'},
explode=explode, colors=colors)
studentValues = [['A', 'B', 'C', 'D'], [350, 150, 200, 300], ['test', 'test', 'test', 'test']]
cellcolors = [['r', 'y', 'b', 'gray'], ['b', 'gray', 'y', 'r'], ['gray', 'y', 'b', 'r']]
rowLabels = ['aaaaa','bbbbb','ccccc']
plt.table(cellText=studentValues,
cellColours=cellcolors,
cellLoc='center', colWidths=[0.1] * 4,
rowLabels=rowLabels)
plt.show()
How could I adjust the height of the cell inside the plt.table function?
Save the link to: plt.table. And adjust via 'scale'.
ytable = plt.table(cellText=studentValues,
cellColours=cellcolors,
cellLoc='center', colWidths=[0.1] * 4,
rowLabels=rowLabels)
ytable.scale(1, 1.0)

Categorize and order bar chart by Hue

I have a problem. I want to show the two highest countries of each category. But unfortunately I only get the below output. However, I would like the part to be listed as an extra category.
Is there an option?
import pandas as pd
import seaborn as sns
d = {'count': [50, 20, 30, 100, 3, 40, 5],
'country': ['DE', 'CN', 'CN', 'BG', 'PL', 'BG', 'RU'],
'part': ['b', 'b', 's', 's', 'b', 's', 's']
}
df = pd.DataFrame(data=d)
print(df)
#print(df.sort_values('count', ascending=False).groupby('party').head(2))
ax = sns.barplot(x="country", y="count", hue='part',
data=df.sort_values('count', ascending=False).groupby('part').head(2), palette='GnBu')
What I got
What I want
You can always not use seaborn and plot everything in matplotlib directly.
from matplotlib import pyplot as plt
import pandas as pd
plt.style.use('seaborn')
df = pd.DataFrame({
'count': [50, 20, 30, 100, 3, 40, 5],
'country': ['DE', 'CN', 'CN', 'BG', 'PL', 'BG', 'RU'],
'part': ['b', 'b', 's', 's', 'b', 'b', 's']
})
fig, ax = plt.subplots()
offset = .2
xticks, xlabels = [], []
handles, labels = [], []
for i, (idx, group) in enumerate(df.groupby('part')):
plot_data = group.nlargest(2, 'count')
x = [i - offset, i + offset]
barcontainer = ax.bar(x=x, height=plot_data['count'], width=.35)
xticks += x
xlabels += plot_data['country'].tolist()
handles.append(barcontainer[0])
labels.append(idx)
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels)
ax.legend(handles=handles, labels=labels, title='Part')
plt.show()
The following approach creates a FacetGrid for your data. Seaborn 11.2 introduced the helpful g.axes_dict. (In the example data I changed the second entry for 'BG' to 'b', supposing that each country/part combination only occurs once, as in the example plots).
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
d = {'count': [50, 20, 30, 100, 3, 40, 5],
'country': ['DE', 'CN', 'CN', 'BG', 'PL', 'BG', 'RU'],
'part': ['b', 'b', 's', 's', 'b', 'b', 's']
}
df = pd.DataFrame(data=d)
sns.set()
g = sns.FacetGrid(data=df, col='part', col_wrap=2, sharey=True, sharex=False)
for part, df_part in df.groupby('part'):
order = df_part.nlargest(2, 'count')['country']
ax = sns.barplot(data=df_part, x='country', y='count', order=order, palette='summer', ax=g.axes_dict[part])
ax.set(xlabel=f'part = {part}')
g.set_ylabels('count')
plt.tight_layout()
plt.show()

Position 'add_vline' annotation text vertically and center mid-point on vertical line

import plotly as plotly
from plotly.offline import plot
import plotly.express as px
import pandas as pd
import numpy as np
df = pd.DataFrame({'height': [712, 712, 716, 716, 718, np.nan, np.nan, np.nan, np.nan, np.nan],
'moisture ': [0.06, 0.19, 0.18, 0.17, 0.18, np.nan, np.nan, np.nan, np.nan, np.nan],
'tasks': ['water', None, None, 'prune', None, None, 'position', None, 'prune', None],
'weather': [None, 'humid', None, None, 'wet', None, None, None, None, 'hot']},
index=['2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09',
'2020-01-10', '2020-01-11', '2020-01-12', '2020-01-13'])
df.index.name = 'date'
fig = px.line(df, y="height")
for x in df.loc[~df["tasks"].isna()].index:
fig.add_vline(x=pd.to_datetime(x).timestamp() * 1000, line_width=1, line_dash="dash", line_color="red", annotation_text=df.loc[x, 'tasks'])
for x in df.loc[~df["weather"].isna()].index:
fig.add_vline(x=pd.to_datetime(x).timestamp() * 1000, line_width=1, line_dash="dash", line_color="blue", annotation_text=df.loc[x, 'weather'])
fig.show()
This produces the following output:
I am wanting to position the annotation text vertically for each line fixed center at the mid point (half way up)
In the documentation, I can only see an option to position the annotation at the top or bottom, using annotation_position
Also, since the lines are from different columns I would like to add a legend and the ability to click to hide/show the column data (represented by the 'red and 'blue' lines)
Many thanks in advance.
You can modify the annotations created by add_vline()
import plotly as plotly
from plotly.offline import plot
import plotly.express as px
import pandas as pd
import numpy as np
df = pd.DataFrame({'height': [712, 712, 716, 716, 718, np.nan, np.nan, np.nan, np.nan, np.nan],
'moisture ': [0.06, 0.19, 0.18, 0.17, 0.18, np.nan, np.nan, np.nan, np.nan, np.nan],
'tasks': ['water', None, None, 'prune', None, None, 'position', None, 'prune', None],
'weather': [None, 'humid', None, None, 'wet', None, None, None, None, 'hot']},
index=['2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09',
'2020-01-10', '2020-01-11', '2020-01-12', '2020-01-13'])
df.index.name = 'date'
fig = px.line(df, y="height")
for x in df.loc[~df["tasks"].isna()].index:
fig.add_vline(x=pd.to_datetime(x).timestamp() * 1000, line_width=1, line_dash="dash", line_color="red", annotation_text=df.loc[x, 'tasks'])
for x in df.loc[~df["weather"].isna()].index:
fig.add_vline(x=pd.to_datetime(x).timestamp() * 1000, line_width=1, line_dash="dash", line_color="blue", annotation_text=df.loc[x, 'weather'])
fig.update_layout(annotations=[{**a, **{"y":.5}} for a in fig.to_dict()["layout"]["annotations"]])

bar plot with vertical lines for each bar

%matplotlib inline
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08], 'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 'NaN']}).set_index('index')
I want to plot a horizontal bar chart with first and second as bars.
I want to use the max column for displaying a vertical line at the corresponding values if the other columns.
I only managed the bar plot as for now.
Like this:
Any hints on how to achieve this?
thx
I have replaced the NaN with some finite value and then you can use the following code
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08],
'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 2.5]}).set_index('index')
plt.barh(range(4), df['first'], height=-0.25, align='edge')
plt.barh(range(4), df['second'], height=0.25, align='edge', color='red')
plt.yticks(range(4), df.index);
for i, val in enumerate(df['max']):
plt.vlines(val, i-0.25, i+0.25, color='limegreen')

Categories