how to remove count from a plotly express bar chart hover data? - python

Given the following code:
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1')
fig.show()
that generates the following bar plot:
how do I remove count from hover_data?
plotly==5.1.0

You can remove it from hovertemplate
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1').update_traces(hovertemplate='col1=%{y}<br><extra></extra>')
fig.show()

Related

dplython: remove duplicates

Is there a way to remove duplicate rows for two specified columns using dplython?
This is an example of what I want to accomplish:
import pandas as pd
from dplython import *
data = {'store': [1, 1, 2, 2, 4, 4],
'Type': ['A', 'A', 'A', 'B', 'B', 'B'],
'weekly_sales': [100, 200, 300, 400, 500, 200]}
df = pd.DataFrame(data)
df.drop_duplicates(subset=["store", "Type"])
This is my dplython attempt:
df_R = DplyFrame(df)
df_R >> sift(drop_duplicates(subset=[X.store,X.Type]))
Thanks!

Categorize and order bar chart by Hue

I have a problem. I want to show the two highest countries of each category. But unfortunately I only get the below output. However, I would like the part to be listed as an extra category.
Is there an option?
import pandas as pd
import seaborn as sns
d = {'count': [50, 20, 30, 100, 3, 40, 5],
'country': ['DE', 'CN', 'CN', 'BG', 'PL', 'BG', 'RU'],
'part': ['b', 'b', 's', 's', 'b', 's', 's']
}
df = pd.DataFrame(data=d)
print(df)
#print(df.sort_values('count', ascending=False).groupby('party').head(2))
ax = sns.barplot(x="country", y="count", hue='part',
data=df.sort_values('count', ascending=False).groupby('part').head(2), palette='GnBu')
What I got
What I want
You can always not use seaborn and plot everything in matplotlib directly.
from matplotlib import pyplot as plt
import pandas as pd
plt.style.use('seaborn')
df = pd.DataFrame({
'count': [50, 20, 30, 100, 3, 40, 5],
'country': ['DE', 'CN', 'CN', 'BG', 'PL', 'BG', 'RU'],
'part': ['b', 'b', 's', 's', 'b', 'b', 's']
})
fig, ax = plt.subplots()
offset = .2
xticks, xlabels = [], []
handles, labels = [], []
for i, (idx, group) in enumerate(df.groupby('part')):
plot_data = group.nlargest(2, 'count')
x = [i - offset, i + offset]
barcontainer = ax.bar(x=x, height=plot_data['count'], width=.35)
xticks += x
xlabels += plot_data['country'].tolist()
handles.append(barcontainer[0])
labels.append(idx)
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels)
ax.legend(handles=handles, labels=labels, title='Part')
plt.show()
The following approach creates a FacetGrid for your data. Seaborn 11.2 introduced the helpful g.axes_dict. (In the example data I changed the second entry for 'BG' to 'b', supposing that each country/part combination only occurs once, as in the example plots).
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
d = {'count': [50, 20, 30, 100, 3, 40, 5],
'country': ['DE', 'CN', 'CN', 'BG', 'PL', 'BG', 'RU'],
'part': ['b', 'b', 's', 's', 'b', 'b', 's']
}
df = pd.DataFrame(data=d)
sns.set()
g = sns.FacetGrid(data=df, col='part', col_wrap=2, sharey=True, sharex=False)
for part, df_part in df.groupby('part'):
order = df_part.nlargest(2, 'count')['country']
ax = sns.barplot(data=df_part, x='country', y='count', order=order, palette='summer', ax=g.axes_dict[part])
ax.set(xlabel=f'part = {part}')
g.set_ylabels('count')
plt.tight_layout()
plt.show()

Pandas: Bar-Plot with two bars from repetitive x-column in dataframe

I have a slightly odd csv file where the month column is repeated as such. My goal is to create a bar graph where each month has two columns of y (from both a and b). I have tried to approach this by separating the data frame into two - a only and b only - but the repetition of the month column gets in the way. Fairly new to Python and Pandas so perhaps there is a function I'm not aware of? Any help is appreciated.
month cond. y
Jan a 4
Jan b 8
Feb a 2
Feb b 9
March a 3
March b 7
Perhaps the most common way to approach this problem is to reshape the long-form data to wide-form via pivot and then DataFrame.plot:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'March', 'March'],
'cond.': ['a', 'b', 'a', 'b', 'a', 'b'],
'y': [4, 8, 2, 9, 3, 7]
})
df.pivot(index='month', columns='cond.', values='y').plot(kind='bar', rot=0)
plt.tight_layout()
plt.show()
There is a noticeable issue in that the x-axis columns appear out of order as they are alphabetically ordered and not ordered by Date. One option would be to reindex before plotting. There would be more options if the month column was regular, but since it contains both full month names and abbreviations manually reindexing is likely the best option.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'March', 'March'],
'cond.': ['a', 'b', 'a', 'b', 'a', 'b'],
'y': [4, 8, 2, 9, 3, 7]
})
(
df.pivot(index='month', columns='cond.', values='y')
.reindex(['Jan', 'Feb', 'March']) # Re-order so they appear correctly on x-axis
.plot(kind='bar', rot=0)
)
plt.tight_layout()
plt.show()
Seaborn is highly popular in solving these types of questions as the hue argument allows the reshaping step to be avoided. Additionally x will be in order of appearance in the frame so reindex is also unnecessary (assuming the data appears in the correct order in the source DataFrame)
sns.barplot:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.set_theme() # (optional) Use seaborn theme
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'March', 'March'],
'cond.': ['a', 'b', 'a', 'b', 'a', 'b'],
'y': [4, 8, 2, 9, 3, 7]
})
sns.barplot(data=df, x='month', y='y', hue='cond.')
plt.tight_layout()
plt.show()
Using the hue attribute to categorize also works
import seaborn as sns
sns.barplot(data=df,x='Month',y='y',hue='Cond')
result_plot

Aggregate by percentile and count for groups in python

I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4
You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)

bar plot with vertical lines for each bar

%matplotlib inline
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08], 'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 'NaN']}).set_index('index')
I want to plot a horizontal bar chart with first and second as bars.
I want to use the max column for displaying a vertical line at the corresponding values if the other columns.
I only managed the bar plot as for now.
Like this:
Any hints on how to achieve this?
thx
I have replaced the NaN with some finite value and then you can use the following code
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08],
'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 2.5]}).set_index('index')
plt.barh(range(4), df['first'], height=-0.25, align='edge')
plt.barh(range(4), df['second'], height=0.25, align='edge', color='red')
plt.yticks(range(4), df.index);
for i, val in enumerate(df['max']):
plt.vlines(val, i-0.25, i+0.25, color='limegreen')

Categories