I've been struggling to plot the results of the GroupBy on three columns.
I have the data on the different absences (AbsenceType) of employees (Employee) over 3 years (MonthYear). I would like to plot in one plot, how many absences of a particular type an employee had in each month-year. I have only two employees in the example, but there are more in the data as well as more month-year values.
Create data
data = {'Employee': ['ID1', 'ID1','ID1','ID1','ID1','ID1','ID1', 'ID1', 'ID1', 'ID2','ID2','ID2','ID2','ID2', 'ID2'],
'MonthYear': ['201708', '201601','201601','201708','201710','201801','201801', '201601', '201601', '201705', '201705', '201705', '201810', '201811', '201705'],
'AbsenceType': ['0210', '0210','0250','0215','0217','0260','0210', '0210', '0210', '0260', '0250', '0215', '0217', '0215', '0250']}
columns = ['Employee','MonthYear','AbsenceType']
df = pd.DataFrame(data, columns=columns)
Then I map each of the codes of the AbsenceType into two categories: Sick or Injury.
df['SickOrInjury'] =df['AbsenceType'].replace({'0210':'Sick', '0215':'Sick', '0217':'Sick', '0250':'Injury', '0260':'Injury'})
What I want to achieve is the following groupby:
test = df.groupby(['Employee', 'MonthYear', 'SickOrInjury'])['SickOrInjury'].count()
But, when I try to plot it, it does not fully show what I want. So far I managed to get to the stage:
df.groupby(['Employee', 'MonthYear', 'SickOrInjury'])['SickOrInjury'].count().unstack('SickOrInjury', fill_value=0).plot()
plt.show()
test plot
However, employee's ID are shown on the X axis and not in the legend.
What I want to have is something like this:
desired plot
I would like to have time on the X axis and the count for each absence type (sick or injury) on the Y axis. There should be two different types of lines (e.g. solid and dashed) for each absence type and different colors for each employee (e.g. black and red).
I think unstacking is the right approach to fill missing values but you should probably convert MonthYear to date and resample by month. You can then plot your dataframe using seaborn.lineplot:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'Employee': ['ID1', 'ID1','ID1','ID1','ID1','ID1','ID1', 'ID1', 'ID1', 'ID2','ID2','ID2','ID2','ID2', 'ID2'],
'MonthYear': ['201708', '201601','201601','201708','201710','201801','201801', '201601', '201601', '201705', '201705', '201705', '201810', '201811', '201705'],
'AbsenceType': ['0210', '0210','0250','0215','0217','0260','0210', '0210', '0210', '0260', '0250', '0215', '0217', '0215', '0250']}
columns = ['Employee','MonthYear','AbsenceType']
df = pd.DataFrame(data, columns=columns)
df['SickOrInjury'] = df['AbsenceType'].replace({'0210':'Sick', '0215':'Sick', '0217':'Sick', '0250':'Injury', '0260':'Injury'})
df['MonthYear'] = pd.to_datetime(df['MonthYear'], format="%Y%m")
df = df.groupby(['MonthYear', 'Employee', 'SickOrInjury']).count()
# renaming the aggregated (and unique) column
df = df.rename(columns={'AbsenceType': 'EmpAbsCount'})
df = df.unstack(['Employee', 'SickOrInjury'], fill_value=0)
# resampling for monthly values:
df = df.resample('M').sum().stack(['Employee', 'SickOrInjury'])
sns.lineplot(x='MonthYear', y='EmpAbsCount', data=df, hue='Employee', style='SickOrInjury', markers=True, ci=None)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
Related
I am trying to make dynamic plots with plotly. I want to plot a count of data that have been aggregated (using groupby).
I want to facet the plot by color (and maybe even by column). The problem is that I want the value count to be displayed on each bar. With histogram, I get smooth bars but I can't find how to display the count:
With a bar plot I can display the count but I don't get smooth bar and the count does not appear for the whole bar but for each case composing that bar
Here is my code for the barplot
val = pd.DataFrame(data2.groupby(["program", "gender"])["experience"].value_counts())
px.bar(x=val.index.get_level_values(0), y=val, color=val.index.get_level_values(1), barmode="group", text=val)
It's basically the same for the histogram.
Thank you for your help!
px.histogram does not seem to have a text attribute. So if you're willing to do any binning before producing your plot, I would use px.Bar. Normally, you apply text to your barplot using px.Bar(... text = <something>). But this gives the results you've described with text for all subcategories of your data. But since we know that px.Bar adds data and annotations in the order that the source is organized, we can simply update text to the last subcategory applied using fig.data[-1].text = sums. The only challenge that remains is some data munging to retrieve the correct sums.
Plot:
Complete code with data example:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data
df = pd.DataFrame({'x':['a', 'b', 'c', 'd'],
'y1':[1, 4, 9, 16],
'y2':[1, 4, 9, 16],
'y3':[6, 8, 4.5, 8]})
df = df.set_index('x')
# calculations
# column sums for transposed dataframe
sums= []
for col in df.T:
sums.append(df.T[col].sum())
# change dataframe format from wide to long for input to plotly express
df = df.reset_index()
df = pd.melt(df, id_vars = ['x'], value_vars = df.columns[1:])
fig = px.bar(df, x='x', y='value', color='variable')
fig.data[-1].text = sums
fig.update_traces(textposition='inside')
fig.show()
If your first graph is with graph object librairy you can try:
# Use textposition='auto' for direct text
fig=go.Figure(data[go.Bar(x=val.index.get_level_values(0),
y=val, color=val.index.get_level_values(1),
barmode="group", text=val, textposition='auto',
)])
I have a data frame (quarterly national account data for different countries in the OECD) that is a panel with indexes country and time. I want to create a bar chart with the rates of growth of GDP across countries for a given date with country names as xticks...
So far I have been able to create a chart with sorted values, but I am not able to create sorted country names as xticks
quant = quant.sort_values('rGDP_Chg', ascending = False)
#print(quant.rGDP_Chg[:, '2019-Q1'])
ind = np.argsort(quant.rGDP_Chg[:, '2019-Q1'])
print(ind)
plt.bar(ind, quant.rGDP_Chg[:, '2019-Q1'])
plt.show()
Using some dummy data:
import pandas as pd
import matplotlib.pyplot as plt
data = [
['B', 5, '2019-Q1'],
['A', 2, '2019-Q1'],
['A', 3, '2019-Q2'],
]
columns = ['Country', 'Growth', 'Date']
df = pd.DataFrame(data=data, columns=columns)
df[df['Date'] == '2019-Q1'].sort_values(['Country']).plot(x='Country', y='Growth', kind='bar')
plt.show()
Result:
I am running into an issue where I can't get my bar chart to show up in descending order of a column grouped by region.
I have tried to order the values and then group by and plot on a bar chart.
df1 = df.drop(['Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'year', 'Unnamed: 0', 'Date'], axis=1)
df1 = df1.sort_values(['AveragePrice'],ascending=True).groupby('region').mean().plot(kind='bar', figsize=(15,5))
The graph still plots the values out in alphabetical order by region.
Group the values first then sort and change ascending=True to False:
df1 = df1.groupby('region').mean().sort_values(['AveragePrice'],ascending=False).plot(kind='bar', figsize=(15,5))
Also, that code will overwrite df1 as a Matplotlib subplot instead of updating the dataframe. Further calls to df1 will just output the type (matplotlib.axes._subplots.AxesSubplot) instead of displaying the dataframe.
To update df1 with the grouped and sorted dataframe you should first manipulate the dataframe and save it, then call plot on the updated dataframe, as shown below:
# Manipulate the dataframe
df1 = df1.groupby('region').mean().sort_values(['AveragePrice'],ascending=False)
# Plot the results
df1.plot(kind='bar', figsize=(15,5))
This way, further calls to df1 will display the grouped and sorted dataframe as expected.
I have two DataFrame north and south. Each has same rows and columns. I would like to plot the speed columns of both DataFrames in one figure as bar chart. I am trying this:
ax = south['speed'].plot(kind='bar', color='gray')
north['speed'].plot(kind = 'bar', color='red', ax=ax)
plt.show()
But it plots only the last dataframe , i.e. only the north DataFrame. Can you help me?
1) If you would like to plot just 'speed' column, you have to concatenate dataframes like:
df = pd.concat([north, south])
or
df = north.append(south)
2) If you would like to compare 'speed' column of both dataframes, you have to join dataframes along axis=1 like:
df = pd.concat([north, south], axis=1, ignore_index=True)
and the call plot method of df.
For more info: https://pandas.pydata.org/pandas-docs/stable/merging.html
I would like to plot 12 graphs (one graph per month) including columns 'A' and 'B' on the left y axis and column 'C' on the right.
Code below plots everything on the left side.
import pandas as pd
index=pd.date_range('2011-1-1 00:00:00', '2011-12-31 23:50:00', freq='1h')
df=pd.DataFrame(np.random.rand(len(index),3),columns=['A','B','C'],index=index)
df2 = df.groupby(lambda x: x.month)
for key, group in df2:
group.plot()
How to separate columns and use something like this:group.plot({'A','B':style='g'},{'C':secondary_y=True}) ?
You can capture the axes which the Pandas plot() command returns and use it again to plot C specifically on the right axis.
index=pd.date_range('2011-1-1 00:00:00', '2011-12-31 23:50:00', freq='1h')
df=pd.DataFrame(np.random.randn(len(index),3).cumsum(axis=0),columns=['A','B','C'],index=index)
df2 = df.groupby(lambda x: x.month)
for key, group in df2:
ax = group[['A', 'B']].plot()
group[['C']].plot(secondary_y=True, ax=ax)
To get all lines in a single legend see:
Legend only shows one label when plotting with pandas