I've been struggling to plot the results of the GroupBy on three columns.
I have the data on the different absences (AbsenceType) of employees (Employee) over 3 years (MonthYear). I would like to plot in one plot, how many absences of a particular type an employee had in each month-year. I have only two employees in the example, but there are more in the data as well as more month-year values.
Create data
data = {'Employee': ['ID1', 'ID1','ID1','ID1','ID1','ID1','ID1', 'ID1', 'ID1', 'ID2','ID2','ID2','ID2','ID2', 'ID2'],
'MonthYear': ['201708', '201601','201601','201708','201710','201801','201801', '201601', '201601', '201705', '201705', '201705', '201810', '201811', '201705'],
'AbsenceType': ['0210', '0210','0250','0215','0217','0260','0210', '0210', '0210', '0260', '0250', '0215', '0217', '0215', '0250']}
columns = ['Employee','MonthYear','AbsenceType']
df = pd.DataFrame(data, columns=columns)
Then I map each of the codes of the AbsenceType into two categories: Sick or Injury.
df['SickOrInjury'] =df['AbsenceType'].replace({'0210':'Sick', '0215':'Sick', '0217':'Sick', '0250':'Injury', '0260':'Injury'})
What I want to achieve is the following groupby:
test = df.groupby(['Employee', 'MonthYear', 'SickOrInjury'])['SickOrInjury'].count()
But, when I try to plot it, it does not fully show what I want. So far I managed to get to the stage:
df.groupby(['Employee', 'MonthYear', 'SickOrInjury'])['SickOrInjury'].count().unstack('SickOrInjury', fill_value=0).plot()
plt.show()
test plot
However, employee's ID are shown on the X axis and not in the legend.
What I want to have is something like this:
desired plot
I would like to have time on the X axis and the count for each absence type (sick or injury) on the Y axis. There should be two different types of lines (e.g. solid and dashed) for each absence type and different colors for each employee (e.g. black and red).
I think unstacking is the right approach to fill missing values but you should probably convert MonthYear to date and resample by month. You can then plot your dataframe using seaborn.lineplot:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'Employee': ['ID1', 'ID1','ID1','ID1','ID1','ID1','ID1', 'ID1', 'ID1', 'ID2','ID2','ID2','ID2','ID2', 'ID2'],
'MonthYear': ['201708', '201601','201601','201708','201710','201801','201801', '201601', '201601', '201705', '201705', '201705', '201810', '201811', '201705'],
'AbsenceType': ['0210', '0210','0250','0215','0217','0260','0210', '0210', '0210', '0260', '0250', '0215', '0217', '0215', '0250']}
columns = ['Employee','MonthYear','AbsenceType']
df = pd.DataFrame(data, columns=columns)
df['SickOrInjury'] = df['AbsenceType'].replace({'0210':'Sick', '0215':'Sick', '0217':'Sick', '0250':'Injury', '0260':'Injury'})
df['MonthYear'] = pd.to_datetime(df['MonthYear'], format="%Y%m")
df = df.groupby(['MonthYear', 'Employee', 'SickOrInjury']).count()
# renaming the aggregated (and unique) column
df = df.rename(columns={'AbsenceType': 'EmpAbsCount'})
df = df.unstack(['Employee', 'SickOrInjury'], fill_value=0)
# resampling for monthly values:
df = df.resample('M').sum().stack(['Employee', 'SickOrInjury'])
sns.lineplot(x='MonthYear', y='EmpAbsCount', data=df, hue='Employee', style='SickOrInjury', markers=True, ci=None)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
I am running into an issue where I can't get my bar chart to show up in descending order of a column grouped by region.
I have tried to order the values and then group by and plot on a bar chart.
df1 = df.drop(['Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'year', 'Unnamed: 0', 'Date'], axis=1)
df1 = df1.sort_values(['AveragePrice'],ascending=True).groupby('region').mean().plot(kind='bar', figsize=(15,5))
The graph still plots the values out in alphabetical order by region.
Group the values first then sort and change ascending=True to False:
df1 = df1.groupby('region').mean().sort_values(['AveragePrice'],ascending=False).plot(kind='bar', figsize=(15,5))
Also, that code will overwrite df1 as a Matplotlib subplot instead of updating the dataframe. Further calls to df1 will just output the type (matplotlib.axes._subplots.AxesSubplot) instead of displaying the dataframe.
To update df1 with the grouped and sorted dataframe you should first manipulate the dataframe and save it, then call plot on the updated dataframe, as shown below:
# Manipulate the dataframe
df1 = df1.groupby('region').mean().sort_values(['AveragePrice'],ascending=False)
# Plot the results
df1.plot(kind='bar', figsize=(15,5))
This way, further calls to df1 will display the grouped and sorted dataframe as expected.
I have two DataFrame north and south. Each has same rows and columns. I would like to plot the speed columns of both DataFrames in one figure as bar chart. I am trying this:
ax = south['speed'].plot(kind='bar', color='gray')
north['speed'].plot(kind = 'bar', color='red', ax=ax)
plt.show()
But it plots only the last dataframe , i.e. only the north DataFrame. Can you help me?
1) If you would like to plot just 'speed' column, you have to concatenate dataframes like:
df = pd.concat([north, south])
or
df = north.append(south)
2) If you would like to compare 'speed' column of both dataframes, you have to join dataframes along axis=1 like:
df = pd.concat([north, south], axis=1, ignore_index=True)
and the call plot method of df.
For more info: https://pandas.pydata.org/pandas-docs/stable/merging.html
I read csv having many columns including a Date column, as pandas dataframe called 'breakageDf'.
Being 'object' type I convert the date column to string first then to datetime object
breakageDf["Date"] = breakageDf["Date"].astype("str")
breakageDf["Date"] = pd.to_datetime(breakageDf["Date"], format="%d-%m-%Y")
Then I set this Date column as index.
breakageDf = breakageDf.set_index("Date")
I wish to create a Bokeh line plot with Date as X axis and some other column values as Y axis.
p = figure(x_axis_type="datetime", width=w, height=h, tools=[hover, 'pan', 'wheel_zoom'])
p.line(breakageDf.index, breakageDf["BreakageValue"], color="#A6CEE3", legend="BreakageValue")
But the plot comes wrong. There is no spread on X axis at all.
Plot Errors
Index when printed looks like
DatetimeIndex(['2015-03-05', '2015-03-07', '2015-03-10', '2015-03-11',
'2015-03-12', '2015-03-12', '2015-03-15', '2015-03-15',
'2015-03-15', '2015-03-20',
...
'2016-02-21', '2016-02-23', '2016-02-26', '2016-02-27',
'2016-02-28', '2016-03-08', '2016-03-14', '2016-03-15',
'2016-03-17', '2016-03-18'],
dtype='datetime64[ns]', name='Date', length=192, freq=None)
Is frequency being None, is the reason? Should it be Daily? How to set it programmatically?
I would like to slice my DF using chosen colors. I know how to slice DFbut I don't know how to put all together in a one plot. Now MWE gives plots like this:
import pandas as pd
import numpy as np
index=pd.date_range('2011-1-1 00:00:00', '2011-1-31 23:50:00', freq='1h')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
df2 = df.groupby([lambda x: x.month, lambda x: x.day]).sum()
df2[:11].plot(kind='bar', color='r')
df2[12:].plot(kind='bar', color='y')
I would like to have one plot (not two as in example) with all 31 values, where for range[:11] the plot color would be red and for [12:] yellow.
You need to concatenate each so that they are separate series. You also need to rename them so they are not exactly the same (I've appended a space).
df3 = pd.concat([df2[:11], df2[12:]], axis=1)
df3.columns = ['A', 'B', 'A ', 'B ']
df3.plot(kind='bar', colors=['r', 'r', 'y', 'y'])
Alternatively, specify the color for each value in the series.
colors = tuple(['r'] * 11 + ['y'] * (len(df2) - 11))
df2.plot(kind='bar', color=[colors], legend=False)