I am having difficulty adding a multi level axis with month and then year to my plot and I have been unable to find any answers anywhere. I have a dataframe which contains the upload date as a datetime dtype and then the year and month for each row. See Below:
Upload Date Year Month DocID
0 2021-03-22 2021 March DOC146984
1 2021-12-16 2021 December DOC173111
2 2021-12-07 2021 December DOC115350
3 2021-10-29 2021 October DOC150149
4 2021-03-12 2021 March DOC125480
5 2021-06-25 2021 June DOC101062
6 2021-05-03 2021 May DOC155916
7 2021-11-14 2021 November DOC198519
8 2021-03-20 2021 March DOC159523
9 2021-07-19 2021 July DOC169328
10 2021-04-13 2021 April DOC182660
11 2021-10-08 2021 October DOC176871
12 2021-09-19 2021 September DOC185854
13 2021-05-16 2021 May DOC192329
14 2021-06-29 2021 June DOC142190
15 2021-11-30 2021 November DOC140231
16 2021-11-12 2021 November DOC145392
17 2021-11-10 2021 November DOC178159
18 2021-11-06 2021 November DOC160932
19 2021-06-16 2021 June DOC131448
What I am trying to achieve is to build a bar chart which has the count for number of documents in each month and year. The graph would look something like this:
The main thing is that the x axis is split by each month and then further by each year, rather than me labelling each column with month and year (e.g 'March 2021'). However I can't figure out how to achieve this. I've tried using a countplot but it only allows me to choose month or year (See Below). I have also tried groupby but the end product is always the same. Any Ideas?
This is using randomly generated data, see the code to replicate below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from datetime import date, timedelta
from random import choices
np.random.seed(42)
# initializing dates ranges
test_date1, test_date2 = date(2020, 1, 1), date(2021, 6, 30)
# initializing K
K = 2000
res_dates = [test_date1]
# loop to get each date till end date
while test_date1 != test_date2:
test_date1 += timedelta(days=1)
res_dates.append(test_date1)
# random K dates from pack
res = choices(res_dates, k=K)
# Generating dataframe
df = pd.DataFrame(res, columns=['Upload Date'])
# Generate other columns
df['Upload Date'] = pd.to_datetime(df['Upload Date'])
df['Year'] = df['Upload Date'].dt.year
df['Month'] = df['Upload Date'].dt.month_name()
df['DocID'] = np.random.randint(100000,200000, df.shape[0]).astype('str')
df['DocID'] = 'DOC' + df['DocID']
# plotting graph
sns.set_color_codes("pastel")
f, ax = plt.subplots(figsize=(20,8))
sns.countplot(x='Month', data=df)
A new column with year and month in numeric form can serve to indicate the x-positions, correctly ordered. The x-tick labels can be renamed to the month names. Vertical lines and manual placing of the year labels lead to the final plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
test_date1, test_date2 = '20200101', '20210630'
months = pd.date_range('2021-01-01', periods=12, freq='M').strftime('%B')
K = 2000
df = pd.DataFrame(np.random.choice(pd.date_range(test_date1, test_date2), K), columns=['Upload Date'])
df['Year'] = df['Upload Date'].dt.year
# df['Month'] = pd.Categorical(df['Upload Date'].dt.strftime('%B'), categories=months)
df['YearMonth'] = df['Upload Date'].dt.strftime('%Y%m').astype(int)
df['DocID'] = np.random.randint(100000, 200000, df.shape[0]).astype('str')
df['DocID'] = 'DOC' + df['DocID']
sns.set_style("white")
sns.set_color_codes("pastel")
fig, ax = plt.subplots(figsize=(20, 8))
sns.countplot(x='YearMonth', data=df, ax=ax)
sns.despine()
yearmonth_labels = [int(l.get_text()) for l in ax.get_xticklabels()]
ax.set_xticklabels([months[ym % 100 - 1] for ym in yearmonth_labels])
ax.set_xlabel('')
# calculate the positions of the borders between the years
pos = []
years = []
prev = None
for i, ym in enumerate(yearmonth_labels):
if ym // 100 != prev:
pos.append(i)
prev = ym // 100
years.append(prev)
pos.append(len(yearmonth_labels))
pos = np.array(pos) - 0.5
# vertical lines to separate the years
ax.vlines(pos, 0, -0.12, color='black', lw=0.8, clip_on=False, transform=ax.get_xaxis_transform())
# years at the center of their range
for year, pos0, pos1 in zip(years, pos[:-1], pos[1:]):
ax.text((pos0 + pos1) / 2, -0.07, year, ha='center', clip_on=False, transform=ax.get_xaxis_transform())
ax.set_xlim(pos[0], pos[-1])
ax.set_ylim(ymin=0)
plt.tight_layout()
plt.show()
Related
I'm trying to plot a barplot with seaborn and I want to change the labels of the xticks.
When the ticks remain numbers all the bars show on the graph. To get these numbers I created a column with the month of each sale date and used grouby to group the data by the month.
pr_sales = sns.barplot(x='Month', y='dollartotal', color='lightblue', data=df_pr_2020, )
pr_sales.set(xlabel='2020 by Month', ylabel='Dollars', title='2020 Pre-Roll Sales ')
See below X axis numbers:
But when I convert the axis to labels I lose a bar of data. The main difference in this code is Months are taken from the saledate column.
pr_sales = sns.barplot(x='Month', y='dollartotal', color='lightblue', data=df_pr_2020, )
pr_sales.set(xlabel='2020 by Month', ylabel='Dollars', title='2020 Pre-Roll Sales ')
pr_sales.set_xticks(range(len(Months)))
pr_sales.set_xticklabels(Months)
Months =df_pr_2020['saledate'].dt.strftime('%b')
As you can see December is now showing:
In your original dataframe, data for april are missing, so when you change tik labels, ticks shift to the left, and it seems december is empty.
I suppose you are working with a dataframe like this:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
samples = 10
df_pr_2020 = pd.DataFrame({'date': np.repeat(pd.date_range(start = '2020-01-01', end = '2020-12-31', freq = 'D'), samples)})
df_pr_2020 = df_pr_2020[df_pr_2020['date'].dt.month != 4]
df_pr_2020['dollartotal'] = 14000*df_pr_2020['date'].dt.month + 150000 + 20000*np.random.randn(len(df_pr_2020))
df_pr_2020['Month'] = df_pr_2020['date'].dt.month.astype(str).apply(lambda x: x.zfill(2))
date dollartotal Month
0 2020-01-01 176160.921650 01
1 2020-01-01 166157.143534 01
2 2020-01-01 187411.157138 01
3 2020-01-01 114761.306578 01
4 2020-01-01 164297.114934 01
5 2020-01-01 163810.407586 01
6 2020-01-01 133359.064046 01
7 2020-01-01 151415.390232 01
8 2020-01-01 131282.857219 01
9 2020-01-01 155171.540172 01
10 2020-01-02 189664.518915 01
11 2020-01-02 156596.250250 01
12 2020-01-02 148865.016156 01
13 2020-01-02 173507.621407 01
14 2020-01-02 155717.433085 01
15 2020-01-02 166862.159482 01
When plotting you get (notice missing april):
When you extract 'Month' column, you could already convert it from month number to month short name:
df_pr_2020['Month'] = df_pr_2020['date'].dt.month_name().str[:3]
Complete Code
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
samples = 10
df_pr_2020 = pd.DataFrame({'date': np.repeat(pd.date_range(start = '2020-01-01', end = '2020-12-31', freq = 'D'), samples)})
df_pr_2020 = df_pr_2020[df_pr_2020['date'].dt.month != 4]
df_pr_2020['dollartotal'] = 14000*df_pr_2020['date'].dt.month + 150000 + 20000*np.random.randn(len(df_pr_2020))
df_pr_2020['Month'] = df_pr_2020['date'].dt.month_name().str[:3]
pr_sales = sns.barplot(x='Month', y='dollartotal', color='lightblue', data=df_pr_2020)
pr_sales.set(xlabel='2020 by Month', ylabel='Dollars', title='2020 Pre-Roll Sales ')
plt.tight_layout()
plt.show()
As you can see, x ticks are converted in month short name and data for april are still missing, as per original data.
I wanna plot a dataframe with seaborn lineplot, which has the following structure:
A Year Month diff
Der 2019 1 3
Der 2019 2 4
Die 2019 1 1
Die 2019 2 1
Right now I am trying:
sns.lineplot(x= ['Year', 'Month'], y='diff', hue='A', ci=None, data = df)
plt.show()
How can I get a timeline graph starting with 2019 1 and going over the order of the months without having a time column?
You could create a new date column from the year and month and just set the day to be 1:
from datetime import date
import matplotlib.dates as mdates
df['date'] = df.apply(lambda row: date(row['Year'], row['Month'], 1), axis=1)
ax = sns.lineplot(x='date', y='diff', hue='A', ci=None, data=df)
# To only show x-axis ticks once per month:
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))
I want to draw a chart from my dataframe (which has got daily data) and want the xlables to show up as months (covering all the daily data period).
For example if i have data from 2010-01-01 till 2010-12-31, i want 365 days data points, but in the x-axis i want just Jan, Feb, Mar, etc... each of those month covering the exact period of the corresponding days. Strugglling in getting this...
This is the DataFrame:
Daily CP ROI S2 positive Month Day
Date
2008-01-02 100087.000 True January 2
2008-01-03 101967.000 True January 3
2008-01-04 102167.000 True January 4
2008-01-07 104004.000 True January 7
2008-01-08 105192.000 True January 8
pl_plot = pl_plot.set_index('Date')
figure(num=None, figsize=(20, 8), dpi=80, facecolor='w', edgecolor='k')
plt.ylabel('USD', fontsize=18)
plt.rc('ytick',labelsize=16)
final_value = new_df_test.iloc[ei-2]['Daily_Compound_ROI']
roi = round(((final_value - very_init_budget)/very_init_budget)*100,3)
roi_s=str(roi)
plt_title_s = new_title+'\nPeriod: '+y_init_s+'-'+y_end_s+', (ROI: '+roi_s+'%)'
plt.title(plt_title_s, fontsize=24)
pl_plot['positive'] = pl_plot['Daily_Compound_ROI'] > 100000
pl_plot['Daily CP ROI S2'].plot(kind='bar', color=pl_plot.positive.map({True: '#5cb85c', False: 'r'}))
ax1 = plt.axes()
plt.rc('xtick',labelsize=14)
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(True)
I want to get something like the below (keep red color when value is below 100000 otherwise green color for the bar), but in the x-axis would like to see Jan, Feb, Mar, etc... without any separation between the months (and dont want to see each single day as i see right now).
enter image description here
IIUC you could give seaborn a try.
If you add a month- and day-column to your dataframe, you can make a barpot with monthly separated blocks of bars per day.
Example:
# import pandas as pd
# import numpy as np
# df = pd.DataFrame(index=pd.date_range('15.8.2019', '27.11.2019'))
# df['Value'] = np.random.random(len(df))
# df['Month'] = df.index.month_name()
# df['Day'] = df.index.day
# Value Month Day
# 2019-08-15 0.813130 August 15
# 2019-08-16 0.850873 August 16
# 2019-08-17 0.728416 August 17
# 2019-08-18 0.326072 August 18
# 2019-08-19 0.880385 August 19
# ... ... ..
# 2019-11-23 0.771801 November 23
# 2019-11-24 0.638811 November 24
# 2019-11-25 0.824542 November 25
# 2019-11-26 0.451075 November 26
# 2019-11-27 0.151469 November 27
given this dataframe, you could do
import seaborn as sns
sns.catplot(kind='bar', x='Month', y='Value', hue='Day', data=df, color='b', legend=False)
results in
That looks quite strange as i am answering my own question.
Anyway i found the way:
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
sns.set(font_scale=1.5, style="whitegrid")
fig, ax = plt.subplots(figsize=(20, 8))
ax.bar(df.index.values, df['Daily CP ROI S2'].values, width=1, color=pl_plot.positive.map({True: '#5cb85c', False: 'r'}))
ax.set(#xlabel="Period", ylabel="USD", title=title1)
ax.title.set(fontsize=24)
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(DateFormatter("%b"))
plt.show()
the above fixes the problem and what i get is the below:
Chart as i wanted it
The only problem left is now i need to fit the title within the chart :) Will look after this, but if any suggestions, please go ahead! Thanks.
In[1] df
Out[0] week year number_of_cases
8 2010 583.0
9 2010 116.0
10 2010 358.0
11 2010 420.0
... ... ...
52 2010 300.0
1 2011 123.0
2 2011 145.0
How may I create a timeline graph where my y-axis is the number of cases and my x axis is increasing week number that corresponds with the year?
I want it to go from week 1 to 52 in the year 2010 then week 1 to 52 in the year 2011. And have this as one large graph to see how the number of cases vary each year according to week.
Python 3, Pandas.
You can create a datetime column based on the year and the week, plot 'number_of_cases' against the date, and then use mdates to format the x-ticks.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# Determine the date
df['date'] = pd.to_datetime(df.assign(day=1, month=1)[['year', 'month', 'day']])+pd.to_timedelta(df.week*7, unit='days')
# Plot
fig, ax = plt.subplots()
df.plot(x='date', y='number_of_cases', marker='o', ax=ax)
# Format the x-ticks
myFmt = mdates.DateFormatter('%Y week %U')
ax.xaxis.set_major_formatter(myFmt)
I am making a stacked bar plot over a year time span where the x-axis is company names, y-axis is the number of calls, and the stacks are the months.
I want to be able to make this plot run for a time span of a month, where the stacks are days, and a time span of a week, where the stacks are days. I am having trouble doing this since my code is built already around the year time span.
My input is a dataframe that looks like this
pivot_table.head(3)
Out[12]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
CompanyName
Customer1 17 30 29 39 15 26 24 12 36 21 18 15
Customer2 4 11 13 22 35 29 15 18 29 31 17 14
Customer3 11 8 25 24 7 15 20 0 21 12 12 17
and my code is this so far.
First I grab a years worth of data (I would change this to a month or a week for this question)
# filter by countries with at least one medal and sort
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - pd.DateOffset(years=1)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
Then I create the pivot_table shown above.
###########################################################
#Create Dataframe
###########################################################
df = df.set_index('recvd_dttm')
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg(len).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
pivot_table = result.pivot(index='Month', columns='CompanyName', values='NumberCalls').fillna(0)
s = pivot_table.sum().sort(ascending=False,inplace=False)
pivot_table = pivot_table.ix[:,s.index[:30]]
pivot_table = pivot_table.transpose()
pivot_table = pivot_table.reset_index()
pivot_table['CompanyName'] = [str(x) for x in pivot_table['CompanyName']]
Companies = list(pivot_table['CompanyName'])
pivot_table = pivot_table.set_index('CompanyName')
pivot_table.to_csv('pivot_table.csv')
Then I use the pivot table to create an OrderedDict for Plotting
###########################################################
#Create OrderedDict for plotting
###########################################################
months = [pivot_table[(m)].astype(float).values for m in range(1, 13)]
names = ["Jan", "Feb", "Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov", "Dec"]
months_dict = OrderedDict(list(zip(names, months)))
###########################################################
#Plot!
###########################################################
palette = brewer["RdYlGn"][8]
hover = HoverTool(
tooltips = [
("Month", "#months"),
("Number of Calls", "#NumberCalls"),
]
)
output_file("stacked_bar.html")
bar = Bar(months_dict, Companies, title="Number of Calls Each Month", palette = palette, legend = "top_right", width = 1200, height=900, stacked=True)
bar.add_tools(hover)
show(bar)
Does anyone have ideas on how to approach modifying this code so it can work for shorter time spans? This is what the graph looks like for a year
EDIT Added the full code. Input looks like this example:
CompanyName recvd_dttm
Company1 6/5/2015 18:28:50 PM
Company2 6/5/2015 14:25:43 PM
Company3 9/10/2015 21:45:12 PM
Company4 6/5/2015 14:30:43 PM
Company5 6/5/2015 14:32:33 PM