line chart with months for x-labels but using weekly data - python

Below is script for a simplified version of the df in question:
import pandas as pd
df = pd.DataFrame({
'week': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],
'month' : ['JAN','JAN ','JAN','JAN','FEB','FEB','FEB','FEB','MAR','MAR',
'MAR','MAR','APR','APR','APR','APR','APR'],
'weekly_stock' : [4,2,5,6,2,3,6,8,7,9,5,3,5,4,5,8,9]
})
df
week month weekly_stock
0 1 JAN 4
1 2 JAN 2
2 3 JAN 5
3 4 JAN 6
4 5 FEB 2
5 6 FEB 3
6 7 FEB 6
7 8 FEB 8
8 9 MAR 7
9 10 MAR 9
10 11 MAR 5
11 12 MAR 3
12 13 APR 5
13 14 APR 4
14 15 APR 5
15 16 APR 8
16 17 APR 9
As it currently stands, the script below produces a bar chart with week for x-labels
# plot chart
labels=df.week
line=df['weekly_stock']
fig, ax = plt.subplots(figsize=(20,8))
line1=plt.plot(line, label = '2019')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=0)
ax.set_ylabel('Stock')
ax.set_xlabel('week')
plt.title('weekly stock')
However, I would like to have the month as the x-label.
INTENDED PLOT:
Any help would be greatly appreciated.

My recommendation is to have a valid datetime values column instead of 'month' and 'week', like you have. Matplotlib is pretty smart when working with valid datetime values, so I'd structure the dates like so first:
import pandas as pd
import matplotlib.pyplot as plt
# valid datetime values in a range
dates = pd.date_range(
start='2019-01-01',
end='2019-04-30',
freq='W', # weekly increments
name='dates',
closed='left'
)
weekly_stocks = [4,2,5,6,2,3,6,8,7,9,5,3,5,4,5,8,9]
df = pd.DataFrame(
{'weekly_stocks': weekly_stocks},
index=dates # set dates column as index
)
df.plot(
figsize=(20,8),
kind='line',
title='Weekly Stocks',
legend=False,
xlabel='Week',
ylabel='Stock'
)
plt.grid(which='both', linestyle='--', linewidth=0.5)
Now this is a fairly simple solution. Take notice that the ticks appear exactly where the weeks are; Matplotlib did all the work for us!
(easier) You can either lay the "data foundation" prior to plotting correctly, i.e., format the data for Matplotlib to do all the work like we did above(think of the ticks being the actual date-points created in the pd.date_range()).
(harder) Use tick locators/formatters as mentioned in docs here
Hope this was helpful.

Related

Change X-axis for timeseries plot in Python

So I have a dataframe like this:
YM
YQMD
YQM
Year
Quarter
Month
Day
srch_id
2012-11
2012-4-11-01
2012-4-11
2012
4
11
01
3033780585
2012-11
2012-4-11-02
2012-4-11
2012
4
11
02
2812558229
..
2013-06
2013-2-06-26
2013-2-06
2013
2
06
26
5000321400
2013-06
2013-2-06-27
2013-2-06
2013
2
06
27
3953504722
Now I want to plot a lineplot. I did it with this code:
#plot lineplot
sns.set_style('darkgrid')
sns.set(rc={'figure.figsize':(14,8)})
ax = sns.lineplot(data=df_monthly, x ='YQMD', y = 'click_bool')
plt.ylabel('Number of Search Queries')
plt.xlabel('Year-Month')
plt.show()
This is the plot that I got and as you can see, the x-axis you cannot see the dates etc, because they are too much.
The pattern of the plot is the right way that I want it, but is it possible to get the x-axis like this below?:

How to calculate Cumulative Average Revenue ? Python

I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429

Multiline plot for each id

I would like to plot multiple lines in Python for this dataset: (x = year, y = freq)
Student_ID Year Freq
A 2012 6
B 2008 22
C 2009 18
A 2010 7
B 2012 13
D 2012 31
D 2013 1
where each student_id has data for different years. count is the result of a groupby.
I would like to have one line for each student_id.
I have tried with this:
df.groupby(['year'])['freq'].count().plot()
but it does not plot one line for each student_id.
Any suggestions are more than welcome. Thank you for your help
I wasn't sure from your question if you wanted count (which in your example is all ones) or sum, so I solved this with sum - if you'd like count, just swap it out for sum in the first line.
df_ = df.groupby(['Student_ID', 'Year'])['Freq'].sum()
print(df_)
> Student_ID Year
A 2010 7
2012 6
B 2008 22
2012 13
C 2009 18
D 2012 31
2013 1
Name: Freq, dtype: int64
fig, ax = plt.subplots()
for student in set(a[0] for a in df_.index):
df_[student].plot(ax=ax, label=student)
plt.legend()
plt.show()
Which gives you:

Make plot between multiple files for a particular Id in a specific period

I am trying to make a plot to compare 4 different files. each file has ID, Date and Value. While ID's and dates remain the same the value differs in each of the files. Now i want to plot the value field for ID lets say "A" for some 7 day period in January. The result would be a a overlaying plot of four different value from the four different files. How can i go about this in python? I want to keep it as automated as possible without several manual steps. Appreciate all your help!
Example data set below
Sample data set 1
ID Date Value
A 01-01-18 12
A 01-02-18 15
A 01-03-18 18
A 02-01-18 12
B 01-01-18 11
B 01-02-18 19
C 01-01-18 15
Sample data set 2
ID Date Value
A 01-01-18 13
A 01-02-18 16
A 01-03-18 12
A 02-01-18 13
B 01-01-18 16
B 01-02-18 15
C 01-01-18 13
Sample data set 3
ID Date Value
A 01-01-18 12
A 01-02-18 12
A 01-03-18 13
A 02-01-18 14
B 01-01-18 15
B 01-02-18 12
C 01-01-18 13
Sample data set 4
ID Date Value
A 01-01-18 12
A 01-02-18 15
A 01-03-18 14
A 02-01-18 12
B 01-01-18 11
B 01-02-18 14
C 01-01-18 13
From this sample data -lets say i am trying to plot for ID "A" between date 01-01-18 to 01-03-18 the values. So i will have a plot of 4 different lines representing the value of each of the data set.
I have been able to do this in Excel but it has involved too many manual steps and the data is 800,000 lines +, so i don't feel very confident. I am sure there is a better way to do it in python.
Suppose your data is stored in separate text files. Then you may do what you want with the following code:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
filenames = ['sample_1.txt', 'sample_2.txt', 'sample_3.txt', 'sample_4.txt']
data = list()
for filename in filenames:
data.append(pd.read_table(filename, delimiter=' ', parse_dates=[1]))
fig = plt.figure()
for idx in range(len(filenames)):
condition_1 = data[idx].loc[:, 'ID'] == 'A'
condition_2 = (
(data[idx].loc[:, 'Date'] >= '2018-01-01') &
(data[idx].loc[:, 'Date'] <= '2018-01-03'))
plt.plot(
data[idx].loc[condition_1 & condition_2, 'Date'],
data[idx].loc[condition_1 & condition_2, 'Value'], 'o--')
plt.title('Some figure')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend(filenames)
# X-axis formatting
days = mdates.DayLocator()
days_fmt = mdates.DateFormatter('%Y-%m-%d')
fig.gca().xaxis.set_major_locator(days)
fig.gca().xaxis.set_major_formatter(days_fmt)
Result:

Manipulate Python Data frame to plot line charts

I have the following data frame:
Parameters Year 2016 Year 2017 Year 2018....
0) X 10 12 13
1) Y 12 12 45
2) Z 89 23 97
3
.
.
.
I want to make a line chart with the column headers starting from Year 2016 to be on the x-axis and each line on the chart to represent each of the parameters - X, Y, Z
I am using the matplotlib library to make the plot but it is throwing errors.
Where given this dataframe:
df = pd.DataFrame({'Parameters':['X','Y','Z'],
'Year 2016':[10,12,13],
'Year 2017':[12,12,45],
'Year 2018':[89,23,97]})
Input Dataframe:
Parameters Year 2016 Year 2017 Year 2018
0 X 10 12 89
1 Y 12 12 23
2 Z 13 45 97
You can use some dataframe shaping and pandas plot:
df_out = df.set_index('Parameters').T
df_out.set_axis(pd.to_datetime(df_out.index, format='Year %Y'), axis=0, inplace=False)\
.plot()
Output Graph:
If you have a pandas DataFrame, let's call it df, for which your columns are X, Y, Z, etc. and your rows are the years in order, you can simply call df.plot() to plot each column as a line with the y axis being the values and the row name giving the x-axis.

Categories