So I have a dataframe like this:
YM
YQMD
YQM
Year
Quarter
Month
Day
srch_id
2012-11
2012-4-11-01
2012-4-11
2012
4
11
01
3033780585
2012-11
2012-4-11-02
2012-4-11
2012
4
11
02
2812558229
..
2013-06
2013-2-06-26
2013-2-06
2013
2
06
26
5000321400
2013-06
2013-2-06-27
2013-2-06
2013
2
06
27
3953504722
Now I want to plot a lineplot. I did it with this code:
#plot lineplot
sns.set_style('darkgrid')
sns.set(rc={'figure.figsize':(14,8)})
ax = sns.lineplot(data=df_monthly, x ='YQMD', y = 'click_bool')
plt.ylabel('Number of Search Queries')
plt.xlabel('Year-Month')
plt.show()
This is the plot that I got and as you can see, the x-axis you cannot see the dates etc, because they are too much.
The pattern of the plot is the right way that I want it, but is it possible to get the x-axis like this below?:
I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429
I would like to plot multiple lines in Python for this dataset: (x = year, y = freq)
Student_ID Year Freq
A 2012 6
B 2008 22
C 2009 18
A 2010 7
B 2012 13
D 2012 31
D 2013 1
where each student_id has data for different years. count is the result of a groupby.
I would like to have one line for each student_id.
I have tried with this:
df.groupby(['year'])['freq'].count().plot()
but it does not plot one line for each student_id.
Any suggestions are more than welcome. Thank you for your help
I wasn't sure from your question if you wanted count (which in your example is all ones) or sum, so I solved this with sum - if you'd like count, just swap it out for sum in the first line.
df_ = df.groupby(['Student_ID', 'Year'])['Freq'].sum()
print(df_)
> Student_ID Year
A 2010 7
2012 6
B 2008 22
2012 13
C 2009 18
D 2012 31
2013 1
Name: Freq, dtype: int64
fig, ax = plt.subplots()
for student in set(a[0] for a in df_.index):
df_[student].plot(ax=ax, label=student)
plt.legend()
plt.show()
Which gives you:
I am trying to make a plot to compare 4 different files. each file has ID, Date and Value. While ID's and dates remain the same the value differs in each of the files. Now i want to plot the value field for ID lets say "A" for some 7 day period in January. The result would be a a overlaying plot of four different value from the four different files. How can i go about this in python? I want to keep it as automated as possible without several manual steps. Appreciate all your help!
Example data set below
Sample data set 1
ID Date Value
A 01-01-18 12
A 01-02-18 15
A 01-03-18 18
A 02-01-18 12
B 01-01-18 11
B 01-02-18 19
C 01-01-18 15
Sample data set 2
ID Date Value
A 01-01-18 13
A 01-02-18 16
A 01-03-18 12
A 02-01-18 13
B 01-01-18 16
B 01-02-18 15
C 01-01-18 13
Sample data set 3
ID Date Value
A 01-01-18 12
A 01-02-18 12
A 01-03-18 13
A 02-01-18 14
B 01-01-18 15
B 01-02-18 12
C 01-01-18 13
Sample data set 4
ID Date Value
A 01-01-18 12
A 01-02-18 15
A 01-03-18 14
A 02-01-18 12
B 01-01-18 11
B 01-02-18 14
C 01-01-18 13
From this sample data -lets say i am trying to plot for ID "A" between date 01-01-18 to 01-03-18 the values. So i will have a plot of 4 different lines representing the value of each of the data set.
I have been able to do this in Excel but it has involved too many manual steps and the data is 800,000 lines +, so i don't feel very confident. I am sure there is a better way to do it in python.
Suppose your data is stored in separate text files. Then you may do what you want with the following code:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
filenames = ['sample_1.txt', 'sample_2.txt', 'sample_3.txt', 'sample_4.txt']
data = list()
for filename in filenames:
data.append(pd.read_table(filename, delimiter=' ', parse_dates=[1]))
fig = plt.figure()
for idx in range(len(filenames)):
condition_1 = data[idx].loc[:, 'ID'] == 'A'
condition_2 = (
(data[idx].loc[:, 'Date'] >= '2018-01-01') &
(data[idx].loc[:, 'Date'] <= '2018-01-03'))
plt.plot(
data[idx].loc[condition_1 & condition_2, 'Date'],
data[idx].loc[condition_1 & condition_2, 'Value'], 'o--')
plt.title('Some figure')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend(filenames)
# X-axis formatting
days = mdates.DayLocator()
days_fmt = mdates.DateFormatter('%Y-%m-%d')
fig.gca().xaxis.set_major_locator(days)
fig.gca().xaxis.set_major_formatter(days_fmt)
Result:
I have the following data frame:
Parameters Year 2016 Year 2017 Year 2018....
0) X 10 12 13
1) Y 12 12 45
2) Z 89 23 97
3
.
.
.
I want to make a line chart with the column headers starting from Year 2016 to be on the x-axis and each line on the chart to represent each of the parameters - X, Y, Z
I am using the matplotlib library to make the plot but it is throwing errors.
Where given this dataframe:
df = pd.DataFrame({'Parameters':['X','Y','Z'],
'Year 2016':[10,12,13],
'Year 2017':[12,12,45],
'Year 2018':[89,23,97]})
Input Dataframe:
Parameters Year 2016 Year 2017 Year 2018
0 X 10 12 89
1 Y 12 12 23
2 Z 13 45 97
You can use some dataframe shaping and pandas plot:
df_out = df.set_index('Parameters').T
df_out.set_axis(pd.to_datetime(df_out.index, format='Year %Y'), axis=0, inplace=False)\
.plot()
Output Graph:
If you have a pandas DataFrame, let's call it df, for which your columns are X, Y, Z, etc. and your rows are the years in order, you can simply call df.plot() to plot each column as a line with the y axis being the values and the row name giving the x-axis.