I have this set of dataframe:Dataframe
I can obtain the values that is 15% greater than the mean by:
df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string()
I obtained all values that are 15% greater than interest in their respective categories
The question is how do I get the year where these values occurred without starting with:
df=df.set_index('Year")
at the start as the function above requires my year values with df.iloc
How do I get the year where these values occurred without starting with df.set_index('Year")
Use .loc:
>>> df
Year Dividends Interest Other Types Rent Royalties Trade Income
0 2007 7632014 4643033 206207 626668 89715 18654926
1 2008 6718487 4220161 379049 735494 58535 29677697
2 2009 1226858 5682198 482776 1015181 138083 22712088
3 2010 978925 2229315 565625 1260765 146791 15219378
4 2011 1500621 2452712 675770 1325025 244073 19697549
5 2012 308064 2346778 591180 1483543 378998 33030888
6 2013 275019 4274425 707344 1664747 296136 17503798
7 2014 226634 3124281 891466 1807172 443671 16023363
8 2015 2171559 3474825 1144862 1858838 585733 16778858
9 2016 767713 4646350 2616322 1942102 458543 13970498
10 2017 759016 4918320 1659303 2001220 796343 9730659
11 2018 687308 6057191 1524474 2127583 1224471 19570540
>>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']]
Year Interest
0 2007 4643033
2 2009 5682198
9 2016 4646350
10 2017 4918320
11 2018 6057191
This will return a DataFrame with Year and the Interest values that match your condition
df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]
This will return the Year :-
df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]
I have a data frame containing the daily CO2 data since 2015, and I would like to produce the monthly mean data for each year, then put this into a new data frame. A sample of the data frame I'm using is shown below.
month day cycle trend
year
2011 1 1 391.25 389.76
2011 1 2 391.29 389.77
2011 1 3 391.32 389.77
2011 1 4 391.36 389.78
2011 1 5 391.39 389.79
... ... ... ... ...
2021 3 13 416.15 414.37
2021 3 14 416.17 414.38
2021 3 15 416.18 414.39
2021 3 16 416.19 414.39
2021 3 17 416.21 414.40
I plan on using something like the code below to create the new monthly mean data frame, but the main problem I'm having is indicating the specific subset for each month of each year so that the mean can then be taken for this. If I could highlight all of the year "2015" for the month "1" and then average this etc. that might work?
Any suggestions would be hugely appreciated and if I need to make any edits please let me know, thanks so much!
dfs = list()
for l in L:
dfs.append(refined_data[index = 2015, "month" = 1. day <=31].iloc[l].mean(axis=0))
mean_matrix = pd.concat(dfs, axis=1).T
I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429
I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.
id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.
I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014