How to assign the groupby results to a series in pandas - python

I have a df which looks like this:
Date Value
2020 0
2020 100
2020 200
2020 300
2021 100
2021 150
2021 0
I want to get the average of the grouped Value by Date where Value > 0. When I tried:
df['Yearly AVG'] = df[df['Value']>0].groupby('Date')['Value'].mean()
I get NaN Values, when I print the line above I get what I need but with the Date column.
Date
2020 200
2021 125
How Can I have the following:
Date Value Yearly AVG
2020 0 200
2020 100 200
2020 200 200
2020 300 200
2021 100 125
2021 150 125
2021 0 125

Here is trick replace non matched values to missing values and then using GroupBy.transform for new columns filled by aggregate values:
df['Yearly AVG'] = df['Value'].where(df['Value']>0).groupby(df['Date']).transform('mean')
print (df)
Date Value Yearly AVG
0 2020 0 200.0
1 2020 100 200.0
2 2020 200 200.0
3 2020 300 200.0
4 2021 100 125.0
5 2021 150 125.0
6 2021 0 125.0
Detail:
print (df['Value'].where(df['Value']>0))
0 NaN
1 100.0
2 200.0
3 300.0
4 100.0
5 150.0
6 NaN
Name: Value, dtype: float64
Your solution should be changed:
df['Yearly AVG'] = df['Date'].map(df[df['Value']>0].groupby('Date')['Value'].mean())

Related

How do I create a new column that references other row's data for its values?

I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
1
1
1
2
2003
46.538
46.66
46.47
46.673
1
2
1
3
2003
46.717
46.781
46.53
46.750
1
3
1
4
2003
46.815
46.843
46.68
46.750
1
4
1
5
2003
46.935
47.000
46.56
46.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
43
7259
10
28
2022
379.869
389.519
379.67
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
44
7261
11
1
2022
390.14
390.39
383.29
384.519
44
I want to create a new column titled 'week high' which will reference each week every year and pull in the high. So for Week 1, Year 2003, it will take the Highest High from rows 0 to 4 but for Week 43, Year 2022, it will take the Highest High from rows 7257 to 7259.
Is it possible to reference the columns Week and Year to calculate that value? Thanks!
Assuming pandas, create a weekly period and use it as grouper for transform('max'):
group = pd.to_datetime(df[['Year', 'Month', 'Day']]).dt.to_period('W')
# or, if you already have a "Week" column
# group = "Week"
df['week_high'] = df.groupby(group)['High'].transform('max')
Output:
Month Day Year Open High Low Close Week week_high
0 1 1 2003 46.593 46.6560 46.405 46.468 1.0 47.000
1 1 2 2003 46.538 46.6600 46.470 46.673 1.0 47.000
2 1 3 2003 46.717 46.7810 46.530 46.750 1.0 47.000
3 1 4 2003 46.815 46.8430 46.680 46.750 1.0 47.000
4 1 5 2003 46.935 47.0000 46.560 46.593 1.0 47.000
7257 10 26 2022 381.619 387.5799 381.350 382.019 43.0 389.519
7258 10 27 2022 383.070 385.0000 379.329 379.980 43.0 389.519
7259 10 28 2022 379.869 389.5190 379.670 389.019 43.0 389.519
7260 10 31 2022 386.440 388.3990 385.260 386.209 44.0 390.390
7261 11 1 2022 390.140 390.3900 383.290 384.519 44 390.390
I am assuming you are using pandas. Other libraries will work similar.
Create a new DataFrame aggregated per week using groupby and join it back to your original DataFrame
df_grouped = df["Week", "High"].groupby("Week").max().rename(columns={"High":"Highest High"}
df_result = df.join(df_grouped, "Week")

Year over Year difference and selecting maximum row in pandas

I have a dataframe given as below:
ID YEAR NPS
500 2020 0
500 2021 0
500 2022 0
501 2020 32
501 2021 52
501 2022 99
503 2021 1
503 2022 4
504 2020 45
504 2021 55
504 2022 50
I have to calculate year over year difference as given below:
ID YEAR NPS nps_gain_yoy
500 2020 0 0
500 2021 0 0
500 2022 0 0
501 2020 32 0
501 2021 52 20
501 2022 99 47
503 2021 1 0
503 2022 4 3
504 2020 45 0
504 2021 55 10
504 2022 50 -5
In above output for starting year 2020 or first occurance of Id nps_gain_yoy needs to be zero then for 2021 nps_gain_yoy is difference between nps of 2021 and 2020 i.e 52-32 = 20 as shown in output for ID 501 for year 2021 and so on.
After this I need to pick the maximum difference or maximum nps_gain_yoy for each ID as given in below output:
ID YEAR NPS NPS_gain_yoy
501 2022 0 0
501 2022 99 47
503 2022 4 3
504 2021 55 10
Here 47 is the maximum nps gain for ID 501 in year 2022 similarly 3 for ID 503 and 4 for Id 504.
If years are consecutive per ID first use DataFrameGroupBy.diff:
df = df.sort_values(['ID','YEAR'])
df['nps_gain_yoy'] = df.groupby('ID')['NPS'].diff().fillna(0)
print (df)
ID YEAR NPS nps_gain_yoy
0 500 2020 0 0.0
1 500 2021 0 0.0
2 500 2022 0 0.0
3 501 2020 32 0.0
4 501 2021 52 20.0
5 501 2022 99 47.0
6 503 2021 1 0.0
7 503 2022 4 3.0
8 504 2020 45 0.0
9 504 2021 55 10.0
10 504 2022 50 -5.0
And then DataFrameGroupBy.idxmax with DataFrame.loc:
df1 = df.loc[df.iloc[::-1].groupby('ID')['nps_gain_yoy'].idxmax()]
#alternative solution
#df1 = df.sort_values(['ID','nps_gain_yoy']).drop_duplicates('ID', keep='last')
print (df1)
ID YEAR NPS nps_gain_yoy
2 500 2022 0 0.0
5 501 2022 99 47.0
7 503 2022 4 3.0
9 504 2021 55 10.0

Python Pandas when i try to add a column in an existing dataframe my new column is not correct

I am trying to get a religion adherence data visulization project done. But on my i am stuck with this problem pls help thank you
x= range(1945,2011,5)
for i in x:
df_new= df_new.append(pd.DataFrame({'year':[i]}))
years
0 1945
0 1950
0 1955
0 1960
0 1965
0 1970
0 1975
0 1980
0 1985
0 1990
0 1995
0 2000
0 2005
0 2010
this is my dataframe for now and i want to add a column which looks like this :
0 1.307603e+08
1 2.941211e+08
2 3.440720e+08
3 4.351231e+08
4 5.146341e+08
5 5.923423e+08
6 6.636743e+08
7 6.471395e+08
8 7.457716e+08
9 9.986003e+08
10 1.153186e+09
11 1.314048e+09
12 1.426454e+09
13 1.555483e+09
when i add them up like that
a=df.groupby(['year'],as_index=False)['islam'].sum()
b=a['islam']
df_new.insert(1,'islam',b)
the dataframe look like this which is not correct help me pls thank you !
year islam
0 1945 130760281.0
0 1950 130760281.0
0 1955 130760281.0
0 1960 130760281.0
0 1965 130760281.0
0 1970 130760281.0
0 1975 130760281.0
0 1980 130760281.0
0 1985 130760281.0
0 1990 130760281.0
0 1995 130760281.0
0 2000 130760281.0
0 2005 130760281.0
0 2010 130760281.0
df:
year name christianity judaism islam budism nonrelig
0 1945 USA 110265118 4641182.0 0.0 1601218 22874544
1 1950 USA 122994019 6090837.0 0.0 0 22568130
2 1955 USA 134001770 5333332.0 0.0 90173 23303540
3 1960 USA 150234347 5500000.0 0.0 2012131 21548225
4 1965 USA 167515758 5600000.0 0.0 1080892 19852362
... ... ... ... ... ... ... ...
1990 1990 WSM 159500 0.0 37.0 15 1200
1991 1995 WSM 161677 0.0 43.0 16 1084
1992 2000 WSM 174600 0.0 50.0 18 1500
1993 2005 WSM 177510 0.0 58.0 18 1525
1994 2010 WSM 180140 0.0 61.0 19 2750
Try skipping the first step where you create a dataframe of years. If you group the dataframe by year and leave off the as_index argument, it will give you what you're looking for.
summed_df = df.groupby(['year'],as_index=False)['islam'].sum()
That will give you a dataframe with the year as the index. Now you just have to reset the index, and you'll have a two-column dataframe with years and the sum values.
summed_df = summed_df.reset_index()
(Note: the default for reset_index() is drop=False. The drop parameter specifies whether you discard the index values (True) or insert them as a column into the dataframe (False). You want False here to preserve those year values.)

Pandas - Determine if Churn occurs with missing years

I have a large pandas dataframe which contains ids, years, spend values, and a slew of other columns, as shown below:
id year spend .... n_columns
1 2015 321 ... ...
1 2016 342 ... ...
1 2017 843
1 2018 483
2 2015 234
2 2018 321
2 2019 232 ... ...
I am trying to create a new column which classifies the years based on the next years value. Something akin to:
id year spend cat
1 2015 321 increase
1 2016 342 increase
1 2017 843 decrease
1 2018 483 churned #as there is no 2019 data
2 2015 234 churned #as there is no 2016 data
2 2018 321 decreased
2 2019 232 decreased
2 2020 200 nan #max data only goes up to 2020
I have been trying to do this with something like the below, to get the difference between years to determine the category:
def categorize(x):
if math.abs(x['diff']) == x['value']:
return "churned"
elif x['diff'] < 0:
return "decrease"
elif x['diff' > 0:
return "increase"
else:
return None
df = df.sort_values(['id', 'year'], ascending = True)
df['diff'] = df.groupby('id')['spend'].diff(-1)
df = df.apply(categorize, axis = 1)
However, this method and all similar methods seem to fail as there are years missing for some ids (such as id = 2 and year = 2015 above). Is there an easy way to ensure all ids all contain all of the years, even if the values are all zeroed or nulled out? Is there a better way to determine if a year is an increase/decrease/churn than how I am doing it?
Thanks!
Here is one way to solve it:
Expand the dataframe to include the missing rows of years; I'll use the complete function from pyjanitor for this - it exposes explicitly missing values:
# pip install pyjanitor
import janitor
tempo = (df.complete(columns=["id",
{"year": lambda df: np.arange(df.year.min(),
df.year.max() + 1)}]
)
.assign(temp=lambda df: df.spend.ffill(),
temp_diff=lambda df: df.temp.diff(-1)
)
)
tempo
id year spend temp temp_diff
0 1 2015 321.0 321.0 -21.0
1 1 2016 342.0 342.0 -501.0
2 1 2017 843.0 843.0 360.0
3 1 2018 483.0 483.0 0.0
4 1 2019 NaN 483.0 249.0
5 2 2015 234.0 234.0 0.0
6 2 2016 NaN 234.0 0.0
7 2 2017 NaN 234.0 -87.0
8 2 2018 321.0 321.0 89.0
9 2 2019 232.0 232.0 NaN
Next step is to create conditions, and combine with np.select:
cond2 = (tempo.spend.shift(-1).notna()) & (tempo.temp_diff.ge(0))
cond1 = (tempo.spend.shift(-1).notna()) & (tempo.temp_diff.lt(0))
cond3 = (tempo.spend.shift(-1).isna()) & (tempo.temp_diff.eq(0))
tempo["cat"] = np.select([cond1, cond2, cond3],
["increase", "decrease", "churn"],
np.nan)
id year spend temp temp_diff cat
0 1 2015 321.0 321.0 -21.0 increase
1 1 2016 342.0 342.0 -501.0 increase
2 1 2017 843.0 843.0 360.0 decrease
3 1 2018 483.0 483.0 0.0 churn
4 1 2019 NaN 483.0 249.0 decrease
5 2 2015 234.0 234.0 0.0 churn
6 2 2016 NaN 234.0 0.0 churn
7 2 2017 NaN 234.0 -87.0 increase
8 2 2018 321.0 321.0 89.0 decrease
9 2 2019 232.0 232.0 NaN nan
Filter out the null rows in spend column:
tempo.query("spend.notna()").drop(columns = ['temp_diff', 'temp'])
id year spend cat
0 1 2015 321.0 increase
1 1 2016 342.0 increase
2 1 2017 843.0 decrease
3 1 2018 483.0 churn
5 2 2015 234.0 churn
8 2 2018 321.0 decrease
9 2 2019 232.0 nan
I used your original dataframe ( which stopped at 2019); let me know how it goes.

Calculate difference from previous year/forecast in pandas dataframe

I wish to compare the output of multiple model runs, calculating these values:
Difference between current period revenue and previous period
Difference between actual current period revenue and forecasted current period revenue
I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....into this:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
EDIT-- I get pretty close with this:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups:
In [11]: g = df.groupby(['ids', 'run'])
In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())
The next part is more complicated, I think you need to do a pivot_table for the next part:
In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')
In [14]: df1
Out[14]:
run actual forecast
ids year
1 2013 10 NaN
2014 10 30
2015 130 120
2 2013 20 NaN
2014 40 50
2015 100 210
3 2013 20 NaN
2014 50 90
2015 190 150
In [15]: g1 = df1.groupby(level='ids', as_index=False)
In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])
In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids ids year
1 1 2013 NaN
2014 -20
2015 10
2 2 2013 NaN
2014 -10
2015 -110
3 3 2013 NaN
2014 -40
2015 40
dtype: float64
Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes:
In [21]: df2 = df.set_index(['ids', 'year', 'run'])
In [22]: out_by.index = out_by.index.droplevel(0)
In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])
In [24]: out_by_df['run'] = 'forecast'
In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']
and we're done...
In [26]: df2.reset_index()
Out[26]:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NaN NaN
1 2 2013 actual 20 NaN NaN
2 3 2013 actual 20 NaN NaN
3 1 2014 forecast 30 NaN -20
4 2 2014 forecast 50 NaN -10
5 3 2014 forecast 90 NaN -40
6 1 2014 actual 10 0 NaN
7 2 2014 actual 40 20 NaN
8 3 2014 actual 50 30 NaN
9 1 2015 forecast 120 90 10
10 2 2015 forecast 210 160 -110
11 3 2015 forecast 150 60 40
12 1 2015 actual 130 120 NaN
13 2 2015 actual 100 60 NaN
14 3 2015 actual 190 140 NaN
Note: I think the first 6 results of chg_from_prev_year should be NaN.
However, I think you may be better off keeping it as a pivot:
In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')
In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values
In [33]: df3
Out[33]:
revenue chg_from_prev_year chg_from_forecast
run actual forecast actual forecast
ids year
1 2013 10 NaN NaN NaN NaN
2014 10 30 0 NaN -20
2015 130 120 120 90 10
2 2013 20 NaN NaN NaN NaN
2014 40 50 20 NaN -10
2015 100 210 60 160 -110
3 2013 20 NaN NaN NaN NaN
2014 50 90 30 NaN -40
2015 190 150 140 60 40

Categories