This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 7 months ago.
I have a large dataset grouped by column, row, year, potveg, and total. I am trying to get the max value of 'total' column in a specific year of a group. i.e., for the dataset below:
col row year potveg total
-125.0 42.5 2015 9 697.3
2015 13 535.2
2015 15 82.3
2016 9 907.8
2016 13 137.6
2016 15 268.4
2017 9 961.9
2017 13 74.2
2017 15 248.0
2018 9 937.9
2018 13 575.6
2018 15 215.5
-135.0 70.5 2015 8 697.3
2015 10 535.2
2015 19 82.3
2016 8 907.8
2016 10 137.6
2016 19 268.4
2017 8 961.9
2017 10 74.2
2017 19 248.0
2018 8 937.9
2018 10 575.6
2018 19 215.5
I would like the output to look like this:
col row year potveg total
-125.0 42.5 2015 9 697.3
2016 9 907.8
2017 9 961.9
2018 9 937.9
-135.0 70.5 2015 8 697.3
2016 8 907.8
2017 8 961.9
2018 8 937.9
I tried this:
df.groupby(['col', 'row', 'year', 'potveg']).agg({'total': 'max'})
and this:
df.groupby(['col', 'row', 'year', 'potveg'])['total'].max()
but they do not seem to work because the output has too many rows.
I think the issue is the 'potveg' column which is a subgroup. I am not sure how to select rows containing max value of 'total'.
One possible solution, using .idxmax() inside groupby.apply:
print(
df.groupby(["col", "row", "year"], as_index=False, sort=False).apply(
lambda x: x.loc[x["total"].idxmax()]
)
)
Prints:
col row year potveg total
0 -125.0 42.5 2015.0 9.0 697.3
1 -125.0 42.5 2016.0 9.0 907.8
2 -125.0 42.5 2017.0 9.0 961.9
3 -125.0 42.5 2018.0 9.0 937.9
4 -135.0 70.5 2015.0 8.0 697.3
5 -135.0 70.5 2016.0 8.0 907.8
6 -135.0 70.5 2017.0 8.0 961.9
7 -135.0 70.5 2018.0 8.0 937.9
DataFrame used:
col row year potveg total
0 -125.0 42.5 2015 9 697.3
1 -125.0 42.5 2015 13 535.2
2 -125.0 42.5 2015 15 82.3
3 -125.0 42.5 2016 9 907.8
4 -125.0 42.5 2016 13 137.6
5 -125.0 42.5 2016 15 268.4
6 -125.0 42.5 2017 9 961.9
7 -125.0 42.5 2017 13 74.2
8 -125.0 42.5 2017 15 248.0
9 -125.0 42.5 2018 9 937.9
10 -125.0 42.5 2018 13 575.6
11 -125.0 42.5 2018 15 215.5
12 -135.0 70.5 2015 8 697.3
13 -135.0 70.5 2015 10 535.2
14 -135.0 70.5 2015 19 82.3
15 -135.0 70.5 2016 8 907.8
16 -135.0 70.5 2016 10 137.6
17 -135.0 70.5 2016 19 268.4
18 -135.0 70.5 2017 8 961.9
19 -135.0 70.5 2017 10 74.2
20 -135.0 70.5 2017 19 248.0
21 -135.0 70.5 2018 8 937.9
22 -135.0 70.5 2018 10 575.6
23 -135.0 70.5 2018 19 215.5
Option 1: One way is the do the groupby() and then merge with the original df
df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(),
df,
on=['col', 'row', 'year', 'total'])
print(df1)
Output:
col row year total potveg
0 -125.0 42.5 2015 697.3 9
1 -125.0 42.5 2016 907.8 9
2 -125.0 42.5 2017 961.9 9
3 -125.0 42.5 2018 937.9 9
4 -135.0 70.5 2015 697.3 8
5 -135.0 70.5 2016 907.8 8
6 -135.0 70.5 2017 961.9 8
7 -135.0 70.5 2018 937.9 8
Option 2: Or the use of sort_values() and drop_duplicates() like this:
df1 = df.sort_values(['col','row','year']).drop_duplicates(['col','row','year'], keep='first')
print(df1)
Output:
col row year potveg total
0 -125.0 42.5 2015 9 697.3
3 -125.0 42.5 2016 9 907.8
6 -125.0 42.5 2017 9 961.9
9 -125.0 42.5 2018 9 937.9
12 -135.0 70.5 2015 8 697.3
15 -135.0 70.5 2016 8 907.8
18 -135.0 70.5 2017 8 961.9
21 -135.0 70.5 2018 8 937.9
Related
This is my available df, it contains year from 2016 to 2020
Year Month Bill
-----------------
2016 1 2
2016 2 5
2016 3 10
2016 4 2
2016 5 4
2016 6 9
2016 7 7
2016 8 8
2016 9 9
2016 10 5
2016 11 1
2016 12 3
.
.
.
2020 12 10
Now I want to create a 2 new columns in this dataframe Level and Contribution.
and the level column contain Q1, Q2, Q3, Q4 representing 4 quarters of the year and Contribution contains average value from bill column of each quarter in those 3 months of the respective year.
for example Q1 for 2016 will contains the average of month 1,2,3 of bill across **Contribution**
and same for Q3 for year 2020 will contains average of month 7,8,9 of the year 2020 bill column in the Contribution Column, expected Dataframe is given below
Year Month Bill levels contribution
------------------------------------
2016 1 2 Q1 5.66
2016 2 5 Q1 5.66
2016 3 10 Q1 5.66
2016 4 2 Q2 5
2016 5 4 Q2 5
2016 6 9 Q2 5
2016 7 7 Q3 8
2016 8 8 Q3 8
2016 9 9 Q3 8
2016 10 5 Q4 3
2016 11 1 Q4 3
2016 12 3 Q4 3
.
.
2020 10 2 Q4 6
2020 11 6 Q4 6
2020 12 10 Q4 6
This process will be repeated for each month 4 quarters
Iam not able to figure out the as it is something new to me
You can try:
df['levels'] = 'Q' + df['Month'].div(3).apply(math.ceil).astype(str)
df['contribution'] = df.groupby(['Year', 'levels'])['Bill'].transform('mean')
Pandas actually has a set of datatypes for monthly and quarterly values called Pandas.Period.
See this similar question:
How to create a period range with year and week of year in python?
In your case it would look like this:
from datetime import datetime
# First create dates for 1st of each month period
df['Dates'] = [datetime(row['Year'], row['Month'], 1) for i, row in df[['Year', 'Month']].iterrows()]
# Create monthly periods
df['Month Periods'] = df['Dates'].dt.to_period('M')
# Use the new monthly index
df = df.set_index('Month Periods')
# Group by quarters
df_qtrly = df['Bill'].resample('Q').mean()
df_qtrly.index.names = ['Quarters']
print(df_qtrly)
Result:
Quarters
2016Q1 5.666667
2016Q2 5.000000
2016Q3 8.000000
2016Q4 3.000000
Freq: Q-DEC, Name: Bill, dtype: float64
If you want to put these values back into the monthly dataframe you could do this:
df['Quarters'] = df['Dates'].dt.to_period('Q')
df['Contributions'] = df_qtrly.loc[df['Quarters']].values
Year Month Bill Dates Quarters Contributions
Month Periods
2016-01 2016 1 2 2016-01-01 2016Q1 5.666667
2016-02 2016 2 5 2016-02-01 2016Q1 5.666667
2016-03 2016 3 10 2016-03-01 2016Q1 5.666667
2016-04 2016 4 2 2016-04-01 2016Q2 5.000000
2016-05 2016 5 4 2016-05-01 2016Q2 5.000000
2016-06 2016 6 9 2016-06-01 2016Q2 5.000000
2016-07 2016 7 7 2016-07-01 2016Q3 8.000000
2016-08 2016 8 8 2016-08-01 2016Q3 8.000000
2016-09 2016 9 9 2016-09-01 2016Q3 8.000000
2016-10 2016 10 5 2016-10-01 2016Q4 3.000000
2016-11 2016 11 1 2016-11-01 2016Q4 3.000000
2016-12 2016 12 3 2016-12-01 2016Q4 3.000000
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am trying to create a new column on an existing dataframe based on values of another dataframe.
# Define a dataframe containing 2 columns Date-Year and Date-Qtr
data1 = {'Date-Year': [2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017],
'Date-Qtr': ['2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2', '2016Q3', '2016Q4', '2017Q1', '2017Q2']}
dfx = pd.DataFrame(data1)
# Define another dataframe containing 2 columns Date-Year and Interest Rate
data2 = {'Date-Year': [2000, 2015, 2016, 2017, 2018, 2019, 2020, 2021],
'Interest Rate': [0.00, 8.20, 8.20, 7.75, 7.50, 7.50, 6.50, 6.50]}
dfy = pd.DataFrame(data2)
# Add 1 more column to the first dataframe
dfx['Int-rate'] = float(0)
Output for dfx
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 0.0
1 2015 2015Q2 0.0
2 2015 2015Q3 0.0
3 2015 2015Q4 0.0
4 2016 2016Q1 0.0
5 2016 2016Q2 0.0
6 2016 2016Q3 0.0
7 2016 2016Q4 0.0
8 2017 2017Q1 0.0
9 2017 2017Q2 0.0
Output for dfy
Date-Year Interest Rate
0 2000 0.00
1 2015 8.20
2 2016 8.20
3 2017 7.75
4 2018 7.50
5 2019 7.50
6 2020 6.50
7 2021 6.50
Now I need to update the column 'Int-rate' of dfx by picking up the value for 'Interest Rate' from dfy for its corresponding year which I am achieving through 2 FOR loops
#Check the year from dfx - goto dfy - check the interest rate from dfy for that year and modify Int-rate of dfx with this value
for i in range (len(dfx['Date-Year'])):
for j in range (len(dfy['Date-Year'])):
if (dfx['Date-Year'][i] == dfy['Date-Year'][j]):
dfx['Int-rate'][i] = dfy['Interest Rate'][j]
and I get the desired output
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Is there a way I can achieve the same output
without declaring dfx['Int-rate'] = float(0). I get a KeyError: 'Int-rate'if I don't declare this
not very happy with the 2 FOR loops. Is it possible to get it done in a better way (like using map or merge or joins)
I have tried looking through other posts and the best one I found is here, tried using map but I could not do it. Any help will be appreciated
thanks
You could use replace with a dictionary:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dict(dfy.to_numpy()))
print(dfx)
Output
Date-Year Date-Qtr Int-Rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Or with a Series as an alternative:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dfy.set_index('Date-Year').squeeze())
You can simply use df.merge:
In [4448]: df = dfx.merge(dfy).rename(columns={'Interest Rate':'Int-rate'})
In [4449]: df
Out[4449]:
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
I've created a pandas dataframe using the 'read html' method from an external source. There's no problem creating the dataframe, however, I'm stuck trying to adjust the structure of the first column, 'Month'.
The data I'm scraping is updated once a month at the source, therefore, the solution requires a dynamic approach. So far I've only been able to achieve the desired outcome using .iloc to manually update each row, which works fine until the data is updated at source next month.
This is what my dataframe looks like:
df = pd.read_html(url)
df
Month Value
0 2017 NaN
1 November 1.29
2 December 1.29
3 2018 NaN
4 January 1.29
5 February 1.29
6 March 1.29
7 April 1.29
8 May 1.29
9 June 1.28
10 July 1.28
11 August 1.28
12 September 1.28
13 October 1.26
14 November 1.16
15 December 1.09
16 2019 NaN
17 January 1.25
18 February 1.34
19 March 1.34
20 April 1.34
This is my desired outcome:
df
Month Value
0 November 2017 1.29
2 December 2017 1.29
4 January 2018 1.29
5 February 2018 1.29
6 March 2018 1.29
7 April 2018 1.29
8 May 2018 1.29
9 June 2018 1.28
10 July 2018 1.28
11 August 2018 1.28
12 September 2018 1.28
13 October 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 January 2019 1.25
18 February 2019 1.34
19 March 2019 1.34
20 April 2019 1.34
Right now the best idea I've come up with would be select, extract and append the year to each row in the 'Month' column, until the month 'December' is reached, and then switch to/increment to next year, but i have no idea how to implement this in code. Would this be a viable solution (and how could it be implemented?) or is there a better way?
Many thanks from a long time reader and first time poster of stackoverflow!
Using ffill base on value, if it is NaN then we should forward fill the year here for future paste
df.Month=df.Month+' '+df.Month.where(df.Value.isna()).ffill().astype(str)
df.dropna(inplace=True)
df
Out[29]:
Month Value
1 November 2017 1.29
2 December 2017 1.29
4 Januari 2018 1.29
5 Februari 2018 1.29
6 Mars 2018 1.29
7 April 2018 1.29
8 Maj 2018 1.29
9 Juni 2018 1.28
10 Juli 2018 1.28
11 Augusti 2018 1.28
12 September 2018 1.28
13 Oktober 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 Januari 2019 1.25
18 Februari 2019 1.34
19 Mars 2019 1.34
20 April 2019 1.34
I have two data frames, i want to join them so that i could check the quantity of the that week in every year in a single in a single data frame.
df1= City Week qty Year
hyd 35 10 2015
hyd 36 15 2015
hyd 37 11 2015
hyd 42 10 2015
hyd 23 10 2016
hyd 32 15 2016
hyd 37 11 2017
hyd 42 10 2017
pune 35 10 2015
pune 36 15 2015
pune 37 11 2015
pune 42 10 2015
pune 23 10 2016
pune 32 15 2016
pune 37 11 2017
pune 42 10 2017
df2= city Week qty Year
hyd 23 10 2015
hyd 32 15 2015
hyd 35 12 2016
hyd 36 15 2016
hyd 37 11 2016
hyd 42 10 2016
hyd 43 12 2016
hyd 44 18 2016
hyd 35 11 2017
hyd 36 15 2017
hyd 37 11 2017
hyd 42 10 2017
hyd 51 14 2017
hyd 52 17 2017
pune 35 12 2016
pune 36 15 2016
pune 37 11 2016
pune 42 10 2016
pune 43 12 2016
pune 44 18 2016
pune 35 11 2017
pune 36 15 2017
pune 37 11 2017
pune 42 10 2017
pune 51 14 2017
pune 52 17 2017
I want to join two data frames as shown in the result, i want to append the quantity of the that week in every year for each city in a single data frame.
city Week qty Year y2016_wk qty y2017_wk qty y2015_week qty
hyd 35 10 2015 2016_35 12 2017_35 11 nan nan
hyd 36 15 2015 2016_36 15 2017_36 15 nan nan
hyd 37 11 2015 2016_37 11 2017_37 11 nan nan
hyd 42 10 2015 2016_42 10 2017_42 10 nan nan
hyd 23 10 2016 nan nan 2017_23 x 2015_23 10
hyd 32 15 2016 nan nan 2017_32 y 2015_32 15
hyd 37 11 2017 2016_37 11 nan nan 2015_37 x
hyd 42 10 2017 2016_42 10 nan nan 2015_42 y
pune 35 10 2015 2016_35 12 2017_35 11 nan nan
pune 36 15 2015 2016_36 15 2017_36 15 nan nan
pune 37 11 2015 2016_37 11 2017_37 11 nan nan
pune 42 10 2015 2016_42 10 2017_42 10 nan nan
You can break down your task into a few steps:
Combine your dataframes df1 and df2.
Create a list of dataframes from your combined dataframe, splitting by year.
At the same time, rename columns to reflect year, set index to Week.
Finally, concatenate along axis=1 and reset_index.
Here is an example:
df = pd.concat([df1, df2], ignore_index=True)
dfs = [df[df['Year'] == y].rename(columns=lambda x: x+'_'+str(y) if x != 'Week' else x)\
.set_index('Week') for y in df['Year'].unique()]
res = pd.concat(dfs, axis=1).reset_index()
Result:
print(res)
Week qty_2015 Year_2015 qty_2016 Year_2016 qty_2017 Year_2017
0 35 10.0 2015.0 12.0 2016.0 11.0 2017.0
1 36 15.0 2015.0 15.0 2016.0 15.0 2017.0
2 37 11.0 2015.0 11.0 2016.0 11.0 2017.0
3 42 10.0 2015.0 10.0 2016.0 10.0 2017.0
4 43 NaN NaN 12.0 2016.0 NaN NaN
5 44 NaN NaN 18.0 2016.0 NaN NaN
6 51 NaN NaN NaN NaN 14.0 2017.0
7 52 NaN NaN NaN NaN 17.0 2017.0
Personally I don't think your example output is that readable, so unless you need that format for a specific reason I might consider using a pivot table. I also think the code required is cleaner.
import pandas as pd
df3 = pd.concat([df1, df2], ignore_index=True)
df4 = df3.pivot(index='Week', columns='Year', values='qty')
print(df4)
Year 2015 2016 2017
Week
35 10.0 12.0 11.0
36 15.0 15.0 15.0
37 11.0 11.0 11.0
42 10.0 10.0 10.0
43 NaN 12.0 NaN
44 NaN 18.0 NaN
51 NaN NaN 14.0
52 NaN NaN 17.0
I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
I create a Groupby Object because I would like to obtain a roolling mean group by columns A,B,C,D
DfGby = Df.groupby(['A','B', 'C','D'])
After I execute rolling mean
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
But I obtain
E
A B C D
23 2015 1 14937 0 NaN
19054 1 NaN
2 14937 2 NaN
19054 3 NaN
3 14937 4 NaN
19054 5 NaN
4 14937 6 NaN
19054 7 NaN
5 14937 8 NaN
19054 9 NaN
6 14937 10 NaN
19054 11 NaN
What is the problem here?
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0