Locate, extract and re-append year from column in pandas DataFrame - python

I've created a pandas dataframe using the 'read html' method from an external source. There's no problem creating the dataframe, however, I'm stuck trying to adjust the structure of the first column, 'Month'.
The data I'm scraping is updated once a month at the source, therefore, the solution requires a dynamic approach. So far I've only been able to achieve the desired outcome using .iloc to manually update each row, which works fine until the data is updated at source next month.
This is what my dataframe looks like:
df = pd.read_html(url)
df
Month Value
0 2017 NaN
1 November 1.29
2 December 1.29
3 2018 NaN
4 January 1.29
5 February 1.29
6 March 1.29
7 April 1.29
8 May 1.29
9 June 1.28
10 July 1.28
11 August 1.28
12 September 1.28
13 October 1.26
14 November 1.16
15 December 1.09
16 2019 NaN
17 January 1.25
18 February 1.34
19 March 1.34
20 April 1.34
This is my desired outcome:
df
Month Value
0 November 2017 1.29
2 December 2017 1.29
4 January 2018 1.29
5 February 2018 1.29
6 March 2018 1.29
7 April 2018 1.29
8 May 2018 1.29
9 June 2018 1.28
10 July 2018 1.28
11 August 2018 1.28
12 September 2018 1.28
13 October 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 January 2019 1.25
18 February 2019 1.34
19 March 2019 1.34
20 April 2019 1.34
Right now the best idea I've come up with would be select, extract and append the year to each row in the 'Month' column, until the month 'December' is reached, and then switch to/increment to next year, but i have no idea how to implement this in code. Would this be a viable solution (and how could it be implemented?) or is there a better way?
Many thanks from a long time reader and first time poster of stackoverflow!

Using ffill base on value, if it is NaN then we should forward fill the year here for future paste
df.Month=df.Month+' '+df.Month.where(df.Value.isna()).ffill().astype(str)
df.dropna(inplace=True)
df
Out[29]:
Month Value
1 November 2017 1.29
2 December 2017 1.29
4 Januari 2018 1.29
5 Februari 2018 1.29
6 Mars 2018 1.29
7 April 2018 1.29
8 Maj 2018 1.29
9 Juni 2018 1.28
10 Juli 2018 1.28
11 Augusti 2018 1.28
12 September 2018 1.28
13 Oktober 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 Januari 2019 1.25
18 Februari 2019 1.34
19 Mars 2019 1.34
20 April 2019 1.34

Related

Pandas groupby: get max value in a subgroup [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 7 months ago.
I have a large dataset grouped by column, row, year, potveg, and total. I am trying to get the max value of 'total' column in a specific year of a group. i.e., for the dataset below:
col row year potveg total
-125.0 42.5 2015 9 697.3
2015 13 535.2
2015 15 82.3
2016 9 907.8
2016 13 137.6
2016 15 268.4
2017 9 961.9
2017 13 74.2
2017 15 248.0
2018 9 937.9
2018 13 575.6
2018 15 215.5
-135.0 70.5 2015 8 697.3
2015 10 535.2
2015 19 82.3
2016 8 907.8
2016 10 137.6
2016 19 268.4
2017 8 961.9
2017 10 74.2
2017 19 248.0
2018 8 937.9
2018 10 575.6
2018 19 215.5
I would like the output to look like this:
col row year potveg total
-125.0 42.5 2015 9 697.3
2016 9 907.8
2017 9 961.9
2018 9 937.9
-135.0 70.5 2015 8 697.3
2016 8 907.8
2017 8 961.9
2018 8 937.9
I tried this:
df.groupby(['col', 'row', 'year', 'potveg']).agg({'total': 'max'})
and this:
df.groupby(['col', 'row', 'year', 'potveg'])['total'].max()
but they do not seem to work because the output has too many rows.
I think the issue is the 'potveg' column which is a subgroup. I am not sure how to select rows containing max value of 'total'.
One possible solution, using .idxmax() inside groupby.apply:
print(
df.groupby(["col", "row", "year"], as_index=False, sort=False).apply(
lambda x: x.loc[x["total"].idxmax()]
)
)
Prints:
col row year potveg total
0 -125.0 42.5 2015.0 9.0 697.3
1 -125.0 42.5 2016.0 9.0 907.8
2 -125.0 42.5 2017.0 9.0 961.9
3 -125.0 42.5 2018.0 9.0 937.9
4 -135.0 70.5 2015.0 8.0 697.3
5 -135.0 70.5 2016.0 8.0 907.8
6 -135.0 70.5 2017.0 8.0 961.9
7 -135.0 70.5 2018.0 8.0 937.9
DataFrame used:
col row year potveg total
0 -125.0 42.5 2015 9 697.3
1 -125.0 42.5 2015 13 535.2
2 -125.0 42.5 2015 15 82.3
3 -125.0 42.5 2016 9 907.8
4 -125.0 42.5 2016 13 137.6
5 -125.0 42.5 2016 15 268.4
6 -125.0 42.5 2017 9 961.9
7 -125.0 42.5 2017 13 74.2
8 -125.0 42.5 2017 15 248.0
9 -125.0 42.5 2018 9 937.9
10 -125.0 42.5 2018 13 575.6
11 -125.0 42.5 2018 15 215.5
12 -135.0 70.5 2015 8 697.3
13 -135.0 70.5 2015 10 535.2
14 -135.0 70.5 2015 19 82.3
15 -135.0 70.5 2016 8 907.8
16 -135.0 70.5 2016 10 137.6
17 -135.0 70.5 2016 19 268.4
18 -135.0 70.5 2017 8 961.9
19 -135.0 70.5 2017 10 74.2
20 -135.0 70.5 2017 19 248.0
21 -135.0 70.5 2018 8 937.9
22 -135.0 70.5 2018 10 575.6
23 -135.0 70.5 2018 19 215.5
Option 1: One way is the do the groupby() and then merge with the original df
df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(),
df,
on=['col', 'row', 'year', 'total'])
print(df1)
Output:
col row year total potveg
0 -125.0 42.5 2015 697.3 9
1 -125.0 42.5 2016 907.8 9
2 -125.0 42.5 2017 961.9 9
3 -125.0 42.5 2018 937.9 9
4 -135.0 70.5 2015 697.3 8
5 -135.0 70.5 2016 907.8 8
6 -135.0 70.5 2017 961.9 8
7 -135.0 70.5 2018 937.9 8
Option 2: Or the use of sort_values() and drop_duplicates() like this:
df1 = df.sort_values(['col','row','year']).drop_duplicates(['col','row','year'], keep='first')
print(df1)
Output:
col row year potveg total
0 -125.0 42.5 2015 9 697.3
3 -125.0 42.5 2016 9 907.8
6 -125.0 42.5 2017 9 961.9
9 -125.0 42.5 2018 9 937.9
12 -135.0 70.5 2015 8 697.3
15 -135.0 70.5 2016 8 907.8
18 -135.0 70.5 2017 8 961.9
21 -135.0 70.5 2018 8 937.9

How to create a new column on an existing Dataframe based on values of another Dataframe both of which are of different length [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am trying to create a new column on an existing dataframe based on values of another dataframe.
# Define a dataframe containing 2 columns Date-Year and Date-Qtr
data1 = {'Date-Year': [2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017],
'Date-Qtr': ['2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2', '2016Q3', '2016Q4', '2017Q1', '2017Q2']}
dfx = pd.DataFrame(data1)
# Define another dataframe containing 2 columns Date-Year and Interest Rate
data2 = {'Date-Year': [2000, 2015, 2016, 2017, 2018, 2019, 2020, 2021],
'Interest Rate': [0.00, 8.20, 8.20, 7.75, 7.50, 7.50, 6.50, 6.50]}
dfy = pd.DataFrame(data2)
# Add 1 more column to the first dataframe
dfx['Int-rate'] = float(0)
Output for dfx
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 0.0
1 2015 2015Q2 0.0
2 2015 2015Q3 0.0
3 2015 2015Q4 0.0
4 2016 2016Q1 0.0
5 2016 2016Q2 0.0
6 2016 2016Q3 0.0
7 2016 2016Q4 0.0
8 2017 2017Q1 0.0
9 2017 2017Q2 0.0
Output for dfy
Date-Year Interest Rate
0 2000 0.00
1 2015 8.20
2 2016 8.20
3 2017 7.75
4 2018 7.50
5 2019 7.50
6 2020 6.50
7 2021 6.50
Now I need to update the column 'Int-rate' of dfx by picking up the value for 'Interest Rate' from dfy for its corresponding year which I am achieving through 2 FOR loops
#Check the year from dfx - goto dfy - check the interest rate from dfy for that year and modify Int-rate of dfx with this value
for i in range (len(dfx['Date-Year'])):
for j in range (len(dfy['Date-Year'])):
if (dfx['Date-Year'][i] == dfy['Date-Year'][j]):
dfx['Int-rate'][i] = dfy['Interest Rate'][j]
and I get the desired output
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Is there a way I can achieve the same output
without declaring dfx['Int-rate'] = float(0). I get a KeyError: 'Int-rate'if I don't declare this
not very happy with the 2 FOR loops. Is it possible to get it done in a better way (like using map or merge or joins)
I have tried looking through other posts and the best one I found is here, tried using map but I could not do it. Any help will be appreciated
thanks
You could use replace with a dictionary:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dict(dfy.to_numpy()))
print(dfx)
Output
Date-Year Date-Qtr Int-Rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Or with a Series as an alternative:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dfy.set_index('Date-Year').squeeze())
You can simply use df.merge:
In [4448]: df = dfx.merge(dfy).rename(columns={'Interest Rate':'Int-rate'})
In [4449]: df
Out[4449]:
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75

Multiply dataframes with similar index but unequal levels

I have two dataframes as follows
df1
Location Month Date Ratio
A June Jun 1 0.2
A June Jun 2 0.3
A June Jun 3 0.4
B June Jun 1 0.6
B June Jun 2 0.7
B June Jun 3 0.8
And df2
Location Month Value
A June 1000
B June 2000
Result should be as :
df3
Location Month Date Value
A June Jun 1 200
A June Jun 2 300
A June Jun 3 400
B June Jun 1 1200
B June Jun 2 1400
B June Jun 3 1600
How do I go about doing this. I am able to carry out division without problem as Pandas somehow does great job of matching indices while division but in multiplication result is all over the place.
Thanks.
You can use df.merge and df.assign
df.assign(Value = df.merge(df1,how='inner',on=['Location','Month'])['Value'].\
mul(df['Ratio']))
#or
# df = df.merge(df1,how='inner',on=['Location','Month'])
# df['Value']*=df['Ratio']
Location Month Date Ratio Value
0 A June Jun 1 0.2 200.0
1 A June Jun 2 0.3 300.0
2 A June Jun 3 0.4 400.0
3 B June Jun 1 0.6 1200.0
4 B June Jun 2 0.7 1400.0
5 B June Jun 3 0.8 1600.0
Or
using df.set_index
df.set_index(['Location','Month'],inplace=True)
df1.set_index(['Location','Month'],inplace=True)
df['Value'] = df['Ratio']*df1['Value']
IIUC and Location is index for both dataframe then you can use pandas.Series.mul
df1["Value"] = df1.Ratio.mul(df2.Value)
df1
Month Date Ratio Value
Location
A June Jun 1 0.2 200.0
A June Jun 2 0.3 300.0
A June Jun 3 0.4 400.0
B June Jun 1 0.6 1200.0
B June Jun 2 0.7 1400.0
B June Jun 3 0.8 1600.0

Sorting values in a dataframe

I've been trying to sort the values on my a dataframe that I've been given to work on.
The following is my dataframe.
1981 1.78
1982 1.74
1983 1.61
1984 1.62
1985 1.61
1986 1.43
1987 1.62
1988 1.96
1989 1.75
1990 1.83
1991 1.73
1992 1.72
1993 1.74
1994 1.71
1995 1.67
1996 1.66
1997 1.61
1998 1.48
1999 1.47
2000 1.6
2001 1.41
2002 1.37
2003 1.27
2004 1.26
2005 1.26
2006 1.28
2007 1.29
2008 1.28
2009 1.22
2010 1.15
2011 1.2
2012 1.29
2013 1.19
2014 1.25
2015 1.24
2016 1.2
2017 1.16
2018 1.14
I've been trying to sort my dataframe in descending order such that the highest values on the right would appear first. However whenever I try to sort it, it would only sort based on the year which are the values on the left.
dataframe.sort_values('1')
I've tried using sort_values and indicating '1' as the column that I want sorted. This however returns ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>
From the error that OP mentioned, The data structure is a Series and hence the sort function should just be called directly
s = s.sort_values(ascending=False)
The error was raised because, in pandas.Series.sort_values the first argument is axis.
The argument of sort_values() should be column name:
df=df.sort_values("col2")

Transpose multiple rows of data in panda df [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 3 years ago.
I have the below table which shows rainfall by month in the UK across a number of years. I want to transpose it so that each row is one month/year and the data is chronological.
Year JAN FEB MAR APR
2010 79.7 74.8 79.4 48
2011 102.8 114.5 49.7 36.7
2012 110.9 60 37 128
2013 110.5 59.8 64.6 63.6
I would like it so the table looks like the below with year, month & rainfall as the columns:
2010 JAN 79.7
2010 FEB 74.8
2010 MAR 79.4
2010 APR 48
2011 JAN 102.8
2011 FEB 114.5
I think I need to use a for loop and iterate through each row to create a new dataframe but I'm not sure of the syntax. I've tried the below loop which nearly does what I want but doesn't output as a dataframe.
for index, row in weather.iterrows():
print(row["Year"],row)
2014.0 Year 2014.0
JAN 188.0
FEB 169.2
MAR 80.0
APR 67.8
MAY 99.6
JUN 54.8
JUL 64.7
Any help would be appreciated.
You should avoid using for-loops and instead use stack.
df.set_index('Year') \
.stack() \
.reset_index() \
.rename(columns={'level_1': 'Month', 0: 'Amount'})
Year Month Amount
0 2010 JAN 79.7
1 2010 FEB 74.8
2 2010 MAR 79.4
3 2010 APR 48.0
4 2011 JAN 102.8
5 2011 FEB 114.5
6 2011 MAR 49.7
7 2011 APR 36.7
8 2012 JAN 110.9
9 2012 FEB 60.0
etc...

Categories