Get values of latest year and all its months in pandas - python

Below is the Raw Data.
Event Month Year
Event1 January 2012
Event1 February 2013
Event1 March 2014
Event1 April 2017
Event1 May 2017
Event1 June 2017
Event2 May 2018
Event2 May 2019
Event3 February 2012
Event3 March 2012
Event3 April 2012
Event1 latest year is 2017 so month should be April, May, June.
Event2 latest year is 2019 so month should be May.
Event3 latest year is 2012 so month should be February, March, April.
Output Should be : -
Event Month Year
Event1 April 2017
Event1 May 2017
Event1 June 2017
Event2 May 2019
Event3 February 2012
Event3 March 2012
Event3 April 2012

You can transform the latest year per group and use it to slice:
out = df[df['Year'].eq(df.groupby('Event')['Year'].transform('max'))]
output:
Event Month Year
3 Event1 April 2017
4 Event1 May 2017
5 Event1 June 2017
7 Event2 May 2019
8 Event3 February 2012
9 Event3 March 2012
10 Event3 April 2012

Related

How to remove unwanted data from a data column using pandas DataFrame

I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----
I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...
I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----

duplicate specific rows of a dataframe based on column values

hi I have the following data frame
weather day month activity
sunny Monday April go for cycling
raining Friday December stay home
what I want is to duplicate the rows by 5 times without taking into account the activity column
so the output should be
weather day month activity
sunny Monday April go for cycling
sunny Monday April
sunny Monday April
sunny Monday April
sunny Monday April
raining Friday December stay home
raining Friday December
raining Friday December
raining Friday December
raining Friday December
raining Friday December
Use Index.repeat with DataFrame.loc for repeated rows and then replace duplicated activity by Series.mask with Index.duplicated:
df = df.loc[df.index.repeat(5)]
df['activity'] = df['activity'].mask(df.index.duplicated(), '')
df = df.reset_index(drop=True)
print (df)
weather day month activity
0 sunny Monday April go for cycling
1 sunny Monday April
2 sunny Monday April
3 sunny Monday April
4 sunny Monday April
5 raining Friday December stay home
6 raining Friday December
7 raining Friday December
8 raining Friday December
9 raining Friday December

create another dataframe datetime column based on the value of the datetime in another dataframe column

I have a dataframe which has a datetime column lets call it my_dates.
I also have a list of dates which has say 5 dates for this example.
15th Jan 2020
20th Mar 2020
28th Jun 2020
20th Jul 2020
8th Aug 2020
What I want to do is create another column in my datframe where it looks at the datetime in my_dates column & where it is less than a date in my date list for it to take that value.
For example lets say for this example say its 23rd June 2020. I want the new column to have the value for this row of 28th June 2020. Hopefully the examples below are clear.
More examples
my_dates expected_values
14th Jan 2020 15th Jan 2020
15th Jan 2020 15th Jan 2020
16th Jan 2020 20th Mar 2020
... ...
19th Mar 2020 20th Mar 2020
20th Mar 2020 20th Mar 2020
21st Mar 2020 28th Jun 2020
What is the most efficient way to do this rather than looping?
IIUC, you need pd.merge_asof with the argument direction set to forward
dates = ['15th Jan 2020',
'20th Mar 2020',
'28th Jun 2020',
'20th Jul 2020',
'8th Aug 2020' ]
dates_proper = [pd.to_datetime(d) for d in dates]
df = pd.DataFrame(pd.date_range('14-01-2020','21-03-2020'),columns=['my_dates'])
df1 = pd.DataFrame(dates_proper,columns=['date_list'])
merged_df = pd.merge_asof(
df, df1, left_on=["my_dates"], right_on=["date_list"], direction="forward"
)
print(merged_df)
my_dates date_list
0 2020-01-14 2020-01-15
1 2020-01-15 2020-01-15
2 2020-01-16 2020-03-20
3 2020-01-17 2020-03-20
4 2020-01-18 2020-03-20
.. ... ...
63 2020-03-17 2020-03-20
64 2020-03-18 2020-03-20
65 2020-03-19 2020-03-20
66 2020-03-20 2020-03-20
67 2020-03-21 2020-06-28
Finally a usecase for pd.merge_asof! :) From the documentation
Perform an asof merge. This is similar to a left-join except that we match on nearest key rather than equal keys.
It would have been helpful to make your example reproducible like this:
In [12]: reference = pd.DataFrame([['15th Jan 2020'],['20th Mar 2020'],['28th Jun 2020'],['20th Jul 2020'],['8th Aug 2020']], columns=['reference']).astype('datetime64')
In [13]: my_dates = pd.DataFrame([['14th Jan 2020'], ['15th Jan 2020'], ['16th Jan 2020'], ['19th Mar 2020'], ['20th Mar 2020'],['21th Mar 2020']], columns=['dates']).astype('datetime64')
In [15]: pd.merge_asof(my_dates, reference, left_on='dates', right_on='reference', direction='forward')
Out[15]:
dates reference
0 2020-01-14 2020-01-15
1 2020-01-15 2020-01-15
2 2020-01-16 2020-03-20
3 2020-03-19 2020-03-20
4 2020-03-20 2020-03-20
5 2020-03-21 2020-06-28

Transpose multiple rows of data in panda df [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 3 years ago.
I have the below table which shows rainfall by month in the UK across a number of years. I want to transpose it so that each row is one month/year and the data is chronological.
Year JAN FEB MAR APR
2010 79.7 74.8 79.4 48
2011 102.8 114.5 49.7 36.7
2012 110.9 60 37 128
2013 110.5 59.8 64.6 63.6
I would like it so the table looks like the below with year, month & rainfall as the columns:
2010 JAN 79.7
2010 FEB 74.8
2010 MAR 79.4
2010 APR 48
2011 JAN 102.8
2011 FEB 114.5
I think I need to use a for loop and iterate through each row to create a new dataframe but I'm not sure of the syntax. I've tried the below loop which nearly does what I want but doesn't output as a dataframe.
for index, row in weather.iterrows():
print(row["Year"],row)
2014.0 Year 2014.0
JAN 188.0
FEB 169.2
MAR 80.0
APR 67.8
MAY 99.6
JUN 54.8
JUL 64.7
Any help would be appreciated.
You should avoid using for-loops and instead use stack.
df.set_index('Year') \
.stack() \
.reset_index() \
.rename(columns={'level_1': 'Month', 0: 'Amount'})
Year Month Amount
0 2010 JAN 79.7
1 2010 FEB 74.8
2 2010 MAR 79.4
3 2010 APR 48.0
4 2011 JAN 102.8
5 2011 FEB 114.5
6 2011 MAR 49.7
7 2011 APR 36.7
8 2012 JAN 110.9
9 2012 FEB 60.0
etc...

Locate, extract and re-append year from column in pandas DataFrame

I've created a pandas dataframe using the 'read html' method from an external source. There's no problem creating the dataframe, however, I'm stuck trying to adjust the structure of the first column, 'Month'.
The data I'm scraping is updated once a month at the source, therefore, the solution requires a dynamic approach. So far I've only been able to achieve the desired outcome using .iloc to manually update each row, which works fine until the data is updated at source next month.
This is what my dataframe looks like:
df = pd.read_html(url)
df
Month Value
0 2017 NaN
1 November 1.29
2 December 1.29
3 2018 NaN
4 January 1.29
5 February 1.29
6 March 1.29
7 April 1.29
8 May 1.29
9 June 1.28
10 July 1.28
11 August 1.28
12 September 1.28
13 October 1.26
14 November 1.16
15 December 1.09
16 2019 NaN
17 January 1.25
18 February 1.34
19 March 1.34
20 April 1.34
This is my desired outcome:
df
Month Value
0 November 2017 1.29
2 December 2017 1.29
4 January 2018 1.29
5 February 2018 1.29
6 March 2018 1.29
7 April 2018 1.29
8 May 2018 1.29
9 June 2018 1.28
10 July 2018 1.28
11 August 2018 1.28
12 September 2018 1.28
13 October 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 January 2019 1.25
18 February 2019 1.34
19 March 2019 1.34
20 April 2019 1.34
Right now the best idea I've come up with would be select, extract and append the year to each row in the 'Month' column, until the month 'December' is reached, and then switch to/increment to next year, but i have no idea how to implement this in code. Would this be a viable solution (and how could it be implemented?) or is there a better way?
Many thanks from a long time reader and first time poster of stackoverflow!
Using ffill base on value, if it is NaN then we should forward fill the year here for future paste
df.Month=df.Month+' '+df.Month.where(df.Value.isna()).ffill().astype(str)
df.dropna(inplace=True)
df
Out[29]:
Month Value
1 November 2017 1.29
2 December 2017 1.29
4 Januari 2018 1.29
5 Februari 2018 1.29
6 Mars 2018 1.29
7 April 2018 1.29
8 Maj 2018 1.29
9 Juni 2018 1.28
10 Juli 2018 1.28
11 Augusti 2018 1.28
12 September 2018 1.28
13 Oktober 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 Januari 2019 1.25
18 Februari 2019 1.34
19 Mars 2019 1.34
20 April 2019 1.34

Categories