How to scrape a table into pandas dataframe using selenium in python? - python

I want to copy the table (id=symbolMarket) and save it as a pandas dataframe in this link https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data
How should I do it in the simple/beautiful way please?
Obviously I can retrieve the element one by one, but I believe there is a better way.
(I am using selenium to access the page, if this helps)
Many thanks for sharing knowledge with me

I was pretty hesitant to post this since it is pretty basic, and there is an abundance of solutions that show how to read a html table into pandas dataframe. Makes me wonder if you even attempted to look it up first.
But, just use .read_html(). This will return a list of dataframes. So you'll just have to figure out which dataframe in that list that you want:
import pandas as pd
url = 'https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data'
tables = pd.read_html(url)
Output:
table = tables[3]
print (table)
0 1 ... 5 6
0 Date Open ... Change (Pips) Change (%)
1 Mar 20, 2019 21:00 25737 ... +253.0 +0.97%
2 Mar 19, 2019 21:00 25871 ... -135.0 -0.52%
3 Mar 18, 2019 21:00 25935 ... -63.0 -0.24%
4 Mar 17, 2019 21:00 25864 ... +70.0 +0.27%
5 Mar 16, 2019 21:00 25864 ... -20.0 -0.08%
6 Mar 14, 2019 21:00 25716 ... +153.0 +0.59%
7 Mar 13, 2019 21:00 25756 ... -40.0 -0.16%
8 Mar 12, 2019 21:00 25575 ... +185.0 +0.72%
9 Mar 11, 2019 21:00 25686 ... -93.0 -0.36%
10 Mar 10, 2019 21:00 25470 ... +212.0 +0.83%
11 Mar 09, 2019 21:00 25470 ... -29.0 -0.11%
12 Mar 07, 2019 21:00 25459 ... +61.0 +0.24%
13 Mar 06, 2019 21:00 25673 ... -197.0 -0.77%
14 Mar 05, 2019 21:00 25786 ... -108.0 -0.42%
15 Mar 04, 2019 21:00 25805 ... +3.0 +0.01%
16 Mar 03, 2019 21:00 26114 ... -300.0 -1.16%
17 Feb 28, 2019 21:00 25911 ... +138.0 +0.53%
18 Feb 27, 2019 21:00 26018 ... -89.0 -0.34%
19 Feb 26, 2019 21:00 26005 ... +31.0 +0.12%
20 Feb 25, 2019 21:00 26093 ... -63.0 -0.24%
21 Feb 24, 2019 21:00 26094 ... -3.0 -0.01%
22 Feb 21, 2019 21:00 25825 ... +210.0 +0.81%
23 Feb 20, 2019 21:00 25962 ... -120.0 -0.46%
24 Feb 19, 2019 21:00 25877 ... +88.0 +0.34%
25 Feb 18, 2019 21:00 25894 ... -9.0 -0.03%
26 Feb 17, 2019 21:00 25905 ... +5.0 +0.02%
27 Feb 14, 2019 21:00 25404 ... +500.0 +1.93%
28 Feb 13, 2019 21:00 25483 ... -68.0 -0.27%
29 Feb 12, 2019 21:00 25418 ... +102.0 +0.40%
.. ... ... ... ... ...
71 Dec 11, 2018 21:00 24341 ... +208.0 +0.85%
72 Dec 10, 2018 21:00 24490 ... -152.0 -0.62%
73 Dec 09, 2018 21:00 24338 ... +144.0 +0.59%
74 Dec 06, 2018 21:00 24921 ... -517.0 -2.12%
75 Dec 05, 2018 21:00 25118 ... -189.0 -0.76%
76 Dec 04, 2018 21:00 25033 ... +134.0 +0.53%
77 Dec 03, 2018 21:00 25837 ... -798.0 -3.19%
78 Dec 02, 2018 21:00 25897 ... -55.0 -0.21%
79 Nov 29, 2018 21:00 25367 ... +220.0 +0.86%
80 Nov 28, 2018 21:00 25327 ... +62.0 +0.24%
81 Nov 27, 2018 21:00 24794 ... +568.0 +2.24%
82 Nov 26, 2018 21:00 24546 ... +253.0 +1.02%
83 Nov 25, 2018 21:00 24300 ... +230.0 +0.94%
84 Nov 22, 2018 21:00 24367 ... -80.0 -0.33%
85 Nov 21, 2018 21:00 24497 ... -144.0 -0.59%
86 Nov 20, 2018 21:00 24461 ... +38.0 +0.16%
87 Nov 19, 2018 21:00 25063 ... -604.0 -2.47%
88 Nov 18, 2018 21:00 25410 ... -342.0 -1.36%
89 Nov 15, 2018 21:00 25335 ... +135.0 +0.53%
90 Nov 14, 2018 21:00 25085 ... +256.0 +1.01%
91 Nov 13, 2018 21:00 25378 ... -273.0 -1.09%
92 Nov 12, 2018 21:00 25422 ... -65.0 -0.26%
93 Nov 11, 2018 21:00 25987 ... -577.0 -2.27%
94 Nov 08, 2018 21:00 26184 ... -202.0 -0.78%
95 Nov 07, 2018 21:00 26190 ... +15.0 +0.06%
96 Nov 06, 2018 21:00 25663 ... +572.0 +2.18%
97 Nov 05, 2018 21:00 25481 ... +200.0 +0.78%
98 Nov 04, 2018 21:00 25267 ... +221.0 +0.87%
99 Nov 01, 2018 21:00 25240 ... +40.0 +0.16%
100 Oct 31, 2018 21:00 25090 ... +229.0 +0.90%
[101 rows x 7 columns]

Related

How to remove unwanted data from a data column using pandas DataFrame

I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----
I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...
I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----

Python AttributeError: 'NoneType' object has no attribute 'find all' with BeautifulSoup

I'm trying to create a list that I can parse through to get some data, but I'm running into this error; AttributeError: 'NoneType' object has no attribute 'find_all'. I've began my code with this:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.sports-reference.com/cbb/schools/michigan/2021-schedule.html")
soup = BeautifulSoup(page.text, features="html.parser")
table = soup.find("table", attrs={"id":"schedule"})
table_rows = table.find_all('tr')
l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)
test_df = pd.DataFrame(l)
This code works but now I'm trying to add multiple elements. This is my current attempt at doing this:
query_set = ["Duke, Michigan"]
for query in query_set:
page = requests.get("https://www.sports-reference.com/cbb/schools/" + str(query) + "/2021-schedule.html")
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find("table", attrs={"id":"schedule"})
table_rows = table.find_all('tr')
l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)
schedule_df = pd.DataFrame(l)
However now I'm getting that attribute message and I can't figure out why. Does anyone have any advice on how to fix this? Thanks.
2 problems here:
Yes it's case sensitive as stated, so you need to have the string all lower case. Simply use the .lower() method
but even converting the string to all lower, your list is a 1 string element "Duke, Michigan", when you want it to be "Duke", "Michigan"
Just want to point out you could just use pandas' .read_html() to read in the table as that's what you are converting it to anyway (and pandas uses beautifulsoup under the hood) as an alternative. But bs4 is fine here too. If you left it as is.
Code:
import requests
import pandas as pd
query_set = ["Duke", "Michigan"]
for query in query_set:
page = requests.get("https://www.sports-reference.com/cbb/schools/" + query.lower() + "/2021-schedule.html")
schedule_df = pd.read_html(page.text, attrs={"id":"schedule"})[0]
Output:
G Date ... Streak Arena
0 1 Sat, Nov 28, 2020 ... W 1 Cameron Indoor Stadium
1 2 Tue, Dec 1, 2020 ... L 1 Cameron Indoor Stadium
2 3 Fri, Dec 4, 2020 ... W 1 Cameron Indoor Stadium
3 4 Tue, Dec 8, 2020 ... L 1 Cameron Indoor Stadium
4 5 Wed, Dec 16, 2020 ... W 1 Purcell Pavilion at the Joyce Center
5 6 Wed, Jan 6, 2021 ... W 2 Cameron Indoor Stadium
6 7 Sat, Jan 9, 2021 ... W 3 Cameron Indoor Stadium
7 8 Tue, Jan 12, 2021 ... L 1 Cassell Coliseum
8 9 Tue, Jan 19, 2021 ... L 2 Petersen Events Center
9 10 Sat, Jan 23, 2021 ... L 3 KFC Yum! Center
10 11 Tue, Jan 26, 2021 ... W 1 Cameron Indoor Stadium
11 12 Sat, Jan 30, 2021 ... W 2 Cameron Indoor Stadium
12 13 Mon, Feb 1, 2021 ... L 1 BankUnited Center
13 14 Sat, Feb 6, 2021 ... L 2 Cameron Indoor Stadium
14 15 Tue, Feb 9, 2021 ... L 3 Cameron Indoor Stadium
15 16 Sat, Feb 13, 2021 ... NaN NaN
16 17 Wed, Feb 17, 2021 ... NaN NaN
17 18 Sat, Feb 20, 2021 ... NaN NaN
18 19 Mon, Feb 22, 2021 ... NaN NaN
19 20 Sat, Feb 27, 2021 ... NaN NaN
20 21 Tue, Mar 2, 2021 ... NaN NaN
21 22 Sat, Mar 6, 2021 ... NaN NaN
[22 rows x 15 columns]
G Date Time Type ... W L Streak Arena
0 1 Wed, Nov 25, 2020 4:00p REG ... 1.0 0.0 W 1 Crisler Arena
1 2 Sun, Nov 29, 2020 6:00p REG ... 2.0 0.0 W 2 Crisler Arena
2 3 Wed, Dec 2, 2020 7:00p REG ... 3.0 0.0 W 3 Crisler Arena
3 4 Sun, Dec 6, 2020 4:00p REG ... 4.0 0.0 W 4 Crisler Arena
4 5 Wed, Dec 9, 2020 6:00p REG ... 5.0 0.0 W 5 Crisler Arena
5 6 Sun, Dec 13, 2020 2:00p REG ... 6.0 0.0 W 6 Crisler Arena
6 7 Fri, Dec 25, 2020 6:00p REG ... 7.0 0.0 W 7 Pinnacle Bank Arena
7 8 Thu, Dec 31, 2020 8:00p REG ... 8.0 0.0 W 8 Xfinity Center
8 9 Sun, Jan 3, 2021 7:30p REG ... 9.0 0.0 W 9 Crisler Arena
9 10 Wed, Jan 6, 2021 8:30p REG ... 10.0 0.0 W 10 Crisler Arena
10 11 Tue, Jan 12, 2021 7:00p REG ... 11.0 0.0 W 11 Crisler Arena
11 12 Sat, Jan 16, 2021 2:00p REG ... 11.0 1.0 L 1 Williams Arena
12 13 Tue, Jan 19, 2021 7:00p REG ... 12.0 1.0 W 1 Crisler Arena
13 14 Fri, Jan 22, 2021 7:00p REG ... 13.0 1.0 W 2 Mackey Arena
14 15 Sun, Feb 14, 2021 1:00p REG ... NaN NaN NaN NaN
15 16 Thu, Feb 18, 2021 NaN REG ... NaN NaN NaN NaN
16 17 Sun, Feb 21, 2021 NaN REG ... NaN NaN NaN NaN
17 18 Sat, Feb 27, 2021 NaN REG ... NaN NaN NaN NaN
18 19 Thu, Mar 4, 2021 NaN REG ... NaN NaN NaN NaN
19 20 Sun, Mar 7, 2021 NaN REG ... NaN NaN NaN NaN
[20 rows x 15 columns]
My guess is that what's going on is the URL is case sensitive.
https://www.sports-reference.com/cbb/schools/michigan/2021-schedule.html
returns data
https://www.sports-reference.com/cbb/schools/Michigan/2021-schedule.html
results in a 404.
Try changing your query set to query_set = ["duke", "michigan"].

How to fill in blank cells in df based on string in sequential row, Pandas

I have a df, where the data looks like this:
Time Value
60.8
Jul 2019 58.1
58.8
56.9
Oct 2019 51.8
54.6
56.8
Jan 2020 58.8
54.2
51.3
Apr 2020 52.2
I want to fill in the blank cells in the Time variable according to the calendar year. So:
Time Value
Jun 2019 60.8
Jul 2019 58.1
Aug 2019 58.8
Sep 2019 56.9
Oct 2019 51.8
Nov 2019 54.6
Dec 2019 56.8
Jan 2020 58.8
Feb 2020 54.2
Mar 2020 51.3
Apr 2020 52.2
I saw a post where pandas could be used to fill in numeric values, but since my variable isn't necessarily defined in a numeric way, I'm not entirely sure how to apply it in this situation.
There seem to me to be two ways of approaching this: 1) modifying the list before writing to df. 2) Modifying the df.
I prefer the first solution, but not sure if it is possible.
Thanks.
My script:
totalmonth=['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
totalvalue=['60.8', '58.1', '58.8', '56.9', '51.8', '54.6', '56.8', '58.8', '54.2', '51.3', '52.2', '48.7']
df = pd.DataFrame({'Time': totalmonth,
'Value': totalvalue})
Ok, this took me longer than I would like to admit. I solved for your first answer
Output:
***********************BEFORE********************************
['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
Time Value
0 60.8
1 Jul 2019 58.1
2 58.8
3 56.9
4 Oct 2019 51.8
5 54.6
6 56.8
7 Jan 2020 58.8
8 54.2
9 51.3
10 Apr 2020 52.2
11 48.7
***********************AFTER********************************
['Jun 2019', 'Jul 2019', 'Aug 2019', 'Sep 2019', 'Oct 2019', 'Nov 2019', 'Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020']
Time Value
0 Jun 2019 60.8
1 Jul 2019 58.1
2 Aug 2019 58.8
3 Sep 2019 56.9
4 Oct 2019 51.8
5 Nov 2019 54.6
6 Dec 2019 56.8
7 Jan 2020 58.8
8 Feb 2020 54.2
9 Mar 2020 51.3
10 Apr 2020 52.2
11 May 2020 48.7
Code:
from datetime import datetime
from dateutil.relativedelta import relativedelta
totalmonth=['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
new_totalmonth = [datetime.strptime(x,'%b %Y') for x in totalmonth if x != '' ]
index = totalmonth.index(min(new_totalmonth).strftime('%b %Y'))
new_totalmonth = [(min(new_totalmonth) + relativedelta(months=x)).strftime('%b %Y') for x in range(-index,len(totalmonth) - index)]
print(new_totalmonth)
Breakdown
This line of code creates a list of all the valid dates and puts them in a format that I can run the min() function on.
new_totalmonth = [datetime.strptime(x,'%b %Y') for x in totalmonth if x != '' ]
What this prints out
print(new_totalmonth)
[datetime.datetime(2019, 7, 1, 0, 0), datetime.datetime(2019, 10, 1, 0, 0), datetime.datetime(2020, 1, 1, 0, 0), datetime.datetime(2020, 4, 1, 0, 0)]
This is creating the variable index and assigning it the index of the minimum date in totalmonth
index = totalmonth.index(min(new_totalmonth).strftime('%b %Y'))
min(new_totalmonth) # this is finding the minimum date in new_totalmonth
print(min(new_totalmonth))
2019-07-01 00:00:00
min(new_totalmonth).strftime('%b %Y') # This is putting that minimum in a format that matches what is in totalmonth so the function totalmonth.index() can get the correct index
print(min(new_totalmonth).strftime('%b %Y'))
Jul 2019
This is using list comprehension.
new_totalmonth = [(min(new_totalmonth) + relativedelta(months=x)).strftime('%b %Y') for x in range(-index,len(totalmonth) - index)]
I am using the index of the minimum date in totalmonth to manipulate the range of values (how many months) I am going to add to the minimum month in totalmonth
range(-index,len(totalmonth) - index)
print(list(range(-index,len(totalmonth) - index)))
[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Since the minimum month (Jul 2019) is at index 1 I need to add -1 months to it to get the month that comes before it which is Jun 2019
So it can be broken out to:
(min(new_totalmonth) + relativedelta(months=-1)).strftime('%b %Y') = Jun 2019
(min(new_totalmonth) + relativedelta(months=0)).strftime('%b %Y') = Ju1 2019
(min(new_totalmonth) + relativedelta(months=1)).strftime('%b %Y') = Aug 2019
...
(min(new_totalmonth) + relativedelta(months=10)).strftime('%b %Y') = May 2019
Take all those values and put them in the list new_totalmonth
print(new_totalmonth)
['Jun 2019', 'Jul 2019', 'Aug 2019', 'Sep 2019', 'Oct 2019', 'Nov 2019', 'Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020']
The minimum minus 1 in the 'Time' column is the start month, the maximum plus 2 in the 'Time' column is the last month, and the target column is updated with date_range() to get successive values.
df['Time'] = pd.to_datetime(df['Time'])
startM = datetime.datetime((df['Time'].min()).year,(df['Time'].min()).month-1,1)
endM = datetime.datetime((df['Time'].max()).year,(df['Time'].max()).month+2,1)
df['Time'] = pd.date_range(startM,endM, freq='1M')
df
Time Value
0 2019-06-30 60.8
1 2019-07-31 58.1
2 2019-08-31 58.8
3 2019-09-30 56.9
4 2019-10-31 51.8
5 2019-11-30 54.6
6 2019-12-31 56.8
7 2020-01-31 58.8
8 2020-02-29 54.2
9 2020-03-31 51.3
10 2020-04-30 52.2
11 2020-05-31 48.7
First use pd.to_datetime to convert the Time column to pandas datetime series t, next use pd.period_range to generate a period range with a monthly frequency and the starting period equals to the calculated period and number of periods equal to the length of the series t, finally use .strftime with a format specifier %b %Y to returns the string representation of the period_range in the desired format:
t = pd.to_datetime(df['Time'])
df['Time'] = pd.period_range(
t.min().to_period('M') - t.idxmin(), periods=len(t), freq='M').strftime('%b %Y')
Details:
# print(t)
0 NaT
1 2019-07-01
2 NaT
3 NaT
4 2019-10-01
5 NaT
6 NaT
7 2020-01-01
8 NaT
9 NaT
10 2020-04-01
11 NaT
Name: Time, dtype: datetime64[ns]
# print(t.min(), t.idxmin())
Timestamp('2019-07-01 00:00:00'), 1
# print(t.min().to_period('M') - t.idxmin())
Period('2019-06', 'M') # starting period of the period range
Result:
# print(df)
Time Value
0 Jun 2019 60.8
1 Jul 2019 58.1
2 Aug 2019 58.8
3 Sep 2019 56.9
4 Oct 2019 51.8
5 Nov 2019 54.6
6 Dec 2019 56.8
7 Jan 2020 58.8
8 Feb 2020 54.2
9 Mar 2020 51.3
10 Apr 2020 52.2
11 May 2020 48.7

create another dataframe datetime column based on the value of the datetime in another dataframe column

I have a dataframe which has a datetime column lets call it my_dates.
I also have a list of dates which has say 5 dates for this example.
15th Jan 2020
20th Mar 2020
28th Jun 2020
20th Jul 2020
8th Aug 2020
What I want to do is create another column in my datframe where it looks at the datetime in my_dates column & where it is less than a date in my date list for it to take that value.
For example lets say for this example say its 23rd June 2020. I want the new column to have the value for this row of 28th June 2020. Hopefully the examples below are clear.
More examples
my_dates expected_values
14th Jan 2020 15th Jan 2020
15th Jan 2020 15th Jan 2020
16th Jan 2020 20th Mar 2020
... ...
19th Mar 2020 20th Mar 2020
20th Mar 2020 20th Mar 2020
21st Mar 2020 28th Jun 2020
What is the most efficient way to do this rather than looping?
IIUC, you need pd.merge_asof with the argument direction set to forward
dates = ['15th Jan 2020',
'20th Mar 2020',
'28th Jun 2020',
'20th Jul 2020',
'8th Aug 2020' ]
dates_proper = [pd.to_datetime(d) for d in dates]
df = pd.DataFrame(pd.date_range('14-01-2020','21-03-2020'),columns=['my_dates'])
df1 = pd.DataFrame(dates_proper,columns=['date_list'])
merged_df = pd.merge_asof(
df, df1, left_on=["my_dates"], right_on=["date_list"], direction="forward"
)
print(merged_df)
my_dates date_list
0 2020-01-14 2020-01-15
1 2020-01-15 2020-01-15
2 2020-01-16 2020-03-20
3 2020-01-17 2020-03-20
4 2020-01-18 2020-03-20
.. ... ...
63 2020-03-17 2020-03-20
64 2020-03-18 2020-03-20
65 2020-03-19 2020-03-20
66 2020-03-20 2020-03-20
67 2020-03-21 2020-06-28
Finally a usecase for pd.merge_asof! :) From the documentation
Perform an asof merge. This is similar to a left-join except that we match on nearest key rather than equal keys.
It would have been helpful to make your example reproducible like this:
In [12]: reference = pd.DataFrame([['15th Jan 2020'],['20th Mar 2020'],['28th Jun 2020'],['20th Jul 2020'],['8th Aug 2020']], columns=['reference']).astype('datetime64')
In [13]: my_dates = pd.DataFrame([['14th Jan 2020'], ['15th Jan 2020'], ['16th Jan 2020'], ['19th Mar 2020'], ['20th Mar 2020'],['21th Mar 2020']], columns=['dates']).astype('datetime64')
In [15]: pd.merge_asof(my_dates, reference, left_on='dates', right_on='reference', direction='forward')
Out[15]:
dates reference
0 2020-01-14 2020-01-15
1 2020-01-15 2020-01-15
2 2020-01-16 2020-03-20
3 2020-03-19 2020-03-20
4 2020-03-20 2020-03-20
5 2020-03-21 2020-06-28

How to collect and move data from a Json link to pandas Dataframe using request

I have a link as below:
https://iislliveblob.niftyindices.com/jsonfiles/Heatmap/FinalHeatMapNIFTY%20BANK.json?_=1566641233858
I want to collect and move table data from a link to pandas Dataframe using request
What is simpler than using pd.read_json routine for such case?
In [2]: pd.read_json('https://iislliveblob.niftyindices.com/jsonfiles/Heatmap/FinalHeatMapNIFTY%20BANK.json?_=15
...: 66641233858')
Out[2]:
Indexmcap_today Indexmcap_yst NewIndexValue ... sharesOutstanding symbol time
0 185787484706 185096824912 27035.620450 ... 1986792096 FEDERALBNK Aug 23, 2019 16:00:00
1 94015024217 92934391755 27036.436029 ... 4782477126 IDFCFIRSTB Aug 23, 2019 16:00:00
2 170122910956 167793447043 27039.047804 ... 427318817 RBLBANK Aug 23, 2019 16:00:00
3 78039629610 75401497477 27039.693345 ... 4604047028 PNB Aug 23, 2019 16:00:00
4 129197826454 125259710888 27042.412097 ... 3846727356 BANKBARODA Aug 23, 2019 16:00:00
5 127258599093 120922516944 27047.427144 ... 2316958738 YESBANK Aug 23, 2019 16:00:00
6 1571596601509 1565086749643 27047.790562 ... 2619107432 AXISBANK Aug 23, 2019 16:00:00
7 1205529358622 1194190000951 27057.890868 ... 8924611534 SBIN Aug 23, 2019 16:00:00
8 2443146242739 2466008258667 26986.362979 ... 6451364340 ICICIBANK Aug 23, 2019 16:00:00
9 4183387487538 4205438912774 26988.058228 ... 2732812271 HDFCBANK Aug 23, 2019 16:00:00
10 820545888847 836293106783 27001.242688 ... 692756723 INDUSINDBK Aug 23, 2019 16:00:00
11 1881780775065 1892084363175 27012.627358 ... 1909120492 KOTAKBANK Aug 23, 2019 16:00:00
[12 rows x 24 columns]

Categories