I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----
I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...
I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----
I'm working on my python skills and I'm trying to scrape only the "Results" table from this page https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results . I'm new to web scraping, could anyone help me with an elegant solution for scraping the Results wikitable? Thanks!
The easiest way is to use Pandas to load the tables:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
# print second table (index 1):
print(tables[1])
Prints:
Date Venue Home team Away team Score Competition Winner Match report
0 7 March 2020 Twickenham Stadium England Wales 33–30 2020 Six Nations England BBC
1 22 February 2020 Principality Stadium Wales France 23–27 2020 Six Nations France BBC
2 8 February 2020 Aviva Stadium Ireland Wales 24–14 2020 Six Nations Ireland BBC
3 1 February 2020 Principality Stadium Wales Italy 42–0 2020 Six Nations Wales BBC
4 30 November 2019 Principality Stadium Wales Barbarians 43–33 Tour Match Wales BBC
.. ... ... ... ... ... ... ... ...
741 5 January 1884 Cardigan Fields England Wales 1G 2T–1G 1884 Home Nations Championship England NaN
742 8 January 1883 Raeburn Place Scotland Wales 3G–1G 1883 Home Nations Championship Scotland NaN
743 16 December 1882 St Helen's Wales England 0–2G 4T 1883 Home Nations Championship England NaN
744 28 January 1882 Lansdowne Road Ireland Wales 0–2G 2T NaN Wales NaN
745 19 February 1881 Richardson's Field England Wales 7G 6T 1D–0 NaN England NaN
[746 rows x 8 columns]
I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2013-14 NBA National TV Schedule
The chart starts out like:
Game/Time Network Matchup
Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
I am using these packages:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
I imported the data by:
pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
The output sample is:
0 1 2
0 Game/Time Network Matchup
1 Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
2 Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
The output I want in a .csv file looks like this:
I am unsure how I can split the game/time up into separate columns. Notice how the date is formatted like 10/29/13. I also am unsure how to split matchup into away (first team) and home (second team) into separate columns. I know pd.to_datetime and str.split() should be used. How do I implement the scraper to get this output?
Here's my take:
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set the correct column names
df = df.T.set_index([0]).T
# separate date and time
datetime = df['Game/Time'].str.extract('(?P<Date>.*), (?P<Time>.*) ET$')
# extract Home and Away
home_away = df['Matchup'].str.extract('^(?P<Away>.*) vs\. (?P<Home>.*)$')
# join the data
final_df = pd.concat([datetime, home_away, df[['Network']]], axis=1)
Output:
Date Time Away Home Network
1 Oct. 29 8 p.m. Chicago Miami TNT
2 Oct. 29 10:30 p.m. LA Clippers LA Lakers TNT
3 Oct. 31 8 p.m. New York Chicago TNT
4 Oct. 31 10:30 p.m. Golden State LA Clippers TNT
5 Nov. 1 8 p.m. Miami Brooklyn ESPN
.. ... ... ... ... ...
141 Apr. 13 1 p.m. Chicago New York ABC
142 Apr. 15 8 p.m. New York Brooklyn TNT
143 Apr. 15 10:30 p.m. Denver LA Clippers TNT
144 Apr. 16 8 p.m. Atlanta Milwaukee ESPN
145 Apr. 16 10:30 p.m. Golden State Denver ESPN
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
This line should help you format the date in the exact way you want
import pandas as pd
import numpy as np
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule",header=0)[0]
df['Date']=df['Game/Time'].str.extract(r'(.*),',expand=True)
df['Time']=df['Game/Time'].str.extract(r',(.*) ET',expand=True)
df['Time']=df['Time'].str.replace('p.m.','PM')
df['Date'] = np.where(df.Date.str.startswith(('10/', 11/', '12/')), df.Date + ' 13', df.Date + ' 14')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
df['Home'] = df['Matchup'].str.extract('(.*)vs')
df['Away'] = df['Matchup'].str.extract('vs.(.*)')
df = df.drop(columns=['Game/Time','Matchup'])
print(df)
Network Date Time Home Away
0 TNT 10/29/2013 8 PM Chicago Miami
1 TNT 10/29/2013 10:30 PM LA Clippers LA Lakers
2 TNT 10/31/2013 8 PM New York Chicago
3 TNT 10/31/2013 10:30 PM Golden State LA Clippers
4 ESPN 11/01/2013 8 PM Miami Brooklyn
I hope this is what you were looking for.
You can use regex to split out your columns, your time has different format so we can handle those by using specific formats and forcing the errors into NaT values.
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set column
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
#set date and time column.
df['date'] = pd.to_datetime((df['Game/Time'].str.split(',',expand=True)[0] + ' 2019')
,format='%b. %d %Y')
df['time'] = df['Game/Time'].str.split(',',expand=True)[1]
#time column has different formats, lets handle those.
s = pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H %p',errors='coerce')
s = s.fillna(pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H:%M %p',errors='coerce'))
df['time'] = s.dt.time
#home and away columns.
df['home'] = df['Matchup'].str.extract('(.*)vs(.*)')[0].str.strip()
df['away'] = df['Matchup'].str.extract('(.*)vs(.*)')[1].str.strip('.')
# slice dataframe.
df2 = df[['date','time','home','away','Network']]
print(df2)
0 date time home away Network
0 2019-10-29 08:00:00 Chicago Miami TNT
1 2019-10-29 10:30:00 LA Clippers LA Lakers TNT
2 2019-10-31 08:00:00 New York Chicago TNT
3 2019-10-31 10:30:00 Golden State LA Clippers TNT
4 2019-11-01 08:00:00 Miami Brooklyn ESPN
.. ... ... ... ... ...
140 2019-04-13 01:00:00 Chicago New York ABC
141 2019-04-15 08:00:00 New York Brooklyn TNT
142 2019-04-15 10:30:00 Denver LA Clippers TNT
143 2019-04-16 08:00:00 Atlanta Milwaukee ESPN
144 2019-04-16 10:30:00 Golden State Denver ESPN
I need to read from a text file, then print the information separately.
for example:
i'm given a list of names in this format: Orville Wright 21 July 1988
And i need to make the outcome as so:
Name
1. Orville Wright
Date
1. 21 July 1988
I've tried using a reader to separate but I would have to have a separate code line for every name and date given as they are not the same length.
with open('File name and location', 'r') as reader:
print(reader.readline(14))
``````````````````````````````````````````````````
this is the outcome : Orville Wright
```````````````````````````````````````````````````
I want my results to be:
Name:
1. Orville Wright
2. Rogelio Holloway
etc
Date:
1. 21 July 1988
2. 13 September 1988
etc
````````````````````````````````````````````````````
The contents of the file are as follows:
Orville Wright 21 July 1988
Rogelio Holloway 13 September 1988
Marjorie Figueroa 9 October 1988
Debra Garner 7 February 1988
Tiffany Peters 25 July 1988
Hugh Foster 2 June 1988
Darren Christensen 21 January 1988
Shelia Harrison 28 July 1988
Ignacio James 12 September 1988
Jerry Keller 30 February 1988
Frankie Cobb 1 July 1988
Clayton Thomas 10 December 1988
Laura Reyes 9 November 1988
Danny Jensen 19 September 1988
Sabrina Garcia 20 October 1988
Winifred Wood 27 July 1988
Juan Kennedy 4 March 1988
Nina Beck 7 May 1988
Tanya Marshall 22 May 1988
Kelly Gardner 16 August 1988
Cristina Ortega 13 January 1988
Guy Carr 21 June 1988
Geneva Martinez 5 September 1988
Ricardo Howell 23 December 1988
Bernadette Rios 19 July 1988
This is one approach using regex.
Ex:
import re
names = []
dates = []
with open(filename) as infile:
for line in infile:
line = line.strip()
date = re.search("(\d{1,2} [a-zA-Z]+ \d{4})", line).group(1) #Extract Date.
dates.append(date)
names.append(line.replace(date, "").strip()) #Get Name.
print("Name:")
for name in names:
print(name)
print("---"*10)
print("Date:")
for date in dates:
print(date)
Output:
Name:
Orville Wright
Rogelio Holloway
Marjorie Figueroa
Debra Garner
Tiffany Peters
Hugh Foster
Darren Christensen
Shelia Harrison
Ignacio James
Jerry Keller
Frankie Cobb
Clayton Thomas
Laura Reyes
Danny Jensen
Sabrina Garcia
Winifred Wood
Juan Kennedy
Nina Beck
Tanya Marshall
Kelly Gardner
Cristina Ortega
Guy Carr
Geneva Martinez
Ricardo Howell
Bernadette Rios
------------------------------
Date:
21 July 1988
13 September 1988
9 October 1988
7 February 1988
25 July 1988
2 June 1988
21 January 1988
28 July 1988
12 September 1988
30 February 1988
1 July 1988
10 December 1988
9 November 1988
19 September 1988
20 October 1988
27 July 1988
4 March 1988
7 May 1988
22 May 1988
16 August 1988
13 January 1988
21 June 1988
5 September 1988
23 December 1988
19 July 1988
Store all the names and dates inside different lists, then display each one.
The following code assumes that each name & date are separated by a newline, and that the first digit in that line is the start of the date.
import re
names = []
dates = []
with open('File name and location', 'r') as reader:
for line in reader.readlines():
date_position = re.search("\d", line).start()
names.append(line[:date_position - 1])
dates.append(line[date_position:])
Now you can print each name and date to your liking:
for i, name in enumerate(names):
print(f"{i+1}. {name}")
And for dates:
for i, date in enumerate(dates):
print(f"{i+1}. {name}")
Output (for part of the text file):
1. Orville Wright
2. Rogelio Holloway
3. Marjorie Figueroa
4. Debra Garner
5. Tiffany Peters
6. Hugh Foster
7. Darren Christensen
8. Shelia Harrison
9. Ignacio James
10. Jerry Keller
11. Frankie Cobb
12. Clayton Thomas
13. Laura Reyes
14. Danny Jensen
15. Sabrina Garcia
16. Winifred Wood
17. Juan Kennedy
1. 21 July 1988
2. 13 September 1988
3. 9 October 1988
4. 7 February 1988
5. 25 July 1988
6. 2 June 1988
7. 21 January 1988
8. 28 July 1988
9. 12 September 1988
10. 30 February 1988
11. 1 July 1988
12. 10 December 1988
13. 9 November 1988
14. 19 September 1988
15. 20 October 1988
16. 27 July 1988
17. 4 March 1988
I have a first dataframe containing a serie called 'Date', and a variable number n of series called 'People_1' to 'People_n' :
Id Date People_1 People_2 People_3 People_4 People_5 People_6 People_7
12.0 Sat Dec 19 00:00:00 EST 1970 Loretta Lynn Owen Bradley
13.0 Sat Jun 07 00:00:00 EDT 1980 Sissy Spacek Loretta Lynn Owen Bradley
14.0 Sat Dec 04 00:00:00 EST 2010 Loretta Lynn Sheryl Crow Miranda Lambert
15.0 Sat Aug 09 00:00:00 EDT 1969 Charley Pride Dallas Frazier A.L. "Doodle" Chet Atkins Jack Clement Bob Ferguson Felton Jarvis
I also have another dataframe containing a list of names and biographic datas :
People Birth_date Birth_state Sex Ethnicity
Charles Kelley Fri Sep 11 00:00:00 EDT 1981 GA Male Caucasian
Hillary Scott Tue Apr 01 00:00:00 EST 1986 TN Female Caucasian
Reba McEntire Mon Mar 28 00:00:00 EST 1955 OK Female Caucasian
Wanda Jackson Wed Oct 20 00:00:00 EST 1937 OK Female Caucasian
Carrie UnderwoodThu Mar 10 00:00:00 EST 1983 OK Female Caucasian
Toby Keith Sat Jul 08 00:00:00 EDT 1961 OK Male Caucasian
David Bellamy Sat Sep 16 00:00:00 EDT 1950 FL Male Caucasian
Howard Bellamy Sat Feb 02 00:00:00 EST 1946 FL Male Caucasian
Keith Urban Thu Oct 26 00:00:00 EDT 1967 Northland Male Caucasian
Miranda Lambert Thu Nov 10 00:00:00 EST 1983 TX Female Caucasian
Sam Hunt Sat Dec 08 00:00:00 EST 1984 GA Male Caucasian
Johnny Cash Fri Feb 26 00:00:00 EST 1932 AR Male Caucasian
June Carter Sun Jun 23 00:00:00 EDT 1929 VA Female Caucasian
Merle Haggard Tue Apr 06 00:00:00 EST 1937 CA Male Caucasian
Waylon Jennings Tue Jun 15 00:00:00 EDT 1937 TX Male Caucasian
Willie Nelson Sat Apr 29 00:00:00 EST 1933 TX Male Caucasian
Loretta Lynn Thu Apr 14 00:00:00 EST 1932 KY Female Caucasian
Sissy Spacek Sun Dec 25 00:00:00 EST 1949 TX Female Caucasian
Sheryl Crow Sun Feb 11 00:00:00 EST 1962 MO Female Caucasian
Charley Pride Sun Mar 18 00:00:00 EST 1934 MS Male African American
Rodney Clawon ? TX Male Caucasian
Nathan Chapman ? TN Male Caucasian
I want to get for each date the biographic datas of each people involved that day :
Date Birth_state Sex Ethnicity
Sat Dec 19 00:00:00 EST 1970 KY Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 TX Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 MO Female Caucasian
Sat Dec 04 00:00:00 EST 2010 TX Female Caucasian
Sat Aug 09 00:00:00 EDT 1969 MS Male African American
Precision :
Consider that my bio datas aren't complete yet, some names are missing, which explains why I don't have a row for each person.
So is there a way to perform this task in Python ?
Lionel
You may to use left join in pandas to join the two tables first and then select the columns you need.
For example, you can first aggregate all person into a new column, for example, named 'person'. Then add one single row for each person. After you finished doing this, then left join two dataframe as you did.