How to web scrape a table by using Python? - python

I am trying to web scrape, by using Python 3, a table off of this website into a .csv file: 2015 NBA National TV Schedule
The chart starts out like:
Date Teams Network
Oct. 27, 8:00 p.m. ET Cleveland # Chicago TNT
Oct. 27, 10:30 p.m. ET New Orleans # Golden State TNT
Oct. 28, 8:00 p.m. ET San Antonio # Oklahoma City ESPN
Oct. 28, 10:30 p.m. ET Minnesota # L.A. Lakers ESPN
I am using these packages:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
The output I want in a .csv file looks like this:
These are the first four lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once, and the time is in a separate column. How do I implement the scraper to get this output?

pd.read_html will get most of the way there:
In [73]: pd.read_html("https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952")[0]
Out[73]:
0 1 2
0 Date Teams Network
1 Oct. 27, 8:00 p.m. ET Cleveland # Chicago TNT
2 Oct. 27, 10:30 p.m. ET New Orleans # Golden State TNT
3 Oct. 28, 8:00 p.m. ET San Antonio # Oklahoma City ESPN
4 Oct. 28, 10:30 p.m. ET Minnesota # L.A. Lakers ESPN
.. ... ... ...
139 Apr. 9, 8:30 p.m. ET Cleveland # Chicago ABC
140 Apr. 12, 8:00 p.m. ET Oklahoma City # San Antonio TNT
141 Apr. 12, 10:30 p.m. ET Memphis # L.A. Clippers TNT
142 Apr. 13, 8:00 p.m. ET Orlando # Charlotte ESPN
143 Apr. 13, 10:30 p.m. ET Utah # L.A. Lakers ESPN
You'd just need to parse out the date into columns and separate the teams.

You'll use pandas to grab the table with .read_html(), and then continue with pandas to manipulate the data:
import pandas as pd
import numpy as np
df = pd.read_html('https://deadspin.com/nba-national-tv-espn-tnt-abc-nba-tv-1723767952', header=0)[0]
# Split the Date column at the comma into Date, Time columns
df[['Date','Time']] = df.Date.str.split(',',expand=True)
# Replace substrings in Time column
df['Time'] = df['Time'].str.replace('p.m. ET','PM')
# Can't convert to datetime as there is no year. One way to do it here is anything before
# Jan, add the suffix ', 2015', else add ', 2016'
# If you have more than 1 seasin, would have to work this out another way
df['Date'] = np.where(df.Date.str.startswith(('Oct.', 'Nov.', 'Dec.')), df.Date + ', 2015', df.Date + ', 2016')
# If you want 0 padding for the day, remove '#' from %#d below
# Change the date format from abbreviated month to full name (Ie Oct. -> October)
df['Date'] = pd.to_datetime(df['Date'].astype(str)).dt.strftime('%B %#d, %Y')
# Split the Teams column
df[['Team 1','Team 2']] = df.Teams.str.split('#',expand=True)
# Remove any leading/trailing whitespace
df= df.applymap(lambda x: x.strip() if type(x) is str else x)
# Final dataframe with desired columns
df = df[['Date','Time','Team 1','Team 2','Network']]
Output:
Date Time Team 1 Team 2 Network
0 October 27, 2015 8:00 PM Cleveland Chicago TNT
1 October 27, 2015 10:30 PM New Orleans Golden State TNT
2 October 28, 2015 8:00 PM San Antonio Oklahoma City ESPN
3 October 28, 2015 10:30 PM Minnesota L.A. Lakers ESPN
4 October 29, 2015 8:00 PM Atlanta New York TNT
5 October 29, 2015 10:30 PM Dallas L.A. Clippers TNT
6 October 30, 2015 7:00 PM Miami Cleveland ESPN
7 October 30, 2015 9:30 PM Golden State Houston ESPN
8 November 4, 2015 8:00 PM New York Cleveland ESPN
9 November 4, 2015 10:30 PM L.A. Clippers Golden State ESPN
10 November 5, 2015 8:00 PM Oklahoma City Chicago TNT
11 November 5, 2015 10:30 PM Memphis Portland TNT
12 November 6, 2015 8:00 PM Miami Indiana ESPN
13 November 6, 2015 10:30 PM Houston Sacramento ESPN
14 November 11, 2015 8:00 PM L.A. Clippers Dallas ESPN
15 November 11, 2015 10:30 PM San Antonio Portland ESPN
16 November 12, 2015 8:00 PM Golden State Minnesota TNT
17 November 12, 2015 10:30 PM L.A. Clippers Phoenix TNT
18 November 18, 2015 8:00 PM New Orleans Oklahoma City ESPN
19 November 18, 2015 10:30 PM Chicago Phoenix ESPN
20 November 19, 2015 8:00 PM Milwaukee Cleveland TNT
21 November 19, 2015 10:30 PM Golden State L.A. Clippers TNT
22 November 20, 2015 8:00 PM San Antonio New Orleans ESPN
23 November 20, 2015 10:30 PM Chicago Golden State ESPN
24 November 24, 2015 8:00 PM Boston Atlanta TNT
25 November 24, 2015 10:30 PM L.A. Lakers Golden State TNT
26 December 3, 2015 7:00 PM Oklahoma City Miami TNT
27 December 3, 2015 9:30 PM San Antonio Memphis TNT
28 December 4, 2015 7:00 PM Brooklyn New York ESPN
29 December 4, 2015 9:30 PM Cleveland New Orleans ESPN
.. ... ... ... ... ...
113 March 10, 2016 10:30 PM Cleveland L.A. Lakers TNT
114 March 12, 2016 8:30 PM Oklahoma City San Antonio ABC
115 March 13, 2016 3:30 PM Cleveland L.A. Clippers ABC
116 March 14, 2016 8:00 PM Memphis Houston ESPN
117 March 14, 2016 10:30 PM New Orleans Golden State ESPN
118 March 16, 2016 7:00 PM Oklahoma City Boston ESPN
119 March 16, 2016 9:30 PM L.A. Clippers Houston ESPN
120 March 19, 2016 8:30 PM Golden State San Antonio ABC
121 March 22, 2016 8:00 PM Houston Oklahoma City TNT
122 March 22, 2016 10:30 PM Memphis L.A. Lakers TNT
123 March 23, 2016 8:00 PM Milwaukee Cleveland ESPN
124 March 23, 2016 10:30 PM Dallas Portland ESPN
125 March 29, 2016 8:00 PM Houston Cleveland TNT
126 March 29, 2016 10:30 PM Washington Golden State TNT
127 March 31, 2016 7:00 PM Chicago Houston TNT
128 March 31, 2016 9:30 PM L.A. Clippers Oklahoma City TNT
129 April 1, 2016 8:00 PM Cleveland Atlanta ESPN
130 April 1, 2016 10:30 PM Boston Golden State ESPN
131 April 3, 2016 3:30 PM Oklahoma City Houston ABC
132 April 5, 2016 8:00 PM Chicago Memphis TNT
133 April 5, 2016 10:30 PM L.A. Lakers L.A. Clippers TNT
134 April 6, 2016 7:00 PM Cleveland Indiana ESPN
135 April 6, 2016 9:30 PM Houston Dallas ESPN
136 April 7, 2016 8:00 PM Chicago Miami TNT
137 April 7, 2016 10:30 PM San Antonio Golden State TNT
138 April 9, 2016 8:30 PM Cleveland Chicago ABC
139 April 12, 2016 8:00 PM Oklahoma City San Antonio TNT
140 April 12, 2016 10:30 PM Memphis L.A. Clippers TNT
141 April 13, 2016 8:00 PM Orlando Charlotte ESPN
142 April 13, 2016 10:30 PM Utah L.A. Lakers ESPN
[143 rows x 5 columns]

Related

How to remove unwanted data from a data column using pandas DataFrame

I'm getting date two times using comma separation along with day in date column from the scraped data. My goal is to remove this December 13, 2021Mon, portion and want to create a separate/new column for days and I also wanted to remove the last one column meaning the Volumn column.
Script
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
Expected Output
Day Date Open High Low Close
Monday Dec 13, 2021 77.77 77.77 77.77 77.77
Friday Dec 10, 2021 77.61 77.61 77.61 77.61
Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
Current output
Date Open High Low Close Volume
Monday, December 13, 2021Mon, Dec 13, 2021 77.77 77.77 77.77 77.77 00.00
Friday, December 10, 2021Fri, Dec 10, 2021 77.61 77.61 77.61 77.61 ----
Thursday, December 09, 2021Thu, Dec 09, 2021 77.60 77.60 77.60 77.60 ----
Wednesday, December 08, 2021Wed, Dec 08, 2021 77.47 77.47 77.47 77.47 ----
Tuesday, December 07, 2021Tue, Dec 07, 2021 77.64 77.64 77.64 77.64 ----
I added the necessary steps to your code:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
# get the Day column
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# reformat Date to the desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# remove the Volume column
df.pop('Volume')
print(df)
After those three operations, df looks like this:
Day Date Open High Low Close
0 Monday Dec 13, 2021 77.77 77.77 77.77 77.77
1 Friday Dec 10, 2021 77.61 77.61 77.61 77.61
2 Thursday Dec 09, 2021 77.60 77.60 77.60 77.60
3 Wednesday Dec 08, 2021 77.47 77.47 77.47 77.47
4 Tuesday Dec 07, 2021 77.64 77.64 77.64 77.64
5 Monday Dec 06, 2021 77.70 77.70 77.70 77.70
6 Friday Dec 03, 2021 77.72 77.72 77.72 77.72
...
I would use regex here to split. Then you can combine them and parse anyway you like afterwards:
import requests
import pandas as pd
isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
df = pd.concat(dfs)
print(df)
df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
df['Date'] = df['Date_alpha'] + df['Date_beta']
df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
Output:
print(df)
Date Open High Low Close Volume
0 Monday, December 13, 2021 77.77 77.77 77.77 77.77 ----
1 Friday, December 10, 2021 77.61 77.61 77.61 77.61 ----
2 Thursday, December 09, 2021 77.60 77.60 77.60 77.60 ----
3 Wednesday, December 08, 2021 77.47 77.47 77.47 77.47 ----
4 Tuesday, December 07, 2021 77.64 77.64 77.64 77.64 ----
5 Monday, December 06, 2021 77.70 77.70 77.70 77.70 ----
6 Friday, December 03, 2021 77.72 77.72 77.72 77.72 ----
7 Thursday, December 02, 2021 77.56 77.56 77.56 77.56 ----
8 Wednesday, December 01, 2021 77.51 77.51 77.51 77.51 ----
9 Tuesday, November 30, 2021 77.52 77.52 77.52 77.52 ----
10 Monday, November 29, 2021 77.37 77.37 77.37 77.37 ----
11 Friday, November 26, 2021 77.44 77.44 77.44 77.44 ----
12 Thursday, November 25, 2021 77.11 77.11 77.11 77.11 ----
13 Wednesday, November 24, 2021 77.10 77.10 77.10 77.10 ----
14 Tuesday, November 23, 2021 77.02 77.02 77.02 77.02 ----
15 Monday, November 22, 2021 77.32 77.32 77.32 77.32 ----
16 Friday, November 19, 2021 77.52 77.52 77.52 77.52 ----
17 Thursday, November 18, 2021 77.38 77.38 77.38 77.38 ----
18 Wednesday, November 17, 2021 77.26 77.26 77.26 77.26 ----
19 Tuesday, November 16, 2021 77.24 77.24 77.24 77.24 ----
20 Monday, November 15, 2021 77.30 77.30 77.30 77.30 ----
0 Monday, December 13, 2021 11.09 11.09 11.09 11.09 ----
1 Friday, December 10, 2021 11.08 11.08 11.08 11.08 ----
2 Thursday, December 09, 2021 11.08 11.08 11.08 11.08 ----
3 Wednesday, December 08, 2021 11.06 11.06 11.06 11.06 ----
4 Tuesday, December 07, 2021 11.08 11.08 11.08 11.08 ----
5 Monday, December 06, 2021 11.09 11.09 11.09 11.09 ----
6 Friday, December 03, 2021 11.08 11.08 11.08 11.08 ----
7 Thursday, December 02, 2021 11.08 11.08 11.08 11.08 ----
8 Wednesday, December 01, 2021 11.05 11.05 11.05 11.05 ----
9 Tuesday, November 30, 2021 11.07 11.07 11.07 11.07 ----
10 Monday, November 29, 2021 11.07 11.07 11.07 11.07 ----
11 Friday, November 26, 2021 11.08 11.08 11.08 11.08 ----
12 Thursday, November 25, 2021 11.04 11.04 11.04 11.04 ----
13 Wednesday, November 24, 2021 11.03 11.03 11.03 11.03 ----
14 Tuesday, November 23, 2021 11.04 11.04 11.04 11.04 ----
15 Monday, November 22, 2021 11.07 11.07 11.07 11.07 ----
16 Friday, November 19, 2021 11.09 11.09 11.09 11.09 ----
17 Thursday, November 18, 2021 11.06 11.06 11.06 11.06 ----
18 Wednesday, November 17, 2021 11.05 11.05 11.05 11.05 ----
19 Tuesday, November 16, 2021 11.05 11.05 11.05 11.05 ----
20 Monday, November 15, 2021 11.05 11.05 11.05 11.05 ----

Web scraping the second of two tables on a page in Python 3 with BeautifulSoup

I'm working on my python skills and I'm trying to scrape only the "Results" table from this page https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results . I'm new to web scraping, could anyone help me with an elegant solution for scraping the Results wikitable? Thanks!
The easiest way is to use Pandas to load the tables:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
# print second table (index 1):
print(tables[1])
Prints:
Date Venue Home team Away team Score Competition Winner Match report
0 7 March 2020 Twickenham Stadium England Wales 33–30 2020 Six Nations England BBC
1 22 February 2020 Principality Stadium Wales France 23–27 2020 Six Nations France BBC
2 8 February 2020 Aviva Stadium Ireland Wales 24–14 2020 Six Nations Ireland BBC
3 1 February 2020 Principality Stadium Wales Italy 42–0 2020 Six Nations Wales BBC
4 30 November 2019 Principality Stadium Wales Barbarians 43–33 Tour Match Wales BBC
.. ... ... ... ... ... ... ... ...
741 5 January 1884 Cardigan Fields England Wales 1G 2T–1G 1884 Home Nations Championship England NaN
742 8 January 1883 Raeburn Place Scotland Wales 3G–1G 1883 Home Nations Championship Scotland NaN
743 16 December 1882 St Helen's Wales England 0–2G 4T 1883 Home Nations Championship England NaN
744 28 January 1882 Lansdowne Road Ireland Wales 0–2G 2T NaN Wales NaN
745 19 February 1881 Richardson's Field England Wales 7G 6T 1D–0 NaN England NaN
[746 rows x 8 columns]

How to separate columns and format date when web scraping by using Python?

I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2013-14 NBA National TV Schedule
The chart starts out like:
Game/Time Network Matchup
Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
I am using these packages:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
I imported the data by:
pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
The output sample is:
0 1 2
0 Game/Time Network Matchup
1 Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
2 Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
The output I want in a .csv file looks like this:
I am unsure how I can split the game/time up into separate columns. Notice how the date is formatted like 10/29/13. I also am unsure how to split matchup into away (first team) and home (second team) into separate columns. I know pd.to_datetime and str.split() should be used. How do I implement the scraper to get this output?
Here's my take:
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set the correct column names
df = df.T.set_index([0]).T
# separate date and time
datetime = df['Game/Time'].str.extract('(?P<Date>.*), (?P<Time>.*) ET$')
# extract Home and Away
home_away = df['Matchup'].str.extract('^(?P<Away>.*) vs\. (?P<Home>.*)$')
# join the data
final_df = pd.concat([datetime, home_away, df[['Network']]], axis=1)
Output:
Date Time Away Home Network
1 Oct. 29 8 p.m. Chicago Miami TNT
2 Oct. 29 10:30 p.m. LA Clippers LA Lakers TNT
3 Oct. 31 8 p.m. New York Chicago TNT
4 Oct. 31 10:30 p.m. Golden State LA Clippers TNT
5 Nov. 1 8 p.m. Miami Brooklyn ESPN
.. ... ... ... ... ...
141 Apr. 13 1 p.m. Chicago New York ABC
142 Apr. 15 8 p.m. New York Brooklyn TNT
143 Apr. 15 10:30 p.m. Denver LA Clippers TNT
144 Apr. 16 8 p.m. Atlanta Milwaukee ESPN
145 Apr. 16 10:30 p.m. Golden State Denver ESPN
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
This line should help you format the date in the exact way you want
import pandas as pd
import numpy as np
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule",header=0)[0]
df['Date']=df['Game/Time'].str.extract(r'(.*),',expand=True)
df['Time']=df['Game/Time'].str.extract(r',(.*) ET',expand=True)
df['Time']=df['Time'].str.replace('p.m.','PM')
df['Date'] = np.where(df.Date.str.startswith(('10/', 11/', '12/')), df.Date + ' 13', df.Date + ' 14')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
df['Home'] = df['Matchup'].str.extract('(.*)vs')
df['Away'] = df['Matchup'].str.extract('vs.(.*)')
df = df.drop(columns=['Game/Time','Matchup'])
print(df)
Network Date Time Home Away
0 TNT 10/29/2013 8 PM Chicago Miami
1 TNT 10/29/2013 10:30 PM LA Clippers LA Lakers
2 TNT 10/31/2013 8 PM New York Chicago
3 TNT 10/31/2013 10:30 PM Golden State LA Clippers
4 ESPN 11/01/2013 8 PM Miami Brooklyn
I hope this is what you were looking for.
You can use regex to split out your columns, your time has different format so we can handle those by using specific formats and forcing the errors into NaT values.
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set column
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
#set date and time column.
df['date'] = pd.to_datetime((df['Game/Time'].str.split(',',expand=True)[0] + ' 2019')
,format='%b. %d %Y')
df['time'] = df['Game/Time'].str.split(',',expand=True)[1]
#time column has different formats, lets handle those.
s = pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H %p',errors='coerce')
s = s.fillna(pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H:%M %p',errors='coerce'))
df['time'] = s.dt.time
#home and away columns.
df['home'] = df['Matchup'].str.extract('(.*)vs(.*)')[0].str.strip()
df['away'] = df['Matchup'].str.extract('(.*)vs(.*)')[1].str.strip('.')
# slice dataframe.
df2 = df[['date','time','home','away','Network']]
print(df2)
0 date time home away Network
0 2019-10-29 08:00:00 Chicago Miami TNT
1 2019-10-29 10:30:00 LA Clippers LA Lakers TNT
2 2019-10-31 08:00:00 New York Chicago TNT
3 2019-10-31 10:30:00 Golden State LA Clippers TNT
4 2019-11-01 08:00:00 Miami Brooklyn ESPN
.. ... ... ... ... ...
140 2019-04-13 01:00:00 Chicago New York ABC
141 2019-04-15 08:00:00 New York Brooklyn TNT
142 2019-04-15 10:30:00 Denver LA Clippers TNT
143 2019-04-16 08:00:00 Atlanta Milwaukee ESPN
144 2019-04-16 10:30:00 Golden State Denver ESPN

Reading from a text file then splitting that information

I need to read from a text file, then print the information separately.
for example:
i'm given a list of names in this format: Orville Wright 21 July 1988
And i need to make the outcome as so:
Name
1. Orville Wright
Date
1. 21 July 1988
I've tried using a reader to separate but I would have to have a separate code line for every name and date given as they are not the same length.
with open('File name and location', 'r') as reader:
print(reader.readline(14))
``````````````````````````````````````````````````
this is the outcome : Orville Wright
```````````````````````````````````````````````````
I want my results to be:
Name:
1. Orville Wright
2. Rogelio Holloway
etc
Date:
1. 21 July 1988
2. 13 September 1988
etc
````````````````````````````````````````````````````
The contents of the file are as follows:
Orville Wright 21 July 1988
Rogelio Holloway 13 September 1988
Marjorie Figueroa 9 October 1988
Debra Garner 7 February 1988
Tiffany Peters 25 July 1988
Hugh Foster 2 June 1988
Darren Christensen 21 January 1988
Shelia Harrison 28 July 1988
Ignacio James 12 September 1988
Jerry Keller 30 February 1988
Frankie Cobb 1 July 1988
Clayton Thomas 10 December 1988
Laura Reyes 9 November 1988
Danny Jensen 19 September 1988
Sabrina Garcia 20 October 1988
Winifred Wood 27 July 1988
Juan Kennedy 4 March 1988
Nina Beck 7 May 1988
Tanya Marshall 22 May 1988
Kelly Gardner 16 August 1988
Cristina Ortega 13 January 1988
Guy Carr 21 June 1988
Geneva Martinez 5 September 1988
Ricardo Howell 23 December 1988
Bernadette Rios 19 July 1988
This is one approach using regex.
Ex:
import re
names = []
dates = []
with open(filename) as infile:
for line in infile:
line = line.strip()
date = re.search("(\d{1,2} [a-zA-Z]+ \d{4})", line).group(1) #Extract Date.
dates.append(date)
names.append(line.replace(date, "").strip()) #Get Name.
print("Name:")
for name in names:
print(name)
print("---"*10)
print("Date:")
for date in dates:
print(date)
Output:
Name:
Orville Wright
Rogelio Holloway
Marjorie Figueroa
Debra Garner
Tiffany Peters
Hugh Foster
Darren Christensen
Shelia Harrison
Ignacio James
Jerry Keller
Frankie Cobb
Clayton Thomas
Laura Reyes
Danny Jensen
Sabrina Garcia
Winifred Wood
Juan Kennedy
Nina Beck
Tanya Marshall
Kelly Gardner
Cristina Ortega
Guy Carr
Geneva Martinez
Ricardo Howell
Bernadette Rios
------------------------------
Date:
21 July 1988
13 September 1988
9 October 1988
7 February 1988
25 July 1988
2 June 1988
21 January 1988
28 July 1988
12 September 1988
30 February 1988
1 July 1988
10 December 1988
9 November 1988
19 September 1988
20 October 1988
27 July 1988
4 March 1988
7 May 1988
22 May 1988
16 August 1988
13 January 1988
21 June 1988
5 September 1988
23 December 1988
19 July 1988
Store all the names and dates inside different lists, then display each one.
The following code assumes that each name & date are separated by a newline, and that the first digit in that line is the start of the date.
import re
names = []
dates = []
with open('File name and location', 'r') as reader:
for line in reader.readlines():
date_position = re.search("\d", line).start()
names.append(line[:date_position - 1])
dates.append(line[date_position:])
Now you can print each name and date to your liking:
for i, name in enumerate(names):
print(f"{i+1}. {name}")
And for dates:
for i, date in enumerate(dates):
print(f"{i+1}. {name}")
Output (for part of the text file):
1. Orville Wright
2. Rogelio Holloway
3. Marjorie Figueroa
4. Debra Garner
5. Tiffany Peters
6. Hugh Foster
7. Darren Christensen
8. Shelia Harrison
9. Ignacio James
10. Jerry Keller
11. Frankie Cobb
12. Clayton Thomas
13. Laura Reyes
14. Danny Jensen
15. Sabrina Garcia
16. Winifred Wood
17. Juan Kennedy
1. 21 July 1988
2. 13 September 1988
3. 9 October 1988
4. 7 February 1988
5. 25 July 1988
6. 2 June 1988
7. 21 January 1988
8. 28 July 1988
9. 12 September 1988
10. 30 February 1988
11. 1 July 1988
12. 10 December 1988
13. 9 November 1988
14. 19 September 1988
15. 20 October 1988
16. 27 July 1988
17. 4 March 1988

Data Preparation - Python

I have a first dataframe containing a serie called 'Date', and a variable number n of series called 'People_1' to 'People_n' :
Id Date People_1 People_2 People_3 People_4 People_5 People_6 People_7
12.0 Sat Dec 19 00:00:00 EST 1970 Loretta Lynn Owen Bradley
13.0 Sat Jun 07 00:00:00 EDT 1980 Sissy Spacek Loretta Lynn Owen Bradley
14.0 Sat Dec 04 00:00:00 EST 2010 Loretta Lynn Sheryl Crow Miranda Lambert
15.0 Sat Aug 09 00:00:00 EDT 1969 Charley Pride Dallas Frazier A.L. "Doodle" Chet Atkins Jack Clement Bob Ferguson Felton Jarvis
I also have another dataframe containing a list of names and biographic datas :
People Birth_date Birth_state Sex Ethnicity
Charles Kelley Fri Sep 11 00:00:00 EDT 1981 GA Male Caucasian
Hillary Scott Tue Apr 01 00:00:00 EST 1986 TN Female Caucasian
Reba McEntire Mon Mar 28 00:00:00 EST 1955 OK Female Caucasian
Wanda Jackson Wed Oct 20 00:00:00 EST 1937 OK Female Caucasian
Carrie UnderwoodThu Mar 10 00:00:00 EST 1983 OK Female Caucasian
Toby Keith Sat Jul 08 00:00:00 EDT 1961 OK Male Caucasian
David Bellamy Sat Sep 16 00:00:00 EDT 1950 FL Male Caucasian
Howard Bellamy Sat Feb 02 00:00:00 EST 1946 FL Male Caucasian
Keith Urban Thu Oct 26 00:00:00 EDT 1967 Northland Male Caucasian
Miranda Lambert Thu Nov 10 00:00:00 EST 1983 TX Female Caucasian
Sam Hunt Sat Dec 08 00:00:00 EST 1984 GA Male Caucasian
Johnny Cash Fri Feb 26 00:00:00 EST 1932 AR Male Caucasian
June Carter Sun Jun 23 00:00:00 EDT 1929 VA Female Caucasian
Merle Haggard Tue Apr 06 00:00:00 EST 1937 CA Male Caucasian
Waylon Jennings Tue Jun 15 00:00:00 EDT 1937 TX Male Caucasian
Willie Nelson Sat Apr 29 00:00:00 EST 1933 TX Male Caucasian
Loretta Lynn Thu Apr 14 00:00:00 EST 1932 KY Female Caucasian
Sissy Spacek Sun Dec 25 00:00:00 EST 1949 TX Female Caucasian
Sheryl Crow Sun Feb 11 00:00:00 EST 1962 MO Female Caucasian
Charley Pride Sun Mar 18 00:00:00 EST 1934 MS Male African American
Rodney Clawon ? TX Male Caucasian
Nathan Chapman ? TN Male Caucasian
I want to get for each date the biographic datas of each people involved that day :
Date Birth_state Sex Ethnicity
Sat Dec 19 00:00:00 EST 1970 KY Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 TX Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 MO Female Caucasian
Sat Dec 04 00:00:00 EST 2010 TX Female Caucasian
Sat Aug 09 00:00:00 EDT 1969 MS Male African American
Precision :
Consider that my bio datas aren't complete yet, some names are missing, which explains why I don't have a row for each person.
So is there a way to perform this task in Python ?
Lionel
You may to use left join in pandas to join the two tables first and then select the columns you need.
For example, you can first aggregate all person into a new column, for example, named 'person'. Then add one single row for each person. After you finished doing this, then left join two dataframe as you did.

Categories