I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?
You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016
Related
I'm very new to python and a novice at development in general. I've been tasked with iterating through a div and I am able to get some data returned. However, when no tag exists I'm getting null results. Would anyone be able to help--I have researched and tried for days now? I truly appreciate your patience and explanations. I'm trying to extract the following:
Start Date [ie, Sat 31 Jul 2021] which I get results from based on my code
End Date [ie, Fri 20 Aug 2021] this one I get no results based on my code
Description [ie, 20 Night New! Malta, The Adriatic & Greece] this one I get no results based on my code
Ship Name [ie, Viking Sea] which I get results from based on my code
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
Here is my code (no judgements please..haha)
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
#initiate data storage
cruisestartdate = []
cruiseenddate = []
itinerarydescription = []
cruiseshipname = []
cruisecabinprice = []
destinationportname = []
portdatetime = []
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
#cruise start date
startdate = container.b.text
print(startdate)
cruisestartdate.append(startdate)
# cruise end date
enddate = container.string
cruiseenddate.append(enddate)
# ship name
ship = container.a.text
cruiseshipname.append(ship)
#pandas dataframe
cruise = pd.DataFrame({
'Sail Date': cruisestartdate,
'End Date': cruiseenddate,
#'Description': description,
'Ship Name': cruiseshipname,
#'imdb': imdb_ratings,
#'metascore': metascores,
#'votes': votes,
#'us_grossMillions': us_gross,
})
print(soup)
print(cruisestartdate)
print(cruiseenddate)
print(itinerarydescription)
print(cruiseshipname)```
Here are my results from the print:
['Sat 31 Jul 2021'] [None] [] ['Viking Sea']
container.text is a nicely formatted list of lines. Just splitit and use the pieces:
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
lines = container.text.splitlines()
cruisestartdate.append(lines[1])
cruiseenddate.append(lines[2])
itinerarydescription.append(lines[3])
# ship name
ship = container.a.text
cruiseshipname.append(ship)
Another solution, using bs4 API:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
all_data = []
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
4 Fri 13 Aug 2021 Mon 23 Aug 2021 10 Night New! Malta & Greek Isles Discovery Viking Venus
5 Fri 13 Aug 2021 Thu 2 Sep 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
6 Sat 14 Aug 2021 Sat 21 Aug 2021 7 Night Iceland's Natural Beauty Viking Sky
7 Mon 16 Aug 2021 Mon 13 Sep 2021 28 Night Star Collector: From Greek Gods to Gaudà Wind Surf
8 Mon 16 Aug 2021 Fri 3 Sep 2021 18 Night Star Collector: Flamenco of the Mediterranean Wind Surf
9 Mon 16 Aug 2021 Wed 29 Sep 2021 44 Night Star Collector: Myths & Masterpieces of the Mediterranean Wind Surf
and saves data.csv (screenshot from LibreOffice):
EDIT: To scrape the data from all pages:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"Accept-Language": "en-US, en;q=0.5"}
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
offset = 1
all_data = []
while True:
u = url + f"&offset={offset}"
print(f"Getting offset {offset=}")
results = requests.get(u, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
# get next offset:
offset = soup.select_one(".cd-buttonlist > b + a")
if offset:
offset = int(offset.get_text(strip=True).split("-")[0])
else:
break
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
Getting offset offset=1
Getting offset offset=11
Getting offset offset=21
...
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
...
144 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nights Mediterranean MSC Splendida
145 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nachte Grose Freiheit - Schwedische Kuste 1 Mein Schiff 6
146 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Night Iceland's Natural Beauty Viking Jupiter
Getting the start date, description, and ship name are straightforward. The only trick here is to get the end date which does lie inside a specific tag. To get it, you have to tweak some regex. Here is one solution similar to #Andrej's:
from bs4 import BeautifulSoup
import re
html = """
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
bold = soup.find_all('b')
print(f"Start date: {bold[0].text}")
print(f"Description: {bold[1].text}")
untagged_line = re.sub("\n", "", soup.find(text=re.compile('To')))
end_date = re.sub(r"\(To (.*)\)", r"\1", untagged_line)
print(f"End date: {end_date}")
ship = soup.find('a', class_='red')
print(f"Ship: {ship.text}")
Output:
Start date: Sat 31 Jul 2021
Description: 20 Night New! Malta, The Adriatic & Greece
End date: Fri 20 Aug 2021
Ship: Viking Sea
I have the column of values as below,
array(['Mar 2018', 'Jun 2018', 'Sep 2018', 'Dec 2018', 'Mar 2019',
'Jun 2019', 'Sep 2019', 'Dec 2019', 'Mar 2020', 'Jun 2020',
'Sep 2020', 'Dec 2020'], dtype=object)
From this values I require output as,
array(['Mar'18', 'Jun'18', 'Sep'18', 'Dec'18', 'Mar'19',
'Jun'19', 'Sep'19', 'Dec'19', 'Mar'20', 'Jun'20',
'Sep'20', 'Dec'20'], dtype=object)
I have tried with following code,
df['Period'] = df['Period'].replace({'20','''})
But here it wasnt converting , how to replace the same?
Any help?
Thanks
With your shown samples, please try following.
df['Period'].replace(r" \d{2}", "'", regex=True)
Output will be as follows.
0 Mar'18
1 Jun'18
2 Sep'18
3 Dec'18
4 Mar'19
5 Jun'19
6 Sep'19
7 Dec'19
8 Mar'20
9 Jun'20
10 Sep'20
11 Dec'20
try this regex:
df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True)
in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.
Following your code (slightly changed to work) will not get you what you need as it will replace all '20's.
>>> df['Period'] = df['Period'].str.replace('20','')
Out[179]:
Period
0 Mar 18
1 Jun 18
2 Sep 18
3 Dec 18
4 Mar 19
5 Jun 19
6 Sep 19
7 Dec 19
8 Mar
9 Jun
10 Sep
11 Dec
Another way without using regex, would be with with vectorized str methods, more here:
df['Period_refined'] = df['Period'].str[:3] + "'" + df['Period'].str[-2:]
Output
df
Period Period_refined
0 Mar 2018 Mar'18
1 Jun 2018 Jun'18
2 Sep 2018 Sep'18
3 Dec 2018 Dec'18
4 Mar 2019 Mar'19
5 Jun 2019 Jun'19
6 Sep 2019 Sep'19
7 Dec 2019 Dec'19
8 Mar 2020 Mar'20
9 Jun 2020 Jun'20
10 Sep 2020 Sep'20
11 Dec 2020 Dec'20
I'm trying to parse dates from individual health records. Since the entries appear to be manual, the date formats are all over the place. My regex patterns are apparently not making the cut for several observations. Here's the list of tasks I need to accomplish along with the accompanying code. Dataframe has been subsetted to 15 observations for convenience.
Parse dates:
#Create DF:
health_records = ['08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',
'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox + fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',
'28 Sep 2015 Primary Care Doctor:',
'06 Mar 1974 Primary Care Doctor:',
'none; but currently has appt with new HJH PCP Rachel Salas, MD on October. 11, 2013 Other Agency Involvement: No',
'.Came back to US on Jan 24 1986, saw Dr. Quackenbush at Beaufort Memorial Hospital. Checked VPA level and found it to be therapeutic and confirmed BPAD dx. Also, has a general physician exam and found to be in good general health, except for being slightly overwt',
'September. 15, 2011 Total time of visit (in minutes):',
'sLanguage based learning disorder, dyslexia. Placed on IEP in 1st grade through Westerbrook HS prior to transitioning to VFA in 8th grade. Graduated from VF Academy in May 2004. Attended 1.5 years college at Arcadia.Employment Currently employed: Yes',
') - Zoloft 100 mg daily: February, 2010 : self-discontinued due to side effects (unknown)',
'6/1998 Primary Care Doctor:',
'12/2008 Primary Care Doctor:',
'ran own business for 35 years, sold in 1985',
'011/14/83 Audit C Score Current:',
'safter evicted in February 1976, hospitalized at Pemberly for 1 mo.Hx of Outpatient Treatment: No',
'. Age 16, 1991, frontal impact. out for two weeks from sports.',
's Mr. Moss is a 27-year-old, Caucasian, engaged veteran of the Navy. He was previously scheduled for an intake at the Southton Sanitorium in January, 2013 but cancelled due to ongoing therapy (see Psychiatric History for more details). He presents to the current intake with primary complaints of sleep difficulties, depressive symptoms, and PTSD.']
import numpy as np
import pandas as pd
df = pd.DataFrame(health_records, columns=['records'])
#Date parsing: patten 1:
df['new'] = (df['records'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Date parsing pattern 2:
df['new2'] = (df['records'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['date'] = df['new']+df['new2']
and here is the output:
df['date']
0 08/11/78
1 7/11/77 16/0.83 4.9/36
2 28 Sep 2015
3 06 Mar 1974
4
5 24 1986
6
7 May 2004
8
9 6/1998
10 12/2008
11
12 011/14
13 February 1976
14
15
As you can see in some places the code works perfectly, but in complex sentences my pattern is not working or spitting out inaccurate results. Here is a list of all possible date combinations:
04/20/2009; 04/20/09; 4/20/09; 4/3/09;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009;
Mar 21st, 2009; Mar 22nd, 2009;
Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009
2009; 2010
Clean dates
Next I tried to clean dates, using a solution provided here - while it should work, since my format is similar to one in that problem, but it doesn't.
#Clean dates to date format
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y'))
df['clean_date']
The above code does not work. Any help would be deeply appreciated. Thanks for your time!
Well figured it out on my own. Still had to make some manual adjustments.
df['new'] = (df['header'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
df['new2'] = (df['header'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['new3'] = (df['header'][455:501].str.findall(r'\d{4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Coerce dates data to date-format
df['date1'] = df['new'].str.strip() + df['new2'].str.strip() + df['new3'].str.strip()
df['date1'] = pd.to_datetime(df['date1'], errors='coerce')
df[['date1', 'header']].sort_values(by ='date1')
I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule
The chart starts out like:
Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN
I am using these packages:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
The output I want in a .csv file looks like this:
These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')
dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
if v:
last = [*g][0]
dates[last] = []
else:
dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])
all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
for time, team1, team2, network in v:
all_data['Date'].append(k)
all_data['Time'].append(time)
all_data['Team 1'].append(team1)
all_data['Team 2'].append(team2)
all_data['Network'].append(network)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
Date Time Team 1 Team 2 Network
0 Tuesday, October 25 8:00 PM Knicks Cavaliers TNT
1 Tuesday, October 25 10:30 PM Spurs Warriors TNT
2 Wednesday, October 26 8:00 PM Thunder Sixers ESPN
3 Wednesday, October 26 10:30 PM Rockets Lakers ESPN
4 Thursday, October 27 8:00 PM Celtics Bulls TNT
.. ... ... ... ... ...
159 Saturday, April 8 8:30 PM Clippers Spurs ABC
160 Monday, April 10 8:00 PM Wizards Pistons TNT
161 Monday, April 10 10:30 PM Rockets Clippers TNT
162 Wednesday, April 12 8:00 PM Hawks Pacers ESPN
163 Wednesday, April 12 10:30 PM Pelicans Blazers ESPN
[164 rows x 5 columns]
And saves data.csv (screenshot from Libre Office):
Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece
Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.
String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]