How can I get a dictionary like below from this text file:
[Fri Aug 20]
shamooshak 4-0 milan
Tehran 2-0 Ams
Liverpool 0-2 Mes
[Fri Aug 19]
Esteghlal 1-0 perspolise
Paris 2-0 perspolise
[Fri Aug 20]
RahAhan 0-0 milan
[Wed Agu 11]
Munich 3-3 ABC
[Wed Agu 12]
RM 0-0 Tarakto
[Sat Jau 01]
Bayern 2-0 Manchester
I have tried list comprehension, for loops with enumerate function. But I could not build this list.
My desired dictionary is:
{'[Fri Aug 20]':[shamooshak 4-0 milan, Tehran 2-0 Ams,Liverpool 0-2 Mes],'[Fri Aug 19]':[Esteghlal 1-0 perspolise,Paris 2-0 perspolise]... and so on.
Assuming your data is lines of text...
def process_arbitrary_text(text):
obj = {}
arr = []
k = None
for line in text:
if line[0] == '[' and line[-1] == ']':
if k and arr: # omit empty keys?
obj[k] = arr
k = line
arr = []
else:
arr.append(line)
return obj
desired_dict = process_arbitrary_text(text)
Edit: since you edited to say it was a text file, just include the following pattern
with open('filename.txt', 'r') as file:
for line in file:
# do something...or:
text = file.readlines()
A for can be a savior here
a='''[Fri Aug 20]
shamooshak 4-0 milan
Tehran 2-0 Ams
Liverpool 0-2 Mes
[Fri Aug 19]
Esteghlal 1-0 perspolise
Paris 2-0 perspolise
[Fri Aug 20]
RahAhan 0-0 milan
[Wed Agu 11]
Munich 3-3 ABC
[Wed Agu 12]
RM 0-0 Tarakto
[Sat Jau 01]
Bayern 2-0 Manchester'''
d={}
temp_value=[]
temp_key=''
for i in a.split('\n'):
if i.startswith('['):
if temp_key and temp_key in d:
d[temp_key]=d[temp_key]+temp_value
elif temp_key:
d[temp_key]=temp_value
temp_key=i
temp_value=[]
else:
temp_value.append(i)
print(d)
Output
{'[Fri Aug 20]': ['shamooshak 4-0 milan', 'Tehran 2-0 Ams', 'Liverpool 0-2 Mes', 'RahAhan 0-0 milan'], '[Fri Aug 19]': ['Esteghlal 1-0 perspolise', 'Paris 2-0 perspolise'], '[Wed Agu 12]': ['RM 0-0 Tarakto'], '[Wed Agu 11]': ['Munich 3-3 ABC']}
Using regular expressions (re module) and your sample text:
text = '''[Fri Aug 20]
shamooshak 4-0 milan
Tehran 2-0 Ams
Liverpool 0-2 Mes
[Fri Aug 19]
Esteghlal 1-0 perspolise
Paris 2-0 perspolise
[Fri Aug 20]
RahAhan 0-0 milan
[Wed Agu 11]
Munich 3-3 ABC
[Wed Agu 12]
RM 0-0 Tarakto
[Sat Jau 01]
Bayern 2-0 Manchester'''
x = re.findall('\[.+?\][^\[]*',text)
x = [i.split('\n') for i in x]
d = dict()
for i in x:
d[i[0]] = [j for j in i[1:] if j!='']
It gives following dictionary d:
`{'[Fri Aug 20]': ['RahAhan 0-0 milan'], '[Sat Jau 01]': ['Bayern 2-0 Manchester'], '[Fri Aug 19]': ['Esteghlal 1-0 perspolise', 'Paris 2-0 perspolise'], '[Wed Agu 12]': ['RM 0-0 Tarakto'], '[Wed Agu 11]': ['Munich 3-3 ABC']}`
I overlooked that dates might repeat, as pointed by mad_, to avoid losing data replace for loop with
for i in x:
d[i[0]] = []
for i in x:
d[i[0]] = d[i[0]]+[j for j in i[1:] if j!='']
Related
I'm very new to python and a novice at development in general. I've been tasked with iterating through a div and I am able to get some data returned. However, when no tag exists I'm getting null results. Would anyone be able to help--I have researched and tried for days now? I truly appreciate your patience and explanations. I'm trying to extract the following:
Start Date [ie, Sat 31 Jul 2021] which I get results from based on my code
End Date [ie, Fri 20 Aug 2021] this one I get no results based on my code
Description [ie, 20 Night New! Malta, The Adriatic & Greece] this one I get no results based on my code
Ship Name [ie, Viking Sea] which I get results from based on my code
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
Here is my code (no judgements please..haha)
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
#initiate data storage
cruisestartdate = []
cruiseenddate = []
itinerarydescription = []
cruiseshipname = []
cruisecabinprice = []
destinationportname = []
portdatetime = []
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
#cruise start date
startdate = container.b.text
print(startdate)
cruisestartdate.append(startdate)
# cruise end date
enddate = container.string
cruiseenddate.append(enddate)
# ship name
ship = container.a.text
cruiseshipname.append(ship)
#pandas dataframe
cruise = pd.DataFrame({
'Sail Date': cruisestartdate,
'End Date': cruiseenddate,
#'Description': description,
'Ship Name': cruiseshipname,
#'imdb': imdb_ratings,
#'metascore': metascores,
#'votes': votes,
#'us_grossMillions': us_gross,
})
print(soup)
print(cruisestartdate)
print(cruiseenddate)
print(itinerarydescription)
print(cruiseshipname)```
Here are my results from the print:
['Sat 31 Jul 2021'] [None] [] ['Viking Sea']
container.text is a nicely formatted list of lines. Just splitit and use the pieces:
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
lines = container.text.splitlines()
cruisestartdate.append(lines[1])
cruiseenddate.append(lines[2])
itinerarydescription.append(lines[3])
# ship name
ship = container.a.text
cruiseshipname.append(ship)
Another solution, using bs4 API:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
all_data = []
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
4 Fri 13 Aug 2021 Mon 23 Aug 2021 10 Night New! Malta & Greek Isles Discovery Viking Venus
5 Fri 13 Aug 2021 Thu 2 Sep 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
6 Sat 14 Aug 2021 Sat 21 Aug 2021 7 Night Iceland's Natural Beauty Viking Sky
7 Mon 16 Aug 2021 Mon 13 Sep 2021 28 Night Star Collector: From Greek Gods to Gaudí Wind Surf
8 Mon 16 Aug 2021 Fri 3 Sep 2021 18 Night Star Collector: Flamenco of the Mediterranean Wind Surf
9 Mon 16 Aug 2021 Wed 29 Sep 2021 44 Night Star Collector: Myths & Masterpieces of the Mediterranean Wind Surf
and saves data.csv (screenshot from LibreOffice):
EDIT: To scrape the data from all pages:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"Accept-Language": "en-US, en;q=0.5"}
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
offset = 1
all_data = []
while True:
u = url + f"&offset={offset}"
print(f"Getting offset {offset=}")
results = requests.get(u, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
# get next offset:
offset = soup.select_one(".cd-buttonlist > b + a")
if offset:
offset = int(offset.get_text(strip=True).split("-")[0])
else:
break
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
Getting offset offset=1
Getting offset offset=11
Getting offset offset=21
...
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
...
144 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nights Mediterranean MSC Splendida
145 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nachte Grose Freiheit - Schwedische Kuste 1 Mein Schiff 6
146 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Night Iceland's Natural Beauty Viking Jupiter
Getting the start date, description, and ship name are straightforward. The only trick here is to get the end date which does lie inside a specific tag. To get it, you have to tweak some regex. Here is one solution similar to #Andrej's:
from bs4 import BeautifulSoup
import re
html = """
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
bold = soup.find_all('b')
print(f"Start date: {bold[0].text}")
print(f"Description: {bold[1].text}")
untagged_line = re.sub("\n", "", soup.find(text=re.compile('To')))
end_date = re.sub(r"\(To (.*)\)", r"\1", untagged_line)
print(f"End date: {end_date}")
ship = soup.find('a', class_='red')
print(f"Ship: {ship.text}")
Output:
Start date: Sat 31 Jul 2021
Description: 20 Night New! Malta, The Adriatic & Greece
End date: Fri 20 Aug 2021
Ship: Viking Sea
I have the column of values as below,
array(['Mar 2018', 'Jun 2018', 'Sep 2018', 'Dec 2018', 'Mar 2019',
'Jun 2019', 'Sep 2019', 'Dec 2019', 'Mar 2020', 'Jun 2020',
'Sep 2020', 'Dec 2020'], dtype=object)
From this values I require output as,
array(['Mar'18', 'Jun'18', 'Sep'18', 'Dec'18', 'Mar'19',
'Jun'19', 'Sep'19', 'Dec'19', 'Mar'20', 'Jun'20',
'Sep'20', 'Dec'20'], dtype=object)
I have tried with following code,
df['Period'] = df['Period'].replace({'20','''})
But here it wasnt converting , how to replace the same?
Any help?
Thanks
With your shown samples, please try following.
df['Period'].replace(r" \d{2}", "'", regex=True)
Output will be as follows.
0 Mar'18
1 Jun'18
2 Sep'18
3 Dec'18
4 Mar'19
5 Jun'19
6 Sep'19
7 Dec'19
8 Mar'20
9 Jun'20
10 Sep'20
11 Dec'20
try this regex:
df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True)
in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.
Following your code (slightly changed to work) will not get you what you need as it will replace all '20's.
>>> df['Period'] = df['Period'].str.replace('20','')
Out[179]:
Period
0 Mar 18
1 Jun 18
2 Sep 18
3 Dec 18
4 Mar 19
5 Jun 19
6 Sep 19
7 Dec 19
8 Mar
9 Jun
10 Sep
11 Dec
Another way without using regex, would be with with vectorized str methods, more here:
df['Period_refined'] = df['Period'].str[:3] + "'" + df['Period'].str[-2:]
Output
df
Period Period_refined
0 Mar 2018 Mar'18
1 Jun 2018 Jun'18
2 Sep 2018 Sep'18
3 Dec 2018 Dec'18
4 Mar 2019 Mar'19
5 Jun 2019 Jun'19
6 Sep 2019 Sep'19
7 Dec 2019 Dec'19
8 Mar 2020 Mar'20
9 Jun 2020 Jun'20
10 Sep 2020 Sep'20
11 Dec 2020 Dec'20
The code works fine, however, the URL I'm trying to fetch the table for seems to have headers repeated throughout the table, I'm not sure how to deal with this and remove those rows as I'm trying to get the data into BigQuery and there are certain characters which aren't allowed.
URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
f.write(json_string)
print(json_string)
You can make class of the tr rows to None so that you don't get duplicate headers.
The following code creates a dataframe from the table
from bs4 import BeautifulSoup
import requests
import pandas as pd
res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]
data = []
for tr in table.find('tbody').find_all('tr', class_=False):
temp = [tr.find('th').get_text(strip=True)]
temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
data.append(temp)
df = pd.DataFrame(data, columns = columns)
print(df)
Output:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS Attend. Notes
0 Sat, Aug 1, 2020 1:00p Miami Heat 125 Denver Nuggets 105 Box Score
1 Sat, Aug 1, 2020 3:30p Utah Jazz 94 Oklahoma City Thunder 110 Box Score
2 Sat, Aug 1, 2020 6:00p New Orleans Pelicans 103 Los Angeles Clippers 126 Box Score
3 Sat, Aug 1, 2020 7:00p Philadelphia 76ers 121 Indiana Pacers 127 Box Score
4 Sat, Aug 1, 2020 8:30p Los Angeles Lakers 92 Toronto Raptors 107 Box Score
.. ... ... ... ... ... ... ... .. ... ...
75 Thu, Aug 13, 2020 Portland Trail Blazers Brooklyn Nets
76 Fri, Aug 14, 2020 Philadelphia 76ers Houston Rockets
77 Fri, Aug 14, 2020 Miami Heat Indiana Pacers
78 Fri, Aug 14, 2020 Oklahoma City Thunder Los Angeles Clippers
79 Fri, Aug 14, 2020 Denver Nuggets Toronto Raptors
[80 rows x 10 columns]
In order to insert to bigquery, you can directly insert json to bigquery or a dataframe to bigquery using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html
I have test log like below. Trying to read it in better way. Got key error while adding elements to the dictionary. while checking the if condition there is no output is generated and while doing elif got key error
Jan 23 2016 10:30:08AM - bla bla Server-1A linked
Jan 23 2016 11:04:56AM - bla bla Server-1B linked
Jan 23 2016 1:18:32PM - bla bla Server-1B dislinked from server
Jan 23 2016 4:16:09PM - bla bla DOS activity from 201.10.0.4
Jan 23 2016 9:43:44PM - bla bla Server-1A dislinked from server
Feb 1 2016 12:40:28AM - bla bla Server-1A linked
Feb 1 2016 1:21:52AM - bla bla DOS activity from 192.168.123.4
Mar 29 2016 1:13:07PM - bla bla Server-1A dislinked from server
Code
result = []
_dict = {}
spu = []
with open(r'C:\Users\Desktop\test.log') as f:
for line in f:
date, rest = line.split(' - ', 1)
conn_disconn = rest.split(' ')[3]
server_name = rest.split(' ')[2]
if line.strip()[-1].isdigit():
dos = re.findall('[0-9]+(?:\.[0-9]+){3}',line)
spu.extend(dos)
##Error part is below
if conn_disconn == 'linked':
dict_to_append = {server_name: [(conn_disconn, date)]}
print (dict_to_append)
_dict[server_name] = dict_to_append
result.append(dict_to_append)
elif conn_disconn == 'dislinked':
_dict[server_name][server_name].append(conn_disconn,date)
del _dict[server_name]
print (result)
Expected out
[{'Server-1A': [('linked', 'Jan 23 2016 11:30:08AM'), ('dislinked', 'Jan 23 2016 10:43:44PM')]},
{'Server-1B': [('linked', 'Jan 23 2016 12:04:56AM'), ('dislinked', 'Jan 23 2016 2:18:32PM')]},
{'Server-1A': [('linked', 'Feb 1 2016 1:40:28AM'), ('dislinked', 'Mar 29 2016 2:13:07PM')]},
{'Server-1A': [('linked', 'Jan 23 2016 11:30:08AM'), ('dislinked', 'Jan 23 2016 10:43:44PM')]},
{'Server-1B': [('linked', 'Jan 23 2016 12:04:56AM'), ('dislinked', 'Jan 23 2016 2:18:32PM')]},
{'Server-1A': [('linked', 'Feb 1 2016 1:40:28AM'), ('dislinked', 'Mar 29 2016 2:13:07PM')]},
{'Server-1A': [('linked', 'Jan 23 2016 11:30:08AM'), ('dislinked', 'Jan 23 2016 10:43:44PM')]},
{'Server-1B': [('linked', 'Jan 23 2016 12:04:56AM'), ('dislinked', 'Jan 23 2016 2:18:32PM')]},
{'Server-1A': [('linked', 'Feb 1 2016 1:40:28AM'), ('dislinked', 'Mar 29 2016 2:13:07PM')]},
{Dos:['201.10.0.4','192.168.123.4']}]
When you are checking if conn_disconn == 'linked': , conn_disconn has linked\n so it is not adding to dictionary and you are getting the key error.
import re
result = []
_dict = {}
spu = []
with open("r'C:\Users\Desktop\test.log'") as f:
for line in f:
date, rest = line.split(' - ', 1)
conn_disconn = rest.split(' ')[3].strip()
server_name = rest.split(' ')[2]
if line.strip()[-1].isdigit():
dos = re.findall('[0-9]+(?:\.[0-9]+){3}',line)
spu.extend(dos)
##Error part is below
if conn_disconn == 'linked':
dict_to_append = {server_name: [(conn_disconn, date)]}
print (dict_to_append)
_dict[server_name] = dict_to_append[server_name]
result.append(dict_to_append)
elif conn_disconn == 'dislinked':
_dict[server_name].append((conn_disconn,date))
del _dict[server_name]
print (result)
Output:
[{'Server-1A': [('linked', 'Jan 23 2016 10:30:08AM'), ('dislinked', 'Jan 23 2016 9:43:44PM')]}, {'Server-1B': [('linked', 'Jan 23 2016 11:04:56AM'), ('dislinked', 'Jan 23 2016 1:18:32PM')]}, {'Server-1A': [('linked', 'Feb 1 2016 12:40:28AM'), ('dislinked', 'Mar 29 2016 1:13:07PM')]}]
append takes one argument but you have given two in some cases. Look at this line's append parameters in your code.
_dict[server_name][server_name].append(conn_disconn,date)
Instead of that you need to add parantheses in order to pass tuple like this:
_dict[server_name][server_name].append((conn_disconn,date))
Try this:
data=[]
dff.seek(0)
for line in dff:
try:
date = re.search(r'\b^.*PM|\b^.*AM', line).group()
server = re.search(r'\b(?:Server-\d[A-Z]|Server-1B)\b', line).group()
linked = re.search(r'\b(?:linked|dislinked)\b', line).group().split()[0]
except:
continue
data.append({server: [(linked, date)]})
data
Out[2374]:
#[{'Server-1A': [('linked', 'Jan 23 2016 10:30:08AM')]},
# {'Server-1B': [('linked', 'Jan 23 2016 11:04:56AM')]},
# {'Server-1B': [('dislinked', 'Jan 23 2016 1:18:32PM')]},
# {'Server-1A': [('dislinked', 'Jan 23 2016 9:43:44PM')]},
# {'Server-1A': [('linked', 'Feb 1 2016 12:40:28AM')]},
# {'Server-1A': [('dislinked', 'Mar 29 2016 1:13:07PM')]}#]
I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?
You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016