Python Regex: Data scrape <li>data<ol><li>data</li></ol></li> - python

I have got an HTML file and I read with Python and I would like to while I print customize it.
First I've to print Country name then players name which they belong to their country.
My HTML file looks like this:
<ul>
<li>
Australia
<ol>
<li>Steve Smith</li>
<li>David Warner</li>
<li>Aaron Finch</li>
</ol>
</li>
<li>
Bangladesh
<ol>
<li>Shakib Al Hasan</li>
<li>Tamim Iqbal</li>
<li>Mushfiqur Rahim</li>
</ol>
</li>
<li>
England
<ol>
<li>Ben Stokes</li>
<li>Joe Root</li>
<li>Eoin Morgan</li>
</ol>
</li>
Now I want to scrape this data from my HTML file:
Australia - Steve Smith, David Warner, Aaron Finch
Bangladesh - Shakib Al Hasan, Tamim Iqbal, Mushfiqur Rahim
England - Ben Stokes, Joe Root, Eoin Morgan
But I can only scrape with Players' name. This is my code:
import re
file_name = "team.html"
mode = "r"
with open(file_name, mode) as fp:
team = fp.read()
pat = re.compile(r'<li>(.*?)</li>')
result = pat.findall(team)
res = ", ".join([str(player) for player in result])
print(res)
Also, I don't' use any package like bs4. I would like to solve this issue by using regex.

Here the solution with using regex.
import re
file_name = "team.html"
mode = "r"
with open(file_name, mode) as fp:
team = fp.read()
regex = re.compile(r'<li>\s+(?P<country>[A-z ]+)|<li>(?P<name>[A-z ]+)</li>')
country_team_rel = {}
country = None
for result in regex.findall(team):
if result[0]:
country = result[0]
country_team_rel[country] = []
else:
country_team_rel[country].append(result[1])
# Or If you like to print
buffer = []
for result in regex.findall(team):
if result[0]:
if buffer:
print(", ".join(buffer))
buffer = []
print(result[0] + " - ", end='')
else:
buffer.append(result[1])
print(", ".join(buffer))

As already suggested, BeautifulSoup is the right tool for this task:
import bs4
file_name = "team.html"
mode = "r"
with open(file_name, mode) as fp:
team = fp.read()
soup = bs4.BeautifulSoup(team)
country = None
for i in soup.findAll('li'):
if '\n' in i.text:
if country:
print(country,'-', ', '.join(players))
country = i.text.splitlines()[1].strip()
players = []
else:
players.append(i.text)
print(country,'-', ','.join(players))

It could be a mistake to use regex in this case. (i am not 100% sur).
You should use Beautiful Soup
Or even other HTML parser

Related

Can't locate and capture few fields out of some unstructured html

I'm trying to scoop out four fields from a webpage using BeautifulSoup library. It's hard to identify the fields individually and that is the reason I seek help.
Sometimes both emails are present but that is not always the case. I used indexing to capture the email for this example but surely this is the worst idea to go with. Moreover, with the following attempt I can only parse the caption of the email, not the email address.
I've tried with (minimum working example):
from bs4 import BeautifulSoup
html = """
<p>
<strong>
Robert Romanoff
</strong>
<br/>
146 West 29th Street, Suite 11W
<br/>
New York, New York 10001
<br/>
Telephone: (718) 527-1577
<br/>
Fax: (718) 276-8501
<br/>
Email:
<a href="mailto:robert#absol.com">
robert#absol.com
</a>
<br/>
Additional Contact: William Locantro
<br/>
Email:
<a href="mailto:bill#absol.com">
bill#absol.com
</a>
</p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)
Current output:
Robert Romanoff Email: William Locantro Email:
Expected output:
Robert Romanoff robert#absol.com William Locantro bill#absol.com
For more complex html/xml parsing you should take a look at xpath which allows very powerful selector rules.
In python it's available in parsel package.
from parsel import Selector
html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (.+)")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff robert#absol.com William Locantro bill#absol.com
You can do like this.
Select the <div> that has the data you need.
Create a list of the data present inside the above selected <div>
Iterate over the list and extract the data you require.
Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.nyeca.org/find-a-contractor-by-name/'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
d = soup.find_all('div', class_='sabai-directory-body')
for i in d:
x = i.text.strip().split('\n')
data = [x[0].strip()]
for item in x:
if item.startswith('Email'):
data.append(item.split(':')[1].strip())
elif item.startswith('Additional'):
data.append(item.split(':')[1].strip())
print(data)
Gives a list of the contractor details and also additional details (if any).
['Ron Singh', 'rsingh#atechelectric.com']
['George Pacacha', 'Office#agvelectricalservices.com']
['Andrew Drazic', 'ADrazic#atjelectrical.com']
['Albert Barbato', 'Abarbato#abelectriccorp.com']
['Ralph Sica', 'Ralph.Sica#abm.com', 'Henry Kissinger', 'Henry.Kissinger#abm.com']
['Robert Romanoff', 'robert#absoluteelectric.com', 'William Locantro', 'bill#absoluteelectric.com']
.
.
Here is a solution you can give it a try,
import re
soup = BeautifulSoup(html, "lxml")
names_ = [
soup.select_one("p > strong").text.strip(),
soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]
email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]
print(" ".join(i + " " + j for i, j in zip(names_, email_)))
Robert Romanoff robert#absol.com William Locantro bill#absol.com

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?
This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.
You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

Parsing html elements

In my (downloaded) HTMLs i have in the top of every file executives mentioned (like Dror Ben Asher" in the code below):
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
Further along the html these executives name reaccurs multiple times where after the name follows an text element i want to parse
Example
<P>
<STRONG> Dror Ben Asher </STRONG>
</P>
<P>Yeah, in terms of production in first quarter, we’re going to be lower than we had forecasted mainly due to our grade. We’ve had a couple of higher grade stopes in our Seabee complex that we’ve had some significant problems in terms of ground failures and dilution effects. In addition, not helping out, we’ve had some equipment downtime on some of our smaller silt development, so the combination of those two issues are affecting us.
</p>
For now i have a code (see below) which identifies one executive "Dror Ben Asher" and graps all the text which accurs after in the P element. But I would like this to work for all executives and for Multiple html files where different executives are mentioned (different company).
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")
Does anyone have a suggestion to tackle this challenge?
If I understand your question correctly, you could put the code in a function, to which you could pass the name you need as an argument and use that variable to construct your search strings.
for example:
def func(name_to_find):
# some code
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("{n}") + p'.format(n=name_to_find)):
# some other code
and call it like so:
func('Dror Ben Asher')

BeautifulSoup - Scraping a comment when the ID field changes

I'm scraping baseball game data for a number of seasons. Here's an example of the data.
https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml
For this question, I'm specifically looking for a way to pull out the comments that contain the umpire and game data. Note, these html files are now stored locally, so I'm trying to iterate through a folder. In the source code it looks like this:
<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div><div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>
As you can see it's inside a comment. The real challenge is that the ID value changes between venues and seasons. I'm parsing through 10 years of data. Can someone tell me how to pull the comment text when the ID actually changes?
Here's my code:
# import libraries and files
from bs4 import BeautifulSoup, Comment
import os
print
# Setup Games list for append
games = []
path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"
for filename in os.listdir(path):
if filename.endswith(".html"):
fullpath = os.path.join(path, filename)
print 'Processing {:}...'.format(fullpath)
# Get Page, Make Soup
soup = BeautifulSoup(open(fullpath), 'lxml')
# Setting up game object to append to list
game = {}
# Get Description
# Note: Skip every other child because of 'Navigable Strings' from BS.
divs = soup.findAll('div', {'scorebox_meta'})
for div in divs:
for idx, child in enumerate(div.children):
if idx == 1:
game['date'] = child.text
elif idx == 3:
game['start_time'] = child.text.split(':', 1)[1].strip()
elif idx == 7:
game['venue'] = child.text.split(':', 1)[1].strip()
elif idx == 9:
game['duration'] = child.text.split(':', 1)[1].strip()
# Get Player Data from tables
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
data = BeautifulSoup(comment,"lxml")
for items in data.select("table tr"):
player_data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(player_data)
print '======================================================='
# Get Umpire Data
# Append game data to full list
games.append(game)
print
print 'Results'
print '*' * 80
# Print the games harvested to the console
for idx, game in enumerate(games):
print str(idx) + ': ' + str(game)
# Write to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(game)
Thanks so much,
Benny
I used re module to extract the comment section:
from bs4 import BeautifulSoup
import re
data = """<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div>
<div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>"""
soup = BeautifulSoup(re.search(r'(?<=<!--)(.*?)(?=-->)', data, flags=re.DOTALL)[0], 'lxml')
umpires, time_of_game, attendance, start_time_weather = soup.select('div.section_content > div')
print('ID: ', soup.find('div', class_="section_content")['id'])
print('umpires: ', umpires.text)
print('time of game: ', time_of_game.text)
print('attendance: ', attendance.text)
print('start_time_weather: ', start_time_weather.text)
Output:
ID: div_342042674
umpires: Umpires: HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
time of game: Time of Game: 3:21.
attendance: Attendance: 33,809.
start_time_weather: Start Time Weather: 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.
If you kick out those vicious signs <!--,--> from the html elements, you can easily access the content. This is how you can go:
import requests
from bs4 import BeautifulSoup
url = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
res = requests.get(url)
content = res.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(content,"lxml")
umpire, gametime, attendance, weather = soup.find_all(class_="section_content")[2]("strong")
print(f'{umpire.next_sibling}\n{gametime.next_sibling}\n{attendance.next_sibling}\n{weather.next_sibling}\n')
Output:
HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
3:21.
33,809.
70° F, Wind 6mph out to Centerfield, Night, No Precipitation.

Get text from <a> element?

I would like to get the school name, "Perkins College..." from this link using beautifulSoup.
The code I use returns nothing.
school = soup.find('a','profiles-show-school-name-sm-link')
print 'school: ', school
print 'school.text: ', school.text
output:
school: <a class="profiles-show-school-name-sm-link" href="/profiles/show/online-degrees/stephen-f-austin-state-university/perkins-college-of-education-undergraduate/395/5401">
<img border="0" src="/images/profiles/243x60/4613/degrees/undergraduate-certificate-in-hospitality-administration.png"/>
</a>
school.text:
Suggestions for a BeautifulSoup implementation to extract school name (not URL)? Thx!
school = soup.find('a','profiles-show-school-name-sm-link')
url = school['href']
Assuming the school is always in the same spot in the url:
for i in range(5):
url = url[url.find("/")+1:]
schoolname = url[:url.find("/")]
print " ".join(schoolname.split("-")).title()
Yields:
Perkins College Of Education Undergraduate
Getting the University
for i in range(4):
url = url[url.find("/")+1:]
university= url[:url.find("/")]
print " ".join(university.split("-")).title()
Yields:
Stephen F Austin State University

Categories