Can't locate and capture few fields out of some unstructured html - python

I'm trying to scoop out four fields from a webpage using BeautifulSoup library. It's hard to identify the fields individually and that is the reason I seek help.
Sometimes both emails are present but that is not always the case. I used indexing to capture the email for this example but surely this is the worst idea to go with. Moreover, with the following attempt I can only parse the caption of the email, not the email address.
I've tried with (minimum working example):
from bs4 import BeautifulSoup
html = """
<p>
<strong>
Robert Romanoff
</strong>
<br/>
146 West 29th Street, Suite 11W
<br/>
New York, New York 10001
<br/>
Telephone: (718) 527-1577
<br/>
Fax: (718) 276-8501
<br/>
Email:
<a href="mailto:robert#absol.com">
robert#absol.com
</a>
<br/>
Additional Contact: William Locantro
<br/>
Email:
<a href="mailto:bill#absol.com">
bill#absol.com
</a>
</p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)
Current output:
Robert Romanoff Email: William Locantro Email:
Expected output:
Robert Romanoff robert#absol.com William Locantro bill#absol.com

For more complex html/xml parsing you should take a look at xpath which allows very powerful selector rules.
In python it's available in parsel package.
from parsel import Selector
html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (.+)")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff robert#absol.com William Locantro bill#absol.com

You can do like this.
Select the <div> that has the data you need.
Create a list of the data present inside the above selected <div>
Iterate over the list and extract the data you require.
Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.nyeca.org/find-a-contractor-by-name/'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
d = soup.find_all('div', class_='sabai-directory-body')
for i in d:
x = i.text.strip().split('\n')
data = [x[0].strip()]
for item in x:
if item.startswith('Email'):
data.append(item.split(':')[1].strip())
elif item.startswith('Additional'):
data.append(item.split(':')[1].strip())
print(data)
Gives a list of the contractor details and also additional details (if any).
['Ron Singh', 'rsingh#atechelectric.com']
['George Pacacha', 'Office#agvelectricalservices.com']
['Andrew Drazic', 'ADrazic#atjelectrical.com']
['Albert Barbato', 'Abarbato#abelectriccorp.com']
['Ralph Sica', 'Ralph.Sica#abm.com', 'Henry Kissinger', 'Henry.Kissinger#abm.com']
['Robert Romanoff', 'robert#absoluteelectric.com', 'William Locantro', 'bill#absoluteelectric.com']
.
.

Here is a solution you can give it a try,
import re
soup = BeautifulSoup(html, "lxml")
names_ = [
soup.select_one("p > strong").text.strip(),
soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]
email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]
print(" ".join(i + " " + j for i, j in zip(names_, email_)))
Robert Romanoff robert#absol.com William Locantro bill#absol.com

Related

Python BeautifulSoup - Create dataframe using html tags between <div>

I have a html website without any tables and I want to scrap data in form of a table.
Here is the sample html code
<div class='ah-content'
<h4>XYZ Community</h4>
<p>123 Street</p>
<p>Atlanta, Georgia, 12345</p>
<p>1234567890</p>
</div>
It is a long list like this and I want to capture <h4> and <p> between <div>
So, the output will be:
Name
Address
Address2
Phone
xyz Community
123 Street
Atlanta, Georgia, 12345
1234567890
If all <div class='ah-content'> follows the same pattern like in your example you can use this script to create a DataFrame:
import pandas as pd
from bs4 import BeautifulSoup
html_doc = """\
<div class='ah-content'>
<h4>XYZ Community</h4>
<p>123 Street</p>
<p>Atlanta, Georgia, 12345</p>
<p>1234567890</p>
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
strings = [[t.text for t in c.find_all()] for c in soup.select(".ah-content")]
df = pd.DataFrame(strings, columns=["Name", "Address", "Address2", "Phone"])
print(df.to_markdown(index=False))
Prints:
Name
Address
Address2
Phone
XYZ Community
123 Street
Atlanta, Georgia, 12345
1234567890

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?
This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.
You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

Select and store values between line-breaks into list

I'm trying to select postcodes within line-breaks and storing them within a list. However, given the nature of the web-page I am finding it very difficult. I'm trying to assort this by hospital names, and gathering the hospital names is fine however there are many text between line-breaks so getting only the postcode is challenging.
Here is what I have tried:
l = ['http://www.wales.nhs.uk/ourservices/directory/Hospitals/92',
'http://www.wales.nhs.uk/ourservices/directory/Hospitals/62']
waleit = {'hospital':[],'address':[]}
asit = []
for url in l:
soup = BeautifulSoup(requests.get(url).content, "lxml")
for s in soup.find('div',{'style':'width:500px; float:left; '}):
soupy = s.find_next('h1')
try:
waleit['hospital'].append(soupy.text)
except AttributeError:
continue
wel = soup.find('div',{'style':'width:500px; float:left; '})
for a in wel.childGenerator():
x=[a]
print(x[0])
which prints:
<h1>Bronglais General Hospital</h1>
Caradoc Road, Aberystwyth
<br/>
SY23 1ER
<br/>
<br/>
Tel: 01970 623131
<br/>
<br/>
Type of Hospital: Major acute - Major A&E - Open 24 hours
<br/>
How do I extract for specific text with <br>...<br> like the postcode?
expected output:
{'hospital': ['Bronglais General Hospital', 'Glan Clwyd Hospital'],
'address': ['SY23 1ER','LL18 5UJ']}
You are almost there. See below; note I used css selectors instead of find(); I just prefer them:
urls = ['http://www.wales.nhs.uk/ourservices/directory/Hospitals/92',
'http://www.wales.nhs.uk/ourservices/directory/Hospitals/62']
waleit = {'hospital':[],'address':[]}
hospitals,addresses = [],[]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
data = list(soup.select_one("div[style='width:500px; float:left; ']").stripped_strings)
hospitals.append(data[0])
addresses.append(data[2])
waleit['hospital']=hospitals
waleit['address']=addresses
waleit
Output:
{'hospital': ['Bronglais General Hospital', 'Glan Clwyd Hospital'],
'address': ['SY23 1ER', 'LL18 5UJ']}

BeautifulSoup to scrape street address

I am using the code at the far bottom to get weblink, and the Masjid name. however I would like to also get denomination and street address. please help I am stuck.
Currently I am getting the following
Weblink:
<div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah">
and Masjid name
<b>Masjid Al-Hijrah</b>
But would like to get the below;
Denomination
<b>Denomination:</b> Sunni (Traditional)
and street address
<br>45 Station Street (Sydney)
The below code scrapes the following
<td width=25><img src='http://www.halalfire.com/images/en/photo_small.jpg' alt='Masjid Al-Hijrah' title='Masjid Al-Hijrah' border=0 width=48 height=36></a></td><td width=10><img src="http://www.salatomatic.com/images/spacer.gif" width=10 border=0></td><td nowrap><div class="subtitleLink"><b>Masjid Al-Hijrah</b> </div><div class="tinyLink"><b>Denomination:</b> Sunni (Traditional)<br>45 Station Street (Sydney) </div></td><td align=right valign=center><div class="tinyLink"></div></td>
CODE:
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
br = result.find('b')
a = result.find('a')
currenturl = a.get('href')
if not currenturl.startswith("http"):
currenturl = "http://www.salatomatic.com" + currenturl
print currenturl
elif currenturl.startswith("http"):
print a.get('href')
pos = br.get_text()
print pos
You can check next <div> element with a class attribute with value tinyLink and that contains either a <b> and a <br> tags and extract their strings:
...
print pos
div = result.find_next_sibling('div', attrs={"class": "tinyLink"})
if div and div.b and div.br:
print(div.b.next_sibling.string)
print(div.br.next_sibling.string)

Get text from <a> element?

I would like to get the school name, "Perkins College..." from this link using beautifulSoup.
The code I use returns nothing.
school = soup.find('a','profiles-show-school-name-sm-link')
print 'school: ', school
print 'school.text: ', school.text
output:
school: <a class="profiles-show-school-name-sm-link" href="/profiles/show/online-degrees/stephen-f-austin-state-university/perkins-college-of-education-undergraduate/395/5401">
<img border="0" src="/images/profiles/243x60/4613/degrees/undergraduate-certificate-in-hospitality-administration.png"/>
</a>
school.text:
Suggestions for a BeautifulSoup implementation to extract school name (not URL)? Thx!
school = soup.find('a','profiles-show-school-name-sm-link')
url = school['href']
Assuming the school is always in the same spot in the url:
for i in range(5):
url = url[url.find("/")+1:]
schoolname = url[:url.find("/")]
print " ".join(schoolname.split("-")).title()
Yields:
Perkins College Of Education Undergraduate
Getting the University
for i in range(4):
url = url[url.find("/")+1:]
university= url[:url.find("/")]
print " ".join(university.split("-")).title()
Yields:
Stephen F Austin State University

Categories