Select and store values between line-breaks into list

Select and store values between line-breaks into list - python

I'm trying to select postcodes within line-breaks and storing them within a list. However, given the nature of the web-page I am finding it very difficult. I'm trying to assort this by hospital names, and gathering the hospital names is fine however there are many text between line-breaks so getting only the postcode is challenging.
Here is what I have tried:
l = ['http://www.wales.nhs.uk/ourservices/directory/Hospitals/92',
'http://www.wales.nhs.uk/ourservices/directory/Hospitals/62']
waleit = {'hospital':[],'address':[]}
asit = []
for url in l:
soup = BeautifulSoup(requests.get(url).content, "lxml")
for s in soup.find('div',{'style':'width:500px; float:left; '}):
soupy = s.find_next('h1')
try:
waleit['hospital'].append(soupy.text)
except AttributeError:
continue
wel = soup.find('div',{'style':'width:500px; float:left; '})
for a in wel.childGenerator():
x=[a]
print(x[0])
which prints:
<h1>Bronglais General Hospital</h1>
Caradoc Road, Aberystwyth
<br/>
SY23 1ER
<br/>
<br/>
Tel: 01970 623131
<br/>
<br/>
Type of Hospital: Major acute - Major A&E - Open 24 hours
<br/>
How do I extract for specific text with <br>...<br> like the postcode?
expected output:
{'hospital': ['Bronglais General Hospital', 'Glan Clwyd Hospital'],
'address': ['SY23 1ER','LL18 5UJ']}

You are almost there. See below; note I used css selectors instead of find(); I just prefer them:
urls = ['http://www.wales.nhs.uk/ourservices/directory/Hospitals/92',
'http://www.wales.nhs.uk/ourservices/directory/Hospitals/62']
waleit = {'hospital':[],'address':[]}
hospitals,addresses = [],[]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
data = list(soup.select_one("div[style='width:500px; float:left; ']").stripped_strings)
hospitals.append(data[0])
addresses.append(data[2])
waleit['hospital']=hospitals
waleit['address']=addresses
waleit
Output:
{'hospital': ['Bronglais General Hospital', 'Glan Clwyd Hospital'],
'address': ['SY23 1ER', 'LL18 5UJ']}

Related

How properly store values in belonging rows/cells of dataframe while scraping with Beautiful Soup?

So, I am trying to scrape data from a journal.
While I can successfully scrape titles of papers, keywords, and so on, and save them in dataframe properly when it comes to collecting authors' names, each mentioned author of the paper after the first one is stored in a new row. The same problem applies to affiliations.
It’s making stored data useless and unrelated, obviously. Thus, instead of having the same number of rows, I get stuck with a useless dataframe.
It is my understanding that the problem arises because the program doesn’t “know” to store all the data associated with each paper in separate rows. Additionally, some papers only have one author, while others have 3-4. For example, authors need to be stored in a "NameSurname, NameSurname, NameSurname..." format within separate rows containing information about each research paper: authors, affiliations, etc.
But when it comes to specifying classes that I intend to scrape, I am uncertain how to set up the Python (BS4) code properly.
Here's a snippet of the relevant code from the simple scraper:
title = []
authors = []
afiliations = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for t in soup.select(".obj_article_details .page_title"):
title.append(t.get_text(strip=True))
for au in soup.select(".obj_article_details .authors .name"):
authors.append(au.get_text(strip=True))
for af in soup.select(".obj_article_details .item.authors .affiliation"):
affiliations.append(af.get_text(strip=True))
time.sleep(3)
Also, here is a structure of section which i am intending to scrape
...
<article class="obj_article_details">
<h1 class="page_title">
Lorem ipsum dolor sit amet
</h1>
<div class="row">
<div class="main_entry">
<section class="item authors">
<ul class="authors">
<li>
<span class="name">Brandon Scott </span>
<span class="affiliation"> Villanova University, Pennsylvania </span>
</li>
<li>
<span class="name">Alvaro Cote </span>
<span class="affiliation">Carleton College, Minnesota</span>
</li>
</ul>
</section>
...
What I am getting now:
|Authors | Affiliation |
+--------------+------------------------------------+
|Brandon Scott | Villanova University, Pennsylvania |
+--------------+------------------------------------+
|Alvaro Cote | Carleton College, Minnesota |
+--------------+------------------------------------+
|... | ... |
What i want:
|Authors | Affiliation |
+--------------+------------------------------------+
|Brandon Scott, Alvaro Cote | Villanova University..|
+--------------+------------------------------------+
|... |... |
+--------------+------------------------------------+
|... |... |

For cases like this, you should use nested loops - an outer loop for the containers ResultSet (soup.select('article.obj_article_details') here), and the inner loop/s for the details you want - title/author/affiliation/etc. And it's also better to build a dictionary of the details for each container and add it to a list of dictionaries than to try to bind together separate lists (you've already faced some of the issues that are caused by that approach).
Since you're doing the same thing for each detail (select followed by get_text), it would be more convenient to move those operations to a function like
def getText_bySelector(tagSoup, selector, sep=None):
selTags = tagSoup.select(selector) if selector else [tagSoup]
if type(sep) == str:
return sep.join([s.get_text(' ').strip() for s in selTags])
return selTags[0].get_text(' ').strip() if selTags else None
(This is a variation of this function, which I use in most of my bs4 projects.)
If you pass a string (like , /; /etc) as sep, it will join all the results with it (or return an empty string [""] if there are no results); otherwise, it will return the first result (or None if there are no results).
Another reason I like using functions like this is that it allows me to use list comprehension instead the innermost for loop.
Then, you just need to define a reference dictionary with the arguments you'll need to pass to getText_bySelector
refDict = {
'title': ('.page_title', None),
'authors': ('.authors .name', ', '),
'affiliations': ('.item.authors .affiliation', '; ')
}
Now you can built a list of dictionaries with
dictList = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
dictList += [{
k: getText_bySelector(a, vsel, vsep)
for k, (vsel, vsep) in refDict.items()
} for a in soup.select('article.obj_article_details')]
The items in dictList will look like
{
'title': 'Lorem ipsum dolor sit amet',
'authors': 'Brandon Scott, Alvaro Cote',
'affiliations': 'Villanova University, Pennsylvania; Carleton College, Minnesota'
}
and you can easily use pandas to view dictList as a table
EDIT [PART 1]: Without a function, you'd just have to do the same operations in an inner for loop:
dictList = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for a in soup.select('article.obj_article_details'):
dets = {}
for k, (sel, sep) in refDict.items():
selTags = a.select(sel) if sel else [a]
if type(sep) == str:
dets[k] = sep.join([s.get_text(' ').strip() for s in selTags])
else:
dets[k] = selTags[0].get_text(' ').strip() if selTags else None
dictList.append(dets)
EDIT [PART 2]: If you must have separate lists:
title = []
authors = []
afiliations = []
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for a in soup.select('article.obj_article_details'):
titleA = a.select_one('.page_title')
if titleA: titleA = titleA.get_text(' ').strip()
title.append(titleA)
authorsA = a.select('.authors .name')
# authors.append(', '.join([aa.get_text(' ').strip() for aa in authorsA]))
listAuth = []
for aa in authorsA: listAuth.append(aa.get_text(' ').strip())
authors.append(', '.join(listAuth))
# authors.append(listAuth) # if you want a list instead of a string
affA = a.select('.item.authors .affiliation')
# afiliations.append('; '.join([aa.get_text(' ').strip() for aa in affA]))
listAff = []
for aa in affA: listAff.append(aa.get_text(' ').strip())
afiliations.append(', '.join(listAff))
# afiliations.append(listAff) # if you want a list instead of a string
the DataFrame arguments would be a little different this time:
[I copied the html twice to create multiple rows and added a1/a2 to differentiate the copies.]
Even that can be shortened with an inner for loop and list-comprehension
refDict = {
'title': '.page_title', 'authors': '.authors .name',
'affiliations': '.item.authors .affiliation'
}
listsDict = {k: [] for k in refDict}
for i in urls:
page = requests.get(link)
content = page.text
soup = BeautifulSoup(content, "html.parser")
for a in soup.select('article.obj_article_details'):
for k in refDict:
kvals = [t.get_text(' ').strip() for t in (
a.select(refDict[k]) if refDict[k] else [a]
)]
listsDict[k].append('; '.join(kvals))
# listsDict[k].append(kvals[0] if len(kvals) == 1 else kvals)
refDict was simplified, so you can't have different separators for different columns.
By the ways, if you want multiple authors/affiliations as lists rather than joining them as strings, you can remove the listsDict[k].append('; '.join(kvals)) line and uncomment the next line
listsDict[k].append(kvals[0] if len(kvals) == 1 else kvals)
Btw, with this last method, if there is more than one .page_title in a container, all of them will be includes, but with all my other methods, one the first title from each container would have been included. (I assumed that there would always only be one title per container.)
The important thing is that the title/authors/afiliations lists are appended to the same number of times for a container - that's why you need to separate the containers and append from there a fixed number of times.

Can't locate and capture few fields out of some unstructured html

I'm trying to scoop out four fields from a webpage using BeautifulSoup library. It's hard to identify the fields individually and that is the reason I seek help.
Sometimes both emails are present but that is not always the case. I used indexing to capture the email for this example but surely this is the worst idea to go with. Moreover, with the following attempt I can only parse the caption of the email, not the email address.
I've tried with (minimum working example):
from bs4 import BeautifulSoup
html = """
<p>
<strong>
Robert Romanoff
</strong>
<br/>
146 West 29th Street, Suite 11W
<br/>
New York, New York 10001
<br/>
Telephone: (718) 527-1577
<br/>
Fax: (718) 276-8501
<br/>
Email:
<a href="mailto:robert#absol.com">
robert#absol.com
</a>
<br/>
Additional Contact: William Locantro
<br/>
Email:
<a href="mailto:bill#absol.com">
bill#absol.com
</a>
</p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)
Current output:
Robert Romanoff Email: William Locantro Email:
Expected output:
Robert Romanoff robert#absol.com William Locantro bill#absol.com

For more complex html/xml parsing you should take a look at xpath which allows very powerful selector rules.
In python it's available in parsel package.
from parsel import Selector
html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (.+)")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff robert#absol.com William Locantro bill#absol.com

You can do like this.
Select the <div> that has the data you need.
Create a list of the data present inside the above selected <div>
Iterate over the list and extract the data you require.
Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.nyeca.org/find-a-contractor-by-name/'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
d = soup.find_all('div', class_='sabai-directory-body')
for i in d:
x = i.text.strip().split('\n')
data = [x[0].strip()]
for item in x:
if item.startswith('Email'):
data.append(item.split(':')[1].strip())
elif item.startswith('Additional'):
data.append(item.split(':')[1].strip())
print(data)
Gives a list of the contractor details and also additional details (if any).
['Ron Singh', 'rsingh#atechelectric.com']
['George Pacacha', 'Office#agvelectricalservices.com']
['Andrew Drazic', 'ADrazic#atjelectrical.com']
['Albert Barbato', 'Abarbato#abelectriccorp.com']
['Ralph Sica', 'Ralph.Sica#abm.com', 'Henry Kissinger', 'Henry.Kissinger#abm.com']
['Robert Romanoff', 'robert#absoluteelectric.com', 'William Locantro', 'bill#absoluteelectric.com']
.
.

Here is a solution you can give it a try,
import re
soup = BeautifulSoup(html, "lxml")
names_ = [
soup.select_one("p > strong").text.strip(),
soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]
email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]
print(" ".join(i + " " + j for i, j in zip(names_, email_)))
Robert Romanoff robert#absol.com William Locantro bill#absol.com

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?

This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.

You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

Python (soup): get nested data and get last item in a tag

So I have an html document that looks something like this:
<title>Speaker Name: Title of Talk | Subtitle | website.com</title>
... [Other Stuff]
<div class='meta'><span class='meta__item'>
Posted
<span class='meta__val'>
Jun 2006
</span></span><span class='meta__row'>
Rated
<span class='meta__val'>
Funny, Informative
</span></span></div>
<div class='talk-article__body talk-transcript__body'> TEXT
<data class='talk-transcript__para__time'>15:57</data>
I have 2200 files like this, and I am hoping to put them all into a CSV file with columns of AUTHOR, TITLE, DATE, LENGTH, and TEXT. Right now, what I have is not the prettiest code, but it works:
from bs4 import BeautifulSoup as soup
soup = soup(open(file).read(), "lxml")
at = soup.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
text = soup.find("div", attrs={ "class" : "talk-article__body"}) # still needs cleaning
date =
length =
I cannot for the life of me figure out how to get at the date: I suspect it's a combination of soup and re, but I confess that I can't wrap my head around the combination.
The trick with the length is that what I want to find is the LAST time <data class='talk-transcript__para__time'> occurs in the file and grab THAT value.

You can try this
date_spans = soup.find_all('span', {'class' : 'meta__val'})
date = [x.get_text().strip("\n\r") for x in date_spans if re.search(r"(?s)[A-Z][a-z]{2}\s+\d{4}", x.get_text().strip("\n\r"))][0]
print(date)
#date = re.findall(r"(?s)<span class=.*?>\s*([A-Z][a-z]{2}\s+\d{4})", str(soup))
length_data = soup.find_all('data', {'class' : 'talk-transcript__para__time'})
length = [x.get_text().strip("\n\r") for x in length_data if re.search(r"(?s)\d{2}:\d{2}", x.get_text().strip("\n\r"))][-1]
print(length)
#length = re.findall(r"(?s).*<data class=.*?>(.*)</data>", str(soup))
Output
Jun 2006
15:57

You don't need a regex for the date if the first meta__val is the date, you definitely don't need it for the time as you can just use the class name talk-transcript__para__time:
from bs4 import BeautifulSoup
h = """<title>Speaker Name: Title of Talk | Subtitle | website.com</title>
<div class='meta'><span class='meta__item'>
Posted
<span class='meta__val'>
Jun 2006
</span></span><span class='meta__row'>
Rated
<span class='meta__val'>
Funny, Informative
</span></span></div>
<div class='talk-article__body talk-transcript__body'> TEXT
<data class='talk-transcript__para__time'>15:57</data>"""
soup = BeautifulSoup(h,"html.parser")
date = soup.select_one("span.meta__val").text
time = soup.select_one("data.talk-transcript__para__time").text
print(date, time)
Output:
(u'\nJun 2006\n', u'15:57')
If you were using a regex you would pass it to find or find_all:
r = re.compile(r"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{4}")
soup = BeautifulSoup(h, "html.parser")
date = soup.find("span", {"class": "meta__val"}, text=r).text.strip()
Which would give you:
'Jun 2006'

Removing particular content from result parces using beautifulsoup

def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
This is the code which gives me text from this html
<div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
Read all announcements in Prestige Estate </p><p> </p>
</div>
This result is fine for me, I just want to exclude the content of
Read all announcements in Prestige Estate
from result, that is desc in my script, if it is present and Ignore if it is not present. How can I do this?

You can use extract() to remove unnecessary tags from the find() result:
descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')] # remove <a> tags
return descItem.get_text() # return the text

Just make some changes to last line and add re module
...
return re.sub(r'<a(.*)</a>','',desc)
Output:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select and store values between line-breaks into list - python

Related

How properly store values in belonging rows/cells of dataframe while scraping with Beautiful Soup?

Can't locate and capture few fields out of some unstructured html

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

Python (soup): get nested data and get last item in a tag

Removing particular content from result parces using beautifulsoup

Categories

Resources