Webscraping for a specific text string issue - python

I am trying to webscrape definitions for Hebrew names in literature such as for these links. https://www.biblestudytools.com/dictionaries/smiths-bible-dictionary/aaron.html
https://www.biblestudytools.com/dictionaries/smiths-bible-dictionary/abednego.html
I have already scraped a list of names from another site that is currently in the list 'test'.
Here is what I have so far:
# create smiths dictionary
smiths_names = {}
# loop through the names in smiths dictionary
try:
for name in test:
# make a request to the website
url = f"https://www.biblestudytools.com/dictionaries/smiths-bible-dictionary/{name}.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# find the definition in the website, definitions are contained in <i> tags
itags = soup.find('i')
if itags:
meaning = itags.get_text()
smiths_names[name] = meaning
else:
print(f'{name} not found')
except requests.exceptions.RequestException as e:
print(e)
print(f'{name} not found')
If I specify the text that I want e.g. for https://www.biblestudytools.com/dictionaries/smiths-bible-dictionary/aaron.html, I can scrape what I want ok, but I want to iterate through the list of names that I have in the 'test' list to get the definitions for each name. After I get the meaning of the name, both the name and the scraped meaning are added to a python dict as keys and values.
# find the definition in the website, definitions are contained in <i> tags
itags = soup.find('i', text = 'a teacher, or lofty')
if itags:
meaning = itags.get_text()
smiths_names[name] = meaning
I need to be able to get the definitions from each of these dictionaries such as the entries that are in the links.
Thanks

You can use
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
to return all of the elements you looking for, as
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
returns just the first element that matches your target.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Try using the following code instead:
# find all the <i> tags
itags = soup.find_all('i')
for i in itags:
if name in i.text:
meaning = i.text
smiths_names[name] = meaning
break
else:
print(f'{name} not found')

Related

python beautifulsoup4 how to get span text in div tag

This is the html code
<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>
I used like this
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
print(item.find('div', class_='salary-snippet'))
i got the result a list such as
<div aria-label="RM 3,500 to RM 8,000 a month" class="salary-snippet"><span>RM 3,500 - RM 8,000 a month</span></div>
if i used
print(item.find('div', class_='salary-snippet').text.strip())
it will return the error
AttributeError: 'NoneType' object has no attribute 'text'
so how can i get only the span text? its my first time web scraping
May be this is what you are looking for.
First select all the <div> tags with class as salary-snippet as this is the parent of the <span> tag that you are looking for. Use .find_all()
Now Iterate over the all the selected <div> tags from above and find the <span> from each <div>.
Based on your question, I assume that All these <div> may not have the <span> tag. In that case you can print the text only if the <div> contains a span tag. See below
# Find all the divs
d = soup.find_all('div', class_='salary-snippet')
# Iterating over the <div> tags
for item in d:
# Find <span> in each item. If not exists x will be None
x = item.find('span')
# Check if x is not None and then only print
if x:
print(x.text.strip())
Here is the complete code.
from bs4 import BeautifulSoup
s = """<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>"""
soup = BeautifulSoup(s, 'lxml')
d = soup.find_all('div', class_='salary-snippet')
for item in d:
x = item.find('span')
if x:
print(x.text.strip())
RM 6,000 a month
I believe the line should be:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
Alternatively, if there is only the span you can simply use:
item.find("span").text.strip()
Considering you used the .find_all() method you might want to ensure that every div returned from your HTML
soup.find_all('div', class_='job_seen_beacon')
contains the element you are looking for as thi could arise if only one element doesn't.
i.e.
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
try:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
except AttributeError:
print("Item Not available")
What this will do is try get the text but if this fails will print the item that failed so you can identify why... perhaps it doesn't have the element you are searching for.

Python3 BeautifulSoup: Remove a portion of HTML and return as soup object

I would like to remove a portion of my soup HTML object, concatenate them together and return as one single soup object.
The portion of the HTML object that I want to remove are all the contents within the span and div tags that contain a certain class name.
An example of the HTML is like so, note that they are in a list of tag objects:
body = [
<div class="content-block">
<p>Some text</p>
</div>
,
<div class="content-block>
<p style="margin-left:30px;">Some content here</p>
<span class="special_class"> //Remove
<a class="explanations-link"></a> //Remove
... //Remove
</span> //Remove
</div>
,
<div class="content-block>
<p style="margin-left:30px;">Some content here</p>
<div class="special_class"> //Remove
<p>Some content here</p> //Remove
... //Remove
</span> //Remove
</div>
]
I would like to remove everything inside the span and div tags that contains the class name special_class as highlighted.
My current implementation loops over each tag object, and converts them into str, and then do a replace. After replaceing, I concatenate them together as a str. It turns out that the replace didn't remove any of those tags, despite having matched.
text_str = ""
for item in body:
item_str = str(item)
span_class_items = item.findAll("span", {"class": "special_class"})
div_class_items = item.findAll("div", {"class": "special_class"})
for i in span_class_items:
item_str.replace(str(i), "")
for d in div_class_items:
item_str.replace(str(d), "")
text_str += item_str
new_soup = BeautifulSoup(text_str, "html.parser")
Also, after parsing text_str back into a soup object, the returned object is not 1 single soup object, but still len number of items in the body list.
What have I missed?
EDIT:
Attempt on using extract()
for item in body:
span_class_items = item.findAll("span", {"class": "legend-block explanations"})
div_class_items = item.findAll("div", {"class": "explanations-fancybox"})
test_item = item
if len(span_class_items) > 0:
for s_item in span_class_items:
test_item.s_item.extract()
if len(div_class_items) > 0:
for d_item in div_class_items:
test_item.d_item.extract()
This attempt throws
'NoneType' object has no attribute 'extract'
Attempt using replace_with()
for item in body:
span_class_items = item.findAll("span", {"class": "legend-block explanations"})
div_class_items = item.findAll("div", {"class": "explanations-fancybox"})
test_item = item
if len(span_class_items) > 0:
for s_item in span_class_items:
test_item.replace_with(s_item)
if len(div_class_items) > 0:
for d_item in div_class_items:
test_item.replace_with(d_item)
This attempt throws
Cannot replace one element with another when the element to be replaced is not part of a tree.
You can try to use decompose() method, it's allow to destroy tag from a tree, here is example
markup = 'I linked to <i>example.com</i>'
soup = BeautifulSoup(markup)
a_tag = soup.a
soup.i.decompose()
a_tag
# I linked to
More information you can find here https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

Using Python + BeautifulSoup to extract tags in tandem, creating a list of lists

I'm a bit new to python/BeautifulSoup, and was wondering if I could get some direction on how to get the following accomplished.
I have html from a webpage, that is structured as follows:
1) block of code contained within a tag that contains all image names (Name1, Name2, Name3.
2) block of code contained within a tag that has image urls.
3) a date, that appears one on the webpage. I put it into 'date' variable (this has already been extracted)
From the code, I'm trying to extract a list of lists that will contain [['image1','url1', 'date'], ['image2','url2','date']], which i will later convert into a dictionary (via dict(zip(labels, values)) function), and insert into a mysql table.
All I can come up with is how to extract two lists that contain all images , and all url's. Any idea on how to get what i'm trying to do accomplished?
Few things to keep in mind:
1) number of images always changes, along with names (1:1)
2) date always appears once.
P.S. Also, if there is a more elegant way to extract the data via bs4, please let me know!
from bs4 import BeautifulSoup
name = []
url = []
date = '2017-10-12'
text = '<div class="tabs"> <ul><li> NAME1</li><li> NAME2</li><li> NAME3</li> </ul> <div><div><div class="img-wrapper"><img alt="" src="www.image1.com/1.jpg" title="image1.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/1.jpg); w.print();"> Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image2.com/2.jpg" title="image2.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image2.com/2.jpg"); w.print();">Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image1.com/3.jpg" title="image3.jpg"></img></div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/3.jpg"); w.print();"> Print</a> </center></div> </div></div>'
soup = BeautifulSoup(text, 'lxml')
#print soup.prettify()
#get names
for imgz in soup.find_all('div', attrs={'class':'img-wrapper'}):
for imglinks in imgz.find_all('img', src = True):
#print imgz
url.append((imglinks['src']).encode("utf-8"))
#3 get ad URLS
for ultag in soup.find_all('ul'):
for litag in ultag.find_all('li'):
name.append((litag.text).encode("utf-8")) #dump all urls into a list
print url
print name
Here's another possible route to pulling the urls and names:
url = [tag.get('src') for tag in soup.find_all('img')]
name = [tag.text.strip() for tag in soup.find_all('li')]
print(url)
# ['www.image1.com/1.jpg', 'www.image2.com/2.jpg', 'www.image1.com/3.jpg']
print(name)
# ['NAME1', 'NAME2', 'NAME3']
As for ultimate list creation, here's something that's functionally similar to what #t.m.adam has suggested:
print([pair + [date] for pair in list(map(list, zip(url, name)))])
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'],
# ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'],
# ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]
Note that map is pretty infrequently used nowadays and its use is outright discouraged in some places.
Or:
n = len(url)
print(list(map(list, zip(url, name, [date] * n))))
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

I can't change string when I use beautifulsoup.The string is auto encodes

the <>is auto encodes.
Please helpe me ,thank you.
You can do this in a more semantical way, with Tag.new_tag(..), Tag.clear() and Tag.append():
atag = soup.a
atag.clear() # clear the content of the <a> tag
atag.append('support') # add the 'support' text
emtag = soup.new_tag('em') # create a new <em> tag
atag.append(emtag) # add the <em> tag to the <a> tag
emtag.string = '[6666]' # alter the text in the <em> tag
This constructs:
>>> soup
<html><body>support<em>[6666]</em></body></html>
On my machine.

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10
import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

Categories