Extending selection with BeautifulSoup - python

I am trying to get BeautifulSoup to do the following.
I have HTML files which I wish to modify. I am interested in two tags in particular, one which I will call TagA is
<div class ="A">...</div>
and one which I will call TagB
<p class = "B">...</p>
Both tags occur independently throughout the HTML and may themselves contain other tags and be nested inside other tags.
I want to place a marker tag around every TagA whenever it is not immediately followed by TagB so that
<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>
But when TagA is followed immediately by TagB, I want the marker Tag to surround them both
so that
<p class="A">...</p><div class="B">...</div>
becomes
<marker><p class="A">...</p><div class="B">...</div></marker>
I can see how to select TagA and enclose it with the marker tag, but when it is followed by TagB I do not know if or how the BeautiulSoup 'selection' can be extended to include the NextSibling.
Any help appreciated.

beautifulSoup does have a "next sibling" function. find all tags of class A and use a.next_sibling to check if it is b.
look at the docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

I think I was going about this the wrong way by trying to extend the 'selection' from one tag to the following. Instead I found the following code which insets the outer 'Marker' tag and then inserts the A and B tags does the trick.
I am pretty new to Python so would appreciate advice regarding improvements or snags with the following.
def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
try:
return tag.name == 'p'#has_key('p') and tag.has_key('B')
except:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")
for TagA in soup.find_all("div", "A"):
Marker = soup.new_tag('Marker')
nexttag = TagA.next_sibling
#skipover white space
while str(nexttag).isspace():
nexttag = nexttag.next_sibling
if isTagB(nexttag):
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
Marker.insert(2,nexttag)
else:
#print("FALSE",nexttag)
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
print (soup)

import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)
all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing
#exa- attrs={'class':"class","id":"1234"}
single_div = all_div[0]
#to find p tag inside single_div
p_tag_obj = single_div.find("p")
you can use obj.findNext(), obj.findAllNext(), obj.findALLPrevious(), obj.findPrevious(),
to get attribute you can use obj.get("href"), obj.get("title") etc.

Related

Using next_sibling with font color in BS4

I need to get data after a certain link with text map, but it doesn't work when the data after the link is colored. How do I get that?
Currently, I am using next_sibling, but it only gets the data points that are not red.
The HTML is like this.
I can read the number from here
map
" 2.8 "
but not from here
map
<font color="red">3.1</font>
soup=BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")
for i in tags:
if i.get_text()=="map":
# prints each next_sibling
print(i.next_sibling)
# Extracts text if needed.
try:
output.write(i.next_sibling.get_text().strip()+"\n")
except AttributeError:
output.write(i.next_sibling.strip()+"\n")
output.close()
The program writes all of the numbers that are not in red, and leaves empty spaces where there are red numbers. I want it to show everything.
If we can see more of your HTML tree there's probably a better way to do this but given the little bit of html that you've shown us, here's one way that will likely work.
from bs4 import BeautifulSoup
html = """map2.8
map
<font color="red">3.1</font>"""
soup=BeautifulSoup(html, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")
for i in tags:
if i.get_text()=="map":
siblings = [sib for sib in i.next_siblings]
map_sibling_text = siblings[0].strip()
if map_sibling_text == '' and len(siblings) > 1:
if siblings[1].name == 'font':
map_sibling_text = siblings[1].get_text().strip()
output.write("{0}\n".format(map_sibling_text))
output.close()
Depends on how your HTML is overall. Is that classname always associated with an a tag for example? You might be able to do the following. Requires bs4 4.7.1.
import requests
from bs4 import BeautifulSoup as bs
html = '''
map
" 2.8 "
map
<font color="red">3.1</font>
'''
soup = bs(html, 'lxml')
data = [item.next_sibling.strip() if item.name == 'a' else item.text.strip() for item in soup.select('.link2:not(:has(+font)), .link2 + font')]
print(data)

Python + BeautifulSoup: Finding a HTML tag where an attribute contains a matched pattern of text?

I'm new to both Python and BeautifulSoup. I'm trying to figure out how to match only the tags that are <div> elements that contain a certain matched pattern of text belonging to an attribute. For example, all cases where the 'id' : 'testid', or everywhere the 'class' : 'title'.
This is what I have so far:
def cleanup(filename):
fh = open(filename, "r")
soup = BeautifulSoup(fh, 'html.parser')
for div_tag in soup.find('div', {'class':'title'}):
h2_tag = soup.h2_tag("h2")
div_tag.div.replace_with(h2_tag)
del div_tag['class']
f = open("/tmp/filename.modified", "w")
f.write(soup.prettify(formatter="html5"))
f.close()
Once I can match all those particular elements, at that point I can figure out how to manipulate the attributes (delete the class, rename the tag itself from <div> to <h1>, etc). So I'm aware that actual part of the cleanup probably doesn't work with how it is currently.
It seems this works sufficiently, but let me know if there's a "better" or "more standard" way to do it.
for tag in soup.findAll(attrs={'class':'title'}):
del tag['class']
.find(tagName, attributes) return single element
.find_all(tagName, attributes) return multiple element (list)
more you can find it in the doc
to replace you need to create element .new_tag(tagName) and to delete attribute del element.attrs[attributeName] see below for example
from bs4 import BeautifulSoup
import requests
html = '''
<div id="title" class="testTitle">
heading h1
</div>
'''
soup = BeautifulSoup(html)
print 'html before'
print soup
div = soup.find('div', id="title")
#delete class attribute
del div.attrs['class']
print 'html after remove attibute'
print soup
# to replace, create h1 element
h1 = soup.new_tag("h1")
# set text from previous element
h1.string = div.text
# uncomment to set ID
# h1['id'] = div['id']
div.replace_with(h1)
print 'html after replace'
print soup

BeautifulSoup to find unofficial HTML tags/attributes

In my job we are using tags that we have created. One of the tags called can-edit and it looks like this in the code (for example):
<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>
It could be inside any tag (img, p, h1, h2, div...).
What i wish to get is all the can-edit tags within a page, for example with the HTML above:
['banner top text', 'banner bottom text']
i've tried
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
But it not finding any.
i've tried
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
But it not finding any.
The reason that this does not work is because here you look for a tag with the name can-edit, so <can-edit ...>, and this thus does not work.
You can use the find_all function of the soup to find all tags with a certain attribute. For example:
soup.find_all(attrs={'can-edit': True})
So here we use the attrs parameter and pass it an attribute that says that we filter tags that have a can-edit attribute. This will give us a list of tags with a can-edit attribute (regardless the value). If we now want to obtain the value of that attribute, we can get the ['can-edit'] item of it, so we can write a list comprehension:
all_can_edit_attrs = [tag['can-edit']
for tag in soup.find_all(attrs={'can-edit': True})]
Or a full working version:
from bs4 import BeautifulSoup
s = """<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>"""
bs = BeautifulSoup(s, 'lxml')
all_can_edit_attrs = [tag['can-edit']
for tag in soup.find_all(attrs={'can-edit': True})]

Check if a specific class present in HTML using beautifulsoup Python

I am writing a script and want to check if a particular class is present in html or not.
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
print("3 star",html.has_attr("i-stars i-stars--large-3 rating-very-large")) #it's returning False
res = html.find('i-stars i-stars--large-3 rating-very-large")) #it's returning NONE
Please guide me how I can resolve this issue?If somehow I get title (title="3.0 star rating") that will also work for me. Screenshot of console HTML
<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">
<img class="offscreen" height="303" src="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_design_web/8a6fc2d74183/assets/img/stars/stars.png" width="84" alt="3.0 star rating">
</div>
has_attr is a method that checks if an element has the attribute that you want. class is an attribute, i-stars i-stars--large-3 rating-very-large is its value.
find expects CSS selectors, not class values. So you should instead use html.find('div.i-stars.i-stars--large-3.rating-very-large'). This is because you are looking for a div with all of these classes.
Was having similar problems getting the exact classes. They can be brought back as a dictionary object as follows.
html = '<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">'
soup = BeautifulSoup(html, 'html.parser')
find = soup.div
classes = find.attrs['class']
c1 = find.attrs['class'][0]
print (classes, c1)
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
res = html.find(class_='i-stars i-stars--large-3 rating-very-large')
if res:
print("3 star", 'whatever you want print')
out:
3 star whatever you want print

Finding specific tag using BeautifulSoup

Here is the website that i'm parsing: http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US
I would like to be able to find the word that will be in line 39 between the td tags. That line tells me if the address is residential or commercial, which is what I need for my script.
Here's what I have, but i'm getting this error:
AttributeError: 'NoneType' object has no attribute 'find_next'
The code I'm using is:
from bs4 import BeautifulSoup
import urllib
page = "http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US"
z = urllib.urlopen(page).read()
thesoup = BeautifulSoup(z, "html.parser")
comres = (thesoup.find("th",text=" Residential or ").find_next("td").text)
print(str(comres))
text argument would not work in this particular case. This is related to how the .string property of an element is calculated. Instead, I would use a search function where you can actually call get_text() and check the complete "text" of an element including the children nodes:
label = thesoup.find(lambda tag: tag and tag.name == "th" and \
"Residential" in tag.get_text())
comres = label.find_next("td").get_text()
print(str(comres))
Prints Commercial.
We can go a little bit further and make a reusable function to get a value by label:
soup = BeautifulSoup(z, "html.parser")
def get_value_by_label(soup, label):
label = soup.find(lambda tag: tag and tag.name == "th" and label in tag.get_text())
return label.find_next("td").get_text(strip=True)
print(get_value_by_label(soup, "Residential"))
print(get_value_by_label(soup, "City"))
Prints:
Commercial
NYC
All you are missing is a bit of housekeeping:
ths = thesoup.find_all("th")
for th in ths:
if 'Residential or' in th.text:
comres = th.find_next("td").text
print(str(comres))
>> Commercial
You'll need to use a regular expression as your text field, like re.compile('Residential or'), rather than a string.
This was working for me. I had to iterate over the results provided, though if you only expect a single result per page you could swap find for find_all:
for r in thesoup.find_all(text=re.compile('Residential or')):
r.find_next('td').text

Categories