I am writing a script and want to check if a particular class is present in html or not.
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
print("3 star",html.has_attr("i-stars i-stars--large-3 rating-very-large")) #it's returning False
res = html.find('i-stars i-stars--large-3 rating-very-large")) #it's returning NONE
Please guide me how I can resolve this issue?If somehow I get title (title="3.0 star rating") that will also work for me. Screenshot of console HTML
<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">
<img class="offscreen" height="303" src="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_design_web/8a6fc2d74183/assets/img/stars/stars.png" width="84" alt="3.0 star rating">
</div>
has_attr is a method that checks if an element has the attribute that you want. class is an attribute, i-stars i-stars--large-3 rating-very-large is its value.
find expects CSS selectors, not class values. So you should instead use html.find('div.i-stars.i-stars--large-3.rating-very-large'). This is because you are looking for a div with all of these classes.
Was having similar problems getting the exact classes. They can be brought back as a dictionary object as follows.
html = '<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">'
soup = BeautifulSoup(html, 'html.parser')
find = soup.div
classes = find.attrs['class']
c1 = find.attrs['class'][0]
print (classes, c1)
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
res = html.find(class_='i-stars i-stars--large-3 rating-very-large')
if res:
print("3 star", 'whatever you want print')
out:
3 star whatever you want print
Related
I am using BeautifulSoup to scrape a website. The retrieved resultset looks like this:
<td><span class="I_Want_This_Class_Name"></span><span class="other_name">Text Is Here</span></td>
From here, I want to retrieve the class name "I_Want_This_Class_Name". I can get the "Text Is Here" part no problem, but the class name itself is proving to be difficult.
Is there a way to do this using BeautifulSoup resultset or do I need to convert to a dictionary?
Thank you
from bs4 import BeautifulSoup
doc = '''<td><span class="I_Want_This_Class_Name"></span><span class="other_name">Text Is Here</span></td>
'''
soup = BeautifulSoup(doc, 'html.parser')
res = soup.find('td')
out = {}
for each in res:
if each.has_attr('class'):
out[each['class'][0]] = each.text
print(out)
output will be like:
{'I_Want_This_Class_Name': '', 'other_name': 'Text Is Here'}
If you are trying to get the class name for this one result, then I would use the select method on your soup object, calling the class key:
foo_class = soup.select('td>span.I_Want_This_Class_Name')[0]['class'][0]
Note here that the select method does return a list, hence the indexing before the key.
I need to get data after a certain link with text map, but it doesn't work when the data after the link is colored. How do I get that?
Currently, I am using next_sibling, but it only gets the data points that are not red.
The HTML is like this.
I can read the number from here
map
" 2.8 "
but not from here
map
<font color="red">3.1</font>
soup=BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")
for i in tags:
if i.get_text()=="map":
# prints each next_sibling
print(i.next_sibling)
# Extracts text if needed.
try:
output.write(i.next_sibling.get_text().strip()+"\n")
except AttributeError:
output.write(i.next_sibling.strip()+"\n")
output.close()
The program writes all of the numbers that are not in red, and leaves empty spaces where there are red numbers. I want it to show everything.
If we can see more of your HTML tree there's probably a better way to do this but given the little bit of html that you've shown us, here's one way that will likely work.
from bs4 import BeautifulSoup
html = """map2.8
map
<font color="red">3.1</font>"""
soup=BeautifulSoup(html, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")
for i in tags:
if i.get_text()=="map":
siblings = [sib for sib in i.next_siblings]
map_sibling_text = siblings[0].strip()
if map_sibling_text == '' and len(siblings) > 1:
if siblings[1].name == 'font':
map_sibling_text = siblings[1].get_text().strip()
output.write("{0}\n".format(map_sibling_text))
output.close()
Depends on how your HTML is overall. Is that classname always associated with an a tag for example? You might be able to do the following. Requires bs4 4.7.1.
import requests
from bs4 import BeautifulSoup as bs
html = '''
map
" 2.8 "
map
<font color="red">3.1</font>
'''
soup = bs(html, 'lxml')
data = [item.next_sibling.strip() if item.name == 'a' else item.text.strip() for item in soup.select('.link2:not(:has(+font)), .link2 + font')]
print(data)
I have a question about python, i want to scrape just 1 page with different attribute classes and loop on them, so this is the html code that i needed:
'a' : "class: a"
'div': "class: b"
'h1' : "class: c"
the page just have one of them, so i try with "else if" and "try" statement but i still don't get it. This code is for one class only:
#!/usr/bin/env python
import csv
import requests
from bs4 import BeautifulSoup
urls = csv.reader(open('link.csv'))
for url in urls:
response = requests.get(url[0])
html = response.content
soup = BeautifulSoup(html, 'html.parser')
condition = soup.find('a', attrs={'class': 'a'}).get_text()
print (condition)
I have searching for another same problem in this forum but i still got stuck on this.
I hope anyone get help me, Thank you.
If you want to select all variations of the elements, you could use the .select() method along with the three relevant CSS selectors to cover the example that you provided, a.a, div.b, h1.c.
If there are any matched elements, you could then grab the first one and get its text:
elements = soup.select('a.a, div.b, h1.c')
if elements:
condition = elements[0].get_text()
print(condition)
import bs4
html = """<html>
<head>
<div class="a"></div>
<a class="b"></a>
<h1 class="c"></h1>
</body>
</html>"""
soup = bs4.BeautifulSoup(html, 'lxml')
soup.find_all(class_=['a', 'b', 'c'])
soup.select('.a, .b, .c')
In find(), [a, b, c] means a or b or c
In select(), a, b, c means a or b or c
I am trying to do some scraping from wikipedia using BeautifulSoup4
Unfortunately I can't get past one findAll call, I have a work around, but would like to understand why this one is not working.
Sample code:
from bs4 import BeautifulSoup
import requests
import lxml
html = requests.get('http://en.wikipedia.org/wiki/Brazil_national_football_team').text
soup = BeautifulSoup(html, "html.parser")
title = "Edit section: Current squad"
print "findAll method : " , soup.findAll("a",{"title",title})
results = soup.findAll("a")
for r in results:
if r.attrs.has_key('title'):
if r.attrs['title']=='Edit section: Current squad':
print "for if if method : ",r['href']
Sample output:
findAll method : []
for if if method : /w/index.php?title=Brazil_national_football_team&action=edit§ion=35
So my alternative code with the 'for if if' method does return the right 'a href' but the beautifulsoup variant doesn't.
What am I doing wrong?
You made a mistake in your dictionary syntax:
soup.findAll("a",{"title",title})
# ----------------------^
You passed in a set, not a dictionary there; replace the , with a ::
soup.findAll("a",{"title":title})
Alternatively, just use a keyword argument:
soup.findAll("a", title=title)
Demo:
>>> soup.findAll("a",{"title",title})
[]
>>> soup.findAll("a",{"title":title})
[edit]
>>> soup.findAll("a", title=title)
[edit]
I am trying to get BeautifulSoup to do the following.
I have HTML files which I wish to modify. I am interested in two tags in particular, one which I will call TagA is
<div class ="A">...</div>
and one which I will call TagB
<p class = "B">...</p>
Both tags occur independently throughout the HTML and may themselves contain other tags and be nested inside other tags.
I want to place a marker tag around every TagA whenever it is not immediately followed by TagB so that
<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>
But when TagA is followed immediately by TagB, I want the marker Tag to surround them both
so that
<p class="A">...</p><div class="B">...</div>
becomes
<marker><p class="A">...</p><div class="B">...</div></marker>
I can see how to select TagA and enclose it with the marker tag, but when it is followed by TagB I do not know if or how the BeautiulSoup 'selection' can be extended to include the NextSibling.
Any help appreciated.
beautifulSoup does have a "next sibling" function. find all tags of class A and use a.next_sibling to check if it is b.
look at the docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways
I think I was going about this the wrong way by trying to extend the 'selection' from one tag to the following. Instead I found the following code which insets the outer 'Marker' tag and then inserts the A and B tags does the trick.
I am pretty new to Python so would appreciate advice regarding improvements or snags with the following.
def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
try:
return tag.name == 'p'#has_key('p') and tag.has_key('B')
except:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")
for TagA in soup.find_all("div", "A"):
Marker = soup.new_tag('Marker')
nexttag = TagA.next_sibling
#skipover white space
while str(nexttag).isspace():
nexttag = nexttag.next_sibling
if isTagB(nexttag):
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
Marker.insert(2,nexttag)
else:
#print("FALSE",nexttag)
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
print (soup)
import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)
all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing
#exa- attrs={'class':"class","id":"1234"}
single_div = all_div[0]
#to find p tag inside single_div
p_tag_obj = single_div.find("p")
you can use obj.findNext(), obj.findAllNext(), obj.findALLPrevious(), obj.findPrevious(),
to get attribute you can use obj.get("href"), obj.get("title") etc.