I have a question about python, i want to scrape just 1 page with different attribute classes and loop on them, so this is the html code that i needed:
'a' : "class: a"
'div': "class: b"
'h1' : "class: c"
the page just have one of them, so i try with "else if" and "try" statement but i still don't get it. This code is for one class only:
#!/usr/bin/env python
import csv
import requests
from bs4 import BeautifulSoup
urls = csv.reader(open('link.csv'))
for url in urls:
response = requests.get(url[0])
html = response.content
soup = BeautifulSoup(html, 'html.parser')
condition = soup.find('a', attrs={'class': 'a'}).get_text()
print (condition)
I have searching for another same problem in this forum but i still got stuck on this.
I hope anyone get help me, Thank you.
If you want to select all variations of the elements, you could use the .select() method along with the three relevant CSS selectors to cover the example that you provided, a.a, div.b, h1.c.
If there are any matched elements, you could then grab the first one and get its text:
elements = soup.select('a.a, div.b, h1.c')
if elements:
condition = elements[0].get_text()
print(condition)
import bs4
html = """<html>
<head>
<div class="a"></div>
<a class="b"></a>
<h1 class="c"></h1>
</body>
</html>"""
soup = bs4.BeautifulSoup(html, 'lxml')
soup.find_all(class_=['a', 'b', 'c'])
soup.select('.a, .b, .c')
In find(), [a, b, c] means a or b or c
In select(), a, b, c means a or b or c
Related
I just got started out learning to use BeautifulSoup in Python to parse html and have a very simple stupid question. Somehow, I just couldn't get Text 1 only from the html below (stored in containers).
....
<div class="listA">
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
</div>
...
soup = BeautifulSoup(driver.page_source, 'html.parser')
containers = soup.findAll("div", {"class": "listA"})
datas = []
for data in containers:
textspan = data.find("span")
datas.append(textspan.text)
The output is as follows: Text1Text2Text3
Any advice how to delimit them as well? Thanks and much appreciated!
if you just want Text 1 use this code
import bs4
content = "<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>"
soup = bs4.BeautifulSoup(content, 'html.parser')
# soup('span') will give you
# [<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>, <span>Text 1</span>]
span_text = soup('span')
for e in span_text:
if not e('span'):
print(e.text)
Output:
Text 1
Another solution involves simplifieddoc, which does not rely on third-party libraries and is lighter and faster, perfect for beginners.
Here are more examples here
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
'''
doc = SimplifiedDoc(html)
span = doc.span # Get the outermost span
first = span.span # Get the first span in span
print (first.text)
second = span.b
print (second.text)
third = second.next
print (third.text)
Result:
Text 1
Text 2
Text 3
I'm trying to scrape the BBC Sounds website for **all of the ** 'currently playing' images. I'm not bothered about which size to use, 400w might be a good.
Below is a relevant excerpt from the HTML and my current python script. A variation on this works brilliantly for the 'now playing' text, but I haven't been able to get it to work for the image URLs, which is what I'm after, I think probably because a) there's so many image URLs to choose from and b) there's a whitespace which no doubt the parser doesn't like. Please bear in mind the HTML code below is repeated about 10 times for each of the channels. I've included just one as an example. Thank you!
import requests
from bs4 import BeautifulSoup
url = "https://www.bbc.co.uk/sounds"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "sc-o-responsive-image__img sc-u-circle"})
print g_data[0].text
print g_data[1].text
print g_data[2].text
print g_data[3].text
print g_data[4].text
print g_data[5].text
print g_data[6].text
print g_data[7].text
print g_data[8].text
print g_data[9].text
.
<div class="gel-layout__item sc-o-island">
<div class="sc-c-network-item__image sc-o-island" aria-hidden="true">
<div class="sc-c-rsimage sc-o-responsive-image sc-o-responsive-image--1by1 sc-u-circle">
<img alt="" class="sc-o-responsive-image__img sc-u-circle"
src="https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg" srcSet="https://ichef.bbci.co.uk/images/ic/160x160/p07fzzgr.jpg 160w,
https://ichef.bbci.co.uk/images/ic/192x192/p07fzzgr.jpg 192w,
https://ichef.bbci.co.uk/images/ic/224x224/p07fzzgr.jpg 224w,
https://ichef.bbci.co.uk/images/ic/288x288/p07fzzgr.jpg 288w,
https://ichef.bbci.co.uk/images/ic/368x368/p07fzzgr.jpg 368w,
https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg 400w,
https://ichef.bbci.co.uk/images/ic/448x448/p07fzzgr.jpg 448w,
https://ichef.bbci.co.uk/images/ic/496x496/p07fzzgr.jpg 496w,
https://ichef.bbci.co.uk/images/ic/512x512/p07fzzgr.jpg 512w,
https://ichef.bbci.co.uk/images/ic/576x576/p07fzzgr.jpg 576w,
https://ichef.bbci.co.uk/images/ic/624x624/p07fzzgr.jpg 624w"
sizes="(max-width: 400px) 34vw,(max-width: 600px) 25vw,17vw"/>
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.bbc.co.uk/sounds")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("img", {'class': 'sc-o-responsive-image__img sc-u-circle'}):
print(item.get("src"))
Output:
https://ichef.bbci.co.uk/images/ic/400x400/p05mpj80.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07dg040.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07zml97.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p0428n3t.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p01lyv4b.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06yphh0.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p05v4t1c.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06z9zzc.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06x0hxb.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06n253f.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p060m6jj.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07l4fjw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03710d6.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p078qrgm.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03crmyc.jpg
In my job we are using tags that we have created. One of the tags called can-edit and it looks like this in the code (for example):
<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>
It could be inside any tag (img, p, h1, h2, div...).
What i wish to get is all the can-edit tags within a page, for example with the HTML above:
['banner top text', 'banner bottom text']
i've tried
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
But it not finding any.
i've tried
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
But it not finding any.
The reason that this does not work is because here you look for a tag with the name can-edit, so <can-edit ...>, and this thus does not work.
You can use the find_all function of the soup to find all tags with a certain attribute. For example:
soup.find_all(attrs={'can-edit': True})
So here we use the attrs parameter and pass it an attribute that says that we filter tags that have a can-edit attribute. This will give us a list of tags with a can-edit attribute (regardless the value). If we now want to obtain the value of that attribute, we can get the ['can-edit'] item of it, so we can write a list comprehension:
all_can_edit_attrs = [tag['can-edit']
for tag in soup.find_all(attrs={'can-edit': True})]
Or a full working version:
from bs4 import BeautifulSoup
s = """<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>"""
bs = BeautifulSoup(s, 'lxml')
all_can_edit_attrs = [tag['can-edit']
for tag in soup.find_all(attrs={'can-edit': True})]
I am writing a script and want to check if a particular class is present in html or not.
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
print("3 star",html.has_attr("i-stars i-stars--large-3 rating-very-large")) #it's returning False
res = html.find('i-stars i-stars--large-3 rating-very-large")) #it's returning NONE
Please guide me how I can resolve this issue?If somehow I get title (title="3.0 star rating") that will also work for me. Screenshot of console HTML
<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">
<img class="offscreen" height="303" src="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_design_web/8a6fc2d74183/assets/img/stars/stars.png" width="84" alt="3.0 star rating">
</div>
has_attr is a method that checks if an element has the attribute that you want. class is an attribute, i-stars i-stars--large-3 rating-very-large is its value.
find expects CSS selectors, not class values. So you should instead use html.find('div.i-stars.i-stars--large-3.rating-very-large'). This is because you are looking for a div with all of these classes.
Was having similar problems getting the exact classes. They can be brought back as a dictionary object as follows.
html = '<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">'
soup = BeautifulSoup(html, 'html.parser')
find = soup.div
classes = find.attrs['class']
c1 = find.attrs['class'][0]
print (classes, c1)
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
res = html.find(class_='i-stars i-stars--large-3 rating-very-large')
if res:
print("3 star", 'whatever you want print')
out:
3 star whatever you want print
#!/usr/bin/env python
import requests, bs4
res = requests.get('https://betaunityapi.webrootcloudav.com/Docs/APIDoc/APIReference')
web_page = bs4.BeautifulSoup(res.text, "lxml")
for d in web_page.findAll("div",{"class":"actionColumnText"}):
print d
Result:
<div class="actionColumnText">
/service/api/console/gsm/{gsmKey}/sites/{siteId}/endpoints/reactivate
</div>
<div class="actionColumnText">
Reactivates a list of endpoints, or all endpoints on a site. </div>
I am interested to see output with only the last line (Reactivates a list of endpoints, or all endpoints on a site) removing start and end .
Not interested in the line with href
Any help is greatly appreciated.
In a simple case, you can just get the text:
for d in web_page.find_all("div", {"class": "actionColumnText"}):
print(d.get_text())
Or/and, if there is only single element you want to find, you can get the last match by index:
d = web_page.find_all("div", {"class": "actionColumnText"})[-1]
print(d.get_text())
Or, you can also find div elements with a specific class which don't have an a child element:
def filter_divs(elm):
return elm and elm.name == "div" and "actionColumnText" in elm.attrs and elm.a is None
for d in web_page.find_all(fitler_divs):
print(d.get_text())
Or, in case of a single element:
web_page.find(fitler_divs).get_text()
U can select the last one with a CSS selector:
var d = web_page.select("div.actionColmnText:last")
d.string()
If this text changes you can use
#!/usr/bin/env python
import requests, bs4
res = requests.get('https://betaunityapi.webrootcloudav.com/Docs/APIDoc/APIReference')
web_page = bs4.BeautifulSoup(res.text, "lxml")
yourText = web_page.findAll("div",{"class":"actionColumnText"})[-1]
yourText = yourText.split(' ')[0]