Python BeautifulSoup findAll by "class" attribute

Python BeautifulSoup findAll by "class" attribute - python

I want to do the following code, which is what BS documentation says to do, the only problem is that the word "class" isn't just a word. It can be found inside HTML, but it's also a python keyword which causes this code to throw an error.
So how do I do the following?
soup.findAll('ul', class="score")

Your problem seems to be that you expect find_all in the soup to find an exact match for your string. In fact:
When you search for a tag that matches a certain CSS class, you’re
matching against any of its CSS classes:
You can properly search for a class tag as #alKid said. You can also search with the class_ keyword arg.
soup.find_all('ul', class_="score")

Here is how to do it:
soup.find_all('ul', {'class':"score"})

If OP is interested in getting the finalScore by going through ul you could solve this with a couple of lines of gazpacho:
from gazpacho import Soup
html = """\
<div>
<ul class="score header" id="400488971-linescoreHeader" style="display: block">
<li>1</li>
<li>2</li>
<li>3</li>
<li>4</li>
<li id="400488971-lshot"> </li>
<li class="finalScore">T</li>
</ul>
<div>
"""
soup = Soup(html)
soup.find("ul", {"class": "score"}).find("li", {"class": "finalScore"}).text
Which would output:
'T'

Related

How to get all the tags (with content) under a certain class with BeautifulSoup?

I have a class in my soup element that is the description of a unit.
<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>
I can easily grab this part with soup.select(".ats-description")[0].
Now I want to remove <div class="ats-description">, only to keep all the inner tags (to retain text structure). How to do it?
soup.select(".ats-description")[0].getText() gives me all the texts within, like this:
'\nHere is a paragraph\ninner div\nAnother div\n\nItem1\nItem2\nItem3\n\n\n'
But removes all the inner tags, so it's just unstructured text. I want to keep the tags as well.

to get innerHTML, use method .decode_contents()
innerHTML = soup.select_one('.ats-description').decode_contents()
print(innerHTML)

Try this, match by tag in list in soup.find_all()
from bs4 import BeautifulSoup
html="""<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>"""
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one("div.ats-description").find_all(['p','div','ul']))

Python BeautifulSoup find_all with regex doesn't match text

I have the following HTML code:
<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>
I would like to get the anchor tag that has Shop as text disregarding the spacing before and after. I have tried the following code, but I keep getting an empty array:
import re
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
prog = re.compile('\s*Shop\s*')
print(soup.find_all("a", string=prog))
# Output: []
I also tried retrieving the text using get_text():
text = soup.find_all("a")[0].get_text()
print(repr(text))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'
and ran the following code to make sure my Regex was right, which seems to be to the case.
result = prog.match(text)
print(repr(result.group()))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'
I also tried selecting span instead of a but I get the same issue. I'm guessing it's something with find_all, I have read the BeautifulSoup documentation but I still can't find the issue. Any help would be appreciated. Thanks!

The problem you have here is that the text you are looking for is in a tag that contains children tags, and when a tag has children tags, the string property is empty.
You can use a lambda expression in the .find call and since you are looking for a fixed string, you may use a mere 'Shop' in t.text condition rather than a regex check:
soup.find(lambda t: t.name == "a" and 'Shop' in t.text)

The text Shop you are searching it is inside span tag so when you are trying with regular expression its unable to fetch the value using regex.
You can try regex to find text and then parent of that.
import re
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(text=re.compile('Shop')).parent.parent)
If you have BS 4.7.1 or above you can use following css selector.
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('a:contains("Shop")'))

Beautifulsoup find_all does not find all tags Python

I am trying to scrape the web page specifically this web page for example. I am trying to scrape the product names but somehow my find_all method doesn't work properly and not finding all tags I specified.
So here is what I am doing
from bs4 import BeatifulSoup
url = 'https://www.toysrus.fi/nallet-ja-pehmolelut/interaktiiviset-pehmolelut'
soup = BeautifulSoup(request.urlopen(url).read(), 'html.parser')
print(len(soup.findAll('div', {'class' : 'inner-wrapper'})))
The length of the class='inner-wrapper' is actually 4 in the specified page but it finds only 1. Please guide in scraping the product names from the web page and how can I get correct number of tags of div having class'inner-wrapper'. Thanks.

Beatiful soup only finds proper html divs tags, those happened to be inside of scripts are ignored. Regretfully Beautiful soup does not evaluates scripts.
Just open the HTML code, you will see one HTML div of class, and bunch of scripts/js-templates like below
<script type="text/x-jsrender" id="product-list-skuid-template">
<div class="product-list-component type-{{:TemplateInfo.type}} outer-wrapper">
<div class="inner-wrapper">
<ul class="product-list-container">
{{for Data}} {{include tmpl="#product-template"/}} {{/for}}
</ul>
</div>
</div>
{{!-- SHADOW --}} {{if TemplateInfo.divider=='roundshadow'}}
<div class="round-shadow"></div>
{{else TemplateInfo.divider=='simple'}}
<hr /> {{/if}}
</script>

Getting the href of <a> tag which is in <li>

How to get the href of the all the tag that is under the class "Subforum" in the given code?
<li class="subforum">
Link1 Text
</li>
<li class="subforum">
Link2 Text
</li>
<li class="subforum">
Link3 Text
</li>
I have tried this code but obviously it didn't work.
Bs = BeautifulSoup(requests.get(url).text,"lxml")
Class = Bs.findAll('li', {'class': 'subforum"'})
for Sub in Class:
print(Link.get('href'))

The href belongs to a tag, not li tag, use li.a to get a tag
Document: Navigating using tag names
import bs4
html = '''<li class="subforum">
Link1 Text
</li>
<li class="subforum">
Link2 Text
</li>
<li class="subforum">
Link3 Text
</li>`<br>'''
soup = bs4.BeautifulSoup(html, 'lxml')
for li in soup.find_all(class_="subforum"):
print(li.a.get('href'))
out:
Link1
Link2
Link3
Why use class_:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is a reserved word in Python. Using class as a keyword argument will give you a syntax error.As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.

You are almost there, you just need to find an a element for every li you've located:
Class = Bs.findAll('li', {'class': 'subforum"'})
for Sub in Class:
print(Sub.find("a").get('href')) # or Sub.a.get('href')
But, there is an easier way - a CSS selector:
for a in Bs.select("li.subforum a"):
print(a.get('href'))
Here, li.subforum a would match all a elements under the li elements having subforum class attribute.
As a side note, in BeautifulSoup 4, findAll() was renamed to find_all(). And, you should follow the Python general variable naming guidelines.

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))

First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup findAll by "class" attribute - python

Here is how to do it: soup.find_all('ul', {'class':"score"})

Related

How to get all the tags (with content) under a certain class with BeautifulSoup?

Python BeautifulSoup find_all with regex doesn't match text

Beautifulsoup find_all does not find all tags Python

Getting the href of <a> tag which is in <li>

Having problems understanding BeautifulSoup filtering

Categories

Resources