Ignore one div class in BeautifulSoup find_all in Python 3 - python

I want to ignore one class when using find_all. I've followed this solution Select all divs except ones with certain classes in BeautifulSoup
My divs are a bit different, I want to ignore description-0
<div class="abc">...</div>
<div class="parent">
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
</div>
<div class="xyz">...</div>
Following is my code
classToIgnore = ["description-0"]
all = soup.find_all('div', class_=lambda x: x not in classToIgnore)
It is reading all divs on the page, instead of just the ones with "descriptions-n". How to fix it?

Use regex, like this, for example:
import re
from bs4 import BeautifulSoup
sample_html = """<div class="abc">...</div>
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
<div class="xyz">...</div>"""
classes_regex = (
BeautifulSoup(sample_html, "lxml")
.find_all("div", {"class": (re.compile(r"description-[1-9]"))})
)
print(classes_regex)
Output:
[<div class="description-1"></div>, <div class="description-2"></div>]

Related

python beautifulsoup: how to find all before certain stop tag?

I need to find all tags of a certain kind (class "nice") but excluding those after a certain other tag (class "stop").
<div class="nice"></div>
<div class="nice"></div>
<div class="stop">here should be the end of found items</div>
<div class="nice"></div>
<div class="nice"></div>
How do I accomplish this using bs4?
I found this as a similar question but it appears a bit fuzzy.
You can use for example .find_previous to filter out unwanted tags:
from bs4 import BeautifulSoup
html_doc = """\
<div class="nice">want 1</div>
<div class="nice">want 2</div>
<div class="stop">here should be the end of found items</div>
<div class="nice">do not want 1</div>
<div class="nice">do not want 2</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
for div in soup.find_all("div", class_="nice"):
if div.find_previous("div", class_="stop"):
break
print(div)
Prints:
<div class="nice">want 1</div>
<div class="nice">want 2</div>

CSS: Div and Argument Selectors

I have the following structure,
<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>
How do I acess the data-ret argument inside div id done? For doing some web scraping.
Tried a couple of ways but don't seem to be able to stick it.
Thanks
Using beautiful soup library:
from bs4 import BeautifulSoup
html = '''<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>'''
soup = BeautifulSoup(html,"lxml")
data_ret = soup.find("div",{'id':'done'})
print(data_ret['data-ret'])
O/P:
512,500

beautifulsoup - extracting link, text, and title within child div

The layout is as follows:
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>
I'm trying to grab The TITLE, then the APP_URL and ideally, when I print via html, I would like for the TITLE to become a hyper link of the APP_URL.
My code is like this but doesn't yield desire results. I believe I need to add another command within the loop to grab the title. Only problem is, How do I make sure that I grab the TITLE and APP_URL so that they go together? There are at least 15 apps with the class of <div class="App">. Of course, I want all 15 results as well.
IMPORTANT: for the href links, I need it from the class called "signed button".
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[1]
print a.text.strip(), '=>', a.attrs['href']
Use CSS selectors:
from bs4 import BeautifulSoup
html = """
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
for div in soup.select('div.App'):
title = div.select_one('div.title')
link = div.select_one('a')
print("Click here: <a href='{}'>{}</a>".format(link["href"], title.text))
Which yields
Click here: <a href='http://app_url'>Application Name #1</a>
Maybe something like this will work?
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[0]
print div.findAll('div', {'class': 'title'})[0].text, '=>', a.attrs['href']

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text
You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Categories