Nested Beautiful Soup classes - python

I am trying to fetch all classes (including the data inside "data_from", "data_to") from the following structure:
<div class="alldata">
<div class="data_from">
<div class="data_to">
<div class="data_to">
<div class="data_from">
</div>
So far I have tried finding all classes, without success. The "data_from", "data_to" classes are not being fetched by:
soup.find_all(class_=True)
When I try to illiterate over "alldata" class I fetch only the first "data_from" class.
for data in soup.findAll('div', attrs={"class": "alldata"}):
print(data.prettify())
All assistance is greatly appreciated. Thank you.

In newer code avoid old syntax findAll() or a mix with new syntax - instead use find_all() only - For more take a minute to check docs
Your HTML is not valid, but to get your goal with valid HTML you could use css selectors that selects all <div> with a class that are contained in your outer <div>:
soup.select('.alldata div[class]')
Example
from bs4 import BeautifulSoup
html='''<div class="alldata">
<div class="data_from"></div>
<div class="data_to"></div>
<div class="data_to"></div>
<div class="data_from"></div>
</div>'''
soup = BeautifulSoup(html)
soup.select('.alldata div[class]')
Output
[<div class="data_from"></div>,
<div class="data_to"></div>,
<div class="data_to"></div>,
<div class="data_from"></div>]
Just in addition if you like to get its texts, iterate over your ResultSet:
for e in soup.select('.alldata div[class]'):
print(e.text)

Related

python beautifulsoup: how to find all before certain stop tag?

I need to find all tags of a certain kind (class "nice") but excluding those after a certain other tag (class "stop").
<div class="nice"></div>
<div class="nice"></div>
<div class="stop">here should be the end of found items</div>
<div class="nice"></div>
<div class="nice"></div>
How do I accomplish this using bs4?
I found this as a similar question but it appears a bit fuzzy.
You can use for example .find_previous to filter out unwanted tags:
from bs4 import BeautifulSoup
html_doc = """\
<div class="nice">want 1</div>
<div class="nice">want 2</div>
<div class="stop">here should be the end of found items</div>
<div class="nice">do not want 1</div>
<div class="nice">do not want 2</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
for div in soup.find_all("div", class_="nice"):
if div.find_previous("div", class_="stop"):
break
print(div)
Prints:
<div class="nice">want 1</div>
<div class="nice">want 2</div>

Ignore one div class in BeautifulSoup find_all in Python 3

I want to ignore one class when using find_all. I've followed this solution Select all divs except ones with certain classes in BeautifulSoup
My divs are a bit different, I want to ignore description-0
<div class="abc">...</div>
<div class="parent">
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
</div>
<div class="xyz">...</div>
Following is my code
classToIgnore = ["description-0"]
all = soup.find_all('div', class_=lambda x: x not in classToIgnore)
It is reading all divs on the page, instead of just the ones with "descriptions-n". How to fix it?
Use regex, like this, for example:
import re
from bs4 import BeautifulSoup
sample_html = """<div class="abc">...</div>
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
<div class="xyz">...</div>"""
classes_regex = (
BeautifulSoup(sample_html, "lxml")
.find_all("div", {"class": (re.compile(r"description-[1-9]"))})
)
print(classes_regex)
Output:
[<div class="description-1"></div>, <div class="description-2"></div>]

CSS: Div and Argument Selectors

I have the following structure,
<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>
How do I acess the data-ret argument inside div id done? For doing some web scraping.
Tried a couple of ways but don't seem to be able to stick it.
Thanks
Using beautiful soup library:
from bs4 import BeautifulSoup
html = '''<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>'''
soup = BeautifulSoup(html,"lxml")
data_ret = soup.find("div",{'id':'done'})
print(data_ret['data-ret'])
O/P:
512,500

Anything similar to "until" in CSS selector?

I would like to get movie names available between "tracked_by" id to "buzz_off" id. I have already created a selector which can grab names after "tracked_by" id. However, my intention is to let the script do the parsing UNTIL it finds "buzz_off" id. The elements within which the names are:
html = '''
<div class="list">
<a id="allow" name="allow"></a>
<h4 class="cluster">Allow</h4>
<div class="base min">Sally</div>
<div class="base max">Blood Diamond</div>
<a id="tracked_by" name="tracked_by"></a>
<h4 class="cluster">Tracked by</h4>
<div class="base min">Gladiator</div>
<div class="base max">Troy</div>
<a id="buzz_off" name="buzz_off"></a>
<h4 class="cluster">Buzz-off</h4>
<div class="base min">Heat</div>
<div class="base max">Matrix</div>
</div>
'''
from lxml import html as htm
root = htm.fromstring(html)
for item in root.cssselect("a#tracked_by ~ div.base a"):
print(item.text)
The selector I've tried with (also mentioned in the above script):
a#tracked_by ~ div.base a
Results I'm having:
Gladiator
Troy
Heat
Matrix
Results I would like to get:
Gladiator
Troy
Btw, I would like to parse the names using this selector not to style.
this is a reference for css selectors. As you can see, it doesn't have any form of logic, as it is not a programming language. You'd have to use a while not loop in python and handle each element one at a time, or append them to a list.

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text
You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

Categories