get all soup above a certain div - python

I have a soup of this format:
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).
How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.
Thanks a ton

You could select your element, iterate over its siblings and break if there is no p:
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
or other way around and closer to your initial question - select the <div class = 'bar'> and find_previous_siblings('p'):
for t in soup.select_one('.bar').find_previous_siblings('p'):
print(t)
Example
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
Output
<p> </p>
<p> </p>
<p> </p>

If html as shown then just use :not to filter out later sibling p tags
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
soup.select('.foo > table ~ p:not(.bar ~ p)')

Related

Beautiful Soup remove first appearance after selector

I am trying to use Beautiful Soup to remove some HTML from an HTML text.
This could be an example of my HTML:
<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>
Focus on those two elements:
<h2 class="myclass"><strong>television</strong></h2>
<ul>
I am trying to remove the first <ul> after <h2 class="myclass"><strong>television</strong></h2>, also if it's possible i would like to remove this <ul> only if it appears 1 or 2 element after that <h2>
Is that possible?
You can search for the second <h2> tag using a CSS Selector: h2:nth-of-type(2), and if the next_sibling or the next_sibling after that is an <ul> tag, than remove it from the HTML using the .decompose() method:
from bs4 import BeautifulSoup
html = """<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>"""
soup = BeautifulSoup(html, "html.parser")
looking_for = soup.select_one("h2:nth-of-type(2)")
if (
looking_for.next_sibling.name == "ul"
or looking_for.next_sibling.next_sibling.name == "ul"
):
soup.select_one("ul:nth-of-type(2)").decompose()
print(soup.prettify())
Output:
<p>
whatever
</p>
<h2 class="myclass">
<strong>
fruit
</strong>
</h2>
<ul>
<li>
something
</li>
</ul>
<div>
whatever
</div>
<h2 class="myclass">
<strong>
television
</strong>
</h2>
<div>
whatever
</div>
You can use a CSS selector (adjacent sibling selector +) and then .extract():
for tag in soup.select('h2.myclass+ul'):
tag.extract()
If you want to extract all adjacent uls then use ~ selector:
for tag in soup.select('h2.myclass~ul'):
tag.extract()

Scraping informations with Beautiful Soup in same name tags

I want to scrape informations from an html page with Beautiful Soup in Python and all the information I need are in a same name tag> How can I differentiate each information I need ?
All the information I need are in different class="hAyfc" tags.
The result will be in order.you just need to take the result out because the order of the results is the same as the order in html
from bs4 import BeautifulSoup
html = """
<div class = "hAyfc">
<div class = "BgcNfc">pro </div>
<span class = "htlgb">
<div>
<span class = "htlgb">
codeA
</span>
</div>
</span>
</div>
<div class = "hAyfc">
<div class = "BgcNfc">pro </div>
<span class = "htlgb">
<div>
<span class = "htlgb">
codeB
</span>
</div>
</span>
</div>
"""
bs = BeautifulSoup(html,"lxml")
result = [e.text for e in bs.find_all("div",{"class":"hAyfc"})]
print(result)

Python - Beautifulsoup count only outer tag children of a tag

HTML of page:
<form name="compareprd" action="">
<div class="gridBox product " id="quickLookItem-1">
<div class="gridItemTop">
</div>
</div>
<div class="gridBox product " id="quickLookItem-2">
<div class="gridItemTop">
</div>
</div>
<!-- many more like this. -->
I am using Beautiful soup to scrap a page. In that page I am able to get a form tag by its name.
tag = soup.find("form", {"name": "compareprd"})
Now I want to count all immediate child divs but not all nested divs.
Say for example there are 20 immediate divs inside form.
I tried :
len(tag.findChildren("div"))
But It gives 1500.
I think it gives all "div" inside "form" tag.
Any help appreciated.
You can use a single css selector form[name=compareprd] > div which will find div's that are immediate children of the form:
html = """<form name="compareprd" action="">
<div class="gridBox product " id="quickLookItem-1">
<div class="gridItemTop">
</div>
</div>
<div class="gridBox product " id="quickLookItem-2">
<div class="gridItemTop">
</div>
</div>
</form>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(len(soup.select("form[name=compareprd] > div")))
Or as commented pass recursive=True but use find_all, findChildren goes back to the bs2 days and is only provided for backwards compatability.
len(tag.find_all("div", recursive=False)

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Python : Extract HTML content

Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)
There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text

Categories