I am trying to use Beautiful Soup to remove some HTML from an HTML text.
This could be an example of my HTML:
<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>
Focus on those two elements:
<h2 class="myclass"><strong>television</strong></h2>
<ul>
I am trying to remove the first <ul> after <h2 class="myclass"><strong>television</strong></h2>, also if it's possible i would like to remove this <ul> only if it appears 1 or 2 element after that <h2>
Is that possible?
You can search for the second <h2> tag using a CSS Selector: h2:nth-of-type(2), and if the next_sibling or the next_sibling after that is an <ul> tag, than remove it from the HTML using the .decompose() method:
from bs4 import BeautifulSoup
html = """<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>"""
soup = BeautifulSoup(html, "html.parser")
looking_for = soup.select_one("h2:nth-of-type(2)")
if (
looking_for.next_sibling.name == "ul"
or looking_for.next_sibling.next_sibling.name == "ul"
):
soup.select_one("ul:nth-of-type(2)").decompose()
print(soup.prettify())
Output:
<p>
whatever
</p>
<h2 class="myclass">
<strong>
fruit
</strong>
</h2>
<ul>
<li>
something
</li>
</ul>
<div>
whatever
</div>
<h2 class="myclass">
<strong>
television
</strong>
</h2>
<div>
whatever
</div>
You can use a CSS selector (adjacent sibling selector +) and then .extract():
for tag in soup.select('h2.myclass+ul'):
tag.extract()
If you want to extract all adjacent uls then use ~ selector:
for tag in soup.select('h2.myclass~ul'):
tag.extract()
Related
I have a soup of this format:
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).
How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.
Thanks a ton
You could select your element, iterate over its siblings and break if there is no p:
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
or other way around and closer to your initial question - select the <div class = 'bar'> and find_previous_siblings('p'):
for t in soup.select_one('.bar').find_previous_siblings('p'):
print(t)
Example
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
Output
<p> </p>
<p> </p>
<p> </p>
If html as shown then just use :not to filter out later sibling p tags
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
soup.select('.foo > table ~ p:not(.bar ~ p)')
I have a class in my soup element that is the description of a unit.
<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>
I can easily grab this part with soup.select(".ats-description")[0].
Now I want to remove <div class="ats-description">, only to keep all the inner tags (to retain text structure). How to do it?
soup.select(".ats-description")[0].getText() gives me all the texts within, like this:
'\nHere is a paragraph\ninner div\nAnother div\n\nItem1\nItem2\nItem3\n\n\n'
But removes all the inner tags, so it's just unstructured text. I want to keep the tags as well.
to get innerHTML, use method .decode_contents()
innerHTML = soup.select_one('.ats-description').decode_contents()
print(innerHTML)
Try this, match by tag in list in soup.find_all()
from bs4 import BeautifulSoup
html="""<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>"""
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one("div.ats-description").find_all(['p','div','ul']))
HTML of page:
<form name="compareprd" action="">
<div class="gridBox product " id="quickLookItem-1">
<div class="gridItemTop">
</div>
</div>
<div class="gridBox product " id="quickLookItem-2">
<div class="gridItemTop">
</div>
</div>
<!-- many more like this. -->
I am using Beautiful soup to scrap a page. In that page I am able to get a form tag by its name.
tag = soup.find("form", {"name": "compareprd"})
Now I want to count all immediate child divs but not all nested divs.
Say for example there are 20 immediate divs inside form.
I tried :
len(tag.findChildren("div"))
But It gives 1500.
I think it gives all "div" inside "form" tag.
Any help appreciated.
You can use a single css selector form[name=compareprd] > div which will find div's that are immediate children of the form:
html = """<form name="compareprd" action="">
<div class="gridBox product " id="quickLookItem-1">
<div class="gridItemTop">
</div>
</div>
<div class="gridBox product " id="quickLookItem-2">
<div class="gridItemTop">
</div>
</div>
</form>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(len(soup.select("form[name=compareprd] > div")))
Or as commented pass recursive=True but use find_all, findChildren goes back to the bs2 days and is only provided for backwards compatability.
len(tag.find_all("div", recursive=False)
I have an HTML document as follows:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
I don't need references from the article, I want to slice the document at the second h2 tag.
Obviously I can find a list of h2 tags like so:
soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it
I don't want to get a list of the h2 tags, I want to slice the document right at the second h2 tag and keep the above contents in a new variable. Basically the desired output I want is:
<h1> Name of Article </h2>
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
What is the best way to go aboout doing this "slicing"/cutting of the HTML document instead of simply finding tags and outputing the tags itself?
You can remove/extract every sibling element of the "References" element and the element itself:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
print(soup)
Prints:
<div>
<h1> Name of Article</h1>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
</div>
You can find the location of the h2 in the string and then find a substring by it:
last_h2_tag = str(soup.find_all("h2")[-1])
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]
Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)
There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text