Cutting/Slicing an HTML document into pieces with BeautifulSoup? - python

I have an HTML document as follows:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
I don't need references from the article, I want to slice the document at the second h2 tag.
Obviously I can find a list of h2 tags like so:
soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it
I don't want to get a list of the h2 tags, I want to slice the document right at the second h2 tag and keep the above contents in a new variable. Basically the desired output I want is:
<h1> Name of Article </h2>
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
What is the best way to go aboout doing this "slicing"/cutting of the HTML document instead of simply finding tags and outputing the tags itself?

You can remove/extract every sibling element of the "References" element and the element itself:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
print(soup)
Prints:
<div>
<h1> Name of Article</h1>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
</div>

You can find the location of the h2 in the string and then find a substring by it:
last_h2_tag = str(soup.find_all("h2")[-1])
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]

Related

How to scrape just one text value on one p tag from bs4

Actually The website has one <p> but inside it there are two text values, I just want to scrape one of the texts. website HTML as below:
<p class="text-base font-medium text-gray-700 w-1/2" xpath="1">
Great Clips
<br><span class="text-blue-600 font-normal text-sm">Request Info</span>
</p>
On HTML above, there are two text values ("Great Clips" & "Request Info")if we target <p>. I just want to scrape "Great Clips" not both, how would I do that with bs4?
You could use .contents with indexing to extract only the first child:
soup.p.contents[0].strip()
Example
from bs4 import BeautifulSoup
html = '''
<p class="text-base font-medium text-gray-700 w-1/2" xpath="1">
Great Clips
<br><span class="text-blue-600 font-normal text-sm">Request Info</span>
</p>
'''
soup = BeautifulSoup(html)
soup.p.contents[0].strip()
Output
Great Clips

Split a flat HTML document into sections by tags - Beautifulsoup

I have a flat HTML document where different elements are separated by H2 tags:
<h2>section 1</h2>
other elements in the section...
<h2>section 2</h2>
other elements in the section...
I need to split document in to hierarchical sections using BS4.. this is basically what I am looking forward to doing:
<section>
<h2>Section 1</h2>
elements following the previous H2
</section>
<section>
<h2>Section 2</h2>
elements following the previous H2
</section>
How to go about it with BS4/XPATH or CSS selectors?
I need to get the elements between the two H2 tags using BS4...
of course I can linearly traverse the DOM tree (the current document is actualyl more like a flat array) and chunkify, but that would be ugly...
BeautifulSoup does let you insert tags, so you could add <section> tags around the <h2>text elements as follows:
from bs4 import BeautifulSoup
from copy import copy
html = """<html><body>
<h2>section 1</h2>
other elements in the section...
<h2>section 2</h2>
other elements in the section...
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
for h2 in soup.find_all('h2'):
section_soup = soup.new_tag('section')
section_soup.append(copy(h2))
section_soup.append(h2.next_sibling)
h2.insert_before(section_soup)
h2.extract()
print(soup)
Giving you an updated HTML as:
<html><body>
<section><h2>section 1</h2>
other elements in the section...
</section><section><h2>section 2</h2>
other elements in the section...
</section></body></html>
This works by first creating a new section tag and adding a copy of the <h2> tag into it followed by the text following it. It then inserts this new soup before the existing <h2> tag. Finally it removes the original <h2> tag.

How to get all the tags (with content) under a certain class with BeautifulSoup?

I have a class in my soup element that is the description of a unit.
<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>
I can easily grab this part with soup.select(".ats-description")[0].
Now I want to remove <div class="ats-description">, only to keep all the inner tags (to retain text structure). How to do it?
soup.select(".ats-description")[0].getText() gives me all the texts within, like this:
'\nHere is a paragraph\ninner div\nAnother div\n\nItem1\nItem2\nItem3\n\n\n'
But removes all the inner tags, so it's just unstructured text. I want to keep the tags as well.
to get innerHTML, use method .decode_contents()
innerHTML = soup.select_one('.ats-description').decode_contents()
print(innerHTML)
Try this, match by tag in list in soup.find_all()
from bs4 import BeautifulSoup
html="""<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>"""
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one("div.ats-description").find_all(['p','div','ul']))

Beautiful Soup remove first appearance after selector

I am trying to use Beautiful Soup to remove some HTML from an HTML text.
This could be an example of my HTML:
<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>
Focus on those two elements:
<h2 class="myclass"><strong>television</strong></h2>
<ul>
I am trying to remove the first <ul> after <h2 class="myclass"><strong>television</strong></h2>, also if it's possible i would like to remove this <ul> only if it appears 1 or 2 element after that <h2>
Is that possible?
You can search for the second <h2> tag using a CSS Selector: h2:nth-of-type(2), and if the next_sibling or the next_sibling after that is an <ul> tag, than remove it from the HTML using the .decompose() method:
from bs4 import BeautifulSoup
html = """<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>"""
soup = BeautifulSoup(html, "html.parser")
looking_for = soup.select_one("h2:nth-of-type(2)")
if (
looking_for.next_sibling.name == "ul"
or looking_for.next_sibling.next_sibling.name == "ul"
):
soup.select_one("ul:nth-of-type(2)").decompose()
print(soup.prettify())
Output:
<p>
whatever
</p>
<h2 class="myclass">
<strong>
fruit
</strong>
</h2>
<ul>
<li>
something
</li>
</ul>
<div>
whatever
</div>
<h2 class="myclass">
<strong>
television
</strong>
</h2>
<div>
whatever
</div>
You can use a CSS selector (adjacent sibling selector +) and then .extract():
for tag in soup.select('h2.myclass+ul'):
tag.extract()
If you want to extract all adjacent uls then use ~ selector:
for tag in soup.select('h2.myclass~ul'):
tag.extract()

With Beautifulsoup, Extract Tags of Element Except Those specified

I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:
<div class="the-one-i-want">
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<ol>
<li>
list item
</li>
<li>
list item
</li>
</ol>
<div class='something-i-don't-want>
content
</div>
<script class="something-else-i-dont-want'>
script
</script>
<p>
content
</p>
</div>
All of the content that I want to extract is found within the <div class="the-one-i-want"> element. Right now, I'm using the following methods, which work most of the time:
soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
This excludes scripts, weird insert div's and otherwise un-predictable content such as ads or 'recommended content' type stuff.
Now, there are some instances in which there are elements other than just the <p> tags, which has content that is contextually important to the main content, such as lists.
Is there a way to get the content from the <div class="the-one-i-want"> in a manner as such:
soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
Where desired-content-elementswould be inclusive of every element that I deemed fit for that particular content? Such as, all <p> tags, all <ol> and <li> tags, but no <div> or <script> tags.
Perhaps noteworthy, is my method of saving the content:
content_string = ''
for p in content:
content_string += str(p)
This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.
You can pass a list of tags that you want:
content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])
If we run something similar on your question url looking for p and pre tags, you can see we get both:
...: for ele in soup.select_one("td.postcell").find_all(["pre","p"]):
...: print(ele)
...:
<p>I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:</p>
<pre><code><div class="the-one-i-want">
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<ol>
<li>
list item
</li>
<li>
list item
</li>
</ol>
<div class='something-i-don't-want>
content
</div>
<script class="something-else-i-dont-want'>
script
</script>
<p>
content
</p>
</div>
</code></pre>
<p>All of the content that I want to extract is found within the <code><div class="the-one-i-want"></code> element. Right now, I'm using the following methods, which work most of the time:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
</code></pre>
<p>This excludes scripts, weird insert <code>div</code>'s and otherwise un-predictable content such as ads or 'recommended content' type stuff.</p>
<p>Now, there are some instances in which there are elements other than just the <code><p></code> tags, which has content that is contextually important to the main content, such as lists.</p>
<p>Is there a way to get the content from the <code><div class="the-one-i-want"></code> in a manner as such:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
</code></pre>
<p>Where <code>desired-content-elements</code>would be inclusive of every element that I deemed fit for that particular content? Such as, all <code><p></code> tags, all <code><ol></code> and <code><li></code> tags, but no <code><div></code> or <code><script></code> tags.</p>
<p>Perhaps noteworthy, is my method of saving the content:</p>
<pre><code>content_string = ''
for p in content:
content_string += str(p)
</code></pre>
<p>This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.</p>
You can do that quite easily using
soup = Beautifulsoup(html.text, 'lxml')
desired-tags = {'div', 'ol'} # add what you need
content = filter(lambda x: x.name in desired-tags
soup.find('div', class_='the-one-i-want').children)
This will go through all direct children of the div tag. If you want this to happen recursively (you said something about adding li tags), you should use .decendants instead of .children. Happy crawling!
Does this work for you? It should loop through the content adding the text you want while ignoring the div and script tags.
for p in content:
if p.find('div') or p.find('script'):
continue
content_string += str(p)

Categories