Consider a HTML structure like
<div class="entry_content">
<p>
<script>some blah blah script here</script>
<fb:like--blah blah></fb:like>
<img/>
</p>
<p align="left">
content to be scraped begins here
</p>
<p>
more content to be scraped in one or many paragraphs from this paragraph onwards
</p>
-- there could be many more <p> here which also need to be included
</div>
The soup
content = soup.html.body.find('div', class_='entry_content')
gives me everything within the outermost div tag, including javascript, facebook code and all html tags.
Now how to remove everything before <p align="left">
I tried something like:
content.split('<p align="left">')[1]
But this is not doing the trick
Have a look at extract or decompose.
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
Tag.decompose() removes a tag from the tree, then completely destroys it and its contents.
Related
I have a class in my soup element that is the description of a unit.
<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>
I can easily grab this part with soup.select(".ats-description")[0].
Now I want to remove <div class="ats-description">, only to keep all the inner tags (to retain text structure). How to do it?
soup.select(".ats-description")[0].getText() gives me all the texts within, like this:
'\nHere is a paragraph\ninner div\nAnother div\n\nItem1\nItem2\nItem3\n\n\n'
But removes all the inner tags, so it's just unstructured text. I want to keep the tags as well.
to get innerHTML, use method .decode_contents()
innerHTML = soup.select_one('.ats-description').decode_contents()
print(innerHTML)
Try this, match by tag in list in soup.find_all()
from bs4 import BeautifulSoup
html="""<div class="ats-description">
<p>Here is a paragraph</p>
<div>inner div</div>
<div>Another div</div>
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
</div>"""
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one("div.ats-description").find_all(['p','div','ul']))
so i have some html like this
<div class="content">
<div class="infobox">
<p> text </p>
<p> more text </p>
</div>
<p> text again </p>
<p> even more text </p>
</div>
And i am using this selector '.content p::text' i thought this would only get me the immediate children, so i wanted it to extract "text again" and "even more text" but it's also getting the text from the paragraphs inside the other div, how can i prevent this from happening, i only want text from the paragraphs that are the immediate children of the div with the class .content
Scrapy uses an extended set of CSS selectors and XPath selectors. In your case, you're using CSS selectors. The CSS relationship selector you want is > denoting a parent/child relationship, as in: .content > p::text. Scrapy's selectors are described in the section titled "Selectors" in its documentation.
to get the child: div>p ( text, more text )
In your case to get what you need: div+p
http://www.w3schools.com/cssref/css_selectors.asp
Worth reading
I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:
<div class="the-one-i-want">
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<ol>
<li>
list item
</li>
<li>
list item
</li>
</ol>
<div class='something-i-don't-want>
content
</div>
<script class="something-else-i-dont-want'>
script
</script>
<p>
content
</p>
</div>
All of the content that I want to extract is found within the <div class="the-one-i-want"> element. Right now, I'm using the following methods, which work most of the time:
soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
This excludes scripts, weird insert div's and otherwise un-predictable content such as ads or 'recommended content' type stuff.
Now, there are some instances in which there are elements other than just the <p> tags, which has content that is contextually important to the main content, such as lists.
Is there a way to get the content from the <div class="the-one-i-want"> in a manner as such:
soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
Where desired-content-elementswould be inclusive of every element that I deemed fit for that particular content? Such as, all <p> tags, all <ol> and <li> tags, but no <div> or <script> tags.
Perhaps noteworthy, is my method of saving the content:
content_string = ''
for p in content:
content_string += str(p)
This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.
You can pass a list of tags that you want:
content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])
If we run something similar on your question url looking for p and pre tags, you can see we get both:
...: for ele in soup.select_one("td.postcell").find_all(["pre","p"]):
...: print(ele)
...:
<p>I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:</p>
<pre><code><div class="the-one-i-want">
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<ol>
<li>
list item
</li>
<li>
list item
</li>
</ol>
<div class='something-i-don't-want>
content
</div>
<script class="something-else-i-dont-want'>
script
</script>
<p>
content
</p>
</div>
</code></pre>
<p>All of the content that I want to extract is found within the <code><div class="the-one-i-want"></code> element. Right now, I'm using the following methods, which work most of the time:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
</code></pre>
<p>This excludes scripts, weird insert <code>div</code>'s and otherwise un-predictable content such as ads or 'recommended content' type stuff.</p>
<p>Now, there are some instances in which there are elements other than just the <code><p></code> tags, which has content that is contextually important to the main content, such as lists.</p>
<p>Is there a way to get the content from the <code><div class="the-one-i-want"></code> in a manner as such:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
</code></pre>
<p>Where <code>desired-content-elements</code>would be inclusive of every element that I deemed fit for that particular content? Such as, all <code><p></code> tags, all <code><ol></code> and <code><li></code> tags, but no <code><div></code> or <code><script></code> tags.</p>
<p>Perhaps noteworthy, is my method of saving the content:</p>
<pre><code>content_string = ''
for p in content:
content_string += str(p)
</code></pre>
<p>This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.</p>
You can do that quite easily using
soup = Beautifulsoup(html.text, 'lxml')
desired-tags = {'div', 'ol'} # add what you need
content = filter(lambda x: x.name in desired-tags
soup.find('div', class_='the-one-i-want').children)
This will go through all direct children of the div tag. If you want this to happen recursively (you said something about adding li tags), you should use .decendants instead of .children. Happy crawling!
Does this work for you? It should loop through the content adding the text you want while ignoring the div and script tags.
for p in content:
if p.find('div') or p.find('script'):
continue
content_string += str(p)
I have an HTML document as follows:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
I don't need references from the article, I want to slice the document at the second h2 tag.
Obviously I can find a list of h2 tags like so:
soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it
I don't want to get a list of the h2 tags, I want to slice the document right at the second h2 tag and keep the above contents in a new variable. Basically the desired output I want is:
<h1> Name of Article </h2>
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
What is the best way to go aboout doing this "slicing"/cutting of the HTML document instead of simply finding tags and outputing the tags itself?
You can remove/extract every sibling element of the "References" element and the element itself:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
print(soup)
Prints:
<div>
<h1> Name of Article</h1>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
</div>
You can find the location of the h2 in the string and then find a substring by it:
last_h2_tag = str(soup.find_all("h2")[-1])
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]
So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.
For example
<p style='font-size: 24px;'>
<strong>Title A</strong>
</p>
<p>
<strong> First Subtitle of Title A </strong>
"Text for first subtitle"
</p>
Then it will switch to
<p>
<strong style='font-size: 24px;'> Second Subtitle for Title B </strong>
</p>
Then sometimes the new subtitles are added to the end of the previous subtitle's text
<p>
...title E's content finishes
<strong>
<span id="inserted31" style="font-size: 24px;"> Title F </span>
</strong>
</p>
<p>
<strong> First Subtitle for Title F </strong>
</p>
Enough confusion, it's simply poor markup. Obvious patterns such as 'font-size:24px;' can find the titles but there isn't a solid, reusable method to scrape the children and associate them with the title.
Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.
I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.
Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.
At least, you can rely on the tag names and text, navigating the DOM tree horizontally - going sideways. These are all strong, p and span (with id attribute set) tags you are showing.
For example, you can get the strong text and get the following sibling:
>>> from bs4 import BeautifulSoup
>>> data = """
... <p style='font-size: 24px;'>
... <strong>Title A</strong>
... </p>
... <p>
... <strong> First Subtitle of Title A </strong>
... "Text for first subtitle"
... </p>
... """
>>> soup = BeautifulSoup(data)
>>> titles = soup.find_all('strong')
>>> titles[0].text
u'Title A'
>>> titles[1].get_text(strip=True)
u'First Subtitle of Title A'
>>> titles[1].next_sibling.strip()
u'"Text for first subtitle"'