BeautifulSoup: Parsing bad Wordpress HTML - python

So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.
For example
<p style='font-size: 24px;'>
<strong>Title A</strong>
</p>
<p>
<strong> First Subtitle of Title A </strong>
"Text for first subtitle"
</p>
Then it will switch to
<p>
<strong style='font-size: 24px;'> Second Subtitle for Title B </strong>
</p>
Then sometimes the new subtitles are added to the end of the previous subtitle's text
<p>
...title E's content finishes
<strong>
<span id="inserted31" style="font-size: 24px;"> Title F </span>
</strong>
</p>
<p>
<strong> First Subtitle for Title F </strong>
</p>
Enough confusion, it's simply poor markup. Obvious patterns such as 'font-size:24px;' can find the titles but there isn't a solid, reusable method to scrape the children and associate them with the title.
Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.
I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.
Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.

At least, you can rely on the tag names and text, navigating the DOM tree horizontally - going sideways. These are all strong, p and span (with id attribute set) tags you are showing.
For example, you can get the strong text and get the following sibling:
>>> from bs4 import BeautifulSoup
>>> data = """
... <p style='font-size: 24px;'>
... <strong>Title A</strong>
... </p>
... <p>
... <strong> First Subtitle of Title A </strong>
... "Text for first subtitle"
... </p>
... """
>>> soup = BeautifulSoup(data)
>>> titles = soup.find_all('strong')
>>> titles[0].text
u'Title A'
>>> titles[1].get_text(strip=True)
u'First Subtitle of Title A'
>>> titles[1].next_sibling.strip()
u'"Text for first subtitle"'

Related

Python - How to Remove (Delete) Unclosed Tags

looking for a way to remove open unpaired tags!
BS4 as well as lxml are good at removing unpaired closed tags.
But if they find an open tag, they try to close it, and close it at the very end :(
Example
from bs4 import BeautifulSoup
import lxml.html
codeblock = '<strong>Good</strong> Some text and bad closed strong </strong> Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>'
soup = BeautifulSoup(codeblock, "html.parser").prettify()
print(soup)
root = lxml.html.fromstring(codeblock)
res = lxml.html.tostring(root)
print(res)
Output bs4:
<strong>
Good
</strong>
Some text and bad closed strong
Some text and bad open strong PROBLEM HERE
<strong>
Some text
<h2>
Some
</h2>
or
<h3>
Some
</h3>
<p>
Some Some text
<strong>
Good2
</strong>
</p>
</strong>
Output lxml:
b'<div><strong>Good</strong> Some text and bad closed strong Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p></strong></div>'
I would be fine if the tag is closed before the first following tag, here in the example of H2
PROBLEM HERE <strong> Some text </strong><h2>Some</h2>
I would also be ok with removing this open tag <strong>
But the fact that it closes at the very end - this is a problem!
In the real code the index (position) of the tag <strong> is not known!
What are the solutions?
I tried to do it with BS4 and lxml but it didn't work!
If you know the solution, please help!
Maybe the solution can be .unwrap() the second <strong> tag:
codeblock = "<strong>Good</strong> Some text and bad closed strong </strong> Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>"
soup = BeautifulSoup(codeblock, "html.parser")
soup.select("strong")[1].unwrap()
print(soup.prettify())
Prints:
<strong>
Good
</strong>
Some text and bad closed strong
Some text and bad open strong PROBLEM HERE
Some text
<h2>
Some
</h2>
or
<h3>
Some
</h3>
<p>
Some Some text
<strong>
Good2
</strong>
</p>
as a temporary solution, decided to remove <strong> tags that have children
from bs4 import BeautifulSoup
codeblock = '<strong>Good</strong> Some text and bad closed strong </strong> Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>'
soup = BeautifulSoup(codeblock, "html.parser")
# pretty = soup.prettify()
for item in soup.find_all('strong'):
if item.findChild():
item.unwrap()
print(soup)
Print:
<strong>Good</strong> Some text and bad closed strong Some text and bad open strong PROBLEM HERE Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>
If you see a better solution, please write...

Get all text in a tag unless it is in another tag

I'm trying to parse some HTML with BeautifulSoup, and I'd like to get all the text (recursively) in a tag, but I want to ignore all text that appears within a small tag. For example, this HTML:
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
should give the text Final definition. Note that this is a minimal example. In the real HTML, there are many other tags involved, so small should be excluded rather than a being included.
The text attribute of the tag is close to what I want, but it would include Fun fact. I could concatenate the text of all children except the small tags, but that would leave out definition. I couldn't find a method like get_text_until (the small tag is always at the end), so what can I do?
You can use find_all to find all the <small> tags, clear them, then use get_text():
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
>>> for el in soup.find_all("small"):
... el.clear()
...
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small></small>
</li>
>>> soup.get_text()
'\n\n\n Final\n \n definition.\n \n\n'
You can get this using recursive method state that you don't want to recurse into child tags:
Like
soup.li.find(text=True, recursive=False)
So you can do this like
' '.join(li.find(text=True, recursive=False) for li in soup.findAll('li', 'a'))

Scrapy XPath - Can't get text within span

I'm trying to reach the address information on a site. Here's an example of my code:
companytype_list = sel.xpath('''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath('''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath('''.//li[#class="company-size"]/p/text()''').extract()
And here's an example of how addresses are formatted on the site:
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
But when I run the scrapy script I get an IndexError: list index out of range for the address (vcard hq). I've tried to rewrite the code to get the data but it does not work. The rest of the spider works fine. Am I missing something?
Your example works fine. But I guess your xpath expressions failed on another page or html part.
The problem is the use of indexes (span[3]) in the headquarters_list xpath expression. Using indexes you heavily depend on:
1. The total number of the span elements
2. On the exact order of the span elements
In general the use of indexes tend to make xpath expressions more fragile and more likely to fail. Thus, if possible, I would always avoid the use of indexes. In your example you actually take the locality of the address info. The span element can also easily be referenced by its class name which makes your expression much more robust:
//li[#class="vcard hq"]/p/span[#class='locality']/text()
Here is my testing code according to your problem description:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
html_text = """
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
"""
sel = Selector(text=html_text)
companytype_list = sel.xpath(
'''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath(
'''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath(
'''.//li[#class="company-size"]/p/text()''').extract()
It doesn't raise any exception. So chances are there exist web pages with a different structure causing errors.
It's a good practice to not using index directly in xpath rules. dron22's answer gives an awesome explanation.

BeautifulSoup Scraping How to

Consider a HTML structure like
<div class="entry_content">
<p>
<script>some blah blah script here</script>
<fb:like--blah blah></fb:like>
<img/>
</p>
<p align="left">
content to be scraped begins here
</p>
<p>
more content to be scraped in one or many paragraphs from this paragraph onwards
</p>
-- there could be many more <p> here which also need to be included
</div>
The soup
content = soup.html.body.find('div', class_='entry_content')
gives me everything within the outermost div tag, including javascript, facebook code and all html tags.
Now how to remove everything before <p align="left">
I tried something like:
content.split('<p align="left">')[1]
But this is not doing the trick
Have a look at extract or decompose.
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
Tag.decompose() removes a tag from the tree, then completely destroys it and its contents.

Parsing HTML with BeautifulSoup

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)
I'm trying to extract the text of the review at the bottom. I've tried this:
y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text
The problem is that there is unwanted text in the unexpanded div tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?
Edit: The HTML is:
div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
The div tag above the text is as follows:
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings suggests that the .strings method is what you want - it returns a iterator of each string within the object. So if you turn that iterator into a list and take the last item, you should get what you want. For example:
$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'
To get the text in the tail of div.tiny:
review = soup.find("div", "tiny").findNextSibling(text=True)
Full example:
#!/usr/bin/env python
from bs4 import BeautifulSoup
html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)
Output
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Here's an equivalent lxml code that produces the same output:
import lxml.html
doc = lxml.html.fromstring(html)
print doc.find(".//div[#class='tiny']").tail

Categories