Fetching list data from malformed HTML

Fetching list data from malformed HTML - python

I am attempting to fetch data from a <UL> where the list data is malformed. In other words the end tags (</LI>) are missing in the list:
<UL>
<LI>Blah2
<LI><A>Blah</A>
<LI><A>Blah2</A>
</UL>
<UL>
<LI><A>Blah</A>
<LI>Blah2
<LI><A>Blah2</A>
</UL>
<UL>
<LI><A>Blah</A>
</UL>
<UL>
<LI>Blah
</UL>
Can i somehow iterate through this? As shown in the example there can be a mixture of links and no links. Whats most important is that i fetch the links (if any) and text.
Unfortunately BeutifulSoup attempts to repair the malformed HTML and causes more damage than needed:
from bs4 import BeautifulSoup as bsoup
html = '<UL><LI>Blah><LI><A>Blah</A><LI><A>Blah2</A></UL>'
print bsoup(html).prettify()
>>> <ul>
>>> <li>
>>> Blah>
>>> <li>
>>> <a>
>>> Blah
>>> </a>
>>> <li>
>>> <a>
>>> Blah2
>>> </a>
>>> </li>
>>> </li>
>>> </li>
>>> </ul>
As seen on the example above Bsoup is adding all end tags in the end of the list items.

As per my comment, BS4 handles invalid HTML differently depending on which parser you use. The four parsers that are supported are:
html.parser (which is built in)
lxml's HTML parser
lxml's XML parser
html5lib (which works in this case)
You can use trial and error or look specifically at your issue and the way each parser handles it (using the links above) and choose a parser that acts in the way you want it to.

If there are no nested list items, you can manually close the <li> tags using regular expressions:
>>> xhtml = re.sub(r'\<li\>(.*?)(?=\<li\>)', r'<li>\1</li>', html,
... flags=re.IGNORECASE | re.DOTALL)
>>> xhtml
'<UL><li>Blah></li><li><A>Blah</A></li><LI><A>Blah2</A></UL>'
>>> print(BeautifulSoup(xhtml).prettify())
<html>
<body>
<ul>
<li>
Blah>
</li>
<li>
<a>
Blah
</a>
</li>
<li>
<a>
Blah2
</a>
</li>
</ul>
</body>
</html>

Related

HTML content rewrite in python

I'm having some issues rewriting HTML text. I want to change my Text.
For Example:
<p>
Some Text
</P>
I cant output like this: <p>This this some text</p>
or
<ol>
<li>My Elm<li>
</ol>
<ul>
<li>My Elm<li>
</ul>
Output should be like this:
<ol>
<li>This is my elm<li>
</ol>
<ul>
<li>This is my elm<li>
</ul>
I want to change only the text object and then put the same HTML.

Cause question is not that clear and needs some improvement - I would point with BeautifulSoups replace_with() in a direction.
Example
from bs4 import BeautifulSoup
html = '''
<ol>
<li>My Elm</li>
</ol>
'''
soup = BeautifulSoup(html, 'html.parser')
soup.ol.li.string.replace_with('Some other text ...')
soup
Output
<ol>
<li>Some other text ...</li>
</ol>

Getting repeats in beautifulsoup nested tags

I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?

Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.

For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

You can use find() instead of find_all() to get your desired result only once

Python : Extract HTML content

Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)

There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text

Filtering all inner text from a HTML document

I want to take a large HTML document and i want to strip away all the inner text between all tags. Everything I seem to find is just about extracting text from the HTML. All I want is the raw HTML tags with their attributes intact. How would one potentially go about filtering out the text?

Find all text with soup.find_all(text=True), and .extract() on each text element to remove it from the document:
for textelement in soup.find_all(text=True):
textelement.extract()
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><body><p>Hello world!<p>
... <div><ul><li>This is all
... </li><li>Set to go!</li></ul></div>
... </body></html>''')
>>> soup
<html><body><p>Hello world!</p><p>
</p><div><ul><li>This is all
</li><li>Set to go!</li></ul></div>
</body></html>
>>> for textelement in soup.find_all(text=True):
... textelement.extract()
...
u'Hello world!'
u'\n'
u'This is all\n'
u'Set to go!'
u'\n'
>>> print soup.prettify()
<html>
<body>
<p>
</p>
<p>
</p>
<div>
<ul>
<li>
</li>
<li>
</li>
</ul>
</div>
</body>
</html>

parse nested html lists using lxml in python

I am trying to parse the elements of an html list which looks like this:
<ol>
<li>r1</li>
<li>r2
<ul>
<li>n1</li>
<li>n2</li>
</ul>
</li>
<li>r3
<ul>
<li>d1
<ol>
<li>e1</li>
<li>e2</li>
</ol>
</li>
<li>d2</li>
</ul>
</li>
<li>r4</li>
</ol>
I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?
For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.

Each node has an attribute called text. That's what you are looking for.
e.g.:
for node in root.iter("*"):
print node.text
# print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fetching list data from malformed HTML - python

Related

HTML content rewrite in python

Getting repeats in beautifulsoup nested tags

Python : Extract HTML content

Filtering all inner text from a HTML document

parse nested html lists using lxml in python

Categories

Resources