python element tree xml parsing - python

I am using Python element tree to parse xml file
lets say i have an xml file like this ..
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>hello this is first paragraph </p>
<p> hello this is second paragraph</p>
</body>
</html>
is there any way i can extract the body with the p tags intact like
desired= "<p>hello this is first paragraph </p> <p> hello this is second paragraph</p>"

The following code does the trick.
import xml.etree.ElementTree as ET
root = ET.fromstring(doc) # doc is a string containing the example file
body = root.find('body')
desired = ' '.join([ET.tostring(c).strip() for c in body.getchildren()])
Now:
>>> desired
'<p>hello this is first paragraph </p> <p> hello this is second paragraph</p>'

You can use lxml library, lxml
So, this code will help you.
import lxml.html
htmltree = lxml.html.parse('''
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>hello this is first paragraph </p>
<p> hello this is second paragraph</p>
</body>
</html>''')
p_tags = htmltree.xpath('//p')
p_content = [p.text_content() for p in p_tags]
print p_content

A slightly different way to #DavidAlber, where the children could easily be selected:
from xml.etree import ElementTree
tree = ElementTree.parse("example.xml")
body = tree.findall("/body/p")
result = []
for elem in body:
result.append(ElementTree.tostring(elem).strip())
print " ".join(result)

Related

How to extract text and the xpath to that element of the HTML page in Python

I am working on a Django project where I need to extract all the text-containing elements and the xPath to that element.
E.G:
<html>
<head>
<title>
The Demo page
</title>
</head>
<body>
<div>
<section>
<h1> Hello world
</h1>
</section>
<div>
<p>
Hope you all are doing well,
</p>
</div>
<div>
<p>
This is the example HTML
</p>
</div>
</div>
</body>
</html>
The output should be something like:
/head/title: The Demo Page
/body/div/section/h1: Hello world!
/body/div/div[1]/p: Hope you all are doing well,
/body/div/div[2]/p: This is the example HTML
Something like this should work:
from lxml import etree
html = """[your html above]"""
root = etree.fromstring(html)
targets = root.xpath('//text()[normalize-space()]/..')
tree = etree.ElementTree(root)
for target in targets:
print(tree.getpath(target),target.text.strip())
Output:
/html/head/title The Demo page
/html/body/div/section/h1 Hello world
/html/body/div/div[1]/p Hope you all are doing well,
/html/body/div/div[2]/p This is the example HTML

Add missing paragraph tags to HTML

I'm processing some medium fancy HTML pages to convert to simpler XHTML ones. The source pages have several divs (that I'm removing), that contain text not inside <p> tags. I need to add these <p> tags.
Here is a minimal example of the source page
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>
I want to convert it to
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<p>This is a sample page</p>
<p>Lots of things to learn!</p>
<p>And lots to test</p>
<p>Enough with the sample code</p>
</body>
</html>
I'm developing a python script using BeautifulSoup4 to do all the stuff. Now I'm stuck at this step. And it looks more like a regex job to locate the text to embed in <p> tags, and pass it to BeautifulSoup4. What do you think is the best approach to the problem?
I've scan several pages and I've seen these wild texts at the start of divs, but I can't exclude there will be several more around the pages in random places. (i.e. a script that checks at start of divs won't probably be reliable).
Notice the <br/> tags that has to be used to split the <p> paragraphs.
This script will remove all tags from <body> but <p> and then creates new paragraphs in place of <br/>:
from bs4 import BeautifulSoup
txt = '''<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.body.find_all(lambda tag: tag.name != 'p'):
tag.unwrap()
for txt in soup.body.find_all(text=True):
if txt.find_parent('p') or txt.strip() == '':
continue
txt.wrap(soup.new_tag("p"))
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<p>
Hello world!
</p>
<p>
This is a sample page
</p>
<p>
Lots of things to learn!
</p>
<p>
And lots to test
</p>
<p>
Enough with the sample code
</p>
</body>
</html>

Python BeautifulSoup: Insert attribute to tags

I'm trying to insert a new attribute to all the nested tables in a html doc. I'm trying with the code below, but it's not inserting the attribute to all the table tags. I would really appreciate any help.
Input html code:
<html>
<head>
<title>Test</title>
</head>
<body>
<div>
<table>
<tr>t<td><table></table></td></tr>
<tr>t<td><table></table></td></tr>
<tr>t<td><table></table></td></tr>
</table>
</div>
</body>
</html>
Code:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("file://xxxxx.html").read()
soup = BeautifulSoup(html)
for tag in soup.find_all(True):
if (tag.name == "table"):
tag['attr'] = 'new'
print(tag)
else:
print(tag.contents)
Output html code:
<html>
<head>
<title>Test</title>
</head>
<body>
<div>
<table attr="new">
<tr>t<td><table attr="new"></table></td></tr>
<tr>t<td><table attr="new"></table></td></tr>
<tr>t<td><table attr="newe"></table></td></tr>
</table>
</div>
</body>
</html>
Your tag['attr'] = 'new' seems to work correctly. The problem is that print(tag.contents) will print parts of the document recursively before the descendants have been modified.
The simple fix is to make one pass to modify the document first, then make just one print(soup) call at the end.

Get a structure of HTML code

I'm using BeautifulSoup4 and I'm curious whether is there a function which returns a structure (ordered tags) of the HTML code.
Here is an example:
<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>
print page.structure():
>>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>
I tried to find a solution but no success.
Thanks
There is not, to my knowledge, but a little recursion should work:
def taggify(soup):
for tag in soup:
if isinstance(tag, bs4.Tag):
yield '<{}>{}</{}>'.format(tag.name,''.join(taggify(tag)),tag.name)
demo:
html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''
soup = BeautifulSoup(html)
''.join(taggify(soup))
Out[34]: '<html><body><h1></h1><p></p></body></html>'
Simple python regular expressions can do what you want:
import re
html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''
structure = ''.join(re.findall(r'(</?.+?>|/n+?)', html))
This methods preserves newline characters.

Finding the parent tag of a text string with ElementTree/lxml

I'm trying to take a string of text, and "extract" the rest of the text in the paragraph/document from the html.
My current is approach is trying to find the "parent tag" of the string in the html that has been parsed with lxml. (if you know of a better way to tackle this problem, I'm all ears!)
For example, search the tree for "TEXT STRING HERE" and return the "p" tag. (note that I won't know the exact layout of the html beforehand)
<html>
<head>
...
</head>
<body>
....
<div>
...
<p>TEXT STRING HERE ......</p>
...
</html>
Thanks for your help!
This is a simple way to do it with ElementTree. It does require that your HTML input is valid XML (so I have added the appropriate end tags to your HTML):
import elementtree.ElementTree as ET
html = """<html>
<head>
</head>
<body>
<div>
<p>TEXT STRING HERE ......</p>
</div>
</body>
</html>"""
for e in ET.fromstring(html).getiterator():
if e.text.find('TEXT STRING HERE') != -1:
print "Found string %r, element = %r" % (e.text, e)

Categories