Parsing HTML tag with ":" with lxml - python

I am new in python and I'm trying to parse a Html page with lxml. I want to get text from <p> tag. But inside it I have a strange tag like this:
<p style="margin-left:0px;padding:0 0 0 0;float:left;">
<g:plusone size="medium">
</g:plusone>
</p>
How can I ignore this tag inside <p> ? I want to cut all tags with ":" inside any html page,because another functions of lxml didn't work properly with tags like this.
parser=etree.HTMLParser()
tree = etree.parse('problemtags.html',parser)
root=tree.getroot()
text = [ b.text for b in root.iterfind(".//p")]
I expect to get some text inside <p> tags.But when i look like this, it fails on fragment like above. it writes: "b'Tag g:plusone invalid'". All i need - it is ignore all incorect tags like this. I don't know exactly how many tags like this i will have in future, but i think a problem really in ":" now, because when I use ".tag" and get name,it is just "plusone",not "g:plusone".

Here is a way I found to clean up the html:
from lxml import etree
from StringIO import StringIO
s = '''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
<g:plusone size="medium">
</g:plusone>
</p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(s), parser)
result = etree.tostring(tree.getroot(),pretty_print=True,method="html")
print result
This prints
<html><body><p style="margin-left:0px;padding:0 0 0 0;float:left;">
<plusone size="medium">
</plusone>
</p></body></html>
To get an etree.Element reference, namely an etree._Element, from an etree._ElementTree, just
root = tree.getroot()
print type(root) # prints lxml.etree._Element
According to _Element-class, lxml.etree._Element is the class of document instance references, in other words its what results from instantiating etree.Element, for example
el = etree.Element("an_etree.Element_reference")
print type(el) # prints lxml.etree._Element

The g: is a namespace prefix. The actual tag name is only plusone. So, lxml is correct in only returning plusone as the tag name. See a summary of namespaces here.
As I understand it, lxml's HTML Parser is not namespace aware. However, the XML Parser is. Presumably, given that this HTML document contains XML, it is most likely actually an XHTML document (if not, then it is probably an invalid HTML document and you cannot expect lxml to parse it correctly). Therefore, you need to run it through the XML Parser rather than HTML Parser. lxml's namespace API is explained in their tutorial.
However, with the fragment you provided the parser returns this:
>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
... <g:plusone size="medium">
... </g:plusone>
... </p>''')
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
File "parser.pxi", line 1674, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101299)
File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:96481)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: Namespace prefix g on plusone is not defined, line 2, column 23
Note that it complains that the "Namespace prefix g on plusone is not defined." Presumably, elsewhere in your document the namespace prefix is defined. As I don't know what that is, I'll just make something up and define if on the plusone tag in your fragment:
>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
... <g:plusone xmlns:g="something" size="medium">
... </g:plusone>
... </p>''')
>>> d
<Element p at 0x2563cd8>
>>> d.tag
'p'
>>> d[0]
<Element {something}plusone at 0x2563940>
>>> d[0].tag
'{something}plusone'
Notice that the g: prefix was replaced with the actual namespace ({something} in this case as I set is like so: xmlns:g="something"). Usually the namespace would actually be a URI. So you may find that your tag looks something like this: {http://where.it/is/from.xml}plusone
Nevertheless, I find working with namespaces rather bothersome when they are not necessary. You may actually find it easier to use the HTML parser which ignores the namespaces. Now that you know that the tag is named plusone, not g:plusone you may be able get on with your work using just the HTML parser.

Related

Including XHTML content when creating ReqIf XML document using pyXB

A bit of background: in the scope of a requirements management plugin for Sphinx, I'm looking into ways to export ReqIF XML content. I've found pyreqif, but found that it isn't complete enough to suit our needs at the moment.
I decided to take a look at the Reqif bindings generated by pyXB instead, with the idea that pyXB can do all the grunt work of converting things to and from XML and I just have to worry about adding some convenience functions/classes.
The project can be found here: https://github.com/bavovanachte/reqif_pyxb_tryout
So far it's going great: I've managed to create instances of all the objects and they tie together nicely into an xml document. The only thing I'm having trouble with is the creation of XHTML content. Ideally I'd want to take existing html content and insert that into the tree.
The naieve approach of doing that caused the xml-unsafe characters to be escaped, so that didn't work.
These are some of my attempts:
Attempt 1: Passing the xml as a string to the XHTML_CONTENT constructor
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xml_string))
Result: Escaped XML content:
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div></ns2:div>
Attempt 2: Passing the xml as a string to the XHTML_CONTENT constructor, with the "_from_xml flag set"
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(xml_string, _from_xml=True))
Result: pyXB exception:
Traceback (most recent call last):
File "examples/export_test.py", line 105, in <module>
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(xml_string, _from_xml=True))
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2127, in __init__
self.extend(args, _from_xml=from_xml, _location=location)
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in extend
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in <listcomp>
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2588, in append
raise pyxb.MixedContentError(self, value, location)
pyxb.exceptions_.MixedContentError: Invalid non-element content
Attempt number 3 - Passing the xml as a string to the xhtml_div_type constructor, with the "_from_xml flag set", then assigning this class to the div member.
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xhtml_div_type(xml_string, _from_xml=True)))
Result: Escaped XML content:
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div></ns2:div>
Attempt number 4 - Converting the string to dom first and using that in the constructor
xml_string = '''
<div>
XY Block Adapter shall translate the Communication to TMN-Block in a bidirectional manner and support all functionalities of a TMN-Block.<br/>
</div>'''
dom_content = xml.dom.minidom.parseString('<myxml>Some data<empty/> some more data</myxml>')
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xhtml_div_type(_dom_node=dom_content)))
Result: pyXB exception:
Unable to convert DOM node empty at [UNAVAILABLE] to binding
Traceback (most recent call last):
File "examples/export_test.py", line 130, in <module>
att_value_xhtml = ATTRIBUTE_VALUE_XHTML(definition=text_attribute, THE_VALUE=XHTML_CONTENT(div=xhtml_div_type(_dom_node=dom_content)))
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2133, in __init__
self.extend(dom_node.childNodes[:], fallback_namespace)
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in extend
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2612, in <listcomp>
[ self.append(_v, **kw) for _v in value_list ]
File "/home/bvn/.pyenv/versions/3.6.10/lib/python3.6/site-packages/pyxb/binding/basis.py", line 2567, in append
raise pyxb.UnrecognizedContentError(self, self.__automatonConfiguration, value, location)
pyxb.exceptions_.UnrecognizedContentError: Invalid content empty (expect {http://www.w3.org/1999/xhtml}h1 or {http://www.w3.org/1999/xhtml}h2 or ...
What would be the correct way of handling the xhtml content?

Error with Python and XML

I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.
Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

Shelve raises out of overflow pages

I have a lot of data that I read from a XML file, but only a part of it will be saved in the database later.
The XML is structured like that:
<element id="1" other_attrib="">
<element id="2" other_attrib="">
...
<other_element>
<elem id="1">
<elem id="100">
</other_element>
<other_element>
<elem id...>
</other_element>
all of the element tags precede the other_elements tags, (this XML is third-party, I can't restructure it)
I have to read all of the elements first because they contain additional attributes, but only save those that are referenced by other_elements.
So I opened a shelve with writeback=False, use lxml.iterparse() to parse the XML and save elements on the go, but after many elements get added (I don't know the exact number, but it goes into hundreds of thousands) I receive the following error:
HASH: Out of overflow pages. Increase page size
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
...
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 133, in __setitem__
self.dict[key] = f.getvalue()
Basically what I do is for every element iterparse returns:
if elem.tag == 'element':
shlv[elem['id']] = {'attrib1': elem['other_attrib'], 'attrib2': elem['attrib2']}
elif elem.tag == 'other_element':
# Here I iterate through this tags children and find references in shlv object
for ref in elem:
save_in_database(shlv[ref.attrib['id']])
What can I change so that shelve handles more data ? Or should I use something else to store that data ?

WordPress XML ParseError: unbound prefix?

I am trying to use Python's xml.etree.ElementTree.parse() function to parse an XML file I created by exporting all of the content from a WordPress blog. However, when I try like so:
import xml.etree.ElementTree as xml
tree = xml.parse('/path/to/file.xml')
I get the following error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
ParseError: unbound prefix: line 189, column 1
Here's what's on line 189 of my XML file:
<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blogname.wordpress.com/osd.xml" title="blog name" />
I've seen many questions about this error coming up with Android development, but I can't tell if and how that applies to my situation. Can anyone help with this?
Apologies to everyone for whom this was stupidly obvious, but it turns out I simply didn't have a namespace definition for "atom" in the document. I'm guessing that "unbound prefix" means that the prefix "atom" wasn't "bound" to a namespace definition?
Anyway, adding said definition has solved the problem. Although it makes me wonder why WordPress exports XML files without proper definitions for all of the namespaces they use...
If you remove all the Name Space, works absolutely fine.
Change
<s:home>USA</s:home>
to
<home>USA</home>
Just in case it helps someone some day, I was also working with a WordPress XML export (WordPress eXtended RSS) file in Python and was getting the same error. In my case, WordPress had included most of the correct namespace definitions. However, the XML had iTunes podcast information as well, and the iTunes namespace declaration was not present.
I fixed it by adding xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" into the RSS declaration block. So this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
became this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
>

ExpatError: junk after document element

I really don't know, what the Problem is? I get the following error:
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: junk after document element: line 5, column 0
I DONT SEE NO JUNK! Any help? I'm getting crazy......
text = """<questionaire>
<question>
<questiontext>Question1</questiontext>
<answer>Your Answer: 99</answer>
</question>
<question>
<questiontext>Question2</questiontext>
<answer>Your Answer: 64</answer>
</question>
<question>
<questiontext>Question3</questiontext>
<answer>Your Answer: 46</answer>
</question>
<question>
<questiontext>Bitte geben</questiontext>
<answer>Your Answer: 544</answer>
<answer>Your Answer: 943</answer>
</question>
</questionaire>"""
cleandata = text.split('<questionaire>')
cleandatastring= "".join(cleandata)
stripped = cleandatastring.strip()
planhtml = stripped.split('</questionaire>')[0]
clean= planhtml.strip()
from xml.dom import minidom
doc = minidom.parseString(clean)
for question in doc.getElementsByTagName('question'):
for answer in question.getElementsByTagName('answer'):
if answer.childNodes[0].nodeValue.strip() == 'Your Answer: 99':
question.parentNode.removeChild(question)
print doc.toxml()
Thanx!
Your original text string is well-formed XML. Then you do a bunch of stuff to it that breaks it. Parse your original text, and you will be fine.
XML is required to have exactly one top-level element. By the time you parse it, it has a number of top-level <question> tags. The XML parser is parsing the first one as a root element, and then is surprised to find another top-level element.
In my case it was caused by the changes made in libxml2-2.9.11 that made tostring() (lxml) return more content (what follows the element) than it should. E.g.
from lxml import etree
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
</b>
</a>
'''
t = etree.fromstring(xml.encode()).getroottree()
print(etree.tostring(
t.xpath('/a/b')[0],
encoding=t.docinfo.encoding,
).decode())
Expected output:
<b>
</b>
Actual output:
<b>
</b>
</a>
Should you pass the result to xml.dom.minidom.parseString(), it will complain.
More on it here.
To avoid this you either need libxml2 <= 2.9.10, or Alpine Linux >= 3.14.

Categories