ExpatError: junk after document element - python

I really don't know, what the Problem is? I get the following error:
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: junk after document element: line 5, column 0
I DONT SEE NO JUNK! Any help? I'm getting crazy......
text = """<questionaire>
<question>
<questiontext>Question1</questiontext>
<answer>Your Answer: 99</answer>
</question>
<question>
<questiontext>Question2</questiontext>
<answer>Your Answer: 64</answer>
</question>
<question>
<questiontext>Question3</questiontext>
<answer>Your Answer: 46</answer>
</question>
<question>
<questiontext>Bitte geben</questiontext>
<answer>Your Answer: 544</answer>
<answer>Your Answer: 943</answer>
</question>
</questionaire>"""
cleandata = text.split('<questionaire>')
cleandatastring= "".join(cleandata)
stripped = cleandatastring.strip()
planhtml = stripped.split('</questionaire>')[0]
clean= planhtml.strip()
from xml.dom import minidom
doc = minidom.parseString(clean)
for question in doc.getElementsByTagName('question'):
for answer in question.getElementsByTagName('answer'):
if answer.childNodes[0].nodeValue.strip() == 'Your Answer: 99':
question.parentNode.removeChild(question)
print doc.toxml()
Thanx!

Your original text string is well-formed XML. Then you do a bunch of stuff to it that breaks it. Parse your original text, and you will be fine.
XML is required to have exactly one top-level element. By the time you parse it, it has a number of top-level <question> tags. The XML parser is parsing the first one as a root element, and then is surprised to find another top-level element.

In my case it was caused by the changes made in libxml2-2.9.11 that made tostring() (lxml) return more content (what follows the element) than it should. E.g.
from lxml import etree
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
</b>
</a>
'''
t = etree.fromstring(xml.encode()).getroottree()
print(etree.tostring(
t.xpath('/a/b')[0],
encoding=t.docinfo.encoding,
).decode())
Expected output:
<b>
</b>
Actual output:
<b>
</b>
</a>
Should you pass the result to xml.dom.minidom.parseString(), it will complain.
More on it here.
To avoid this you either need libxml2 <= 2.9.10, or Alpine Linux >= 3.14.

Related

How to extract text from XML based on a tag & then put it back (Python)?

I have a messy XML, with some tag structure like -
<textTag>
<div xmlns="http://www.tei-c.org/ns/1.0"><p> -----some text goes here-----
</p>
</div>
</textTag>
I want to extract -----some text goes here-----, make some changes and put it back into the XML. How should I go about it?
Option 1:
You can use the xml module of python for parsing, updating and saving the xml file. A problem though, would be that the resulting xml file might have order of attributes etc different from the original xml file. So when you make a diff, you might see a lot of differences.
So you might do something like.
from xml.etree import ElementTree as ET
tree = ET.parse('xmlfilename')
root = tree.getroot()
p_nodes = root.findall('.//<p>')
for node in p_nodes:
# process
tree.save()
Option 2:
Use regular expression.
Read the file line by line and look for the pattern you are interested in and make the update and write it back.
The obvious advantage being the diff between the original and modified file will shown only the update you made.
import re
with open(outputfile) as fout:
with open(xmlfile) as f:
data = f.readlines()
pattern = re.compile(r"...") # your pattern
for line in data:
re.sub(line, pattern, update)
fout.write(line)
You could use lxml (which has much better XPath 1.0 support than ElementTree) to find all text() nodes that contain "-----some text goes here-----", modify the text, and then replace the .text (or .tail) of the parent.
Example...
Python 3.x
from lxml import etree
xml = """
<textTag>
<div xmlns="http://www.tei-c.org/ns/1.0"><p> <br/>-----some text goes here-----
</p>
</div>
</textTag>"""
tree = etree.fromstring(xml)
for text in tree.xpath(".//text()[contains(.,'-----some text goes here-----')]"):
parent = text.getparent()
new_text = text.replace("-----some text goes here-----", "---- BAM! ----")
if text.is_text:
parent.text = new_text
elif text.is_tail:
parent.tail = new_text
etree.dump(tree)
Output (dumped to console)
<textTag>
<div xmlns="http://www.tei-c.org/ns/1.0"><p> ---- BAM! ----
</p>
</div>
</textTag>

How to to produce XML output? [duplicate]

This question already has answers here:
Creating a simple XML file using python
(6 answers)
Closed 5 years ago.
I'm creating an web api and need a good way to very quickly generate some well formatted xml. I cannot find any good way of doing this in python.
Note: Some libraries look promising but either lack documentation or only output to files.
ElementTree is a good module for reading xml and writing too e.g.
from xml.etree.ElementTree import Element, SubElement, tostring
root = Element('root')
child = SubElement(root, "child")
child.text = "I am a child"
print(tostring(root))
Output:
<root><child>I am a child</child></root>
See this tutorial for more details and how to pretty print.
Alternatively if your XML is simple, do not underestimate the power of string formatting :)
xmlTemplate = """<root>
<person>
<name>%(name)s</name>
<address>%(address)s</address>
</person>
</root>"""
data = {'name':'anurag', 'address':'Pune, india'}
print xmlTemplate%data
Output:
<root>
<person>
<name>anurag</name>
<address>Pune, india</address>
</person>
</root>
You can use string.Template or some template engine too, for complex formatting.
Using lxml:
from lxml import etree
# create XML
root = etree.Element('root')
root.append(etree.Element('child'))
# another child with text
child = etree.Element('child')
child.text = 'some text'
root.append(child)
# pretty string
s = etree.tostring(root, pretty_print=True)
print s
Output:
<root>
<child/>
<child>some text</child>
</root>
See the tutorial for more information.
I would use the yattag library.
from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('food'):
with tag('name'):
text('French Breakfast')
with tag('price', currency='USD'):
text('6.95')
with tag('ingredients'):
for ingredient in ('baguettes', 'jam', 'butter', 'croissants'):
with tag('ingredient'):
text(ingredient)
print(doc.getvalue())
FYI I'm the author of the library.
Use lxml.builder class, from: http://lxml.de/tutorial.html#the-e-factory
import lxml.builder as lb
from lxml import etree
nstext = "new story"
story = lb.E.Asset(
lb.E.Attribute(nstext, name="Name", act="set"),
lb.E.Relation(lb.E.Asset(idref="Scope:767"),
name="Scope", act="set")
)
print 'story:\n', etree.tostring(story, pretty_print=True)
Output:
story:
<Asset>
<Attribute name="Name" act="set">new story</Attribute>
<Relation name="Scope" act="set">
<Asset idref="Scope:767"/>
</Relation>
</Asset>
An optional way if you want to use pure Python:
ElementTree is good for most cases, but it can't CData and pretty print.
So, if you need CData and pretty print you should use minidom:
minidom_example.py:
from xml.dom import minidom
doc = minidom.Document()
root = doc.createElement('root')
doc.appendChild(root)
leaf = doc.createElement('leaf')
text = doc.createTextNode('Text element with attributes')
leaf.appendChild(text)
leaf.setAttribute('color', 'white')
root.appendChild(leaf)
leaf_cdata = doc.createElement('leaf_cdata')
cdata = doc.createCDATASection('<em>CData</em> can contain <strong>HTML tags</strong> without encoding')
leaf_cdata.appendChild(cdata)
root.appendChild(leaf_cdata)
branch = doc.createElement('branch')
branch.appendChild(leaf.cloneNode(True))
root.appendChild(branch)
mixed = doc.createElement('mixed')
mixed_leaf = leaf.cloneNode(True)
mixed_leaf.setAttribute('color', 'black')
mixed_leaf.setAttribute('state', 'modified')
mixed.appendChild(mixed_leaf)
mixed_text = doc.createTextNode('Do not use mixed elements if it possible.')
mixed.appendChild(mixed_text)
root.appendChild(mixed)
xml_str = doc.toprettyxml(indent=" ")
with open("minidom_example.xml", "w") as f:
f.write(xml_str)
minidom_example.xml:
<?xml version="1.0" ?>
<root>
<leaf color="white">Text element with attributes</leaf>
<leaf_cdata>
<![CDATA[<em>CData</em> can contain <strong>HTML tags</strong> without encoding]]> </leaf_cdata>
<branch>
<leaf color="white">Text element with attributes</leaf>
</branch>
<mixed>
<leaf color="black" state="modified">Text element with attributes</leaf>
Do not use mixed elements if it possible.
</mixed>
</root>
I've tried a some of the solutions in this thread, and unfortunately, I found some of them to be cumbersome (i.e. requiring excessive effort when doing something non-trivial) and inelegant. Consequently, I thought I'd throw my preferred solution, web2py HTML helper objects, into the mix.
First, install the the standalone web2py module:
pip install web2py
Unfortunately, the above installs an extremely antiquated version of web2py, but it'll be good enough for this example. The updated source is here.
Import web2py HTML helper objects documented here.
from gluon.html import *
Now, you can use web2py helpers to generate XML/HTML.
words = ['this', 'is', 'my', 'item', 'list']
# helper function
create_item = lambda idx, word: LI(word, _id = 'item_%s' % idx, _class = 'item')
# create the HTML
items = [create_item(idx, word) for idx,word in enumerate(words)]
ul = UL(items, _id = 'my_item_list', _class = 'item_list')
my_div = DIV(ul, _class = 'container')
>>> my_div
<gluon.html.DIV object at 0x00000000039DEAC8>
>>> my_div.xml()
# I added the line breaks for clarity
<div class="container">
<ul class="item_list" id="my_item_list">
<li class="item" id="item_0">this</li>
<li class="item" id="item_1">is</li>
<li class="item" id="item_2">my</li>
<li class="item" id="item_3">item</li>
<li class="item" id="item_4">list</li>
</ul>
</div>

parse xml with lxml including namespace

I need to get some info after a specific tag in lxml.
the xml doc looks like this
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app xmlns="http://java.sun.com/xml/ns/j2ee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/
ns/j2ee/web-app_2_4.xsd"
version="2.4">
<display-name>Community Bank</display-name>
<description>WebGoat for Cigital</description>
<context-param>
<param-name>PropertiesPath</param-name>
<param-value>/WEB-INF/properties.txt</param-value>
<description>This is the path to the properties file from the servlet root</description>
</context-param>
<servlet>
<servlet-name>Index</servlet-name>
<servlet-class>com.cigital.boi.servlet.index</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index</url-pattern>
</servlet-mapping>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index.html</url-pattern>
</servlet-mapping>
I want to read com.cigital.boi.servlet.index .
I have used this code to read everything under servlets
context = etree.parse(handle)
list = parser.xpath('//servlet')
print list
list contains nothing
more info : iterating over the context field i found these lines.
<Element {http://java.sun.com/xml/ns/j2ee}servlet-name at 2ad19e6eca48>
<Element {http://java.sun.com/xml/ns/j2ee}servlet-class at 2ad19e6ecaf8>
I am thinking as I have not included name space while searching , output is empty list.
please suggest hoe to read "com.cigital.boi.servlet.index" in the servlet-class tag
Try following:
from lxml import etree
context = etree.parse(handle)
print next(x.text for x in context.xpath('.//*[local-name()="servlet-class"]'))
Alternative:
from lxml import etree
context = etree.parse(handle)
nsmap = context.getroot().nsmap.copy()
nsmap['xmlns'] = nsmap.pop(None)
print next(x.text for x in context.xpath('.//xmlns:servlet-class', namespaces=nsmap))

Best way to generate xml? [duplicate]

This question already has answers here:
Creating a simple XML file using python
(6 answers)
Closed 5 years ago.
I'm creating an web api and need a good way to very quickly generate some well formatted xml. I cannot find any good way of doing this in python.
Note: Some libraries look promising but either lack documentation or only output to files.
ElementTree is a good module for reading xml and writing too e.g.
from xml.etree.ElementTree import Element, SubElement, tostring
root = Element('root')
child = SubElement(root, "child")
child.text = "I am a child"
print(tostring(root))
Output:
<root><child>I am a child</child></root>
See this tutorial for more details and how to pretty print.
Alternatively if your XML is simple, do not underestimate the power of string formatting :)
xmlTemplate = """<root>
<person>
<name>%(name)s</name>
<address>%(address)s</address>
</person>
</root>"""
data = {'name':'anurag', 'address':'Pune, india'}
print xmlTemplate%data
Output:
<root>
<person>
<name>anurag</name>
<address>Pune, india</address>
</person>
</root>
You can use string.Template or some template engine too, for complex formatting.
Using lxml:
from lxml import etree
# create XML
root = etree.Element('root')
root.append(etree.Element('child'))
# another child with text
child = etree.Element('child')
child.text = 'some text'
root.append(child)
# pretty string
s = etree.tostring(root, pretty_print=True)
print s
Output:
<root>
<child/>
<child>some text</child>
</root>
See the tutorial for more information.
I would use the yattag library.
from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('food'):
with tag('name'):
text('French Breakfast')
with tag('price', currency='USD'):
text('6.95')
with tag('ingredients'):
for ingredient in ('baguettes', 'jam', 'butter', 'croissants'):
with tag('ingredient'):
text(ingredient)
print(doc.getvalue())
FYI I'm the author of the library.
Use lxml.builder class, from: http://lxml.de/tutorial.html#the-e-factory
import lxml.builder as lb
from lxml import etree
nstext = "new story"
story = lb.E.Asset(
lb.E.Attribute(nstext, name="Name", act="set"),
lb.E.Relation(lb.E.Asset(idref="Scope:767"),
name="Scope", act="set")
)
print 'story:\n', etree.tostring(story, pretty_print=True)
Output:
story:
<Asset>
<Attribute name="Name" act="set">new story</Attribute>
<Relation name="Scope" act="set">
<Asset idref="Scope:767"/>
</Relation>
</Asset>
An optional way if you want to use pure Python:
ElementTree is good for most cases, but it can't CData and pretty print.
So, if you need CData and pretty print you should use minidom:
minidom_example.py:
from xml.dom import minidom
doc = minidom.Document()
root = doc.createElement('root')
doc.appendChild(root)
leaf = doc.createElement('leaf')
text = doc.createTextNode('Text element with attributes')
leaf.appendChild(text)
leaf.setAttribute('color', 'white')
root.appendChild(leaf)
leaf_cdata = doc.createElement('leaf_cdata')
cdata = doc.createCDATASection('<em>CData</em> can contain <strong>HTML tags</strong> without encoding')
leaf_cdata.appendChild(cdata)
root.appendChild(leaf_cdata)
branch = doc.createElement('branch')
branch.appendChild(leaf.cloneNode(True))
root.appendChild(branch)
mixed = doc.createElement('mixed')
mixed_leaf = leaf.cloneNode(True)
mixed_leaf.setAttribute('color', 'black')
mixed_leaf.setAttribute('state', 'modified')
mixed.appendChild(mixed_leaf)
mixed_text = doc.createTextNode('Do not use mixed elements if it possible.')
mixed.appendChild(mixed_text)
root.appendChild(mixed)
xml_str = doc.toprettyxml(indent=" ")
with open("minidom_example.xml", "w") as f:
f.write(xml_str)
minidom_example.xml:
<?xml version="1.0" ?>
<root>
<leaf color="white">Text element with attributes</leaf>
<leaf_cdata>
<![CDATA[<em>CData</em> can contain <strong>HTML tags</strong> without encoding]]> </leaf_cdata>
<branch>
<leaf color="white">Text element with attributes</leaf>
</branch>
<mixed>
<leaf color="black" state="modified">Text element with attributes</leaf>
Do not use mixed elements if it possible.
</mixed>
</root>
I've tried a some of the solutions in this thread, and unfortunately, I found some of them to be cumbersome (i.e. requiring excessive effort when doing something non-trivial) and inelegant. Consequently, I thought I'd throw my preferred solution, web2py HTML helper objects, into the mix.
First, install the the standalone web2py module:
pip install web2py
Unfortunately, the above installs an extremely antiquated version of web2py, but it'll be good enough for this example. The updated source is here.
Import web2py HTML helper objects documented here.
from gluon.html import *
Now, you can use web2py helpers to generate XML/HTML.
words = ['this', 'is', 'my', 'item', 'list']
# helper function
create_item = lambda idx, word: LI(word, _id = 'item_%s' % idx, _class = 'item')
# create the HTML
items = [create_item(idx, word) for idx,word in enumerate(words)]
ul = UL(items, _id = 'my_item_list', _class = 'item_list')
my_div = DIV(ul, _class = 'container')
>>> my_div
<gluon.html.DIV object at 0x00000000039DEAC8>
>>> my_div.xml()
# I added the line breaks for clarity
<div class="container">
<ul class="item_list" id="my_item_list">
<li class="item" id="item_0">this</li>
<li class="item" id="item_1">is</li>
<li class="item" id="item_2">my</li>
<li class="item" id="item_3">item</li>
<li class="item" id="item_4">list</li>
</ul>
</div>

Python xml minidom. generate <text>Some text</text> element

I have the following code.
from xml.dom.minidom import Document
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
main = doc.createElement('Text')
root.appendChild(main)
text = doc.createTextNode('Some text here')
main.appendChild(text)
print doc.toprettyxml(indent='\t')
The result is:
<?xml version="1.0" ?>
<root>
<Text>
Some text here
</Text>
</root>
This is all fine and dandy, but what if I want the output to look like this?
<?xml version="1.0" ?>
<root>
<Text>Some text here</Text>
</root>
Can this easily be done?
Orjanp...
Can this easily be done?
It depends what exact rule you want, but generally no, you get little control over pretty-printing. If you want a specific format you'll usually have to write your own walker.
The DOM Level 3 LS parameter format-pretty-print in pxdom comes pretty close to your example. Its rule is that if an element contains only a single TextNode, no extra whitespace will be added. However it (currently) uses two spaces for an indent rather than four.
>>> doc= pxdom.parseString('<a><b>c</b></a>')
>>> doc.domConfig.setParameter('format-pretty-print', True)
>>> print doc.pxdomContent
<?xml version="1.0" encoding="utf-16"?>
<a>
<b>c</b>
</a>
(Adjust encoding and output format for whatever type of serialisation you're doing.)
If that's the rule you want, and you can get away with it, you might also be able to monkey-patch minidom's Element.writexml, eg.:
>>> from xml.dom import minidom
>>> def newwritexml(self, writer, indent= '', addindent= '', newl= ''):
... if len(self.childNodes)==1 and self.firstChild.nodeType==3:
... writer.write(indent)
... self.oldwritexml(writer) # cancel extra whitespace
... writer.write(newl)
... else:
... self.oldwritexml(writer, indent, addindent, newl)
...
>>> minidom.Element.oldwritexml= minidom.Element.writexml
>>> minidom.Element.writexml= newwritexml
All the usual caveats about the badness of monkey-patching apply.
I was looking for exactly the same thing, and I came across this post. (the indenting provided in xml.dom.minidom broke another tool that I was using to manipulate the XML, and I needed it to be indented) I tried the accepted solution with a slightly more complex example and this was the result:
In [1]: import pxdom
In [2]: xml = '<a><b>fda</b><c><b>fdsa</b></c></a>'
In [3]: doc = pxdom.parseString(xml)
In [4]: doc.domConfig.setParameter('format-pretty-print', True)
In [5]: print doc.pxdomContent
<?xml version="1.0" encoding="utf-16"?>
<a>
<b>fda</b><c>
<b>fdsa</b>
</c>
</a>
The pretty printed XML isn't formatted correctly, and I'm not too happy about monkey patching (i.e. I barely know what it means, and understand it's bad), so I looked for another solution.
I'm writing the output to file, so I can use the xmlindent program for Ubuntu ($sudo aptitude install xmlindent). So I just write the unformatted to the file, then call the xmlindent from within the python program:
from subprocess import Popen, PIPE
Popen(["xmlindent", "-i", "2", "-w", "-f", "-nbe", file_name],
stderr=PIPE,
stdout=PIPE).communicate()
The -w switch causes the file to be overwritten, but annoyingly leaves a named e.g. "myfile.xml~" which you'll probably want to remove. The other switches are there to get the formatting right (for me).
P.S. xmlindent is a stream formatter, i.e. you can use it as follows:
cat myfile.xml | xmlindent > myfile_indented.xml
So you might be able to use it in a python script without writing to a file if you needed to.
This could be done with toxml(), using regular expressions to tidy things up.
>>> from xml.dom.minidom import Document
>>> import re
>>> doc = Document()
>>> root = doc.createElement('root')
>>> _ = doc.appendChild(root)
>>> main = doc.createElement('Text')
>>> _ = root.appendChild(main)
>>> text = doc.createTextNode('Some text here')
>>> _ = main.appendChild(text)
>>> out = doc.toxml()
>>> niceOut = re.sub(r'><', r'>\n<', re.sub(r'(<\/.*?>)', r'\1\n', out))
>>> print niceOut
<?xml version="1.0" ?>
<root>
<Text>Some text here</Text>
</root>
The pyxml package offers a simple solution to this by using the xml.dom.ext.PrettyPrint() function. It can also print to a file descriptor.
But the pyxml package is no longer maintained.
Oerjan Pettersen
This solution worked for me without monkey patching or ceasing to use minidom:
from xml.dom.ext import PrettyPrint
from StringIO import StringIO
def toprettyxml_fixed (node, encoding='utf-8'):
tmpStream = StringIO()
PrettyPrint(node, stream=tmpStream, encoding=encoding)
return tmpStream.getvalue()
http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace/#best-solution
Easiest way to do this is to use prettyxml, and remove the \n and \t inside the tags. That way you keep your indent as you required in your example.
xml_output = doc.toprettyxml()
nojunkintags = re.sub('>(\n|\t)</', '', xml_output)
print nojunkintags

Categories