Python ElementTree does not like colon in name of processing instruction - python

The following code:
import xml.etree.ElementTree as ET
xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
<?LazyComment Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>'''
root = ET.fromstring(xml)
xml2 = xml.replace('LazyComment ', 'LazyComment:')
print(xml2)
try:
root2 = ET.fromstring(xml2)
except ET.ParseError:
print("\nERROR in xml2!!!\n")
xml3 = xml2.replace('testCaseConfig', 'testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/"', 1)
print(xml3)
try:
root3 = ET.fromstring(xml3)
except ET.ParseError:
print("\nERROR in xml3!!!\n")
raise
Gives this output:
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
<?LazyComment:Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>
ERROR in xml2!!!
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
<?LazyComment:Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>
ERROR in xml3!!!
Traceback (most recent call last):
File "C:\Users\Paddy3118\Google Drive\Code\elementtree_error.py", line 30, in <module>
root3 = ET.fromstring(xml3)
File "C:\Anaconda3\envs\Py3.5\lib\xml\etree\ElementTree.py", line 1333, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 17
I searched and found this Q that pointed to other resources that I read.
It seems that the '?' makes it a processing instruction whose tag name can include colons. Without the '?' then a colon in a name indicates namespace and one of the answers stated that defining the namespace should make things work.
Combining '?' and ':' though causes issues with ElementTree.
I am given xml files of this type that are used by other tools that do parse it OK and want to process the files myself using Python. Any ideas?
Thanks.

According to the W3C Extensible Markup Language 1.0 Specifications under Common Syntactic Constructs:
The Namespaces in XML Recommendation [XML Names] assigns a meaning to
names containing colon characters. Therefore, authors should not use
the colon in XML names except for namespace purposes, but XML
processors must accept the colon as a name character.
And further in the W3C XPath 1.0 note on Processing Instruction nodes:
A processing instruction has an expanded-name: the local part is the
processing instruction's target; the namespace URI is null.
Altogether, <?LazyComment:Blah de blah/?> is an invalid processing instruction as colons is used to reference namespace URIs and for processing instructions that part is null or empty. Therefore, Python's XML processor complains that using such an instruction does not render a well-formed XML.
Also, reconsider such tools that are generating such invalid processing instructions as they are not handling valid XML documents. Possibly, such tools are treating XML files as text documents (similar to the way you were able to replace the string representation of XML but would not have been able to append an instruction using etree).

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
<?LazyComment:Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
Is invalid XML. You can't have attributes in the closing tag. The last line should be just </testCaseConfig>
Also comments are written like this
<!-- this is a comment -->

Related

how to read a xml file python? [duplicate]

I have a big XML file with several article nodes. I have included only one with the problem. I try to parse it in Python to filter some data and I get the error
File "<string>", line unknown
ParseError: undefined entity Ö: line 90, column 17
Sample of the XML file
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
From my search in Google, I found that this kind of error appears if you have issues in the node names. However, the line with the error is the second author, in the text.
This is my Python code
with open('xaa.xml', 'r') as xml_file:
xml_tree = etree.parse(xml_file)
The declaration of the Ouml entity is presumably in the DTD (dblp.dtd), but ElementTree does not support external DTDs. ElementTree only recognizes entities declared directly in the XML file (in the "internal subset"). This is a working example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp [
<!ENTITY Ouml 'Ö'>
]>
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
To parse the XML file in the question without errors, you need a more powerful XML library that supports external DTDs. lxml is a good choice for that.

Why does XML only validate against an XSD when lxml etree.ElementTree is a string?

I have an XSD:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns="http://tempuri.org/me"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://tempuri.org/me">
<xs:element name="B"></xs:element>
</xs:schema>
Using lxml I can create an XML document and validate it against the XSD:
path = os.path.join(os.path.dirname(__file__), 'sample.xsd')
schema = etree.XMLSchema(file=path)
el = etree.Element('B', nsmap={None: 'http://tempuri.org/me'})
doc = etree.ElementTree(el)
schema.assertValid(doc)
However it produces the following error:
lxml.etree.DocumentInvalid: Element 'B': No matching global declaration available for the validation root.
That error doesn't occur if I convert doc to a string and back again it validates:
st = etree.tostring(doc)
schema.assertValid(etree.XML(st)) # This validates.
What is going on here? Why do I need to convert my etree document to a string and back again to make it validate? How can I prevent that wasteful step?
I'm using Python 3.8 and lxml 4.4.1.

How to replace xml lines using 'if statements' in python?

Hi I'm new to xml files in general, but I am trying to replace specific lines in a xml file using 'if statements' in python 3.6. I've been looking at suggestions to use ElementTree, but none of the posts online quite fit the problem I have, so here I am.
My file is as followed:
<?xml version="1.0" encoding="UTF-8"?>
-<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
I want to replace
url value="http://example.org/fhir/StructureDefinition/MyObservation"/
to something like
url value="http://example.org/fhir/StructureDefinition/NewObservation"/
by using conditional statements - because these are repeated multiple times in other files.
I have tried for-looping through the xml find to find the exact string match (which I've succeeded), but I wasn't able to delete, or replace the line (probably having to do with the fact that this isn't a .txt file).
Any help is greatly appreciated!
Your sample file contains a "-"-token in ln 3 that may be overlooked when copy/pasting in order to find a solution.
Input File
<?xml version="1.0" encoding="UTF-8"?>
<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
Script
from xml.dom.minidom import parse # use minidom for this task
dom = parse('june.xml') #read in your file
search = "http://example.org/fhir/StructureDefinition/MyObservation" #set search value
replace = "http://example.org/fhir/StructureDefinition/NewObservation" #set replace value
res = dom.getElementsByTagName('url') #iterate over url tags
for element in res:
if element.getAttribute('value') == search: #in case of match
element.setAttribute('value', replace) #replace
with open('june_updated.xml', 'w') as f:
f.write(dom.toxml()) #update the dom, save as new xml file
Output file
<?xml version="1.0" ?><StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/NewObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>

keep original xml declaration when editing with python

my original xml file looks like this:
<?xml version="1.0" encoding="utf-8"?>
<foo/>
and I want to change it to
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>confusing dev</bar>
</foo>
I am using xml.etree.ElementTree as suggested by this tutorial
with open('file.xml','r+b') as f:
tree = etree.parse(f)
f.seek(0,0)
tree.write(f,xml_declaration=True)# default argument: encoding="us-ascii"
this outputs
<?xml version='1.0' encoding='us-ascii'?>
<foo/>
But how do I get the encoding of file.xml at runtime and pass it as an argument to tree.write or is there a better way to edit xml in python? I just want to change some Element.text but keep the declaration and namespace unchanged.

Error in escaping XML for a KML file

Some time ago I asked a question trying to figure out why modifying a KML file increased the file size.
After poking around, I've found that the issue had to do with escaping XML.
Essentially, the "<", ">", and "&" characters were being replaced with:
"<", ">", and "&"
It's not a big deal for smaller files, but the extra characters make a big difference in larger files.
I copied some code from this site to help solve the problem:
import lxml
from lxml import etree
import pykml
from pykml.factory import KML_ElementMaker as KML
from pykml import parser
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
## Ampersands must be last to avoid errors in text replacement
s = s.replace("&", "&")
return s
with open("myplaces.kml", "rb") as f:
doc = parser.parse(f).getroot()
a = doc.Document.Folder[0].Folder[1]
for q in GEList:
x = KML.Folder(KML.name(q))
a.append(x)
finished = (etree.tostring(doc, pretty_print = True))
finished = unescape(finished)
with open("myplaces.kml", "wb") as f:
f.write(finished)
Now however, I'm running into another error. I compared the file before and after I replaced the <, >, and & characters.
Before: <description><![CDATA[<img src="fedland_leg_pop_2.jpg" alt="headerimg" width="550" height="77"><br>
After: <description><img src="fedland_leg_pop_2.jpg" alt="headerimg" width="550" height="77"><br>
Now it seems to be throwing out "< ![CDATA[", & I can't figure out why.
I had the same issue but then I found this (https://developers.google.com/kml/documentation/kml_tut#descriptive_html):
Using the CDATA Element
If you want to write standard HTML inside a tag, you can put it inside a CDATA tag. If you don't, the angle brackets need to be written as entity references to prevent Google Earth from parsing the HTML incorrectly (for example, the symbol > is written as > and the symbol < is written as <). This is a standard feature of XML and is not unique to Google Earth.
Consider the difference between HTML markup with CDATA tags and without CDATA. First, here's the with CDATA tags:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>CDATA example</name>
<description>
<![CDATA[
<h1>CDATA Tags are useful!</h1>
<p><font color="red">Text is <i>more readable</i> and
<b>easier to write</b> when you can avoid using entity
references.</font></p>
]]>
</description>
<Point>
<coordinates>102.595626,14.996729</coordinates>
</Point>
</Placemark>
</Document>
</kml>
And here's the without CDATA tags, so that special characters must use entity references:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>Entity references example</name>
<description>
<h1>Entity references are hard to type!</h1>
<p><font color="green">Text is
<i>more readable</i>
and <b>easier to write</b>
when you can avoid using entity references.</font></p>
</description>
<Point>
<coordinates>102.594411,14.998518</coordinates>
</Point>
</Placemark>
</Document>
</kml>

Categories