Python xml etree DTD from a StringIO source? - python

I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important:
xmldoc = open(filename)
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(xmldoc, parser)
This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form.
Loading the xml file works fine:
tree = etree.parse(StringIO(data['xml_file'])
But as the DTD is linked to in the top of the xml file, the following statement fails:
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(StringIO(data['xml_file'], parser)
Via this question, I tried:
etree.DTD(StringIO(data['dtd_file'])
tree = etree.parse(StringIO(data['xml_file'])
Whilst the first line doesn't cause an error, the second falls over on unicode entities the DTD is meant to pick up (and does so in the file system version):
XMLSyntaxError: Entity 'eacute' not
defined, line 4495, column 46
How do I go about correctly loading this DTD?

Here's a short but complete example, using the custom resolver technique #Steven mentioned.
from StringIO import StringIO
from lxml import etree
data = dict(
xml_file = '''<?xml version="1.0"?>
<!DOCTYPE x SYSTEM "a.dtd">
<x><y>ézz</y></x>
''',
dtd_file = '''<!ENTITY eacute "é">
<!ELEMENT x (y)>
<!ELEMENT y (#PCDATA)>
''')
class DTDResolver(etree.Resolver):
def resolve(self, url, id, context):
return self.resolve_string(data['dtd_file'], context)
xmldoc = StringIO(data['xml_file'])
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
parser.resolvers.add(DTDResolver())
try:
tree = etree.parse(xmldoc, parser)
except etree.XMLSyntaxError as e:
# handle xml and validation errors

You could probably use a custom resolver. The docs actually give an example of doing this to provide a dtd.

Related

Editing existing XML file and sending post via Jboss

I have the following python method which runs though an xml file and parses it, and TRIES to edit a field:
import requests
import xml.etree.ElementTree as ET
import random
def runThrougheTree():
#open xml file
with open("testxml.xml") as xml:
from lxml import etree
#parse
parser = etree.XMLParser(strip_cdata=True, recover=True)
tree = etree.parse("testxml.xml", parser)
root= tree.getroot()
#ATTEMPT to edit field - will not work as of now
for ci in root.iter("CurrentlyInjured"):
ci.text = randomCurrentlyInjured(['sffdgdg', 'sdfsdfdsfsfsfsd','sfdsdfsdfds'])
#Overwrite initial xml file with new fields - will not work as of now
etree.ElementTree(root).write("testxml.xml",pretty_print=True, encoding='utf-8', xml_declaration=True)
#send post (Jboss)
requests.post('http://localhost:9000/something/RuleServiceImpl', data="testxml.xml)
def randomCurrentlyInjured(ran):
random.shuffle(ran)
return ran[0]
#-----------------------------------------------
if __name__ == "__main__":
runThrougheTree()
Edited XML file:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rule="http://somewebsite.com/" xmlns:ws="http://somewebsite.com/" xmlns:bus="http://somewebsite.com/">
<soapenv:Header/>
<soapenv:Body>
<ws:Respond>
<ws:busMessage>
<bus:SomeRef>insertnumericvaluehere</bus:SomeRef>
<bus:Content><![CDATA[<SomeDef>
<SomeType>ABCD</Sometype>
<A_Message>
<body>
<AnonymousField>
<RefIndicator>1111111111111</RefIndicator>
<OneMoreType>HIJK</OneMoreType>
<CurrentlyInjured>ABCDE</CurentlyInjured>
</AnonymousField>
</body>
</A_Message>
</SomeDef>]]></bus:Content>
<bus:MessageTypeId>somenumericvalue</bus:MessageTypeId>
</ws:busMessage>
</ws:Respond>
</soapenv:Body>
</soapenv:Envelope>
Issues:
The field is not being edited.
Jboss error: Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
Note: I have ensured that there is no characters prior to first xml tag.
In the end, I was unable to use lxml, elementtree to edit the fields/post to Jboss as:
I had CDATA in the xml as mzjn pointed out in the comments
Jboss did not like the request after it had been parsed, even when the CDATA tags were removed.
Workaround/Eventual SOlution: I was able to (somewhat tediously) use .replace() in my script to edit the plaintext successfully, and then send the POST via Jboss. I hope this helps someone else someday!

How to write an xml file using libxml2 in python?

I am attempting to write an xml file in python3 using libxml2. I cannot find any relevant documentation regarding python about writing files with libxml. When I attempt to write an xml file parsed with libxml2 I get the error:
xmlDoc has no attribute write
Anyone here done this before? I can get it to work in Etree just fine but Etree will not respect the attribute order that I need.
You can use saveFile() or saveFileEnc(). Example:
import libxml2
XML = """
<root a="1" b="2">XYZ</root>
"""
doc = libxml2.parseDoc(XML)
doc.saveFile("test.xml")
doc.saveFileEnc("test2.xml", "UTF-8")
I could not find any good documentation for the Python API. Here is the corresponding C documentation: http://xmlsoft.org/html/libxml-tree.html#xmlSaveFile.
import libxml2
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
doc = libxml2.parseDoc(DOC)
root = doc.children
print root

Create array of values from specific element in XML using Python

I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>
The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.

python lxml with py2exe

I have Generated an XML with dom and i want to use lxml to pretty print the xml.
this is my code for pretty print the xml
def prettify_xml(xml_str):
import lxml.etree as etree
root = etree.fromstring(xml_str)
xml_str = etree.tostring(root, pretty_print=True)
return xml_str
my output should be an xml formatted string.
I got this code from some post in stactoverflow. This works flawlessly when i am compiling wit python itself. But when i convert my project to a binary created from py2exe (my binary is windows service with a namedpipe).I had two problems:
My service was not starting , i solved this by adding lxml.etree in includes option in py2exe function. then on my service started properly.
when xml generation in called here, is the error which I am seeing in my log
'module' object has no attribute 'fromstring'
where do i rectify this error ? And Is my first problem's solution correct ?
my xml generation Code :
from xml.etree import ElementTree
from xml.dom import minidom
from xml.etree.ElementTree import Element, SubElement, tostring, XML
import lxml.etree
def prettify_xml(xml_str):
root = lxml.etree.fromstring(xml_str)
xml_str = lxml.etree.tostring(root, pretty_print=True)
return xml_str
def dll_xml(status):
try:
xml_declaration = '<?xml version="1.0" standalone="no" ?>'
rootTagName='response'
root = Element(rootTagName)
root.set('id' , 'rp001')
parent = SubElement(root, 'command', opcode ='-ac')
# Create children
chdtag1Name = 'mode'
chdtag1Value = 'repreport'
chdtag2Name='status'
chdtag2Value = status
fullchildtag1 = ''+chdtag1Name+' value = "'+chdtag1Value+'"'
fullchildtag2=''+chdtag2Name+' value="'+chdtag2Value+'"'
children = XML('''<root><'''+fullchildtag1+''' /><'''+fullchildtag2+'''/></root> ''')
# Add parent
parent.extend(children)
dll_xml_doc = xml_declaration + tostring(root)
dll_xml_doc = prettify_xml(dll_xml_doc)
return dll_xml_doc
except Exception , error:
log.error("xml_generation_failed : %s" % error)
Try to use PyInstaller instead py2exe. I converted your program to binary .exe with no problem just by running python pyinstaller.py YourPath\xml_a.py.

How do I validate xml against a DTD file in Python

I need to validate an XML string (and not a file)
against a DTD description file.
How can that be done in python?
Another good option is lxml's validation which I find quite pleasant to use.
A simple example taken from the lxml site:
from StringIO import StringIO
from lxml import etree
dtd = etree.DTD(StringIO("""<!ELEMENT foo EMPTY>"""))
root = etree.XML("<foo/>")
print(dtd.validate(root))
# True
root = etree.XML("<foo>bar</foo>")
print(dtd.validate(root))
# False
print(dtd.error_log.filter_from_errors())
# <string>:1:0:ERROR:VALID:DTD_NOT_EMPTY: Element foo was declared EMPTY this one has content
from the examples directory in the libxml2 python bindings:
#!/usr/bin/python -u
import libxml2
import sys
# Memory debug specific
libxml2.debugMemory(1)
dtd="""<!ELEMENT foo EMPTY>"""
instance="""<?xml version="1.0"?>
<foo></foo>"""
dtd = libxml2.parseDTD(None, 'test.dtd')
ctxt = libxml2.newValidCtxt()
doc = libxml2.parseDoc(instance)
ret = doc.validateDtd(ctxt, dtd)
if ret != 1:
print "error doing DTD validation"
sys.exit(1)
doc.freeDoc()
dtd.freeDtd()
del dtd
del ctxt

Categories