Python xml - handle unclosed token

Python xml - handle unclosed token - python

I am reading in hundreds of XML files and parsing them with xml.etree.ElementTree.
Quick background just fwiw:
These XML files were at one point totally valid but somehow when processing them historically my process which copied/pasted them may have corrupted them. (Turns out it was a flushing issue / with statement not closing, if you care, see the good help I got on that investigation at... Python shutil copyfile - missing last few lines ).
Anyway back to the point of this question.
I would still like to read in the first 100,000 lines or so of these documents which are valid XML. The files are only missing the last 4 or 5KB of a 6MB file. As alluded to earlier, though, the file just 'cuts out'. it looks like this:
</Maintag>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
where (perhaps obviously) Scheduled_E is the beginning of what should be another attribute, <.Scheduled_Event>, say. But the file gets cut short mid tag. Once again, before this point in the file, there are several thousand 'good' "Maintag" entries which I would like to read in, accepting the cutoff entry (and obviously anything that should have come after) as an unrecoverable fail.
A simple but incomplete method of dealing with this might be to simply - pre XML processing - look for the last instance of the string <./Maintag> in the file, and replace what follows (which will be broken, at some point) with the 'opening' tags. Again, this at least lets me process what is still there and valid.
If someone wants to help me out with that sort of string replacement, then fwiw the opening tags are:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
<Source FileName="myfile">
I am hoping that even easier than that, there might be an elementtree or beautifulsoup or other way of handling this situation... I've done a decent amount of searching and nothing seems easy/obvious.
Thanks

For dealing with unclosed elements -or token as in the title of this questioin-, I'd recommend to try lxml. lxml's XMLParser has recover option which documented as :
recover - try hard to parse through broken XML
For example, given a broken XML as follow :
from lxml import etree
xml = """
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
"""
parser = etree.XMLParser(recover=True)
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))
The recovered XML as printed by the above code is as follow :
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E/></Maintag></root>

Related

Parsing XML with ElementTree's iter() with no argument, does not return the first several tags in file

I am trying to extract all of the headers from an XML file and put them into a list in python, however, every time I run my code the first tag extracted from the file is not actually first tag in the XML file. It instead begins with the 18th tag and then prints the remainder of the list from there. The really weird part is when I originally wrote this code, it worked as expected, but as I added code to extract the element text and put it in a list, the header code stopped working, both in the original program and the standalone code below. I should also mention the complete program does not manipulate the XML file in any way. All manipulation is done exclusively on the python lists after the extraction.
import xml.etree.ElementTree as ET
tree = ET.parse("Sample.xml")
root = tree.getroot()
headers = [elem.tag for elem in root.iter()]
print(headers)
Sample.XML is a sensitive file so I had to redact all the element text. It is also a very large file so I only included one account's worth of elements.
-<ExternalCollection xmlns="namespace.xsd">
-<Batch>
<BatchID>***</BatchID>
<ExternalCollectorName>***</ExternalCollectorName>
<PrintDate>***</PrintDate>
<ProviderOrganization>***</ProviderOrganization>
<ProvOrgID>***</ProvOrgID>
-<Account>
<AccountNum>***</AccountNum>
<Guarantor>***</Guarantor>
<GuarantorAddress1>***</GuarantorAddress1>
<GuarantorAddress2/>
<GuarantorCityStateZip>***</GuarantorCityStateZip>
<GuarantorEmail/>
<GuarantorPhone>***</GuarantorPhone>
<GuarantorMobile/>
<GuarantorDOB>***</GuarantorDOB>
<AccountID>***</AccountID>
<GuarantorID>***</GuarantorID>
-<Incident>
<Patient>***</Patient>
<PatientDOB>***</PatientDOB>
<FacilityName>***</FacilityName>
-<ServiceLine>
<DOS>***</DOS>
<Provider>***</Provider>
<Code>***</Code>
<Modifier>***</Modifier>
<Description>***</Description>
<Billed>***</Billed>
<Expected>***</Expected>
<Balance>***</Balance>
<SelfPay>***</SelfPay>
<IncidentID>***</IncidentID>
<ServiceLineID>***</ServiceLineID>
-<OtherActivity>
</OtherActivity>
</ServiceLine>
</Incident>
</Account>
</Batch>
</ExternalCollection>
The output is as follows:
'namespace.xsd}PatientDOB', '{namespace.xsd}FacilityName', '{namespace.xsd}ServiceLine', '{namespace.xsd}DOS', '{namespace.xsd}Provider', '{namespace.xsd}Code', '{namespace.xsd}Modifier', '{namespace.xsd}Description', '{namespace.xsd}Billed', '{namespace.xsd}Expected', '{namespace.xsd}Balance', '{namespace.xsd}SelfPay', '{namespace.xsd}IncidentID', '{namespace.xsd}ServiceLineID', '{namespace.xsd}OtherActivity'
As you can see, for some reason the first returned value is Patient DOB instead of the actual first tag.
Thank y'all in advance!

Your input file should not contain "-" chars in front of XML tags.
You should drop at least the first "-", in front of the root tag, otherwise
a parsing error occurs.
Note also that your first printed tag name has no initial "{", so apparently
something weird is going on with your list, presumably, after your loop.
I ran your code and got a proper list, containing all tags.
Try the following loop:
for elem in root.iter():
print(elem.tag)
Maybe it will give you some clue about the real cause of your error.
Consider also upgrading your Python installation. Maybe you have
some outdated modules.
Yet another hint: Run your code on just this input that you included
in your post, with content replaced with "***". Maybe the real cause
of your error is in the actual content of any source element
(which you replaced here with asterixes).

how to modify attribute values from xml tag with python

I have many graphml files starting with:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns/graphml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/graphml">
I need to change the xmlns and xsi attributes to reflect proper values for this XML file format specification:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
I tried to change these values with BeautifulSoup like:
soup = BeautifulSoup(myfile, 'html.parser')
soup.graphml['xmlns'] = 'http://graphml.graphdrawing.org/xmlns'
soup.graphml['xsi:schemalocation'] = "http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
It works fine but it is definitely too slow on some of my larger files, so I am trying to do the same with lxml, but I don't understand how to achieve the same result. I sort of managed to reach the attributes, but don't know how to change them:
doc = etree.parse(myfile)
root = doc.getroot()
root.attrib
> {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://graphml.graphdrawing.org/xmlns/graphml'}
What is the right way to accomplish this task?

When you say that you have many files "starting with" those 4 lines, if you really mean they're exactly like that, the fastest way is probably to entirely ignore that fact that it's XML, and just replace those lines.
In Python, just read the first four lines, compare them to what you expect (so you can issue a warning if they don't match), then discard them. Write out the new four lines you want, then copy the rest of the file out. Repeat for each file.
On the other hand, if you have namespace attributes anywhere else in the file this method wouldn't catch them, and you should probably do a real XML-based solution. With a regular SAX parser, you get a callback for each element start, element end, text node, etc. as it comes along. So you'd just copy them out until you hit the one(s) you want (in this case, a graphml element), then instead of copying out that start-tag, write out the new one you want. Then back to copying. XSLT is also a fine way to do this, which would let you write a tiny generic copier, plus one rule to handle the graphml element.

Editing a DOCX file

I am working on a little project that should be quite simple. I know its been done before but for the life of me, I cannot get it to work. Alright so I made a docx template using Microsoft word that contains a Header and just some text in the body of the paper. My goal is have a program that can change this text. Using python-docx I have successfully been able to write a program that modifies the body text easily. That being said I am trying to learn how to do the same thing using XML parsing, which will allow the header to be changed. Long story short, XML parsing (I think thats what it is) will give me much more freedom down the road.
I know after the docx is unzipped, the word/document.xml contains the body text.
Here is my code so far.
from lxml import etree as ET
tree = ET.parse('document.xml')
root = tree.getroot()
for i in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
if i.text == 'Title':
i.text = 'How to cook'
tree.write('document_output.xml', xml_declaration = True, encoding = "UTF-8", method = "xml" \
, standalone = "yes")
This program successfully changes the wanted text to the updated text.
Here is the original document.xml
https://www.dropbox.com/s/ghe1m176rdqtng7/document.xml?dl=0
Here is the output.
https://www.dropbox.com/s/8n9llagozbvb2mz/document_output.xml?dl=0
P.S. viewing the code from dropbox, it makes everything start at line 4 instead of line 1.
If you view them in an XML viewer you can see they are identical. Also, if you use a text difference tool, the only difference is the changed word. And I wouldn't think this would matter but the top line uses single quotes instead of double.
Hope someone can shed some light on why this is still not opening properly in Word.
Thanks for all the help!!

you're having the usual problems with ET.
As a starter, check out these Stackoverflow threads:
Namespace 1
Namespace 2
Namespace 3 with xml declaration
xml declaration
As you can see, you're not the first person with these problems.
What you could do for the namespaces is parse the xml twice:
first time in order to extract the namespaces and
a second time in order to do your actual work.
Besides, some people already suggested to switch from Elementtree to lxml.

Python and ElementTree: write() isn't working properly

First question. If I screwed up somehow let me know.
Ok, what I need to do is the following. I'm trying to use Python to get some data from an API. The API sends it to me in XML. I'm trying to use ElementTree to parse it.
Now every time I request information from the API, it's different. I want to construct a list of all the data I get. I could use Python's lists, but since I want to save it to a file at the end I figured - why not use ElementTree for that too.
Start with an Element, lets call it ListE. Call the API, parse the XML, get the root Element from the ElementTree. Add the root Element as a subelement into ListE. Call the API again, and do it all over. At the end ListE should be an Element whose subelements are the results of each API call. And the end of everything just wrap ListE into an ElementTree in order to use the ElementTree write() function. Below is the code.
import xml.etree.ElementTree as ET
url = "http://http://api.intrade.com/jsp/XML/MarketData/ContractBookXML.jsp?id=769355"
try:
returnurl=urlopen(url)
except IOError:
exit()
tree = ET.parse(returnurl)
root = tree.getroot()
print "root tag and attrib: ",root.tag, root.attrib
historyE = ET.Element('historical data')
historyE.append(root)
historyE.append(root)
historyET = ET.ElementTree(historyE)
historyET.write('output.xml',"UTF-8")
The program doesn't return any error. The problem is when I ask the browser to open it, it claims a syntax error. Opening the file with notepad here's what I find:
<?xml version='1.0' encoding='UTF-8'?>
<historical data><ContractBookInfo lastUpdateTime="0">
<contractInfo conID="769355" expiryPrice="100.0" expiryTime="1357334563000" state="S" vol="712" />
</ContractBookInfo><ContractBookInfo lastUpdateTime="0">
<contractInfo conID="769355" expiryPrice="100.0" expiryTime="1357334563000" state="S" vol="712" />
</ContractBookInfo></historical data>
I think the reason for the syntax error is that there isn't a space or a return between 'historical data' and 'ContractBookInfo lastUpdateTime="0"'. Suggestions?

The problem is here:
historyE = ET.Element('historical data')
You shouldn't use a space. As summarized on Wikipedia:
The element tags are case-sensitive; the beginning and end tags must
match exactly. Tag names cannot contain any of the characters
!"#$%&'()*+,/;<=>?#[]^`{|}~, nor a space character, and cannot start
with -, ., or a numeric digit.
See this section of the XML spec for the details ("Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.")

Comments in XML at beginning of document

my PYTHON xml parser fails if there´s a comment at the beginnging of an xml file like::
<?xml version="1.0" encoding="utf-8"?>
<!-- Script version: "1"-->
<!-- Date: "07052010"-->
<component name="abc">
<pp>
....
</pp>
</component>
is it illegal to place a comment like this?
EDIT:
well it´s not throwing an error but the DOM module will fail and not recognize the child nodes:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
for component in sub_tree.firstChild.childNodes:
print(component)
I cannot acces the child nodes; sub_tree.firstChild.childNodes returns an empty list,but if I remove those 2 comments I can loop through the list and read the childnodes as usual!
EDIT:
Guys, this simple example is working and enough to figure it out. start your python shell and execute this small code above. Once it will output nothing and after deleting the comments it will show up the node!

If you do this:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
print sub_tree.children
You will see what is your problem:
>>> print sub_tree.childNodes
[<DOM Comment node " Script ve...">, <DOM Comment node " Date: "07...">, <DOM Element: component at 0x7fecf88c>]
firstChild will obviously pick up the first child, which is a comment and doesn't have any children of its own.
You could iterate over the children and skip all comment nodes.
Or you could ditch the DOM model and use ElementTree, which is so much nicer to work with. :)

It is legal; from XML 1.0 Reference:
2.5 Comments
[Definition: Comments may appear
anywhere in a document outside other
markup; in addition, they may appear
within the document type declaration
at places allowed by the grammar. They
are not part of the document's
character data; an XML processor MAY,
but need not, make it possible for an
application to retrieve the text of
comments. For compatibility, the
string " -- " (double-hyphen) MUST NOT
occur within comments.] Parameter
entity references MUST NOT be
recognized within comments.

To get better answers, show us (a) a small complete Python script and (b) a small complete XML document that together demonstrate the unexpected behaviour.
Have you considered using ElementTree?

That should be legal as long as the XML declaration is on the first line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.