Editing a DOCX file - python

I am working on a little project that should be quite simple. I know its been done before but for the life of me, I cannot get it to work. Alright so I made a docx template using Microsoft word that contains a Header and just some text in the body of the paper. My goal is have a program that can change this text. Using python-docx I have successfully been able to write a program that modifies the body text easily. That being said I am trying to learn how to do the same thing using XML parsing, which will allow the header to be changed. Long story short, XML parsing (I think thats what it is) will give me much more freedom down the road.
I know after the docx is unzipped, the word/document.xml contains the body text.
Here is my code so far.
from lxml import etree as ET
tree = ET.parse('document.xml')
root = tree.getroot()
for i in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
if i.text == 'Title':
i.text = 'How to cook'
tree.write('document_output.xml', xml_declaration = True, encoding = "UTF-8", method = "xml" \
, standalone = "yes")
This program successfully changes the wanted text to the updated text.
Here is the original document.xml
https://www.dropbox.com/s/ghe1m176rdqtng7/document.xml?dl=0
Here is the output.
https://www.dropbox.com/s/8n9llagozbvb2mz/document_output.xml?dl=0
P.S. viewing the code from dropbox, it makes everything start at line 4 instead of line 1.
If you view them in an XML viewer you can see they are identical. Also, if you use a text difference tool, the only difference is the changed word. And I wouldn't think this would matter but the top line uses single quotes instead of double.
Hope someone can shed some light on why this is still not opening properly in Word.
Thanks for all the help!!

you're having the usual problems with ET.
As a starter, check out these Stackoverflow threads:
Namespace 1
Namespace 2
Namespace 3 with xml declaration
xml declaration
As you can see, you're not the first person with these problems.
What you could do for the namespaces is parse the xml twice:
first time in order to extract the namespaces and
a second time in order to do your actual work.
Besides, some people already suggested to switch from Elementtree to lxml.

Related

Parsing XML with ElementTree's iter() with no argument, does not return the first several tags in file

I am trying to extract all of the headers from an XML file and put them into a list in python, however, every time I run my code the first tag extracted from the file is not actually first tag in the XML file. It instead begins with the 18th tag and then prints the remainder of the list from there. The really weird part is when I originally wrote this code, it worked as expected, but as I added code to extract the element text and put it in a list, the header code stopped working, both in the original program and the standalone code below. I should also mention the complete program does not manipulate the XML file in any way. All manipulation is done exclusively on the python lists after the extraction.
import xml.etree.ElementTree as ET
tree = ET.parse("Sample.xml")
root = tree.getroot()
headers = [elem.tag for elem in root.iter()]
print(headers)
Sample.XML is a sensitive file so I had to redact all the element text. It is also a very large file so I only included one account's worth of elements.
-<ExternalCollection xmlns="namespace.xsd">
-<Batch>
<BatchID>***</BatchID>
<ExternalCollectorName>***</ExternalCollectorName>
<PrintDate>***</PrintDate>
<ProviderOrganization>***</ProviderOrganization>
<ProvOrgID>***</ProvOrgID>
-<Account>
<AccountNum>***</AccountNum>
<Guarantor>***</Guarantor>
<GuarantorAddress1>***</GuarantorAddress1>
<GuarantorAddress2/>
<GuarantorCityStateZip>***</GuarantorCityStateZip>
<GuarantorEmail/>
<GuarantorPhone>***</GuarantorPhone>
<GuarantorMobile/>
<GuarantorDOB>***</GuarantorDOB>
<AccountID>***</AccountID>
<GuarantorID>***</GuarantorID>
-<Incident>
<Patient>***</Patient>
<PatientDOB>***</PatientDOB>
<FacilityName>***</FacilityName>
-<ServiceLine>
<DOS>***</DOS>
<Provider>***</Provider>
<Code>***</Code>
<Modifier>***</Modifier>
<Description>***</Description>
<Billed>***</Billed>
<Expected>***</Expected>
<Balance>***</Balance>
<SelfPay>***</SelfPay>
<IncidentID>***</IncidentID>
<ServiceLineID>***</ServiceLineID>
-<OtherActivity>
</OtherActivity>
</ServiceLine>
</Incident>
</Account>
</Batch>
</ExternalCollection>
The output is as follows:
'namespace.xsd}PatientDOB', '{namespace.xsd}FacilityName', '{namespace.xsd}ServiceLine', '{namespace.xsd}DOS', '{namespace.xsd}Provider', '{namespace.xsd}Code', '{namespace.xsd}Modifier', '{namespace.xsd}Description', '{namespace.xsd}Billed', '{namespace.xsd}Expected', '{namespace.xsd}Balance', '{namespace.xsd}SelfPay', '{namespace.xsd}IncidentID', '{namespace.xsd}ServiceLineID', '{namespace.xsd}OtherActivity'
As you can see, for some reason the first returned value is Patient DOB instead of the actual first tag.
Thank y'all in advance!
Your input file should not contain "-" chars in front of XML tags.
You should drop at least the first "-", in front of the root tag, otherwise
a parsing error occurs.
Note also that your first printed tag name has no initial "{", so apparently
something weird is going on with your list, presumably, after your loop.
I ran your code and got a proper list, containing all tags.
Try the following loop:
for elem in root.iter():
print(elem.tag)
Maybe it will give you some clue about the real cause of your error.
Consider also upgrading your Python installation. Maybe you have
some outdated modules.
Yet another hint: Run your code on just this input that you included
in your post, with content replaced with "***". Maybe the real cause
of your error is in the actual content of any source element
(which you replaced here with asterixes).

Python finding exact string in .html file

I have a .html file which gets dynamically filled depending on what actions are taken in the program, however I am having an issue when searching for an exact string, the issue is that although I know the file is not blank, the loop doesn't return anything and thinks its blank.
I have searched and read many other SO questions and tried many of them, including 'blah' in line, re.findall, and with open() all the time they return only blank, I'm thinking I need HTML parsing or similar?
Can anyone shed any light on this for me?
f = open(outApp + '_report.html', 'r+')
for line in f:
#check the for loop works
self.progressBox.AppendText(line)
if 'mystring' in line:
#do stuff
The string I wish to find is My country which is wrapped in h2 tags
It is definitely shouldn't be done without special HTML parser.
Google about any python HTML parser you want. For basic usage they are all easy. For example lxml. In pseudo-code your task would be:
from some_cool_lib import SomeCoolHTMLParser
parser = SomeCoolHTMLParser()
doc = parser.parse(path_to_my_html_file)
h2_elements = doc.findall('h2')
for h2 in h2_elements:
if h2.text == 'My country':
# do stuff

Parsing Google Earth KML file in Python (lxml, namespaces)

I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).
As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:
from lxml import etree
tree=etree.parse('kmlfile')
Here is the example from the tutorial I am trying to emulate:
If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:
for element in root.getiterator("child"):
print element.tag, '-', element.text
I would like to get all data under 'Placemark', so I tried
for i in tree.getiterterator("Placemark"):
print i, type(i)
which doesn't give me anything. What does work is:
for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
print i, type(i)
I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand
how the part in {} relates to my specific example at all
why it is different from the tutorial
and what I am doing wrong
Any help is much appreciated!
Here is my solution.
So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.
We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.
###Problem
Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:
<Placemark> <name>CITY NAME</name>
The XPath equivalent to the non-functional code I posted above is:
tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')
Where the text() part is needed to get the text contained in the location //Placemark/name.
Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:
xmlns="http://www.opengis.net/kml/2.2"
###Solution
We can supply namespaces to xpath by setting the namespaces argument:
xpath(X, namespaces={prefix: namespace})
This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".
However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:
tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})
The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.

using python to parse xml data

I have a question with regards to XML and python. I want to comb through this xml file and look for certain tags, and then within those tags look for where there is data separated by a comma. split that and make a new line. I have the logic down, im just not too familiar with python to know whoch modules I should be researching. Any help as to where i should start researching would help.
172.28.18.142,10.0.0.2
thanks
I think when it comes to xml parsing in python there are a few options: lxml, xml, and BeautifulSoup. Most of my experience has dealt with the first two and I've found lxml to be extraordinarily faster than xml. Here's an lxml code snippet for parsing all elements of the root with a particular tag and storing the comma-separated text of each tag as a list. I think you'll want to add a lot of try and except blocks and tinker with the details, but this should get you started.
from lxml import etree
file_path = r'C:\Desktop\some_file.xml'
tree = etree.parse(file_path)
info_list = []
my_tag_path = tree.xpath('//topTag')
for elem in my_tag_path:
if elem.find('.//childTag') is not None:
info_list.append(elem.xpath('.//childTag')[0].text.split(','))

Python xml - handle unclosed token

I am reading in hundreds of XML files and parsing them with xml.etree.ElementTree.
Quick background just fwiw:
These XML files were at one point totally valid but somehow when processing them historically my process which copied/pasted them may have corrupted them. (Turns out it was a flushing issue / with statement not closing, if you care, see the good help I got on that investigation at... Python shutil copyfile - missing last few lines ).
Anyway back to the point of this question.
I would still like to read in the first 100,000 lines or so of these documents which are valid XML. The files are only missing the last 4 or 5KB of a 6MB file. As alluded to earlier, though, the file just 'cuts out'. it looks like this:
</Maintag>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
where (perhaps obviously) Scheduled_E is the beginning of what should be another attribute, <.Scheduled_Event>, say. But the file gets cut short mid tag. Once again, before this point in the file, there are several thousand 'good' "Maintag" entries which I would like to read in, accepting the cutoff entry (and obviously anything that should have come after) as an unrecoverable fail.
A simple but incomplete method of dealing with this might be to simply - pre XML processing - look for the last instance of the string <./Maintag> in the file, and replace what follows (which will be broken, at some point) with the 'opening' tags. Again, this at least lets me process what is still there and valid.
If someone wants to help me out with that sort of string replacement, then fwiw the opening tags are:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
<Source FileName="myfile">
I am hoping that even easier than that, there might be an elementtree or beautifulsoup or other way of handling this situation... I've done a decent amount of searching and nothing seems easy/obvious.
Thanks
For dealing with unclosed elements -or token as in the title of this questioin-, I'd recommend to try lxml. lxml's XMLParser has recover option which documented as :
recover - try hard to parse through broken XML
For example, given a broken XML as follow :
from lxml import etree
xml = """
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
"""
parser = etree.XMLParser(recover=True)
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))
The recovered XML as printed by the above code is as follow :
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E/></Maintag></root>

Categories