XML scheme to scheme conversion with Python

XML scheme to scheme conversion with Python - python

I am a writer of books and I am new to Python. The question I have is strategic. I write my manuscripts into an xml-file with proprietary tags (they have a size around 1 MB and 5000-1000 lines):
manuscript.xml
<h>Title of Chapter</h>
<p>This is a sentence, with one word written in <i>italics</i></p>
I often want to output what I have written so far and I am trying to create a fully automated workflow with Python. Python should convert my XML into two different XML-schemes:
1. HTML for epub (with creating IDs):
<h1 id="title-of-chapter">Title of Chapter</h1>
<p>This is a sentence, with one word written in <i>italics</i></p>
Then save as manuscript.html.
2. ODT:
<text:h text:style-name="HeadlineStyle1" text:outline-level="1">GetByName</text:h>
<text:p text:style-name="ParagraphStyle1">This is a sentence, with one word written in <text:span text:style-name="Italics">italics</text:span></text:p>
Then save as content.xml
I am not sure whether I should really parse the XML (original XML → dict → new XML) at all. Wouldn't it be easier to handle the original file as plain text and to let Python just convert the tags, thus <p> becomes <text:p text:style-name="ParagraphStyle1">?
On the other side, the task above is only the first step. Later on, I would like to make Python create a table of contents by collecting all headlines and write it into the helper file toc.ncx and finally letting Python zip all those files into an epub-container.
There are alot of tutorials out there about xml → dict, but it is hard to find something about the second step dict → xml which goes into details.

It is easily possible to rename tags in ElementTree:
for oldTag in root.iter('oldtag'):
oldTag.tag = 'newtag'
XSLT cannot do that. It cannot transform XML, it can only pick elements out of it.

Related

xml2csv with xml Attribute and Values in Python

How can i convert i large xml (500M)with complex structure in to csv ?
Sample XML:
<images>
<image ismain="1" sml="1" med="1" big="0"><id>2</id><title><![CDATA[]]></title><url>www.mysite.com/45656.jpeg</url></image>
<image ismain="1" sml="1" med="0" big="1"><id>2</id><title><![CDATA[]]></title><url>www.mysite.com/354456.jpeg</url></image>
</images>
Code Python :
from xmlutils.xml2csv import xml2csv
converter = xml2csv("/home/mehul/Downloads/instant/static/images.xml", "/home/mehul/Downloads/instant/static/images.csv", encoding="utf-8")
converter.convert(tag="image")
Actual Output:
id,title,url
2,,www.mysite.com/45656.jpeg
2,,www.mysite.com/354456.jpeg
Expected Output:
id,ismain sml med big,title,url
2,,,,,www.mysite.com/45656.jpeg
2,,,,,www.mysite.com/354456.jpeg

As far as I have used xmlutils, it doesn't work well with complex structures, such as XML with nested tags. Furthermore, you want all the attributes too.
I had worked on this in a company project, and basically I had to write my own parsing code.
You can use Python's built-in xml library to parse through the XML, and check for events such as start and end tags, and then extract the data.
In fact, if all your tag names are in lowercase, you can just use Python's HTMLParser. It has pre-defined functions for handling events which you can just override. It however converts tag names to lowercase (if they are in uppercase originally).

Editing a DOCX file

I am working on a little project that should be quite simple. I know its been done before but for the life of me, I cannot get it to work. Alright so I made a docx template using Microsoft word that contains a Header and just some text in the body of the paper. My goal is have a program that can change this text. Using python-docx I have successfully been able to write a program that modifies the body text easily. That being said I am trying to learn how to do the same thing using XML parsing, which will allow the header to be changed. Long story short, XML parsing (I think thats what it is) will give me much more freedom down the road.
I know after the docx is unzipped, the word/document.xml contains the body text.
Here is my code so far.
from lxml import etree as ET
tree = ET.parse('document.xml')
root = tree.getroot()
for i in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
if i.text == 'Title':
i.text = 'How to cook'
tree.write('document_output.xml', xml_declaration = True, encoding = "UTF-8", method = "xml" \
, standalone = "yes")
This program successfully changes the wanted text to the updated text.
Here is the original document.xml
https://www.dropbox.com/s/ghe1m176rdqtng7/document.xml?dl=0
Here is the output.
https://www.dropbox.com/s/8n9llagozbvb2mz/document_output.xml?dl=0
P.S. viewing the code from dropbox, it makes everything start at line 4 instead of line 1.
If you view them in an XML viewer you can see they are identical. Also, if you use a text difference tool, the only difference is the changed word. And I wouldn't think this would matter but the top line uses single quotes instead of double.
Hope someone can shed some light on why this is still not opening properly in Word.
Thanks for all the help!!

you're having the usual problems with ET.
As a starter, check out these Stackoverflow threads:
Namespace 1
Namespace 2
Namespace 3 with xml declaration
xml declaration
As you can see, you're not the first person with these problems.
What you could do for the namespaces is parse the xml twice:
first time in order to extract the namespaces and
a second time in order to do your actual work.
Besides, some people already suggested to switch from Elementtree to lxml.

Trouble retrieving text from XML with ElementTree with tags

Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central. I'm trying to parse the XML with ElementTree to just have the text from the page. Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no. But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work. Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information. Even the ElementTree documentation doesn't go into this (from what I can tell). Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?
Here's an example. Note: I use Python 3.4
Small snippit of the XML:
<sec sec-type="materials|methods" id="s5">
<title>Materials and Methods</title>
<sec id="s5a">
<title>Overgo design</title>
<p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50–60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
<table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
<label>Table 2</label>
<caption>
<title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
</caption>
<alternatives>
<graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"/>
<col align="center" span="1"/>
</colgroup>
For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).
Now, I already know that I can make the whole XML string into an ElementTree Object
>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)
Now if I try to get the text using the tag like this:
>>> text = root.find('p')
>>> print("".join(text.itertext()))
or
>>> text = root.get('p').text
I can't extract the text that I want. From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.
While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it. Please let me know what I'm missing and how I can fix this. Thanks!
--- EDIT ---
So now I know that I should be using this code to get everything in the 'p' tags:
>>> text = root.find('.//p')
>>> print("".join(text.itertext()))
Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content. Does itertext() only iterate within a tag? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.
---- FINAL EDIT --
I figured out that itertext() only works within one tag and find() only returns the first item. In order to get the enitre text that I want I must use findall()
>>> all_text = root.findall('.//p')
>>> for texts in all_text:
print("".join(texts.itertext()))

root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag.
root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags).
If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find/root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support). In your case it is './/p':
text = root.find('.//p')
print("".join(text.itertext()))

Python xml - handle unclosed token

I am reading in hundreds of XML files and parsing them with xml.etree.ElementTree.
Quick background just fwiw:
These XML files were at one point totally valid but somehow when processing them historically my process which copied/pasted them may have corrupted them. (Turns out it was a flushing issue / with statement not closing, if you care, see the good help I got on that investigation at... Python shutil copyfile - missing last few lines ).
Anyway back to the point of this question.
I would still like to read in the first 100,000 lines or so of these documents which are valid XML. The files are only missing the last 4 or 5KB of a 6MB file. As alluded to earlier, though, the file just 'cuts out'. it looks like this:
</Maintag>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
where (perhaps obviously) Scheduled_E is the beginning of what should be another attribute, <.Scheduled_Event>, say. But the file gets cut short mid tag. Once again, before this point in the file, there are several thousand 'good' "Maintag" entries which I would like to read in, accepting the cutoff entry (and obviously anything that should have come after) as an unrecoverable fail.
A simple but incomplete method of dealing with this might be to simply - pre XML processing - look for the last instance of the string <./Maintag> in the file, and replace what follows (which will be broken, at some point) with the 'opening' tags. Again, this at least lets me process what is still there and valid.
If someone wants to help me out with that sort of string replacement, then fwiw the opening tags are:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
<Source FileName="myfile">
I am hoping that even easier than that, there might be an elementtree or beautifulsoup or other way of handling this situation... I've done a decent amount of searching and nothing seems easy/obvious.
Thanks

For dealing with unclosed elements -or token as in the title of this questioin-, I'd recommend to try lxml. lxml's XMLParser has recover option which documented as :
recover - try hard to parse through broken XML
For example, given a broken XML as follow :
from lxml import etree
xml = """
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
"""
parser = etree.XMLParser(recover=True)
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))
The recovered XML as printed by the above code is as follow :
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E/></Maintag></root>

XML parsing using Elemetree in python

I am trying to read a XML file using python [ver - 2.6.7] using ElementTree
There are some tags of the format :
<tag, [attributes]>
....Data....
</tag>
The data in my case is usually some binary data that I read using text attribute.
However there are some cases where data can reference any other tag in the file.
<tag, [attributes]>
....Data....
<ref target='idname'/>
</tag>
What attribute from element tree can be used to parse them ?

Try XPath expressions.
This will tell you whether the tag is present and, if present, returns the node.

I think I would use something like this:
for iteration in root.iter('tag'):
if iteration.find('ref'):
...
So basicly I would parse thous cases separately.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.