I am using lxml and Python for writing XML files. I was wondering what is the accepted practice: creating a document tree first and then adding the sub elements OR adding the sub elements and creating the tree later? I know this hardly makes any difference as to the output, but I was interested in knowing what is the accepted norm in this from a coding-style point of view.
Sample code:
page = etree.Element('root')
#first create the tree
doc = etree.ElementTree(page)
#add the subelements
headElt = etree.SubElement(page, 'head')
Or this:
page = etree.Element('root')
headElt = etree.SubElement(page, 'head')
#create the tree in the end
doc = etree.ElementTree(page)
Since tree construction is typically a recursive action, I would say that the tree root could get created last, once the subtree is done. However, I don't see any reason why that should be any better than creating the tree first. I honestly don't think there's an accepted norm for this, and rather than trying to find one I would advise you to write your code in such a way that it makes sense for you and anyone else that might need to read and understand it later.
Related
I am trying to parse through a large XML file and each node has an ID. My hope is to create one document at the end. To get there, my idea was to parse through each Node and get the information I needed, then join all of these DF's together to get one final DF.
For example:
tree = ET.parse('test.xml')
root = tree.getroot()
Session_list = []
Data_list = []
Policy_list = []
Line_list = []
for session in root.iter('session'):
print(session.tag, session.get('id'))
Session_list.append({'SessionID': session.get('id')})
for data in root.iter('data'):
print(data.tag, data.get('id'))
Data_list.append({'DataID': data.get('id')})
# How would I go up 1 level to get SessionID here to join on later?
for policy in root.iter('policy'):
print(policy.tag, policy.get('id'))
Policy_list.append({'PolicyID': policy.get('id')})
# How would I go up 1 level to get dataID here to join on later?
for line in root.iter('line'):
print(line.tag, line.get('id'))
Line_list.append({'LineID': line.get('id')})
# How would I go up 1 level to get policyID here to join on later?
I have some code working to go through the XML but I am trying to create a better, simpler solution. The other solution goes through a lot of for loops to keep extracting nodes, and then joins later (because I am in the loop, I can reference the loop above to get the correct ID).
Any help would be awesome! I feel I am close to a way less complicated solution :)
Edit - The issue was that I was running an outdated version of lxml - I feel really stupid now but I'm glad I found out.
I'm having trouble iterating through an XML tree to export single child elements.
What I'm looking for is isolating child elements and exporting them in separate xml files. But my problem is that when I'm using the 'etree.iter' function, I'm not only getting the children elements, I'm also getting all following siblings. How can I only get one child element at the time?
This should explain it better. Here's my sample code:
from lxml import etree
root = etree.XML("<users><user><name>Test</name><id>01</id></user> \
<user><name>Test</name><id>02</id></user> \
<user><name>Test</name><id>03</id></user></users>")
for record in root.iter("user"):
print(etree.tostring(record))
It produces the following output
b'<user><name>Test</name><id>01</id></user><user><name>Test</name><id>02</id></user><user><name>Test</name><id>03</id></user></users>'
b'<user><name>Test</name><id>02</id></user><user><name>Test</name><id>03</id></user></users>'
b'<user><name>Test</name><id>03</id></user></users>'
But what I need is
b'<user><name>Test</name><id>01</id></user>'
b'<user><name>Test</name><id>02</id></user>'
b'<user><name>Test</name><id>03</id></user>'
What am I doing wrong?
Quite not sure why iter is producing such an error. Try this, it works fine.
xn = etree.fromstring("<users><user><name>Test</name><id>01</id></user><user><name>Test</name><id>02</id></user><user><name>Test</name><id>03</id></user></users>")
user_nodes = xn.findall("user")
str_nodes = [etree.tostring(un) for un in user_nodes]
print(str_nodes)
produces an expected output
[
b'<user><name>Test</name><id>01</id></user>',
b'<user><name>Test</name><id>02</id></user>',
b'<user><name>Test</name><id>03</id></user>']
I am trying to create an XML export from a python application and need to structure the file in a specific way for the external recipient of the file.
The root node needs to be namespaced, but the child nodes should not.
The root node of should look like this:
<ns0:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice">...</ns0:SalesInvoice_Custom_Xml>
I have tried to generate the same node using the lxml library on Python 2.7, but it does not behave as expected.
Here is the code that should generate the root node:
def create_edi(self, document):
_logger.info("INFO: Started creating EDI invoice with invoice number %s", document.number)
rootNs = etree.QName("ns0", "SalesInvoice_Custom_Xml")
doc = etree.Element(rootNs, nsmap={
'ns0': "http://EDI-export/Invoice"
})
This gives the following output
<ns1:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice" xmlns:ns1="ns0">...</ns1:SalesInvoice_Custom_Xml>
What should I change in my code to get lxml to generate the correct root node
You need to use
rootNs = etree.QName(ns0, "SalesInvoice_Custom_Xml")
with
ns0 = "http://EDI-export/Invoice"
The whole data structure itself is agnostic of any namespace mapping you might apply later, i. e. the tags know the true namespaces (e. g. http://EDI-export/Invoice) not their mapping (e. g. ns0).
Later, when you finally serialize this into a string, a namespace mapping is needed. Then (and only then) a namespace mapping will be used.
Also, after parsing you can ask the etree object what namespace mapping had been found during parsing. But that is not part of the structure, it is just additional information about how the structure had been encoded as string. Consider that the following two XMLs are logically equal:
<x:tag xmlns:x="namespace"></x:tag>
and
<y:tag xmlns:y="namespace"></y:tag>
After parsing, their structures will be equal, their namespace mappings will not.
I'd like to insert a couple of rows in the middle of the table using python-docx. Is there any way to do it? I've tried to use a similar to inserting pictures approach but it didn't work.
If not, I'd appreciate any hint on which module is a better fit for this task. Thanks.
Here is my attempt to mimic the idea for inserting a picture. It's WRONG. 'Run' object has no attribute 'add_row'.
from docx import Document
doc = Document('your docx file')
tables = doc.tables
p = tables[1].rows[4].cells[0].add_paragraph()
r = p.add_run()
r.add_row()
doc.save('test.docx')
The short answer is No. There's no Table.insert_row() method in the API.
A possible approach is to write a so-called "workaround function" that manipulates the underlying XML directly. You can get to any given XML element (e.g. <w:tbl> in this case or perhaps <w:tr>) from it's python-docx proxy object. For example:
tbl = table._tbl
Which gives you a starting point in the XML hierarchy. From there you can create a new element from scratch or by copying and use lxml._Element API calls to place it in the right position in the XML.
It's a little bit of an advanced approach, but probably the simplest option. There are no other Python packages out there that provide a more extensive API as far as I know. The alternative would be to do something in Windows with their COM API or whatever from VBA, possibly IronPython. That would only work at small scale (desktop, not server) running Windows OS.
A search on python-docx workaround function and python-pptx workaround function will find you some examples.
You can insert row to the end of the table and then move it in another position as follows:
from docx import Document
doc = Document('your docx file')
t = doc.tables[0]
row0 = t.rows[0] # for example
row1 = t.rows[-1]
row0._tr.addnext(row1._tr)
Though there isn't a directly usable api to achieve this according to the python-docx documentation, there is a simple solution without using any other libs such as lxml, just use the underlying data structure provided by python-docx, which are CT_Tbl, CT_Row, etc.
These classes do have common methods like addnext, addprevious which can conveniently add element as siblings right after/before current element.
So the problem can be solved as below, (tested on python-docx v0.8.10)
from docx import Document
doc = Document('your docx file')
tables = doc.tables
row = tables[1].rows[4]
tr = row._tr # this is a CT_Row element
for new_tr in build_rows(): # build_rows should return list/iterator of CT_Row instance
tr.addnext(new_tr)
doc.save('test.docx')
this should solve the problem
You can add a row in last position by this way :
from win32com import client
doc = word.Documents.Open(r'yourFile.docx'))
doc = word.ActiveDocument
table = doc.Tables(1) #number of the tab you want to manipulate
table.Rows.Add()
addnext() in lxml.etree seems like will be the better option to use and its working fine, and the only thing is, i cannot set the height of the row, so please provide some answers, if you know!
current_row = table.rows[row_index]
table.rows[row_index].height_rule = WD_ROW_HEIGHT_RULE.AUTO
tbl = table._tbl
border_copied = copy.deepcopy(current_row._tr)
tr = border_copied
current_row._tr.addnext(tr)
I created a video here to demonstrate how to do this because it threw me for a loop the first time.
https://www.youtube.com/watch?v=nhReq_0qqVM
document=Document("MyDocument.docx")
Table = document.table[0]
Table.add_row()
for cells in Table.rows[-1].cells:
cells.text = "test text"
insertion_row = Table.rows[4]._tr
insertion_row.add_next(Table.rows[-1]._tr)
document.save("MyDocument.docx")
The python-docx module doesn't have a method for this, So the best workaround Ive found is to create a new row at the bottom of the table and then use methods from the xml elements to place it in the position it is suppose to be.
This will create a new row with every cell in the row having the value "test text" and we then add that row underneath of our insertion_row.
How to check if two XML files are equivalent?
For example, the two XML files are the same even though the ordering is different. I need to check if the two XML files content the same textual info disregarding the order.
<a>
<b>hello</b>
<c><d>world</d></c>
</a>
<a>
<c><d>world</d></c>
<b>hello</b>
</a>
Are there tools for this out there?
It all depends on your definition of "equivalent".
Assuming you really only care about the text nodes (for example: the d tags in your example do not even matter, you only care about the content word), you can just make a set of the text nodes of each document, and compare the sets. Using lxml, this could look like:
from lxml import etree
tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')
print set(tree1.getroot().itertext()) == set(tree2.getroot().itertext())
You might even want to ignore whitespace nodes, doing something like:
set(i for i in tree.getroot().itertext() if i.strip())
Note that using sets means you will NOT take into account how many times certain pieces of text occur in the document (this might be what you want, it might not). If the order is not important, but the number of times something occurs is, you could use a dictionary instead of a set, and keep track of the number of occurences (eg. with collections.defaultdict() or collections.Counter in python 2.7)
But if it is only the order of the direct child elements of the root element (in your case, children of the a element) that may be ignored, and everything inside those elements really counts, you would need another approach. You could for example do xml canonicalization on each child element to get a normalized version of each child (again, I don't know if this is normalized enough for your needs).
from lxml import etree
tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')
set1 = set(etree.tostring(i, method='c14n') for i in tree1.getroot())
set2 = set(etree.tostring(i, method='c14n') for i in tree2.getroot())
print set1 == set2
Note: to keep the example simpler, I've used the development version of lxml, in older versions, there is no method='c14n' for etree.tostring(), only a c14n() method on the ElementTree, that writes to a file-like object. So to get it working there, you'd have to copy each element to a tree of its own, and use a StringIO() object as a dummy file)
Also, this way of doing it is probably not recommended with very large files.
But again: a BIG WARNING: you really have to know what you need as "equivalent", and create your own solution based on that knowledge!
Ordering is important in XML, so the two files you provided are different. Normally you could normalize the XML and then simply compare the files as text, but if you want order-insensitive comparison, you will probably have to implement it yourself using one of the bazillion XML parsers out there (I would recommend lxml, by the way).
my solution is below. compare all attributes,tags iteration.
Some code refered from : Testing Equivalence of xml.etree.ElementTree
import xml.etree.ElementTree as ET
def elements_equal(e1, e2):
if e1.tag != e2.tag:
return False
if e1.text != e2.text:
if e1.text!=None and e2.text!=None :
return False
if e1.tail != e2.tail:
if e1.tail!=None and e2.tail!=None:
return False
if e1.attrib != e2.attrib:
return False
if len(e1) != len(e2):
return False
return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
def is_two_xml_equal(f1, f2):
tree1 = ET.parse(f1)
root1 = tree1.getroot()
tree2 = ET.parse(f2)
root2 = tree2.getroot()
return elements_equal(root1,root3)
f1 = '2.xml'
f2 = '1.xml'
print(is_two_xml_equal(f1, f2))