I wish to use ElementMaker in lxml to build an xml representation of an excel spreadsheet with the corresponding nesting. I would like
<excelbook>
workbook info
<excelsheet>
<sheetname>Sheet1</sheetname>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
</excelsheet>
<excelsheet>
<sheetname>Sheet2</sheetname>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
</excelsheet>
</excelbook>
My python code looks like the following
for excelSheet in excelBook.excelSheets:
for excelTable in excelSheet.excelTables:
exceltable = E.exceltable(
E.num_rows(str(excelTable.num_rows)),
E.num_cols(str(excelTable.num_cols)),
)
excelsheet = E.excelsheet(
exceltable,
E.sheetname(excelSheet.sheetName),
)
excelbook = E.excelbook(
excelsheet,
E.bookname(fullpathname),
E.numSheets(str(excelBook.numSheets)))
root = E.XML(excelbook)
The problem is that I can only nest one sheet inside each book and one table inside each sheet. How do I change the code to allow multiple sheets in each book and multiple tables inside each sheet.
You can't next a tag with the same tag name twice under the same root and expect them to be appended one after another rather than the second one overriding the first one, because this is not very coherent.
For example if you have:
<root>
<elem1>some text 1</elem1>
</root>
If you try to append another "elem1" tag to root, it will override the existing one.
I'd suggest you use indexing ("elem_0", "elem_1", "elem_2".. etc).
Related
I have been using python docx library and oxml to automate some changes to my tables in my word document. Unfortunately, no matter what I do, I cannot wrap the text in the table cells.
I managed to successfully manipulate 'autofit' and 'fit-text' properties of my table, but non of them contribute to the wrapping of the text in the cells. I can see that there is a "w:noWrap" in the xml version of my word document and no matter what I do I cannot manipulate and remove it. I believe it is responsible for the word wrapping in my table.
for example in this case I am adding a table. I can fit text in cell and set autofit to 'true' but cannot for life of me wrap the text:
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
doc = Document()
table = doc.add_table(5,5)
table.autofit = True # Does Autofit but not wrapping
tc = table.cell(0,0)._tc # As a test, fit text to cell 0,0
tcPr = tc.get_or_add_tcPr()
tcFitText = OxmlElement('w:tcFitText')
tcFitText.set(qn('w:val'),"true")
tcPr.append(tcFitText) #Does fitting but no wrapping
doc.save('demo.docx')
I would appreciate any help or hints.
The <w:noWrap> element appears to be a child of <w:tcPr>, the element that controls table cell properties.
You should be able to access it from the table cell element using XPath:
tc = table.cell(0, 0)._tc
noWraps = tc.xpath(".//w:noWrap")
The noWraps variable here will then be a list containing zero or more <w:noWrap> elements, in your case probably one.
Deleting it is probably the simplest approach, which you can accomplish like this:
if noWraps: # ---skip following code if list is empty---
noWrap = noWraps[0]
noWrap.getparent().remove(noWrap)
You can also take the approach of setting the value of the w:val attribute of the w:noWrap element, but then you have to get into specifying the Clark name of the attribute namespace, which adds some extra fuss and doesn't really produce a different outcome unless for some reason you want to keep that element around.
This question already has an answer here:
XML change tag name using Python
(1 answer)
Closed 4 years ago.
Very new to XML and Python. I want to change the tag names of certain elements in an XML document. Here's how the document looks now:
<Company>
<Employee>
<SSN>xxxxx1234</SSN>
<Dependent>
<SSN>xxxxx4321</SSN>
I want to change the <SSN> tag under the Employee to <EE SSN> and leave the tag under the Dependent the same. The document includes hundreds of companies and thousands of employees both with tens to hundreds of sub elements, so a find and replace option is what I believe I need.
I want to use ElementTree module, but open to other suggestions. There are other modifications I want to make (copying and pasting elements) and will be posting another question, so I would like to maintain one module for all. The only code I have that is working is the importing the data and writing it to a new file. Thanks for all your help!
ElementTree is your way to go.
First you get the root, and after that it's just a matter of accessing the nodes:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml) # assuming you can store your data in a string (xml)
root[i] # access the ith node of the root
root[i][j] # access the jth node of the ith node
To get data from the nodes, you can use:
root[i][j].text
...where you can change the data as needed.
Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?
I tried to find a keyword in all paragraphs in a document by iteration:
doc = Document('test.docx')
for paragraph in doc.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
It is found if placed in normal text, but not inside a textbox.
A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.
doc = Document('test.docx')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:
body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')
I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.
opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:
$ opc browse test.docx document.xml
http://opc-diag.readthedocs.org/en/latest/index.html
If I have two XML files with the same tags but different attributes and texts values. I have tried xmldiff but I did not know how to use it in my project.
I need to extract the differences in the attributes and the values to new XML structure.
In my code I tried to compare every node with all the other nodes but if the other file have sub elements with the same name multiple times this code does not work.
My code:
walkAll = tree1.getiterator()
for elt in walkAll:
el=root2.find(elt.tag)
for ee in root2.iter(elt.tag):
p1=ee.getparent()
p2=el.getparent()
if p1==p2:
if ee.text !=elt.text:
print(ee.text+"****"+elt.text )
How can I get the direct first order parent to an xml node?
I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...