lxml parsing with RAM performance - python

I am trying to parse this kind of structure (really big xml file) and would like to keep RAM efficiency, so I think lxml would be nice for it (iterparser maybe) but I run into some problems...
There is a sample of xml structure:
<root>
<item1>
<NAME>X</NAME>
<someothertag>...</someothertag>
<someothertag2>
<NAME>Y</NAME>
</sometohertag2>
<OPTION>Z<OPTION>
</item1>
<item2>
<NAME>A</NAME>
<someothertag>...</someothertag>
<someothertag2>
<NAME>B</NAME>
</sometohertag2>
<Options>
<01>
<OPTION>C</OPTION>
</01>
</Options>
</item2>
</root>
You can see texts: X, Y, A, B, C, what I am trying to do is to get only X, Z, A and CI tried this so far but no idea if there is better way to do so:
for event, element in etree.iterparse('file.xml', tag='NAME'):
print(element.text)
element.clear()
My issue here is I want for tag NAME to be only from top level (X, A) and for tag OPTION to be from top level or specific tag(OPTIONS) and from its children.I could use parse and look for it by specifiing loops etc. but it is so ram heavy.Any idea is welcomed.Thanks in advice!

Related

can we search multiple pattern using etree findall() in xml?

For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?
The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>

how to modify attribute values from xml tag with python

I have many graphml files starting with:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns/graphml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/graphml">
I need to change the xmlns and xsi attributes to reflect proper values for this XML file format specification:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
I tried to change these values with BeautifulSoup like:
soup = BeautifulSoup(myfile, 'html.parser')
soup.graphml['xmlns'] = 'http://graphml.graphdrawing.org/xmlns'
soup.graphml['xsi:schemalocation'] = "http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
It works fine but it is definitely too slow on some of my larger files, so I am trying to do the same with lxml, but I don't understand how to achieve the same result. I sort of managed to reach the attributes, but don't know how to change them:
doc = etree.parse(myfile)
root = doc.getroot()
root.attrib
> {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://graphml.graphdrawing.org/xmlns/graphml'}
What is the right way to accomplish this task?
When you say that you have many files "starting with" those 4 lines, if you really mean they're exactly like that, the fastest way is probably to entirely ignore that fact that it's XML, and just replace those lines.
In Python, just read the first four lines, compare them to what you expect (so you can issue a warning if they don't match), then discard them. Write out the new four lines you want, then copy the rest of the file out. Repeat for each file.
On the other hand, if you have namespace attributes anywhere else in the file this method wouldn't catch them, and you should probably do a real XML-based solution. With a regular SAX parser, you get a callback for each element start, element end, text node, etc. as it comes along. So you'd just copy them out until you hit the one(s) you want (in this case, a graphml element), then instead of copying out that start-tag, write out the new one you want. Then back to copying. XSLT is also a fine way to do this, which would let you write a tiny generic copier, plus one rule to handle the graphml element.

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Reading text from XML nodes using Python's libxml2

I am a first time XPath user and need to be able to get the text values of these different elements.. for instance time, title, etc.. I am using the libxml2 module in Python and so far have not had much luck getting just the values of the text I need. The code below here only returns the element tags.. i need the values.. any help would be GREATLY appreciated!
I'm using this code:
doc = libxml2.parseDoc(xmlOutput)
result = doc.xpathEval('//*')
With the following document:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SCAN_LIST_OUTPUT SYSTEM "https://qualysapi.qualys.com/api/2.0/fo/sca/scan_list_output.dtd">
<SCAN_LIST_OUTPUT>
<RESPONSE>
<DATETIME>2012-01-22T01:21:53Z</DATETIME>
<SCAN_LIST>
<SCAN>
<REF>scan/2343423</REF>
<TYPE>Scheduled</TYPE>
<TITLE><![CDATA[customer 1 5/20/2012]]></TITLE>
<USER_LOGIN>user1</USER_LOGIN>
<LAUNCH_DATETIME>2012-02-21T04:11:05Z</LAUNCH_DATETIME>
<STATUS>
<STATE>Finished</STATE>
</STATUS>
<TARGET><![CDATA[13.3.3.2, 13.8.8.10, 13.10.12.60, 13.10.12.11...]]></TARGET>
</SCAN>
</SCAN_LIST>
</RESPONSE>
</SCAN_LIST_OUTPUT>
You can call getContent() on each returned xmlNode object to retrieve the associated text. Note that this is recursive -- to non-recursively access text content in libxml2, you'll want to retrieve the associated text node under the element, and call .getContent() on that.
That said, this would be easier if you used lxml.etree (a higher-level Python API, still backing into the C libxml2 library) instead of the Python libxml2; in that case, it's simply element.text to access the associated content as a string.
Have a look at Mark Pilgrim's Dive Into Python 3, Chapter 12. XML
The chapter starts with short course to XML (general talk but with the Atom Syndication Feed example), then it continues with the standard xml.etree.ElementTree and continues with third party lxml that implements more with the same interface (full XPATH 1.0, based on libxml2).

Quickest/Best way to traverse XML with lxml in Python

I have an XML file that looks like this:
xml = '''<?xml version="1.0"?>
<root>
<item>text</item>
<item2>more text</item2>
<targetroot>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
</targetroot>
...more items
</root>
'''
With lxml I'm trying to acces the text in the element < target >. I've found a solution, but I'm sure there is a better, more efficient way to do this. My solution:
target = etree.XML(xml)
for x in target.getiterator('root'):
item1 = x.findtext('item')
for target in x.iterchildren('targetroot'):
for t in target.iterchildren('targetcontainer'):
targetText = t.findtext('target')
Although this works, as it gives me acces to all the elements in root as well as the target element, I'm having a hard time believing this is the most efficient solution.
So my question is this: is there a more efficient way to access the < target >'s texts while staying in the loop of root, because I also need access to the other elements.
You can use XPath:
for x in target.xpath('/root/targetroot/targetcontainer/target'):
print x.text
We ask all elements that match a path. In this case, the path is /root/targetroot/targetcontainer/target, which means
all the <target> elements that are inside a <targetcontainer> element, inside a <targetroot> element, inside a <root> element. Also, the <root> element should be the document root because it is preceded by /, which means the beginning of the document.
Also, your XML document had two problems. First, the <?xml version="1.0"?> declaration should be the very first thing in the document - and in this example it is preceded by a newline and some space. Also, it is not a tag and should not be closed, so the </xml> at the end of your string should be removed. I already edited your question anyway.
EDIT: this solution can be improved yet. You do not need to pass all the path - you can just ask to all elements <target> inside the document. This is done by preceding the tag name by two slashes. Since you want all the <target> texts, independent of where they are, this can be a better solution. So, the loop above can be written just as:
for x in target.xpath('//target'):
print x.text
I tried it at first but it did not worked. The problem, however, was the syntax problems in the XML, not the XPath, but I tried the other, longer path and forgot to retry this one. Sorry! Anyway, I hope I put some light about XPath nonetheless :)

Categories