XML attributes get sorted - python

When I create a document using the minidom, attributes get sorted alphabetically in the element. Take this example from here:
from xml.dom import minidom
# New document
xml = minidom.Document()
# Creates user element
userElem = xml.createElement("user")
# Set attributes to user element
userElem.setAttribute("name", "Sergio Oliveira")
userElem.setAttribute("nickname", "seocam")
userElem.setAttribute("email", "seocam#taboca.com")
userElem.setAttribute("photo","seocam.png")
# Append user element in xml document
xml.appendChild(userElem)
# Print the xml code
print xml.toprettyxml()
The result is this:
<?xml version="1.0" ?>
<user email="seocam#taboca.com" name="Sergio Oliveira" nickname="seocam" photo="seocam.png"/>
Which is all very well if you wanted the attributes in email/name/nickname/photo order instead of name/nickname/email/photo order as they were created.
How do you get the attributes to show up in the order you created them? Or, how do you control the order at all?

According to the documentation, the order of attributes is arbitrary but consistent for the life of the DOM. This is common across DOM implementations. Sorry.

Related

How to get values from this XML?

I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?
You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)
The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None

Sibling nodes in ElementTree in Python

I am looking at a piece of XML that I want to add a node in.
<profile>
<dog>1</dog>
<halfdog>0</halfdog>
<cat>545</cat>
<lions>0</lions>
<bird>23</bird>
<dino>0</dino>
<pineapples>2</pineapples>
<people>0</people>
</profile>
With the above XML, I'm able to insert XML nodes into it. However, I'm not able to insert it at exact locations.
Is there a way to find if I am next to a certain node, whether it be before or after. Say if I wanted to add <snail>2</snail> between the <dino>0</dino> and <pineapples>2</pineapples> nodes.
Using ElementTree how can I find what node is next to me? I'm asking about ElementTree or any standard Python library. Unfortunately, lxml is out of the question for me.
I believe its not doable using ElementTree, but you can do it using the standard python minidom:
# create snail element
snail = dom.createElement('snail')
snail_text = dom.createTextNode('2')
snail.appendChild(snail_text)
# add it in the right place
profile = dom.getElementsByTagName('profile')[0]
pineapples = dom.getElementsByTagName('pineapples')[0]
profile.insertBefore(snail, pineapples)
output:
<?xml version="1.0" ?><profile>
<dog>1</dog>
<halfdog>0</halfdog>
<cat>545</cat>
<lions>0</lions>
<bird>23</bird>
<dino>0</dino>
<snail>2</snail><pineapples>2</pineapples>
<people>0</people>
</profile>
If you know the parent element and the element to insert before, you can use the following method with ElementTree:
index = parentElem.getchildren().index(elemToInsertBefore)
parent.insert(index, newElement)

Why does this xpath expression return an empty list?

I'm trying to parse this XML. It's a YouTube feed. I'm working based on code in the tutorial. I want to get all the entry nodes that are nested under the feed.
from lxml import etree
root = etree.fromstring(text)
entries = root.xpath("/feed/entry")
print entries
For some reason entries is an empty list. Why?
feed and all its children are actually in the http://www.w3.org/2005/Atom namespace. You need to tell your xpath that:
entries = root.xpath("/atom:feed/atom:entry",
namespaces={'atom': 'http://www.w3.org/2005/Atom'})
or, if you want to change the default empty namespace:
entries = root.xpath("/feed/entry",
namespaces={None: 'http://www.w3.org/2005/Atom'})
or, if you don't want to use shorthandles at all:
entries = root.xpath("/{http://www.w3.org/2005/Atom}feed/{http://www.w3.org/2005/Atom}entry")
To my knowledge the "local namespace" is implicitly assumed for the node you're working with so that operations on children in the same namespace do not require you to set it again. So you should be able to do something along the lines of:
feed = root.find("/atom:feed",
namespaces={'atom': 'http://www.w3.org/2005/Atom'})
title = feed.xpath("title")
entries = feed.xpath("entries")
# etc...
It's because of the namespace in the XML. Here is an explanation: http://www.edankert.com/defaultnamespaces.html#Conclusion.

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

How to grab the attribute value using beautifulSoup?

Code:
soup=BeautifulSoup(f.read())
data=soup.findAll('node',{'id':'memory'})
print data
Output
[<node id="memory" claimed="true" class="memory" handle="DMI:000E">
<description>
System Memory
</description>
<physid>
e
</physid>
<slot>
System board or motherboard
</slot>
<size units="bytes">
3221225472
</size>
<capacity units="bytes">
3221225472
</capacity>
</node>]
Now how will I grab the attributes value like the data between tag that is System Memory and so on. Any help is appreciated.
To get <...>this</...> you should use contents field, so in your case it would be:
print data.description.contents
To get attributes access them as they were a dictionary
print data.size['units']
And to iterate all the tags, use findAll that you already know:
for node in data.findAll(True):
# do stuff on node
beautifulsoup can create a tree. you can then iterate over that tree and get the attributes
check out the following link
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#TheattributesofTags

Categories