replicate namespace and root element attributes with ElementTree - python

Trying to replicate the following root element including namespace:
<ns0:StdFX1.3 xmlns:ns0="http://website.com/schemas/StdFX1.3.In"
CutOff="2200LON" DataSource="" SpotDataSource="">
</ns0:StdFX1.3>
here is my code so far:
import xml.etree.ElementTree as ET
ET.register_namespace("", "http://website.com/schemas/StdFX1.3.In")
top = ET.Element('{http://website.com/schemas/StdFX1.3.In}Stuff')
it only gets me the following though:
<?xml version='1.0' encoding='UTF-8'?>
< xmlns="http://website.com/schemas/StdFX1.3.In">

I gave up and used string substitution on the final object.
root.tostring().replace("mangled toplevel namespace", '<ns0:StdFX1.3 xmlns:ns0="http://website.com/schemas/StdFX1.3.In"
CutOff="2200LON" DataSource="" SpotDataSource="">')
Likewise for the closing tag. Any other way just wouldn't keep the changes I specified.
fromString method to get back to the element tree. I was just submitting XML so didn't so I can't remember if this effected your desired changes, but you get the desired XML.

Related

Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:
namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]
This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.example.com/</loc>
<priority>1.00</priority>
</url>
From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.
However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace ? The element tree always will be the same, so my xpath wouldn't change.
Thanks
You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.
The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.
To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...
from lxml import etree
tree = etree.parse("input.xml")
root_ns_uri = tree.xpath("namespace-uri()")
namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]
print(data)
prints...
['https://www.example.com/']
If urlset isn't always the root element, you may want to do something like this instead...
root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

Better way to find interactive deeper element tag in xml?

I want to find the last deeper xml tag interactively. I found some other questions but they all bring me a fixed way to find it. I want to add elements always to the last tag interactively.
root = Element('soap:Envelope', {"xmlns:soap":"http://www.w3.org/2003/05/soap-soap_envelope", "xmlns:aut":"Automidia"})
sub_elementos = [Element("soap:Body"),
Element("information", {"token":"ABC"}),
Element("data"),
Element("value")]
for elemento in sub_elementos:
list(root.iter())[-1].append(elemento) # This is the way I've found
I saw in xml Element Tree documentation that there is a findall() method that supports Xpath to navigate through XML easily. I want to know how can I use it to find the last element with last() function, instead of list(root.iter())[-1] as written in my code above. This command reduces code readability, in my opinion. Some ideias how could I achieve this?
This is my final output:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:aut="Automidia">
<soap:Body>
<information token="ABC>
<data>
<value/>
</data>
</information>
</soap:Body>
</soap:Envelope>
something like this:
import xml.etree.ElementTree as ET
tree_elements = {'body':{}, 'info':{'token':'ABC'}, 'data':{}, 'value':{}}
tree = ET.Element('root')
root = tree
for ele,ele_attrs in tree_elements.items():
root = ET.SubElement(root, ele)
root.attrib = ele_attrs
ET.dump(tree)
output
<root><body><info token="ABC"><data><value /></data></info></body></root>

Python add new element by xml ElementTree

XML file
<?xml version="1.0" encoding="utf-8"?>
<Info xmlns="BuildTest">
<RequestDate>5/4/2020 12:27:46 AM</RequestDate>
</Info>
I want to add a new element inside the Info tag.
Here is what I did.
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
ele = ET.Element('element1')
ele.text = 'ele1'
root.append(ele)
tree.write("output.xhtml")
Output
<ns0:Info xmlns:ns0="BuildTest">
<ns0:RequestDate>5/4/2020 12:27:46 AM</ns0:RequestDate>
<element1>ele1</element1></ns0:Info>
Three questions:
The <?xml version="1.0" encoding="utf-8"?> is missing.
The namespace is wrong.
The whitespace of the new element is gone.
I saw many questions related to this topic, most of them are suggesting other packages.
Is there any way it can handle properly?
The processing instructions are not considered XML elements. Just Google are processing instructions part of an XML, and the first result states:
Processing instructions are markup, but they're not elements.
Since the package you are using is literally called ElementTree, you can reasonably expect its objects to be a trees of elements. If I remember correctly, DOM compliant XML packages can support non-element markup in XML.
For the namespace issue, the answer is in stack overflow, at Remove ns0 from XML - you just have to register the namespace you specified in the top element of your document. The following worked for me:
ET.register_namespace("", "Buildtest")
As for the whitespace - the new element does not have any whitespace. You can assign to the tail member to add a linefeed after an element.

How to get values from this XML?

I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?
You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)
The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None

Retaining empty elements when parsing with ElementTree

Using Python 3.4 and ElementTree, I'm trying to add a sub-element to an xml file, keeping the xml file (written in UTF-16) otherwise exactly the same.
My code:
new = new_XML_file.xml
tree = ET.parse(new)
root = tree.getroot()
new_element = ET.SubElement(root, 'RENAMED_SOUND_FILE')
new_element.text=new.split('\\')[num][:-4]+'.wav'
tree.write(fake_path++new.split('\\')[num], encoding='utf-16', xml_declaration=True)
The problem I'm having is that empty elements are being changed in this process. For example:
<EMPTY_ELEMENT></EMPTY_ELEMENT>
becomes:
<EMPTY_ELEMENT />
I know that to a machine, this is basically the same thing, but I'd like to retain the earlier formatting for testing purposes.
Any ideas on how I can retain the full empty elements?
Per the documentation, output methods (whether you're using tostring methods or write) have a "short_empty_elements" keyword that defaults to True. Making this False should give you your desired output:
import xml.etree.ElementTree as ET
root=ET.Element("root")
print(ET.tostring(root,short_empty_elements=False))

Categories