How can I include a link to XSLT file with Python ElementTree? - python

I'm trying hard to include my XSLT header in my XML with ElementTree, and I can´t find any information on how to do it.
Here's my Python code:
tree = ET.parse('myfile.xml') #get all tags from this XML document
root = tree.getroot() #get all elements from each tag
root[0][0].text = "ole"
root[0][1].text = "ole"
tree.write('test_file.xml', encoding='utf-8', method="xml") #write XML file
The only problem is to include this header:
<\? xml-stylesheet type="text/xsl" href="myfile.xsl" \?>

Unfortunately xml.etree.ElementTree does not support XSLT (for instance, you can read about write() that method is either xml, text or html).
Luckily though, you can easily do that if you rely on lxml which adds supports to XSLT

I just found the answer!!!
You need to use lxml instead and this is the new code:
from lxml import etree as ET
parser = ET.XMLParser(strip_cdata=False) #strip = false to prevent cdata to be removed/ stripped
tree = ET.parse('myfile.xml', parser)
root = tree.getroot() #get all elements from each tag
tag1 = root.find('TAG1')
tag1.find('TAG2').text = 'text change here'
tree.write('test_file.xml', encoding='utf-8', method="xml")
Your XML template (myfile.xml) is like this:
<?xml-stylesheet type="text/xsl" href="your_file.xsl" ?>
<FirstTAG>
<TAG1>
<TAG2>your text</TAG2>
</TAG1>
</FirstTAG>
And the new one will be like this:
<?xml-stylesheet type="text/xsl" href="your_file.xsl" ?>
<FirstTAG>
<TAG1>
<TAG2>text change here</TAG2>
</TAG1>
</FirstTAG>

Related

In Python, using beautiful soup- how do I get the text of a XML tag that contains a hyphen

I'm trying to get the text content on tag 'Event-id' in the XML, but hyphen is not recognizing as an element on the file, I know script is working well because if a replace the hyphen for a underscore in the XML and run the script it works, anybody knows which could be the problem?
<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate>
from bs4 import BeautifulSoup
dir_path = '20211006085201.xml'
file = open(dir_path, encoding='UTF-8')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
events = soup.find_all('fullEventUpdate')
print(' \n-------', len(events), 'events calculated on ', dir_path, '--------\n')
idi = soup.find_all('event-reference')
for x in range(0, len(events)):
idText = (idi[x].event-id.get_text())
print(idText)
The problem is you are dealing with namespaced xml, and for that type of document, you should use css selectors instead:
events = soup.select('fullEventUpdate')
for event in events:
print(event.select_one('event-id').text)
Output:
24425412
24342548
More generally, in dealing with xml documents, you are probably better off using something which supports xpath (like lxml or ElementTree).
For XML parsing idiomatic approach is to use xpath selectors.
In python this can be easily achieved with parsel package which is similar to beautifulsoup but built on top of lxml for full xpath support:
body = ...
from parsel import Selector
selector = Selector(body)
for event in sel.xpath("//event-reference"):
print(event.xpath('event-id/text()').get())
results in:
24425412
24342548
Without any external lib (Just ElementTree)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate> '''
root = ET.fromstring(xml)
ids = [e.text for e in root.findall('.//event-id')]
print(ids)
output
['24425412', '24342548']

Parse xml in Python ( One correct way to do so) using xlml

I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems

set text to SubElement which has more than one text field in XML using ElementTree Python specifically using xpath query if possible

<?xml version="1.0"?>
<doc>
<doc1>
<branch name="testing" hash="1cdf045c">
text,source
</branch>
<some name="release01" hash:hashing="f200013e">
<sub-branch name="subrelease01">
xml,sgml
</sub-branch>
</branch>
<bsome name="invalid">
</branch>
</doc1>
</doc>
I wrote something like this:
import xml.etree.ElementTree as ET
from lxml import etree
filename='/testxml'
tree = etree.parse(filename)
root = tree.getroot()
print root
primary = etree.Element("{http://schemas.dmtf.org/xxx/doc/1}doc")
secondary = etree.SubElement(primary, "{http://schemas.dmtf.org/xxx/doc/1}some name")
secondary.text = ????
print(etree.tostring(primary, pretty_print=True))
I would want to change the value for hash:hashing...
If possible I want use xpath query for example /doc/doc1/some name/#hash:hashing
to do this task...
Please help....Thank you!

ElementTree find returns None?

I'm using ElementTree with Python to parse an XML file to find the contents of the tag contentType.
Here's the Python line:
extensionType = ET.parse("src/" + str(filename)).find('contentType')
And here's the XML:
<?xml version="1.0" encoding="UTF-8"?>
<StaticResource xmlns="http://soap.sforce.com/2006/04/metadata">
<cacheControl>Private</cacheControl>
<contentType>image/jpeg</contentType>
</StaticResource>
What am I doing wrong?
Thanks!
you are just parsing the xml file so far. this is how you can get your element using xpath (note how you have to use the given xml namespace xmlns)
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
xmlns = {'soap': '{http://soap.sforce.com/2006/04/metadata}'}
ct_element = root.find('.//{soap}contentType'.format(**xmlns))
print(ct_element.text)

How to get XML tag value in Python

I have some XML in a unicode-string variable in Python as follows:
<?xml version='1.0' encoding='UTF-8'?>
<results preview='0'>
<meta>
<fieldOrder>
<field>count</field>
</fieldOrder>
</meta>
<result offset='0'>
<field k='count'>
<value><text>6</text></value>
</field>
</result>
</results>
How do I extract the 6 in <value><text>6</text></value> using Python?
With lxml:
import lxml.etree
# xmlstr is your xml in a string
root = lxml.etree.fromstring(xmlstr)
textelem = root.find('result/field/value/text')
print textelem.text
Edit: But I imagine there could be more than one result...
import lxml.etree
# xmlstr is your xml in a string
root = lxml.etree.fromstring(xmlstr)
results = root.findall('result')
textnumbers = [r.find('field/value/text').text for r in results]
BeautifulSoup is the most simple way to parse XML as far as I know...
And assume that you have read the introduction, then just simply use:
soup = BeautifulSoup('your_XML_string')
print soup.find('text').string

Categories