how to parse the second xml tree in a file

how to parse the second xml tree in a file - python

Suppose I have a XML file like
<?xml version="1.0" encoding="utf-8"?>
<items>
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<price>1500</price>
<info> asfgfdff</info>
</item>
</items>
How do I parse so that the parser selects the recently updated xml tree?

with open('file','r') as f:
newestXml = []
for line in f.readlines():
if re.search('^<\?xml',line):
newestXml = [line]
else:
newestXml.append(line)
At the end of the loop, newestXml will contain all the lines from the last occurrence of <?xml to the end of the file.
Now you can combine the lines and use the xml parser to parse the xml.
Note - I can't check this code now, so it may contain small mistakes, but I hope the idea will help you.

Related

Get xml value of ElementTree Element

I would like to get the xml value of an element in ElementTree. For example, if I had the code:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<child>asd</child>
hello world
<ch>jkl</ch>
</item>
It would get me
<child>asd</child>
hello world
<ch>jkl</ch>
Here's what I tried so far:
import xml.etree.ElementTree as ET
root = ET.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<item>
<child>asd</child>
hello world
<ch>jkl</ch>
</item>""")
print(root.text)

Try
print(ET.tostring(root.find('.//child')).decode(),ET.tostring(root.find('.//ch')).decode())
Or, more readable:
elems = ['child','ch']
for elem in elems:
print(ET.tostring(doc.find(f'.//{elem}')).decode())
The output, based on the xml in your question, should be what you're looking for.

Building on Jack Fleeting's answer, I created a solution I feel is more general, not just relating to the xml I inserted.
import xml.etree.ElementTree as ET
root = ET.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<item>
<child>asd</child>
hello world
<ch>jkl</ch>
</item>""")
for elem in root:
print(ET.tostring(root.find(f'.//{elem.tag}')).decode())

Remove unwanted tags from XML file

I working on a XML file that contains soap tags in it. I want to remove those soap tags as part of XML cleanup process.
How can I achieve it in either Python or Scala. Should not use shell script.
Sample Input :
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://sample.com/">
<soap:Body>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
</soap:Body>
</soap:Envelope>
Expected Output :
<?xml version="1.0" encoding="UTF-8"?>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>

This could help you!
from lxml import etree
doc = etree.parse('test.xml')
for ele in doc.xpath('//soap'):
parent = ele.getparent()
parent.remove(ele)
print(etree.tostring(doc))

How to get specific block(group) based on child node's value in XPath from the XML?

I am newbie for XPath. I have the following XML file.
Here my xml file:
<?xml version='1.0' encoding='utf-8'?>
<items>
<item>
<country>India</country>
<referenceId>IN375TP</referenceId>
<price>400</price>
</item>
<item>
<country>Australia</country>
<referenceId>AU120ED</referenceId>
<price>15</price>
</item>
<item>
<country>United Kingdom</country>
<referenceId>UK862RB</referenceId>
<price>20</price>
</item>
</items>
I want the following <item> tag as an output:
<item>
<country>Australia</country>
<referenceId>AU120ED</referenceId>
<price>15</price>
</item>
Note: Please use condition like /items/item[referenceId/text()="AU120ED"]

If you want to find the item by country, you can use an xpath specifying you want to find the item in items that have the text=country_name:
from lxml.etree import parse, HTMLParser
xml = parse("check.xml",HTMLParser())
print(xml.find("//items//item[country='Australia']"))
<Element item at 0x7f40faa28950>
If you actually want to search be referenceid, just change to item[referenceid='AU120ED']:
print(xml.find("//items//item[referenceid='AU120ED']"))
<Element item at 0x7f02c0c24998>
For xml:
from xml.etree import ElementTree as et
xml = et.parse("check.xml")
print(xml.find(".").find("./item[referenceId='AU120ED']"))

how to add xml child node with namespace in python?

i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution

Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>

python lxml using iterparse to edit and output xml

I've been messing around with the lxml library for a little while and maybe I'm not understanding it correctly or I'm missing something but I can't seem to figure out how to edit the file after I catch a certain xpath and then be able to write that back out into xml while I'm parsing element by element.
Say we have this xml as an example:
<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>
What I would like to do while parsing is when I hit that xpath of "/xml/items/pie" is to add an element before pie, so it will turn out like this:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>
That output would need to be done by writing to a file line by line as I hit each tag and edit the xml at certain xpaths. I mean I could have it print the starting tag, the text, the attribute if it exists, and then the ending tag by hard coding certain parts, but that would be very messy and it be nice if there was a way to avoid that if possible.
Here's my guess code at this:
from lxml import etree
path=[]
count=0
context=etree.iterparse(file,events=('start','end'))
for event, element in context:
if event=='start':
path.append(element.tag)
if /'+'/'.join(path)=='/xml/items/pie':
itemnode=etree.Element('item',id=str(count))
itemnode.text=""
element.addprevious(itemnode)#Not the right way to do it of course
#write/print out xml here.
else:
element.clear()
path.pop()
Edit: Also, I need to run through fairly big files, so I have to use iterparse.

Here's a solution using iterparse(). The idea is to catch all tag "start" events, remember the parent (items) tag, then for every pie tag create an item tag and put the pie into it:
from StringIO import StringIO
from lxml import etree
from lxml.etree import Element
data = """<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>"""
stream = StringIO(data)
context = etree.iterparse(stream, events=("start", ))
for action, elem in context:
if elem.tag == 'items':
items = elem
index = 1
elif elem.tag == 'pie':
item = Element('item', {'id': str(index)})
items.replace(elem, item)
item.append(elem)
index += 1
print etree.tostring(context.root)
prints:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>

There is a more clean way to make modifications you need:
iterate over pie elements
make an item element
use replace() to replace a pie element with item
replace(self, old_element, new_element)
Replaces a subelement with the
element passed as second argument.
from lxml import etree
from lxml.etree import XMLParser, Element
data = """<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>"""
tree = etree.fromstring(data, parser=XMLParser())
items = tree.find('.//items')
for index, pie in enumerate(items.xpath('.//pie'), start=1):
item = Element('item', {'id': str(index)})
items.replace(pie, item)
item.append(pie)
print etree.tostring(tree, pretty_print=True)
prints:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>

I would suggest you to use an XSLT template, as it seems to match better for this task. Initially XSLT is a little bit tricky until you get used to it, if all you want is to generate some output from an XML, then XSLT is a great tool.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to parse the second xml tree in a file - python

Suppose I have a XML file like <?xml version="1.0" encoding="utf-8"?> <items> <?xml version="1.0" encoding="utf-8"?> <items> <item> <price>1500</price> <info> asfgfdff</info> </item> </items> How do I parse so that the parser selects the recently updated xml tree?

Related

Get xml value of ElementTree Element

Remove unwanted tags from XML file

How to get specific block(group) based on child node's value in XPath from the XML?

how to add xml child node with namespace in python?

python lxml using iterparse to edit and output xml

Categories

Resources