lxml, get xml between elements

lxml, get xml between elements - python

given this sample xml:
<xml>
<pb facs="id1" />
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
</xml>
i need to parse it and get all the content between pb, saving into distinct external files.
expected result:
$ cat id1
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
$ cat id2
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
what is the correct xpath axe to use?
from lxml import etree
xml = etree.parse("sample.xml")
for pb in xml.xpath('//pb'):
filename = pb.xpath('#facs')[0]
f = open(filename, 'w')
content = **{{ HOW TO GET THE CONTENT HERE? }}**
f.write(content)
f.close()
is there any xpath expression to get all descendants and stop when reached a new pb?

Do you want to extract the tag between two pb's? If yes then that's not quite possible because it is not a tag in between pb's rather than an individual tag on the same level as pb as you have closed the tag pb . If you close the tag after the test tag then test can become a child of pb.
In other words if your xml is like this:
<xml>
<pb facs="id1">
<test></test>
</pb>
<test></test>
<pb facs="id2" />
<test></test>
<test></test>
</xml>
Then you can use
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root:
for subchild in child:
print subchild
to print the subchild('test') with pb as a parent.
Well if that's not the case (you just want to extract the attributes of pb tag)then you can use either of the two methods shown below to extract the elements.
With python's inbuilt etree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
With the lxml library you can parse it like this:
tree = etree.parse('test.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')

OK, I tested this code:
lists = []
for node in tree.findall('*'):
if node.tag == 'pb':
lists.append([])
else:
lists[-1].append(node)
Output:
>>> lists
[[<Element test at 2967fa8>, <Element test at 2a89030>, <Element lot-of-xml at 2a89080>], [<Element test at 2a89170>, <Element test at 2a891c0>, <Element lot-of-xml at 2a89210>]]
Input file (just in case):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>
<pb facs="id1" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
</xml>

Related

How to properly use xmlfile api from lxml

I have a large (5+ gigs) XML file which I need to parse, do some operation & write a new XML file.
dummy.xml
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header>
<name>Product Catalog</name>
</header>
<product product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
<product product-id="5678">
<available-flag>false</available-flag>
<name>product1</name>
</product>
<product product-id="9999">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog>
As you see the above XML has 3 product tags & I need to filter some product-ids on basis of a pre-defined list of ids.
I am using lxml iterparse to parse the XML iteratively & want to use xmlfile API to create a new XML incrementally to keep the memory footprint low. So, my motive is to filter out the product tags which don't meet the criteria & copy the rest of the XML tags as it is.
from lxml import etree
f = './dummy.xml'
f1 = './test.xml'
context = etree.iterparse(f, events=('start',))
productsToExport = ['1234']
with etree.xmlfile(f1, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element('catalog xmlns="http://www.namespace.com" catalog-id="test-catalog"'):
for event, element in context:
tagName = etree.QName(element.tag).localname
if (tagName == 'product'):
pid = element.get('product-id')
if (pid in productsToExport):
xf.write(element)
elif (tagName == 'header'):
xf.write(element) # copy existing header tag as it is
Above code works ok & generates a XML as below
<?xml version='1.0' encoding='utf-8'?>
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header xmlns="http://www.namespace.com">
<name>Product Catalog</name>
</header>
<product xmlns="http://www.namespace.com" product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
If you observe the above XML it has few issues:
Closing <catalog> tag has xmlns & catalog-id present in it
All tags like header, product have xmlns attribute present in it
I checked xmlfile api documentation but couldn't find a way to fix above issues.
EDIT:
I managed to resolve the 1st issue by using below
attribs = {'xmlns' : 'http://www.namespace.com', 'catalog-id' : 'test-catalog'}
with xf.element('catalog', attribs):
# previous logic
So, now am left with removing the namespace from every element.

Consider simply rebuilding the XML tree with lxml.etree methods instead of the xmlfile API, still in the context of your iterparse:
from lxml import etree
f = './dummy.xml'
f1 = './test.xml'
productsToExport = ['1234']
# ROOT ELEMENT WITH DEFUALT NAMESPACE
my_nmsp = {None: 'http://www.namespace.com'}
# INITIALIZE ITERATOR
context = etree.iterparse(f, events=('start',))
for event, element in context:
tagName = etree.QName(element.tag).localname
for prod in productsToExport:
root = etree.Element('catalog', nsmap=my_nmsp)
root.text = '\n\t'
root.attrib['catalog-id'] = "test-catalog"
# PRODUCT ELEMENT
if tagName == 'product':
pid = element.get('product-id')
if pid == prod:
root.append(element)
# HEADER ELEMENT
elif (tagName == 'header'):
root.append(element)
# OUTPUT TREE TO FILE
with open(f1, 'wb') as f:
f.write(etree.tostring(root, pretty_print=True))
Output
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header>
<name>Product Catalog</name>
</header>
<product product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog>

Python XML get immediate child elements only

I have an xml file as below:
<?xml version="1.0" encoding="utf-8"?>
<EDoc CID="1000101" Cname="somename" IName="iname" CSource="e1" Version="1.0">
<RIGLIST>
<RIG RIGID="100001" RIGName="RgName1">
<ListID>
<nodeA nodeAID="1000011" nodeAName="node1A" nodeAExtID="9000011" />
<nodeA nodeAID="1000012" nodeAName="node2A" nodeAExtID="9000012" />
<nodeA nodeAID="1000013" nodeAName="node3A" nodeAExtID="9000013" />
<nodeA nodeAID="1000014" nodeAName="node4A" nodeAExtID="9000014" />
<nodeA nodeAID="1000015" nodeAName="node5A" nodeAExtID="9000015" />
<nodeA nodeAID="1000016" nodeAName="node6A" nodeAExtID="9000016" />
<nodeA nodeAID="1000017" nodeAName="node7A" nodeAExtID="9000017" />
</ListID>
</RIG>
<RIG RIGID="100002" RIGName="RgName2">
<ListID>
<nodeA nodeAID="1000021" nodeAName="node1B" nodeAExtID="9000021" />
<nodeA nodeAID="1000022" nodeAName="node2B" nodeAExtID="9000022" />
<nodeA nodeAID="1000023" nodeAName="node3B" nodeAExtID="9000023" />
</ListID>
</RIG>
</RIGLIST>
</EDoc>
I need to search for the Node value RIGName and if match is found print out all the values of nodeAName
Example:
Searching for RIGName = "RgName2" should print all the values as node1B, node2B, node3B
As of now I am only able to get the first part as below:
import xml.etree.ElementTree as eT
import re
xmlfilePath = "Path of xml file"
tree = eT.parse(xmlfilePath)
root = tree.getroot()
for elem in root.iter("RIGName"):
# print(elem.tag, elem.attrib)
if re.findall(searchtxt, elem.attrib['RIGName'], re.IGNORECASE):
print(elem.attrib)
count += 1
How can I get only the immediate child node values?

Switching from xml.etree to lxml would give you a way to do it in a single go because of a much better XPath query language support:
In [1]: from lxml import etree as ET
In [2]: tree = ET.parse('input.xml')
In [3]: root = tree.getroot()
In [4]: root.xpath('//RIG[#RIGName = "RgName2"]/ListID/nodeA/#nodeAName')
Out[4]: ['node1B', 'node2B', 'node3B']

Modifying an xml attribute element with a value of a child element with lxml

I have an XML snippet like this:
<parent id="1">
<child1>
<child2>[content]I need to get[/content]Other text</child2>
</child1>
</parent>
And I would like to add the [content] of "child1" as an attribute into the parent element.
Getting something like this:
<parent id="1" value = "I need to get">
<child1>
<child2>Other text</child2>
</child1>
</parent>
I have this code, however it does not work as it looks like it only iters in the first child and does not go to the next.
pattern = re.compile('[content](.*?)[/content]')
xml_parser = et.parse(str(xml_file))
root_xml = xml_parser.getroot()
translatable_elements = root_xml.xpath('//parent')
for element in translatable_elements:
for child_element in element.iterchildren():
if child_element.tag == 'child1':
source_content = child_element.text
value_str = pattern.match(source_content).group(1)
element.attrib['value'] = value_str
source_content = pattern.sub(source_content,'')
tree = et.ElementTree(root_xml)
tree.write(str(xml_file), encoding='utf-8', pretty_print=True)

You need to compile the re with a proper regex escaped string. Also, you were trying to grab text from child1 instead of child2. This should be along the lines you're looking:
import re
from lxml import etree
with open(path, 'r') as f:
tree = etree.parse(f)
pattern = re.compile(r'\[content\](.*?)\[\/content\]')
root = tree.getroot()
pars = root.xpath('//parent')
for par in pars:
for child1 in par.iterchildren('child1'):
child2 = child1.getchildren()[0]
val = pattern.match(child2.text).group(1)
par.set('value', val)
child2.text = pattern.sub('', child2.text)
print(etree.tostring(tree, encoding='utf-8', pretty_print=True))

Another option is to not use regex at all and use plain xpath.
Since you said your XML was a snippet, I wrapped it in a doc element and added another parent to show what happens when there are multiples.
Example...
XML Input (input.xml)
<doc>
<parent id="1">
<child1>
<child2>[content]I need to get[/content]Other text</child2>
</child1>
</parent>
<parent id="2">
<child1>
<child2>[content]I need to get this too[/content]More other text</child2>
</child1>
</parent>
</doc>
Python
from lxml import etree
tree = etree.parse("input.xml")
for parent in tree.xpath(".//parent"):
child2 = parent.xpath("./child1/child2")[0]
parent.attrib["value"] = child2.xpath("substring-before(substring-after(.,'[content]'),'[/content]')")
child2.text = child2.xpath("substring-after(.,'[/content]')")
tree.write("output.xml")
Output (output.xml)
<doc>
<parent id="1" value="I need to get">
<child1>
<child2>Other text</child2>
</child1>
</parent>
<parent id="2" value="I need to get this too">
<child1>
<child2>More other text</child2>
</child1>
</parent>
</doc>

how to get value of an xml element not directly under root

I am trying to parse an xml and get the value of dir_path as below,however I dont seem to get the desired output,whats wrong here and how to fix it?
input.xml
<?xml version="1.0" ?>
<data>
<software>
<name>xyz</name>
<role>xyz</role>
<future>unknown</future>
</software>
<software>
<name>abc</name>
<role>abc</role>
<future>clear</future>
<dir_path cmm_root_path_var="COMP_softwareROOT">\\location\software\INR\</dir_path>
<loadit reduced="true">
<RW>yes</RW>
<readonly>R/</readonly>
</loadit>
<upload reduced="true">
</upload>
</software>
<software>
<name>def</name>
<role>def</role>
<future>clear</future>
<dir_path cmm_root_path_var="COMP2_softwareROOT">\\location1\software\INR\</dir_path>
<loadit reduced="true">
<RW>yes</RW>
<readonly>R/</readonly>
</loadit>
<upload reduced="true">
</upload>
</software>
</data>
CODE:-
tree = ET.parse(input.xml)
root = tree.getroot()
dir_path = root.find(".//dir_path")
print dir_path.text
OUTPUT:-
.\
EXPECTED OUTPUT:-
\\location\software\INR\

Try the following:
from xml.etree import ElementTree as ET
tree = ET.parse('filename.xml')
item = tree.find('software/[name="abc"]/dir_path')
print(item.text if item is not None else None)

how can I select all descendants of a certain element with ElementTree in Python 3.3?

This is the sample data.
input.xml
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
I'd like to put node and its descendants into . That is,
output.xml
<root>
<entry id="1">
<headword>go</headword>
<examplegrp>
<example>I <hw>go</hw> to school.</example>
</examplegrp>
</entry>
</root>
My poor and incomplete script is:
import codecs
import xml.etree.ElementTree as ET
fin = codecs.open(r'input.xml', 'rb', encoding='utf-8')
data = ET.parse(fin)
root = data.getroot()
example = root.find('.//example')
for elem in example.iter():
---and then I don't know what to do---

Here's an example of how it can be done:
text = """
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
"""
import lxml.etree
import StringIO
data = lxml.etree.parse(StringIO.StringIO(text))
root = data.getroot()
for entry in root.xpath('//example/ancestor::entry[1]'):
examplegrp = lxml.etree.SubElement(entry,"examplegrp")
nodes = [node for node in entry.xpath('./example')]
for node in nodes:
entry.remove(node)
examplegrp.append(node)
print lxml.etree.tostring(root,pretty_print=True)
which will output:
<root>
<entry id="1">
<headword>go</headword>
<examplegrp><example>I <hw>go</hw> to school.</example>
</examplegrp></entry>
</root>

http://docs.python.org/3/library/xml.dom.html?highlight=xml#node-objects
http://docs.python.org/3/library/xml.dom.html?highlight=xml#document-objects
You probably want to follow some paradigm of creating a Document Element and appending reach result to it.
group = Document.createElement(tagName)
for found in founds:
group.appendNode(found)
Or something like this

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml, get xml between elements - python

Related

How to properly use xmlfile api from lxml

Python XML get immediate child elements only

Modifying an xml attribute element with a value of a child element with lxml

how to get value of an xml element not directly under root

how can I select all descendants of a certain element with ElementTree in Python 3.3?

Categories

Resources