How to properly use xmlfile api from lxml - python

I have a large (5+ gigs) XML file which I need to parse, do some operation & write a new XML file.
dummy.xml
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header>
<name>Product Catalog</name>
</header>
<product product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
<product product-id="5678">
<available-flag>false</available-flag>
<name>product1</name>
</product>
<product product-id="9999">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog>
As you see the above XML has 3 product tags & I need to filter some product-ids on basis of a pre-defined list of ids.
I am using lxml iterparse to parse the XML iteratively & want to use xmlfile API to create a new XML incrementally to keep the memory footprint low. So, my motive is to filter out the product tags which don't meet the criteria & copy the rest of the XML tags as it is.
from lxml import etree
f = './dummy.xml'
f1 = './test.xml'
context = etree.iterparse(f, events=('start',))
productsToExport = ['1234']
with etree.xmlfile(f1, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element('catalog xmlns="http://www.namespace.com" catalog-id="test-catalog"'):
for event, element in context:
tagName = etree.QName(element.tag).localname
if (tagName == 'product'):
pid = element.get('product-id')
if (pid in productsToExport):
xf.write(element)
elif (tagName == 'header'):
xf.write(element) # copy existing header tag as it is
Above code works ok & generates a XML as below
<?xml version='1.0' encoding='utf-8'?>
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header xmlns="http://www.namespace.com">
<name>Product Catalog</name>
</header>
<product xmlns="http://www.namespace.com" product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
If you observe the above XML it has few issues:
Closing <catalog> tag has xmlns & catalog-id present in it
All tags like header, product have xmlns attribute present in it
I checked xmlfile api documentation but couldn't find a way to fix above issues.
EDIT:
I managed to resolve the 1st issue by using below
attribs = {'xmlns' : 'http://www.namespace.com', 'catalog-id' : 'test-catalog'}
with xf.element('catalog', attribs):
# previous logic
So, now am left with removing the namespace from every element.

Consider simply rebuilding the XML tree with lxml.etree methods instead of the xmlfile API, still in the context of your iterparse:
from lxml import etree
f = './dummy.xml'
f1 = './test.xml'
productsToExport = ['1234']
# ROOT ELEMENT WITH DEFUALT NAMESPACE
my_nmsp = {None: 'http://www.namespace.com'}
# INITIALIZE ITERATOR
context = etree.iterparse(f, events=('start',))
for event, element in context:
tagName = etree.QName(element.tag).localname
for prod in productsToExport:
root = etree.Element('catalog', nsmap=my_nmsp)
root.text = '\n\t'
root.attrib['catalog-id'] = "test-catalog"
# PRODUCT ELEMENT
if tagName == 'product':
pid = element.get('product-id')
if pid == prod:
root.append(element)
# HEADER ELEMENT
elif (tagName == 'header'):
root.append(element)
# OUTPUT TREE TO FILE
with open(f1, 'wb') as f:
f.write(etree.tostring(root, pretty_print=True))
Output
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header>
<name>Product Catalog</name>
</header>
<product product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog>

Related

How to convert XML to CSV when attributes are inside the root tag?

I need to convert an XML file to CSV file, in which attributes are inside the root tag.
I've checked many links but didn't find any example where XML is converted to CSV by parsing root tag's attributes.
XML Sample:
<Products>
<Product ProductID="1234" ProductName="ABC" Category="Food" />
<Product ProductID="1236" ProductName="ABE" Category="Healthcare" />
</Products>
Python Code (getting NoneType error):
ProductID = member.find('Product').attrib.get('ProductID')
Product_nodes.append(ProductID)
Expected Results in CSV file:
ProductID, ProductName, Category
1234, ABC,Food
1236, ABE,Healthcare
Using the csv and lxml libraries:
import csv
from lxml import etree
xml_content = """<Products>
<Product ProductID="1234" ProductName="ABC" Category="Food" />
<Product ProductID="1236" ProductName="ABE" Category="Healthcare" />
</Products>"""
outfile = "products.csv"
field_names = ["ProductID", "ProductName", "Category"]
root_node = etree.fromstring(xml_content)
filehandle = open(outfile, "w", newline="")
writer = csv.DictWriter(filehandle, fieldnames=field_names)
writer.writeheader()
for node in root_node.findall("Product"):
writer.writerow(dict(node.attrib))
filehandle.close()
If you've got access to XQuery or XPath 2.0+ it's simply
string-join(//Product/string-join((#ProductID, #ProductName, #Category),','),'
')

ElementTree parse xml file - problem with parsing

I have a problem parsing data from xml file. I'm using xml.etree.ElementTree to extract data from files and then save them into .csv. I have all the necessery modules installed on server.
I am aware that there is bs4 module with BeutifulSoup, yet I would like to know if is possible to parse this data/xml file using ElementTree. Sorry if the answear is easy or obvious, yet I'm still very much a beginner and with this problem I could not name the problem in a way to find an answear.
While running python script written below I have no errors and no outcome. I don't really know what should I change. I can not find solution. I tried using different child.tag or attributes but with no result.
The xml file that I have problem with.:
<?xml version="1.0" encoding="utf-8"?>
<offer file_format="IOF" version="2.6" extensions="yes" xmlns="http://www.iai-shop.com/developers/iof.phtml">
<product id="9" vat="23.0" code_on_card="BHA">
<producer id="1308137276" name="BEAL"/>
...
<price gross="175" net="142.28"/>
<sizes>
<size code_producer="3700288265272" code="9-uniw" weight="0">
<stock id="0" quantity="-1"/>
<stock id="1" quantity="4"/>
</size>
</sizes>
</product>
<product>
...
</product>
...
and script that I tried to use (here to extract code_on_card, price net, quantity).
(I am aware that there are two childs: stock and quantity, and I'm completely fine with the second one overwrting the first one)
import requests
import os,sys
import csv
import xml.etree.ElementTree as ET
reload(sys)
sys.setdefaultencoding('utf-8')
xml_path = '/file.xml'
xml = ET.parse(xml_path)
with open('/home/file.csv', 'wb') as f:
c = csv.writer(f, delimiter=';')
for product in xml.iter('product'):
product_id = product.attrib["code_on_card"]
for child in product:
if child.tag == 'price':
if child.attrib["net"] != None:
hurt_net = child.attrib["net"]
for size in product.iter('size'):
for stock in size.iter('stock'):
if 'quantity' in stock.attrib.keys():
quantity = stock.attrib["quantity"]
line = product_id, hurt_net, quantity
c.writerow(line)
Files that seem to me to be built on similar scheme work just fine (offer -> product ->child/attrib ), like this one:
<?xml version="1.0" encoding="UTF-8"?>
<offer file_format="IOF" version="2.5">
<product id="2">
<price gross="0.00" net="0.00" vat="23.0"/>
<srp gross="0.00" net="0" vat="23.0"/>
<sizes>
<size id="0" code="2-0" weight="0" >
</size>
</sizes>
</product>
...
</product>
...
EDIT:
Outcome should be .csv file containing multpile rows (each for each product in xml file) of code_on_card, price net, quantity. It should look like:
BC097B.50GD.O;70.81;37
BC097B.50.A;76.75;24
BC086C.50.B;76.75;29
BGRT.L;3;96.75;28
....
EDIT2
code as it is, after drec4s answear:
import requests
import os,sys
import csv
import xml.etree.cElementTree as ET
reload(sys)
sys.setdefaultencoding('utf-8')
xml_path = '/home/platne/serwer16373/dane/z_hurtowni/pobrane/beal2.xml'
root = ET.parse(xml_path)
ns = {'offer': 'http://www.iai-shop.com/developers/iof.phtml'}
products = root.getchildren()
with open('/home/platne/serwer16373/dane/z_hurtowni/stany_magazynowe/karol/bealKa.csv', 'wb') as f:
c = csv.writer(f, delimiter=';')
hurtownia = 'beal'
for product in root.iter('product'):
qtt = [1]
code = product.get('code_on_card')
hurt_net = product.find('price').get('net')
for stock in product.find('sizes').find('size').getchildren():
qtt.append(stock.get('quantity'))
quantity = max(qtt)
line = 'beal-'+str(code), hurt_net, quantity
c.writerow(line)
somehow I'm getting
AttributeError: 'ElementTree' object has no attribute 'getchildren'
I've got Ele
This is how I would go and parse an xml file with namespaces. As per official documentation, the easiest way is to define a dictionary specifying the namespace.
from xml.etree import cElementTree as ET
root = ET.fromstring("""
<offer file_format="IOF" version="2.6" extensions="yes" xmlns="http://www.iai-shop.com/developers/iof.phtml">
<product id="9" vat="23.0" code_on_card="BHA">
<producer id="1308137276" name="BEAL"/>
<price gross="175" net="142.28"/>
<sizes>
<size code_producer="3700288265272" code="9-uniw" weight="0">
<stock id="0" quantity="-1"/>
<stock id="1" quantity="4"/>
</size>
</sizes>
</product>
</offer>
""")
ns = {'offer': 'http://www.iai-shop.com/developers/iof.phtml'}
products = root.getchildren()
for p in products:
qtt = [] #to store all stock quantities
product_id = p.get('code_on_card')
hurt_net = p.find('offer:price', ns).get('net')
for stock in p.find('offer:sizes', ns).find('offer:size', ns).getchildren():
qtt.append(int(stock.get('quantity')))
quantity = max(qtt) #or sum
line = (product_id, hurt_net, quantity)
print(line)
Outputs:
('BHA', '142.28', 4)
Also, I did not understand what was the stock quantity that you needed to extract, since you were only getting the last children(stock) value (change the sum function to max or to whatever you need).

Modifying an xml attribute element with a value of a child element with lxml

I have an XML snippet like this:
<parent id="1">
<child1>
<child2>[content]I need to get[/content]Other text</child2>
</child1>
</parent>
And I would like to add the [content] of "child1" as an attribute into the parent element.
Getting something like this:
<parent id="1" value = "I need to get">
<child1>
<child2>Other text</child2>
</child1>
</parent>
I have this code, however it does not work as it looks like it only iters in the first child and does not go to the next.
pattern = re.compile('[content](.*?)[/content]')
xml_parser = et.parse(str(xml_file))
root_xml = xml_parser.getroot()
translatable_elements = root_xml.xpath('//parent')
for element in translatable_elements:
for child_element in element.iterchildren():
if child_element.tag == 'child1':
source_content = child_element.text
value_str = pattern.match(source_content).group(1)
element.attrib['value'] = value_str
source_content = pattern.sub(source_content,'')
tree = et.ElementTree(root_xml)
tree.write(str(xml_file), encoding='utf-8', pretty_print=True)
You need to compile the re with a proper regex escaped string. Also, you were trying to grab text from child1 instead of child2. This should be along the lines you're looking:
import re
from lxml import etree
with open(path, 'r') as f:
tree = etree.parse(f)
pattern = re.compile(r'\[content\](.*?)\[\/content\]')
root = tree.getroot()
pars = root.xpath('//parent')
for par in pars:
for child1 in par.iterchildren('child1'):
child2 = child1.getchildren()[0]
val = pattern.match(child2.text).group(1)
par.set('value', val)
child2.text = pattern.sub('', child2.text)
print(etree.tostring(tree, encoding='utf-8', pretty_print=True))
Another option is to not use regex at all and use plain xpath.
Since you said your XML was a snippet, I wrapped it in a doc element and added another parent to show what happens when there are multiples.
Example...
XML Input (input.xml)
<doc>
<parent id="1">
<child1>
<child2>[content]I need to get[/content]Other text</child2>
</child1>
</parent>
<parent id="2">
<child1>
<child2>[content]I need to get this too[/content]More other text</child2>
</child1>
</parent>
</doc>
Python
from lxml import etree
tree = etree.parse("input.xml")
for parent in tree.xpath(".//parent"):
child2 = parent.xpath("./child1/child2")[0]
parent.attrib["value"] = child2.xpath("substring-before(substring-after(.,'[content]'),'[/content]')")
child2.text = child2.xpath("substring-after(.,'[/content]')")
tree.write("output.xml")
Output (output.xml)
<doc>
<parent id="1" value="I need to get">
<child1>
<child2>Other text</child2>
</child1>
</parent>
<parent id="2" value="I need to get this too">
<child1>
<child2>More other text</child2>
</child1>
</parent>
</doc>

Python add Tags to XML using lxml

I have the following Input XML:
<?xml version="1.0" encoding="utf-8"?>
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
</Scenario>
My Program does add three Tags to the XML but they are formatted false.
The Output XML looks like the following:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration><EVC-SW-Version>08.02.0001.0027</EVC-SW-Version><STAC-Release>08.02.0001.0027</STAC-Release></Scenario>
Thats my Source-Code:
class XmlManager:
#staticmethod
def write_xml(xml_path, duration, evc_sw_version):
xml_path = os.path.abspath(xml_path)
if os.path.isfile(xml_path) and xml_path.endswith(".xml"):
# parse XML into etree
root = etree.parse(xml_path).getroot()
# add tags
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = duration
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = evc_sw_version
stac_release = evc_sw_version
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = stac_release
# write changes to the XML-file
tree = etree.ElementTree(root)
tree.write(xml_path, pretty_print=False)
else:
XmlManager.logger.log("Invalid path to XML-file")
def main():
xml = r".\Test_Input_Data_Base\blnmerf1_md1czjyc_REL_V_08.01.0001.000x\Test_startup_0029\Test_startup_0029.xml"
XmlManager.write_xml(xml, "12", "08.02.0001.0027")
My Question is how to add the new tags to the XML in the right format. I guess its working that way for parsing again the changed XML but its not nice formated. Any Ideas? Thanks in advance.
To ensure nice pretty-printed output, you need to do two things:
Parse the input file using an XMLParser object with remove_blank_text=True.
Write the output using pretty_print=True
Example:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse("Test_startup_0029.xml", parser)
root = tree.getroot()
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = "12"
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = "08.02.0001.0027"
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = "08.02.0001.0027"
tree.write("output.xml", pretty_print=True)
Contents of output.xml:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration>
<EVC-SW-Version>08.02.0001.0027</EVC-SW-Version>
<STAC-Release>08.02.0001.0027</STAC-Release>
</Scenario>
See also http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output.

lxml, get xml between elements

given this sample xml:
<xml>
<pb facs="id1" />
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
</xml>
i need to parse it and get all the content between pb, saving into distinct external files.
expected result:
$ cat id1
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
$ cat id2
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
what is the correct xpath axe to use?
from lxml import etree
xml = etree.parse("sample.xml")
for pb in xml.xpath('//pb'):
filename = pb.xpath('#facs')[0]
f = open(filename, 'w')
content = **{{ HOW TO GET THE CONTENT HERE? }}**
f.write(content)
f.close()
is there any xpath expression to get all descendants and stop when reached a new pb?
Do you want to extract the tag between two pb's? If yes then that's not quite possible because it is not a tag in between pb's rather than an individual tag on the same level as pb as you have closed the tag pb . If you close the tag after the test tag then test can become a child of pb.
In other words if your xml is like this:
<xml>
<pb facs="id1">
<test></test>
</pb>
<test></test>
<pb facs="id2" />
<test></test>
<test></test>
</xml>
Then you can use
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root:
for subchild in child:
print subchild
to print the subchild('test') with pb as a parent.
Well if that's not the case (you just want to extract the attributes of pb tag)then you can use either of the two methods shown below to extract the elements.
With python's inbuilt etree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
With the lxml library you can parse it like this:
tree = etree.parse('test.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
OK, I tested this code:
lists = []
for node in tree.findall('*'):
if node.tag == 'pb':
lists.append([])
else:
lists[-1].append(node)
Output:
>>> lists
[[<Element test at 2967fa8>, <Element test at 2a89030>, <Element lot-of-xml at 2a89080>], [<Element test at 2a89170>, <Element test at 2a891c0>, <Element lot-of-xml at 2a89210>]]
Input file (just in case):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>
<pb facs="id1" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
</xml>

Categories