ElementTree parse xml file - problem with parsing - python

I have a problem parsing data from xml file. I'm using xml.etree.ElementTree to extract data from files and then save them into .csv. I have all the necessery modules installed on server.
I am aware that there is bs4 module with BeutifulSoup, yet I would like to know if is possible to parse this data/xml file using ElementTree. Sorry if the answear is easy or obvious, yet I'm still very much a beginner and with this problem I could not name the problem in a way to find an answear.
While running python script written below I have no errors and no outcome. I don't really know what should I change. I can not find solution. I tried using different child.tag or attributes but with no result.
The xml file that I have problem with.:
<?xml version="1.0" encoding="utf-8"?>
<offer file_format="IOF" version="2.6" extensions="yes" xmlns="http://www.iai-shop.com/developers/iof.phtml">
<product id="9" vat="23.0" code_on_card="BHA">
<producer id="1308137276" name="BEAL"/>
...
<price gross="175" net="142.28"/>
<sizes>
<size code_producer="3700288265272" code="9-uniw" weight="0">
<stock id="0" quantity="-1"/>
<stock id="1" quantity="4"/>
</size>
</sizes>
</product>
<product>
...
</product>
...
and script that I tried to use (here to extract code_on_card, price net, quantity).
(I am aware that there are two childs: stock and quantity, and I'm completely fine with the second one overwrting the first one)
import requests
import os,sys
import csv
import xml.etree.ElementTree as ET
reload(sys)
sys.setdefaultencoding('utf-8')
xml_path = '/file.xml'
xml = ET.parse(xml_path)
with open('/home/file.csv', 'wb') as f:
c = csv.writer(f, delimiter=';')
for product in xml.iter('product'):
product_id = product.attrib["code_on_card"]
for child in product:
if child.tag == 'price':
if child.attrib["net"] != None:
hurt_net = child.attrib["net"]
for size in product.iter('size'):
for stock in size.iter('stock'):
if 'quantity' in stock.attrib.keys():
quantity = stock.attrib["quantity"]
line = product_id, hurt_net, quantity
c.writerow(line)
Files that seem to me to be built on similar scheme work just fine (offer -> product ->child/attrib ), like this one:
<?xml version="1.0" encoding="UTF-8"?>
<offer file_format="IOF" version="2.5">
<product id="2">
<price gross="0.00" net="0.00" vat="23.0"/>
<srp gross="0.00" net="0" vat="23.0"/>
<sizes>
<size id="0" code="2-0" weight="0" >
</size>
</sizes>
</product>
...
</product>
...
EDIT:
Outcome should be .csv file containing multpile rows (each for each product in xml file) of code_on_card, price net, quantity. It should look like:
BC097B.50GD.O;70.81;37
BC097B.50.A;76.75;24
BC086C.50.B;76.75;29
BGRT.L;3;96.75;28
....
EDIT2
code as it is, after drec4s answear:
import requests
import os,sys
import csv
import xml.etree.cElementTree as ET
reload(sys)
sys.setdefaultencoding('utf-8')
xml_path = '/home/platne/serwer16373/dane/z_hurtowni/pobrane/beal2.xml'
root = ET.parse(xml_path)
ns = {'offer': 'http://www.iai-shop.com/developers/iof.phtml'}
products = root.getchildren()
with open('/home/platne/serwer16373/dane/z_hurtowni/stany_magazynowe/karol/bealKa.csv', 'wb') as f:
c = csv.writer(f, delimiter=';')
hurtownia = 'beal'
for product in root.iter('product'):
qtt = [1]
code = product.get('code_on_card')
hurt_net = product.find('price').get('net')
for stock in product.find('sizes').find('size').getchildren():
qtt.append(stock.get('quantity'))
quantity = max(qtt)
line = 'beal-'+str(code), hurt_net, quantity
c.writerow(line)
somehow I'm getting
AttributeError: 'ElementTree' object has no attribute 'getchildren'
I've got Ele

This is how I would go and parse an xml file with namespaces. As per official documentation, the easiest way is to define a dictionary specifying the namespace.
from xml.etree import cElementTree as ET
root = ET.fromstring("""
<offer file_format="IOF" version="2.6" extensions="yes" xmlns="http://www.iai-shop.com/developers/iof.phtml">
<product id="9" vat="23.0" code_on_card="BHA">
<producer id="1308137276" name="BEAL"/>
<price gross="175" net="142.28"/>
<sizes>
<size code_producer="3700288265272" code="9-uniw" weight="0">
<stock id="0" quantity="-1"/>
<stock id="1" quantity="4"/>
</size>
</sizes>
</product>
</offer>
""")
ns = {'offer': 'http://www.iai-shop.com/developers/iof.phtml'}
products = root.getchildren()
for p in products:
qtt = [] #to store all stock quantities
product_id = p.get('code_on_card')
hurt_net = p.find('offer:price', ns).get('net')
for stock in p.find('offer:sizes', ns).find('offer:size', ns).getchildren():
qtt.append(int(stock.get('quantity')))
quantity = max(qtt) #or sum
line = (product_id, hurt_net, quantity)
print(line)
Outputs:
('BHA', '142.28', 4)
Also, I did not understand what was the stock quantity that you needed to extract, since you were only getting the last children(stock) value (change the sum function to max or to whatever you need).

Related

lxml is not reading the XML opening and closing tags when they are in different lines

I am using lxml package to read the tags and attribute values from XML. It does read the values when opening and closing tags are in one line but it doesn't read when opening and closing tags are in different lines.
In below XML, price tags i.e. <price> and </price> in a same line and price coming in output
a.xml
<catalog>
<product description="Cardigan Sweater" product_image="cardigan.jpg">
<catalog_item gender="Men's">
<cool_number>QWZ5671</cool_number>
<price></price>
</catalog_item>
</product>
</catalog>
Output:
[{'gender': ["Men's"], 'cool_number': ['QWZ5671'], 'price': ['None']}]
But if price tags and in different line then price is not coming in the output
a.xml
<catalog>
<product description="Cardigan Sweater" product_image="cardigan.jpg">
<catalog_item gender="Men's">
<cool_number>QWZ5671</cool_number>
<price>
</price>
</catalog_item>
</product>
</catalog>
Output:
[{'gender': ["Men's"], 'cool_number': ['QWZ5671']}]
Code is the same for both xmls:
from lxml import etree
from collections import defaultdict
root_1 = etree.parse('a.xml').getroot()
d1= []
for node in root_1.findall('.//catalog_item'):
item = defaultdict(list)
for x in node.iter():
# iterate over the items
for k, v in x.attrib.items():
item[k].append(v)
if x.attrib is None:
item[x.attrib].append('None')
if x.text is None:
item[x.tag].append('None')
elif x.text.strip():
item[x.tag].append(x.text.strip())
d1.append(dict(item))
print(d1)
Any idea when tags and in different line then why price tag is not coming in the output?
what is the solution to fix this?
your issue is with this condition:
if x.text is None:
item[x.tag].append('None')
you are checking whether the tag contains no text at all. this is the case here <price></price> because the closing tag follows the opening tag immediately. here however
...
<price>
</price>
...
your tag does contain text: a newline and some whitespace characters. to fix this you have to change your condition from if x.text is None: to something like if not x.text or not x.text.strip():

Extracting comments from XML file in Python

I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!
Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.

How to convert XML to CSV when attributes are inside the root tag?

I need to convert an XML file to CSV file, in which attributes are inside the root tag.
I've checked many links but didn't find any example where XML is converted to CSV by parsing root tag's attributes.
XML Sample:
<Products>
<Product ProductID="1234" ProductName="ABC" Category="Food" />
<Product ProductID="1236" ProductName="ABE" Category="Healthcare" />
</Products>
Python Code (getting NoneType error):
ProductID = member.find('Product').attrib.get('ProductID')
Product_nodes.append(ProductID)
Expected Results in CSV file:
ProductID, ProductName, Category
1234, ABC,Food
1236, ABE,Healthcare
Using the csv and lxml libraries:
import csv
from lxml import etree
xml_content = """<Products>
<Product ProductID="1234" ProductName="ABC" Category="Food" />
<Product ProductID="1236" ProductName="ABE" Category="Healthcare" />
</Products>"""
outfile = "products.csv"
field_names = ["ProductID", "ProductName", "Category"]
root_node = etree.fromstring(xml_content)
filehandle = open(outfile, "w", newline="")
writer = csv.DictWriter(filehandle, fieldnames=field_names)
writer.writeheader()
for node in root_node.findall("Product"):
writer.writerow(dict(node.attrib))
filehandle.close()
If you've got access to XQuery or XPath 2.0+ it's simply
string-join(//Product/string-join((#ProductID, #ProductName, #Category),','),'
')

How to properly use xmlfile api from lxml

I have a large (5+ gigs) XML file which I need to parse, do some operation & write a new XML file.
dummy.xml
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header>
<name>Product Catalog</name>
</header>
<product product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
<product product-id="5678">
<available-flag>false</available-flag>
<name>product1</name>
</product>
<product product-id="9999">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog>
As you see the above XML has 3 product tags & I need to filter some product-ids on basis of a pre-defined list of ids.
I am using lxml iterparse to parse the XML iteratively & want to use xmlfile API to create a new XML incrementally to keep the memory footprint low. So, my motive is to filter out the product tags which don't meet the criteria & copy the rest of the XML tags as it is.
from lxml import etree
f = './dummy.xml'
f1 = './test.xml'
context = etree.iterparse(f, events=('start',))
productsToExport = ['1234']
with etree.xmlfile(f1, encoding='utf-8') as xf:
xf.write_declaration()
with xf.element('catalog xmlns="http://www.namespace.com" catalog-id="test-catalog"'):
for event, element in context:
tagName = etree.QName(element.tag).localname
if (tagName == 'product'):
pid = element.get('product-id')
if (pid in productsToExport):
xf.write(element)
elif (tagName == 'header'):
xf.write(element) # copy existing header tag as it is
Above code works ok & generates a XML as below
<?xml version='1.0' encoding='utf-8'?>
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header xmlns="http://www.namespace.com">
<name>Product Catalog</name>
</header>
<product xmlns="http://www.namespace.com" product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
If you observe the above XML it has few issues:
Closing <catalog> tag has xmlns & catalog-id present in it
All tags like header, product have xmlns attribute present in it
I checked xmlfile api documentation but couldn't find a way to fix above issues.
EDIT:
I managed to resolve the 1st issue by using below
attribs = {'xmlns' : 'http://www.namespace.com', 'catalog-id' : 'test-catalog'}
with xf.element('catalog', attribs):
# previous logic
So, now am left with removing the namespace from every element.
Consider simply rebuilding the XML tree with lxml.etree methods instead of the xmlfile API, still in the context of your iterparse:
from lxml import etree
f = './dummy.xml'
f1 = './test.xml'
productsToExport = ['1234']
# ROOT ELEMENT WITH DEFUALT NAMESPACE
my_nmsp = {None: 'http://www.namespace.com'}
# INITIALIZE ITERATOR
context = etree.iterparse(f, events=('start',))
for event, element in context:
tagName = etree.QName(element.tag).localname
for prod in productsToExport:
root = etree.Element('catalog', nsmap=my_nmsp)
root.text = '\n\t'
root.attrib['catalog-id'] = "test-catalog"
# PRODUCT ELEMENT
if tagName == 'product':
pid = element.get('product-id')
if pid == prod:
root.append(element)
# HEADER ELEMENT
elif (tagName == 'header'):
root.append(element)
# OUTPUT TREE TO FILE
with open(f1, 'wb') as f:
f.write(etree.tostring(root, pretty_print=True))
Output
<catalog xmlns="http://www.namespace.com" catalog-id="test-catalog">
<header>
<name>Product Catalog</name>
</header>
<product product-id="1234">
<available-flag>false</available-flag>
<name>product1</name>
</product>
</catalog>

python extract xml element value to csv

I'm a new to python so please bear with me as I try to explain what I am trying to do
here is my xml
<?xml version="1.0"?>
<playlist>
<list>
<txdate>2015-10-30</txdate>
<channel>cake</channel>
<name>Play List</name>
</list>
<eventlist>
<event type="MEDIA">
<title>title1</title>
<starttype>FIX</starttype>
<mediaid>a</mediaid>
<onairtime>2015-10-30T13:30:00:00</onairtime>
<som>00:00:40:03</som>
<duration>01:15:47:15</duration>
<reconcilekey>123</reconcilekey>
<category>PROGRAM</category>
<subtitles>
<cap>CLOSED</cap>
<file>a</file>
<lang>ENG</lang>
<lang>GER</lang>
</subtitles>
</event>
<event type="MEDIA">
<title>THREE DAYS AND A CHILD</title>
<mediaid>b</mediaid>
<onairtime>2015-10-30T14:45:47:15</onairtime>
<som>00:00:00:00</som>
<duration>01:19:41:07</duration>
<reconcilekey>321</reconcilekey>
<category>PROGRAM</category>
<subtitles>
<cap>CLOSED</cap>
<file>b</file>
<lang>ENG</lang>
<lang>GER</lang>
</subtitles>
</event>
</eventlist>
</playlist>
I would like to print all the mediaid values to a file
this is my code so far
import os
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
wfile = 'new.csv'
for child in root:
child.find( "media type" )
for x in child.iter("mediaid"):
file = open(wfile, 'a')
file.write(str(x))
file.close
I tried this with a few other nonstandard libraries but I didn't have much success
For your requirement (as mentioned in the comments) -
just the mediaid from each <event type="MEDIA">
You should use findall() method of ElementTree to get all the event elements with type="MEDIA" , and then get the child mediaid element from it. Example -
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
with open('new.csv','w') as outfile:
for elem in root.findall('.//event[#type="MEDIA"]'):
mediaidelem = elem.find('./mediaid')
if mediaidelem is not None:
outfile.write("{}\n".format(mediaidelem.text))

Categories