python get xml element by path - python

I try to walk through a large xml file, and collect some data. As the location of the data can be find by the path, I used xpath, but no result.
Could someonne suggest what I am doing wrong?
Example of the xml:
<?xml version="1.0" encoding="UTF-8"?>
<rootnode>
<subnode1>
</subnode1>
<subnode2>
</subnode2>
<subnode3>
<listnode>
<item id="1"><name>test name1</name></item>
<item id="2"><name>test name2</name></item>
<item id="3"><name>test name3</name></item>
</listnode>
</subnode3>
</rootnode>
The code:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('./rootnode/subnode3/listnode')
for next_item in subtree:
Id = next_item.attrib.get('id')
name = next_item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))

You are pretty close.
Ex:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('/rootnode/subnode3/listnode')
for next_item in subtree:
for item in next_item.findall('item'):
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
OR
subtree = tree.xpath('/rootnode/subnode3/listnode/item')
for item in subtree:
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
Output:
test name1 - 1
test name2 - 2
test name3 - 3

Related

Ho to parse and get element of an xml using Python data frame

This is my XML string i am getting this as a message so it is not a file
<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>
I want to get output in below format
message,code,Id
I have mentioned only three elements but i can have many more elements .
This is how i am trying but not getting the exact output
I started learning Python so excuse me for silly mistakes
from __future__ import print_function
import pandas as pd
def lambda_handler():
import xml.etree.ElementTree as et
xtree = et.parse('''<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>''')
xroot = xtree.getroot()
df_cols = ["message", "code", "Id"]
rows = []
for node in xroot:
s_name = node.attrib.get("message")
s_mail = node.find("code").text if node is not None else None
s_grade = node.find("Id").text if node is not None else None
lambda_handler()
you can try using XPath, it will be easier to retrieve the wanted data
import xml.etree.ElementTree as et
import pandas as pd
xtree = et.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>""")
keys = ["message", "code", "Id"]
data = {k: [xtree.find(".//"+k).text] for k in keys}
print(pd.DataFrame(data))
# Outputs:
# message code Id
# 0 5jb10x5rf7sp1fov5msgoof7r COMPLETED dfkjlhgd98568y
Is this the output you desire?
# !pip install xmltodict
import xmltodict
xml = """
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>
"""
d = xmltodict.parse(xml)
print(d['name']['message'])
print(d['name']['code'])
print(d['name']['Id'])
Output
5jb10x5rf7sp1fov5msgoof7r
COMPLETED
dfkjlhgd98568y
More info on xmltodict at https://github.com/martinblech/xmltodict
Given your string:
your_string='''\
<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>'''
Since this is a string, you would use .fromstring() rather than .parse(). That automatically finds the root node name for you (ie, no need to call .getroot()):
root = et.fromstring(your_string)
>>> root
<Element 'name' at 0x1050f51d0>
Once you have the data structure with name as the root, you can either iterate over the sub elements:
df_cols = ["message", "code", "Id"]
for node in root:
if node.tag in df_cols:
print({node.tag:node.text})
Prints:
{'message': '5jb10x5rf7sp1fov5msgoof7r'}
{'code': 'COMPLETED'}
{'Id': 'dfkjlhgd98568y'}
Or you can use an xpath query to find each element of interest:
for k in df_cols:
print({k:root.find(f'./{k}').text})
# same output
Now since a data frame can be constructed by {key:[list_of_elements],...} you can construct that type of dict from what we have built here:
df=pd.DataFrame({k:[root.find(f'./{k}').text] for k in df_cols})
If you have multiple elements, use findall:
df=pd.DataFrame({k:[x.text for x in root.findall(f'./{k}')] for k in df_cols})

Fetching elements from XML and insert into Postgres DB

I have an XML file like this i need to insert this data to PostgreSQL DB.Below is the sample XML and the code which i use ,but i'm not getting any output,Can someone please guide on how to effectively fetch these XML values.
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
<item>
.
.
.
</item>
Below is the script which i use,
Python : 3.5 Postgres version 11
# import modules
import sys
import psycopg2
import datetime
now = datetime.datetime.now()
# current data and time
dt = now.strftime("%Y%m%dT%H%M%S")
# xml tree access
#from xml.etree import ElementTree
import xml.etree.ElementTree as ET
# incremental variable
x = 0
with open('/Users/admin/documents/shopping.xml', 'rt',encoding="utf8") as f:
#tree = ElementTree.parse(f)
tree = ET.parse(f)
# connection to postgreSQL database
try:
conn=psycopg2.connect(host='localhost', database='postgres',
user='postgres', password='postgres',port='5432')
except:
print ("Hey I am unable to connect to the database.")
cur = conn.cursor()
# access the xml tree element nodes
try:
for node in tree.findall('.//item'):
src = node.find('id')
tgt = node.find('mpn')
print(node)
except:
print ("Oops I can't insert record into database table!")
conn.commit()
conn.close()
The current output i'm getting is like,
None
None
None
Expected Output,
id title description gtin ......
20 product 1 g:description xxxx .....
Strange is that you can't find item. It seems you use wrong file and it doesn't have item.
Using your XML data as string and ET.fromstring() I have no problem to get item.
Maybe check print( f.read() ) to see what you really read from file.
Problem is only id, tgt which use namespace - g: - and it need something more then only g:id, g:tgt
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
or use directly as '{http://base.google.com/ns/1.0}id' '{http://base.google.com/ns/1.0}mpn'
tree = ET.fromstring(xml)
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Minimal working code:
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Result:
Node: <Element 'item' at 0x7f74ba45b710>
src: 20
tgt: 0014
BTW: It works even when I use io.StringIO to simulate file
f = io.StringIO(xml)
tree = ET.parse(f)
Minimal working code:
import xml.etree.ElementTree as ET
import io
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
f = io.StringIO(xml)
tree = ET.parse(f)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('mpn:', tgt.text)

XML parser returns NoneType

I am trying to parse below XML format using the ElementTree XML in Python, but I get "member" as None, when I use .text it gives attribute error
<address-group>
<entry name="TBR">
<static>
<member>TBR1-1.1.1.1_21</member>
<member>TBR2-2.2.2.2_24</member>
<member>TBR3-3.3.3.3_21</member>
<member>TBR4-4.4.4.4_24</member>
</static>
</entry>
<address-group>
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse("addrgrp.xml")
root = tree.getroot()
tag = root.tag
print (tag)
attr = root.attrib
for entries in root.findall("entry"):
name = entries.get('name')
print (name)
ip = entries.find('static')
print (ip)
for mem in ip.findall('member'):
member = mem.find('member')
print (member)
The code below aggregate the members of each entry by entry name
import xml.etree.ElementTree as ET
import pprint
XML = '''
<address-group>
<entry name="TBR1">
<static>
<member>TBR1-1.1.1.1_21</member>
<member>TBR2-2.2.2.2_24</member>
<member>TBR3-3.3.3.3_21</member>
<member>TBR4-4.4.4.4_24</member>
</static>
</entry>
<entry name="TBR2">
<static>
<member>TBR1-4.1.1.1_21</member>
<member>TBR2-4.2.2.2_24</member>
<member>TBR3-4.3.3.3_21</member>
<member>TBR4-9.4.4.4_24</member>
</static>
</entry>
</address-group>'''
root = ET.fromstring(XML)
data_by_entry = {}
entries = root.findall('.//entry')
for entry in entries:
data_by_entry[entry.attrib['name']] = [m.text for m in entry.findall('./static/member')]
pprint.pprint(data_by_entry)
output
{'TBR1': ['TBR1-1.1.1.1_21',
'TBR2-2.2.2.2_24',
'TBR3-3.3.3.3_21',
'TBR4-4.4.4.4_24'],
'TBR2': ['TBR1-4.1.1.1_21',
'TBR2-4.2.2.2_24',
'TBR3-4.3.3.3_21',
'TBR4-9.4.4.4_24']}
The source of your problem is that:
within for mem in ip.findall('member'): loop mem is the current member element,
but the first instruction in this loop is member = mem.find('member'),
so you attempt to find another (nested) member within the current member,
which doesn't exist.
Another flaw in your code is that there is no point in printing a node which does
not have any text.
Change your loop to the code below:
for entries in root.findall('entry'):
name = entries.get('name')
print(name)
ip = entries.find('static')
print('Members:')
for mem in ip.findall('member'):
print(mem.text)
and you will get meaningful result.

Python XML Element Tree finding the value of an XML tag

I'm trying to retrieve the value of a particular xml tag in an XML file. The problem is that it returns a memory address instead of the actual value.
Already tried multiple approaches using other libraries as well. Nothing really yielded the result.
from xml.etree import ElementTree
tree = ElementTree.parse('C:\\Users\\Sid\\Desktop\\Test.xml')
root = tree.getroot()
items = root.find("items")
item= items.find("item")
print(item)
Expected was 1 2 3 4. Actual : Memory address.
XML File is :
<data>
<items>
<item>1</item>
</items>
<items>
<item>2</item>
</items>
<items>
<item>3</item>
</items>
<items>
<item>4</item>
</items>
</data>
Using BeautifulSoup:
from bs4 import BeautifulSoup
import urllib
test = '''<data>
<items>
<item>1</item>
</items>
<items>
<item>2</item>
</items>
<items>
<item>3</item>
</items>
<items>
<item>4</item>
</items>
</data>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("item")
for d in data:
print(d.text)
OUTPUT:
1
2
3
4
Using XML Element Tree:
from xml.etree import ElementTree
tree = ElementTree.parse('list.txt')
root = tree.getroot()
items = root.findall("items")
for elem in items:
desired_tag = elem.find("item")
print(desired_tag.text)
OUTPUT:
1
2
3
4
EDIT:
If you want them printed in a line separated by spaces.
print(desired_tag.text, "\t", end = "")

XML Attribures Empty

I'm reading an xml object into Python 3.6 on Windows 10 from file. Here is a sample of the xml:
<?xml version="1.0"?>
<rss version="2.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<item>
<BurnLocation># 32 40 52.99 # 80 57 33.00</BurnLocation>
<geo:lat>32.681389</geo:lat>
<geo:long>-80.959167</geo:long>
<County>Jasper</County>
<BurnType>PD</BurnType>
<BurnTypeDescription>PILED DEBRIS</BurnTypeDescription>
<Acres>2</Acres>
</item>
<item>
<BurnLocation># 33 29 34.26 # 81 15 52.89</BurnLocation>
<geo:lat>33.492851</geo:lat>
<geo:long>-81.264694</geo:long>
<County>Orangebrg</County>
<BurnType>PD</BurnType>
<BurnTypeDescription>PILED DEBRIS</BurnTypeDescription>
<Acres>1</Acres>
</item>
</channel>
</rss>
Here is a version of my code:
import os
import xml.etree.ElementTree as ET
local_filename = os.path.join('C:\\Temp\\test\\', filename)
tree = ET.parse(local_filename)
root = tree.getroot()
for child in root:
for next1 in child:
for next2 in next1:
print(next2.tag,next2.attrib)
The issue I'm having is that I cannot seem to isolate the attributes of the child tags, they are coming up as empty dictionaries. Here is an example of the result:
BurnLocation {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}lat {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}long {}
County {}
BurnType {}
BurnTypeDescription {}
Acres {}
BurnLocation {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}lat {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}long {}
County {}
BurnType {}
BurnTypeDescription {}
Acres {}
I am trying to print out the items within the tags (i.e. Jasper), what am I doing wrong?
What you want here is the text contents of each element, and not their attributes.
This ought to do it (slightly simplified for a fixed filename):
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
for next1 in child:
for next2 in next1:
print ('{} = "{}"'.format(next2.tag,next2.text))
print ()
However, I'd simplify it a bit by:
locating all <item> elements at once, and
then looping over its children elements.
Thus
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
for item in tree.findall('*/item'):
for elem in list(item):
print ('{} = "{}"'.format(elem.tag,elem.text))
print ()

Categories