python get xml element by path

python get xml element by path - python

I try to walk through a large xml file, and collect some data. As the location of the data can be find by the path, I used xpath, but no result.
Could someonne suggest what I am doing wrong?
Example of the xml:
<?xml version="1.0" encoding="UTF-8"?>
<rootnode>
<subnode1>
</subnode1>
<subnode2>
</subnode2>
<subnode3>
<listnode>
<item id="1"><name>test name1</name></item>
<item id="2"><name>test name2</name></item>
<item id="3"><name>test name3</name></item>
</listnode>
</subnode3>
</rootnode>
The code:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('./rootnode/subnode3/listnode')
for next_item in subtree:
Id = next_item.attrib.get('id')
name = next_item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))

You are pretty close.
Ex:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('/rootnode/subnode3/listnode')
for next_item in subtree:
for item in next_item.findall('item'):
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
OR
subtree = tree.xpath('/rootnode/subnode3/listnode/item')
for item in subtree:
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
Output:
test name1 - 1
test name2 - 2
test name3 - 3

Related

Ho to parse and get element of an xml using Python data frame

This is my XML string i am getting this as a message so it is not a file
<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>
I want to get output in below format
message,code,Id
I have mentioned only three elements but i can have many more elements .
This is how i am trying but not getting the exact output
I started learning Python so excuse me for silly mistakes
from __future__ import print_function
import pandas as pd
def lambda_handler():
import xml.etree.ElementTree as et
xtree = et.parse('''<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>''')
xroot = xtree.getroot()
df_cols = ["message", "code", "Id"]
rows = []
for node in xroot:
s_name = node.attrib.get("message")
s_mail = node.find("code").text if node is not None else None
s_grade = node.find("Id").text if node is not None else None
lambda_handler()

you can try using XPath, it will be easier to retrieve the wanted data
import xml.etree.ElementTree as et
import pandas as pd
xtree = et.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>""")
keys = ["message", "code", "Id"]
data = {k: [xtree.find(".//"+k).text] for k in keys}
print(pd.DataFrame(data))
# Outputs:
# message code Id
# 0 5jb10x5rf7sp1fov5msgoof7r COMPLETED dfkjlhgd98568y

Is this the output you desire?
# !pip install xmltodict
import xmltodict
xml = """
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>
"""
d = xmltodict.parse(xml)
print(d['name']['message'])
print(d['name']['code'])
print(d['name']['Id'])
Output
5jb10x5rf7sp1fov5msgoof7r
COMPLETED
dfkjlhgd98568y
More info on xmltodict at https://github.com/martinblech/xmltodict

Given your string:
your_string='''\
<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>'''
Since this is a string, you would use .fromstring() rather than .parse(). That automatically finds the root node name for you (ie, no need to call .getroot()):
root = et.fromstring(your_string)
>>> root
<Element 'name' at 0x1050f51d0>
Once you have the data structure with name as the root, you can either iterate over the sub elements:
df_cols = ["message", "code", "Id"]
for node in root:
if node.tag in df_cols:
print({node.tag:node.text})
Prints:
{'message': '5jb10x5rf7sp1fov5msgoof7r'}
{'code': 'COMPLETED'}
{'Id': 'dfkjlhgd98568y'}
Or you can use an xpath query to find each element of interest:
for k in df_cols:
print({k:root.find(f'./{k}').text})
# same output
Now since a data frame can be constructed by {key:[list_of_elements],...} you can construct that type of dict from what we have built here:
df=pd.DataFrame({k:[root.find(f'./{k}').text] for k in df_cols})
If you have multiple elements, use findall:
df=pd.DataFrame({k:[x.text for x in root.findall(f'./{k}')] for k in df_cols})

Fetching elements from XML and insert into Postgres DB

I have an XML file like this i need to insert this data to PostgreSQL DB.Below is the sample XML and the code which i use ,but i'm not getting any output,Can someone please guide on how to effectively fetch these XML values.
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
<item>
.
.
.
</item>
Below is the script which i use,
Python : 3.5 Postgres version 11
# import modules
import sys
import psycopg2
import datetime
now = datetime.datetime.now()
# current data and time
dt = now.strftime("%Y%m%dT%H%M%S")
# xml tree access
#from xml.etree import ElementTree
import xml.etree.ElementTree as ET
# incremental variable
x = 0
with open('/Users/admin/documents/shopping.xml', 'rt',encoding="utf8") as f:
#tree = ElementTree.parse(f)
tree = ET.parse(f)
# connection to postgreSQL database
try:
conn=psycopg2.connect(host='localhost', database='postgres',
user='postgres', password='postgres',port='5432')
except:
print ("Hey I am unable to connect to the database.")
cur = conn.cursor()
# access the xml tree element nodes
try:
for node in tree.findall('.//item'):
src = node.find('id')
tgt = node.find('mpn')
print(node)
except:
print ("Oops I can't insert record into database table!")
conn.commit()
conn.close()
The current output i'm getting is like,
None
None
None
Expected Output,
id title description gtin ......
20 product 1 g:description xxxx .....

Strange is that you can't find item. It seems you use wrong file and it doesn't have item.
Using your XML data as string and ET.fromstring() I have no problem to get item.
Maybe check print( f.read() ) to see what you really read from file.
Problem is only id, tgt which use namespace - g: - and it need something more then only g:id, g:tgt
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
or use directly as '{http://base.google.com/ns/1.0}id' '{http://base.google.com/ns/1.0}mpn'
tree = ET.fromstring(xml)
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Minimal working code:
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Result:
Node: <Element 'item' at 0x7f74ba45b710>
src: 20
tgt: 0014
BTW: It works even when I use io.StringIO to simulate file
f = io.StringIO(xml)
tree = ET.parse(f)
Minimal working code:
import xml.etree.ElementTree as ET
import io
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
f = io.StringIO(xml)
tree = ET.parse(f)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('mpn:', tgt.text)

XML parser returns NoneType

I am trying to parse below XML format using the ElementTree XML in Python, but I get "member" as None, when I use .text it gives attribute error
<address-group>
<entry name="TBR">
<static>
<member>TBR1-1.1.1.1_21</member>
<member>TBR2-2.2.2.2_24</member>
<member>TBR3-3.3.3.3_21</member>
<member>TBR4-4.4.4.4_24</member>
</static>
</entry>
<address-group>
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse("addrgrp.xml")
root = tree.getroot()
tag = root.tag
print (tag)
attr = root.attrib
for entries in root.findall("entry"):
name = entries.get('name')
print (name)
ip = entries.find('static')
print (ip)
for mem in ip.findall('member'):
member = mem.find('member')
print (member)

The code below aggregate the members of each entry by entry name
import xml.etree.ElementTree as ET
import pprint
XML = '''
<address-group>
<entry name="TBR1">
<static>
<member>TBR1-1.1.1.1_21</member>
<member>TBR2-2.2.2.2_24</member>
<member>TBR3-3.3.3.3_21</member>
<member>TBR4-4.4.4.4_24</member>
</static>
</entry>
<entry name="TBR2">
<static>
<member>TBR1-4.1.1.1_21</member>
<member>TBR2-4.2.2.2_24</member>
<member>TBR3-4.3.3.3_21</member>
<member>TBR4-9.4.4.4_24</member>
</static>
</entry>
</address-group>'''
root = ET.fromstring(XML)
data_by_entry = {}
entries = root.findall('.//entry')
for entry in entries:
data_by_entry[entry.attrib['name']] = [m.text for m in entry.findall('./static/member')]
pprint.pprint(data_by_entry)
output
{'TBR1': ['TBR1-1.1.1.1_21',
'TBR2-2.2.2.2_24',
'TBR3-3.3.3.3_21',
'TBR4-4.4.4.4_24'],
'TBR2': ['TBR1-4.1.1.1_21',
'TBR2-4.2.2.2_24',
'TBR3-4.3.3.3_21',
'TBR4-9.4.4.4_24']}

The source of your problem is that:
within for mem in ip.findall('member'): loop mem is the current member element,
but the first instruction in this loop is member = mem.find('member'),
so you attempt to find another (nested) member within the current member,
which doesn't exist.
Another flaw in your code is that there is no point in printing a node which does
not have any text.
Change your loop to the code below:
for entries in root.findall('entry'):
name = entries.get('name')
print(name)
ip = entries.find('static')
print('Members:')
for mem in ip.findall('member'):
print(mem.text)
and you will get meaningful result.

Python XML Element Tree finding the value of an XML tag

I'm trying to retrieve the value of a particular xml tag in an XML file. The problem is that it returns a memory address instead of the actual value.
Already tried multiple approaches using other libraries as well. Nothing really yielded the result.
from xml.etree import ElementTree
tree = ElementTree.parse('C:\\Users\\Sid\\Desktop\\Test.xml')
root = tree.getroot()
items = root.find("items")
item= items.find("item")
print(item)
Expected was 1 2 3 4. Actual : Memory address.
XML File is :
<data>
<items>
<item>1</item>
</items>
<items>
<item>2</item>
</items>
<items>
<item>3</item>
</items>
<items>
<item>4</item>
</items>
</data>

Using BeautifulSoup:
from bs4 import BeautifulSoup
import urllib
test = '''<data>
<items>
<item>1</item>
</items>
<items>
<item>2</item>
</items>
<items>
<item>3</item>
</items>
<items>
<item>4</item>
</items>
</data>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("item")
for d in data:
print(d.text)
OUTPUT:
1
2
3
4
Using XML Element Tree:
from xml.etree import ElementTree
tree = ElementTree.parse('list.txt')
root = tree.getroot()
items = root.findall("items")
for elem in items:
desired_tag = elem.find("item")
print(desired_tag.text)
OUTPUT:
1
2
3
4
EDIT:
If you want them printed in a line separated by spaces.
print(desired_tag.text, "\t", end = "")

XML Attribures Empty

I'm reading an xml object into Python 3.6 on Windows 10 from file. Here is a sample of the xml:
<?xml version="1.0"?>
<rss version="2.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<item>
<BurnLocation># 32 40 52.99 # 80 57 33.00</BurnLocation>
<geo:lat>32.681389</geo:lat>
<geo:long>-80.959167</geo:long>
<County>Jasper</County>
<BurnType>PD</BurnType>
<BurnTypeDescription>PILED DEBRIS</BurnTypeDescription>
<Acres>2</Acres>
</item>
<item>
<BurnLocation># 33 29 34.26 # 81 15 52.89</BurnLocation>
<geo:lat>33.492851</geo:lat>
<geo:long>-81.264694</geo:long>
<County>Orangebrg</County>
<BurnType>PD</BurnType>
<BurnTypeDescription>PILED DEBRIS</BurnTypeDescription>
<Acres>1</Acres>
</item>
</channel>
</rss>
Here is a version of my code:
import os
import xml.etree.ElementTree as ET
local_filename = os.path.join('C:\\Temp\\test\\', filename)
tree = ET.parse(local_filename)
root = tree.getroot()
for child in root:
for next1 in child:
for next2 in next1:
print(next2.tag,next2.attrib)
The issue I'm having is that I cannot seem to isolate the attributes of the child tags, they are coming up as empty dictionaries. Here is an example of the result:
BurnLocation {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}lat {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}long {}
County {}
BurnType {}
BurnTypeDescription {}
Acres {}
BurnLocation {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}lat {}
{http://www.w3.org/2003/01/geo/wgs84_pos#}long {}
County {}
BurnType {}
BurnTypeDescription {}
Acres {}
I am trying to print out the items within the tags (i.e. Jasper), what am I doing wrong?

What you want here is the text contents of each element, and not their attributes.
This ought to do it (slightly simplified for a fixed filename):
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
for next1 in child:
for next2 in next1:
print ('{} = "{}"'.format(next2.tag,next2.text))
print ()
However, I'd simplify it a bit by:
locating all <item> elements at once, and
then looping over its children elements.
Thus
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
for item in tree.findall('*/item'):
for elem in list(item):
print ('{} = "{}"'.format(elem.tag,elem.text))
print ()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python get xml element by path - python

Related

Ho to parse and get element of an xml using Python data frame

Fetching elements from XML and insert into Postgres DB

XML parser returns NoneType

Python XML Element Tree finding the value of an XML tag

XML Attribures Empty

Categories

Resources