Python XML element tree data extract - python

I am VERY new to python and airflow. I have been asked to create a python script which will be run in airflow which goes through XML and extracts the data places it into a variable. I have so far been doing ok and have extracted data successfully however i have now hit a problem and I am not sure why. I have attached a screenshot of the XML I am trying to extract:
<?xml version="1.0"?>
<message>
<m_control>
</m_control>
<m_content>
<b_control>
</b_control>
<intermediary type="IFA">
</intermediary>
<application>
<personal_client id="pc1">
</personal_client>
<product type="xxx" product_code="xxx">
<risk_benefit id="xx1" type="xxx">
<cover_purpose>xxxxxx</cover_purpose>
<cover_period>
<end_age definition="">xx</end_age>
</cover_period>
</risk_benefit>
</application>
</m_content>
</message>
Below is the code I am using which has worked before
from xml.etree import ElementTree as ET
import re
strip_namespace_regex = re.compile(' xmlns="[^"]+"')
product = root.findall('.//product')
for product in product:
risk_benefit_node = product.find('.//risk_benefit')
result['cover_purpose'] = risk_benefit_node.find('cover_purpose').text if risk_benefit_node.find('cover_purpose').text is not None else ''
However at the moment I get
ERROR - Failed to execute task: 'NoneType' object has no attribute 'text'
When I try the below
from xml.etree import ElementTree as ET
import re
strip_namespace_regex = re.compile(' xmlns="[^"]+"')
product = root.findall('.//product')
for product in product:
risk_benefit_node = product.find('.//risk_benefit')
result['risk_benefit_id'] = risk_benefit_node.attrib['id'] if risk_benefit_node.attrib['id'] is not None else ''
I get
ERROR - Failed to execute task: 'id'"
I am not sure whether it's because the risk benefit section has two attributes or it is not picking up the risk benefit section.
Does anyone know what I am doing wrong?

Related

How can we parse xml data that contains nodes with xml namespace tags in python?

I am getting XML as a response so I want to parse it. I tried many python libraries but not get my desired results. So if you can help, it will be really appreciative.
The following code returns None:
xmlResponse = ET.fromstring(context.response_document)
a = xmlResponse.findall('.//Body')
print(a)
Sample XML Data:
<S:Envelope
xmlns:S="http://www.w3.org/2003/05/soap-envelope">
<S:Header>
<wsa:Action s:mustUnderstand="1"
xmlns:s="http://www.w3.org/2003/05/soap-envelope"
xmlns:wsa="http://www.w3.org/2005/08/addressing">urn:ihe:iti:2007:RegistryStoredQueryResponse
</wsa:Action>
</S:Header>
<S:Body>
<query:AdhocQueryResponse status="urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success"
xmlns:query="urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0">
<rim:RegistryObjectList
xmlns:rim="u`enter code here`rn:oasis:names:tc:ebxml-regrep:xsd:rim:3.0"/>
</query:AdhocQueryResponse>
</S:Body>
</S:Envelope>
I want to get status from it which is in Body. If you can suggest some changes of some library then please help me. Thanks
Given the following base code:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
Let's build on top of it to get your desired output.
Your initial find for .//Body x-path returns NONE because it doesn't exist in your XML response.
Each tag in your XML has a namespace associated with it. More info on xml namespaces can be found here.
Consider the following line with xmlns value (xml-namespace):
<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope">
The value of namespace S is set to be http://www.w3.org/2003/05/soap-envelope.
Replacing S in {S}Envelope with value set above will give you the resulting tag to find in your XML:
root.find('{http://www.w3.org/2003/05/soap-envelope}Envelope') #top most node
We would need to do the same for <S:Body>.
To get<S:Body> elements and it's child nodes you can do the following:
body_node = root.find('{http://www.w3.org/2003/05/soap-envelope}Body')
for response_child_node in list(body_node):
print(response_child_node.tag) #tag of the child node
print(response_child_node.get('status')) #the status you're looking for
Outputs:
{urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0}AdhocQueryResponse
urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success
Alternatively
You can also directly find all {query}AdhocQueryResponse in your XML using:
response_nodes = root.findall('.//{urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0}AdhocQueryResponse')
for response in response_nodes:
print(response.get('status'))
Outputs:
urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success

Generating one XML file from another using Python

How would one go about generating a local XML file from a url?
I need to be able to select out certain values from the remote XML file and place thme into a local one. Currently I only have snippets.
from xml.dom import minidom
from urllib.request import urlopen
import requests
url = 'http://url.php'
private_list = [16735,7456,18114]
xmldoc = minidom.parse(urlopen(url))
public_list = xmldoc.getElementsByTagName('server')
for public_server in public_list:
for private_server in private_list:
if (public_server.attributes['id'].value) == str(private_server):
print("Found one!")
Sadly that is about as far as I can get I am able to grab the correct number of elements from the public list but I am not sure how to take those elements and build a local copy of those.
Can anyone assist?
Edit: Example
The example XML looks like this:
<settings>
<servers>
<server url="192.168.1.100" name="CentOS" id="12" host="Kirk.corporate.lan"/>
<server url="10.0.0.95" name="Ubunutu" id="4" url2="192.168.1.50" host="Spock.corporate.lan"/>
<server url="10.0.1.95" id="30" host="scotty.corporate.lan"/>
</servers>
</settings>
I this example my list will only include ID 4. So I want to take just that subelement whos ID = 4. I can find that with my code above I dont know how to take that entire element though and shove it into a new file.
It looks like with xml.etree.ElementTree I can do
import xml.etree.ElementTree as ET
settings = ET.Element('settings')
servers = ET.SubElement(settings, 'servers')
That will build the base but thats about as far as I can get successfully.
Edit 2:
Got a little further
from lxml import etree
root = etree.Element('settings')
main = etree.SubElement(root,'servers')
main = etree.SubElement(main, "server", url = "192.168.1.100", name="CentOS", id="12", host="Kirk.corporate.lan")

error traversing xml in Python

My attempts to traverse an xml file retrieved from a url has always failed. Though, it worked if I typed the xml file directly in the code such as:
smplexml = ''' somexml'''
but I have been unsuccessful to make a code like:
import xml.etree.ElementTree as ET
import urllib
xmlstr = urllib.urlopen('http://www.w3schools.com/xml/simple.xml').read()
tree = ET.fromstring(xmlstr)
print tree.find('name').text
this work. Please what am I doing wrongly? Sometimes I get an error message like:
AttributeError: 'NoneType' object has no attribute 'text'
import xml.etree.ElementTree as ET
import urllib
xmlstr = urllib.urlopen('http://www.w3schools.com/xml/simple.xml').read()
tree = ET.fromstring(xmlstr)
for food in tree:
print food.find('name').text

Extracting nested namespace from a xml using lxml

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.
Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:
<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
-------------
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>
What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.
If there's anyone out there can lend your expertise, it would be much appreciated.
Here what I did so far with the help from the two contributors:
#!/usr/bin/env python
from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree
def parse():
import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)
with open(xml_file)as f:
xml = f.read()
tree = etree.XML(xml)
caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace
print caption_namespace
print tree.nsmap
nsmap = {}
for ns in tree.xpath('//namespace::*'):
if ns[0]:
nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)
return nsmap
if __name__=="__main__":
parse()
But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'
Any suggestions pls?
This program prints the namespace of the indicated tag:
from lxml import etree
xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')
print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace
Result:
http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#
Reference: http://lxml.de/tutorial.html#namespaces

Python XML parsing from website

I am trying to Parse from a website. I am stuck. I will provide the XML below. It is coming from a webiste. I have two questions. What is the best way to read xml from a website, and then I am having trouble digging into the xml to get the rate I need.
The figure I need back is Base:OBS_VALUE 0.12
What I have so far:
from xml.dom import minidom
import urllib
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
ff_DataSet = xmldoc.getElementsByTagName('ff:DataSet')[0]
ff_series = ff_DataSet.getElementsByTagName('ff:Series')[0]
for line in ff_series:
price = line.getElementsByTagName('base:OBS_VALUE')[0].firstChild.data
print(price)
XML code from webiste:
-<Header> <ID>FFD</ID>
<Test>false</Test>
<Name xml:lang="en">Federal Funds daily averages</Name> <Prepared>2013-05-08</Prepared>
<Sender id="FRBNY"> <Name xml:lang="en">Federal Reserve Bank of New York</Name>
<Contact>
<Name xml:lang="en">Public Information Web Team</Name> <Email>ny.piwebteam#ny.frb.org</Email>
</Contact>
</Sender>
<!--ReportingBegin></ReportingBegin-->
</Header>
<ff:DataSet> -<ff:Series TIME_FORMAT="P1D" DISCLAIMER="G" FF_METHOD="D" DECIMALS="2" AVAILABILITY="A">
<ffbase:Key>
<base:FREQ>D</base:FREQ>
<base:RATE>FF</base:RATE>
<base:MATURITY>O</base:MATURITY>
<ffbase:FF_SCOPE>D</ffbase:FF_SCOPE>
</ffbase:Key>
<ff:Obs OBS_CONF="F" OBS_STATUS="A">
<base:TIME_PERIOD>2013-05-07</base:TIME_PERIOD>
<base:OBS_VALUE>0.12</base:OBS_VALUE>
If you wanted to stick with xml.dom.minidom, try this...
from xml.dom import minidom
import urllib
url_str = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
xml_str = urllib.urlopen(url_str).read()
xmldoc = minidom.parseString(xml_str)
obs_values = xmldoc.getElementsByTagName('base:OBS_VALUE')
# prints the first base:OBS_VALUE it finds
print obs_values[0].firstChild.nodeValue
# prints the second base:OBS_VALUE it finds
print obs_values[1].firstChild.nodeValue
# prints all base:OBS_VALUE in the XML document
for obs_val in obs_values:
print obs_val.firstChild.nodeValue
However, if you want to use lxml, use underrun's solution. Also, your original code had some errors. You were actually attempting to parse the document variable, which was the web address. You needed to parse the xml returned from the website, which in your example is the get_web variable.
Take a look at your code:
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
I'm not sure you have document correct unless you want http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=dailyr because that's what you'll get (the parens group in this case and strings listed next to each other automatically concatenate).
After that you do some work to create get_web but then you don't use it in the next line. Instead you try to parse your document which is the url...
Beyond that, I would totally suggest you use ElementTree, preferably lxml's ElementTree (http://lxml.de/). Also, lxml's etree parser takes a file-like object which can be a urllib object. If you did, after straightening out the rest of your doc, you could do this:
from lxml import etree
from io import StringIO
import urllib
url = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
root = etree.parse(urllib.urlopen(url))
for obs in root.xpath('/ff:DataSet/ff:Series/ff:Obs'):
price = obs.xpath('./base:OBS_VALUE').text
print(price)

Categories