error traversing xml in Python

error traversing xml in Python - python

My attempts to traverse an xml file retrieved from a url has always failed. Though, it worked if I typed the xml file directly in the code such as:
smplexml = ''' somexml'''
but I have been unsuccessful to make a code like:
import xml.etree.ElementTree as ET
import urllib
xmlstr = urllib.urlopen('http://www.w3schools.com/xml/simple.xml').read()
tree = ET.fromstring(xmlstr)
print tree.find('name').text
this work. Please what am I doing wrongly? Sometimes I get an error message like:
AttributeError: 'NoneType' object has no attribute 'text'

import xml.etree.ElementTree as ET
import urllib
xmlstr = urllib.urlopen('http://www.w3schools.com/xml/simple.xml').read()
tree = ET.fromstring(xmlstr)
for food in tree:
print food.find('name').text

Related

Python XML element tree data extract

I am VERY new to python and airflow. I have been asked to create a python script which will be run in airflow which goes through XML and extracts the data places it into a variable. I have so far been doing ok and have extracted data successfully however i have now hit a problem and I am not sure why. I have attached a screenshot of the XML I am trying to extract:
<?xml version="1.0"?>
<message>
<m_control>
</m_control>
<m_content>
<b_control>
</b_control>
<intermediary type="IFA">
</intermediary>
<application>
<personal_client id="pc1">
</personal_client>
<product type="xxx" product_code="xxx">
<risk_benefit id="xx1" type="xxx">
<cover_purpose>xxxxxx</cover_purpose>
<cover_period>
<end_age definition="">xx</end_age>
</cover_period>
</risk_benefit>
</application>
</m_content>
</message>
Below is the code I am using which has worked before
from xml.etree import ElementTree as ET
import re
strip_namespace_regex = re.compile(' xmlns="[^"]+"')
product = root.findall('.//product')
for product in product:
risk_benefit_node = product.find('.//risk_benefit')
result['cover_purpose'] = risk_benefit_node.find('cover_purpose').text if risk_benefit_node.find('cover_purpose').text is not None else ''
However at the moment I get
ERROR - Failed to execute task: 'NoneType' object has no attribute 'text'
When I try the below
from xml.etree import ElementTree as ET
import re
strip_namespace_regex = re.compile(' xmlns="[^"]+"')
product = root.findall('.//product')
for product in product:
risk_benefit_node = product.find('.//risk_benefit')
result['risk_benefit_id'] = risk_benefit_node.attrib['id'] if risk_benefit_node.attrib['id'] is not None else ''
I get
ERROR - Failed to execute task: 'id'"
I am not sure whether it's because the risk benefit section has two attributes or it is not picking up the risk benefit section.
Does anyone know what I am doing wrong?

python minidom: 'NoneType' object has no attribute 'data' from url

I'm trying to parse a XML with Python using minidom. When I'm parsing a xml file from my filesystem I haven't any probnlem.
doc = minidom.parse("PATH HERE")
etiquetaDia = doc.getElementsByTagName("dia")
for dia in etiquetaDia:
probPrecip = dia.getElementsByTagName("maxima")[0]
print(probPrecip.firstChild.data)
But when I try to parse a XML from a url with this code:
url = urllib2.urlopen('URL HERE')
doc = minidom.parse(url)
etiquetaDia = doc.getElementsByTagName("dia")
for dia in etiquetaDia:
probPrecip = dia.getElementsByTagName("maxima")[0]
print(probPrecip.firstChild.data)
I have an error message
Obviously it's the same XML in path and in url. Thanks

The urlopen function returns an HttpResponse object. You must first call the read() method of this object to get the actual content of the response, and pass that to minidom
minidom.parse(url.read())

Try the new urllib library instead like below.
It prints out Hello. Is that what you want?
from xml.dom import minidom
from urllib import request
url = request.urlopen('http://localhost:8000/sample.xml')
doc = minidom.parse(url)
etiquetaDia = doc.getElementsByTagName("dia")
for dia in etiquetaDia:
probPrecip = dia.getElementsByTagName("maxima")[0]
print(probPrecip.firstChild.data)
Sample XML
<?xml version="1.0" encoding="UTF-8"?>
<dia>
<maxima>Hello</maxima>
</dia>

Generating XML file with proper indentation

I am trying to generate the XML file in python but its not getting indented the out put is coming in straight line.
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
name = str(request.POST.get('name'))
top = Element('scenario')
environment = SubElement(top, 'environment')
cluster = SubElement(top, 'cluster')
cluster.text=name
I tried to use pretty parser but its giving me an error as: 'Element' object has no attribute 'read'
import xml.dom.minidom
xml_p = xml.dom.minidom.parse(top)
pretty_xml = xml_p.toprettyxml()
Is the input given to parser is proper format ? if this is wrong method please suggest another way to indent.

You cannot directly parse top which is an Element(), you need to make that a string (which is why you should import tostring. that you are currently not using), and use xml.dom.minidom.parseString() on the result:
import xml.dom.minidom
xml_p = xml.dom.minidom.parseString(tostring(top))
pretty_xml = xml_p.toprettyxml()
print(pretty_xml)
that gives:
<?xml version="1.0" ?>
<scenario>
<environment/>
<cluster>xyz</cluster>
</scenario>

Extracting nested namespace from a xml using lxml

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.
Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:
<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
-------------
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>
What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.
If there's anyone out there can lend your expertise, it would be much appreciated.
Here what I did so far with the help from the two contributors:
#!/usr/bin/env python
from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree
def parse():
import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)
with open(xml_file)as f:
xml = f.read()
tree = etree.XML(xml)
caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace
print caption_namespace
print tree.nsmap
nsmap = {}
for ns in tree.xpath('//namespace::*'):
if ns[0]:
nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)
return nsmap
if __name__=="__main__":
parse()
But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'
Any suggestions pls?

This program prints the namespace of the indicated tag:
from lxml import etree
xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')
print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace
Result:
http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#
Reference: http://lxml.de/tutorial.html#namespaces

python lxml with py2exe

I have Generated an XML with dom and i want to use lxml to pretty print the xml.
this is my code for pretty print the xml
def prettify_xml(xml_str):
import lxml.etree as etree
root = etree.fromstring(xml_str)
xml_str = etree.tostring(root, pretty_print=True)
return xml_str
my output should be an xml formatted string.
I got this code from some post in stactoverflow. This works flawlessly when i am compiling wit python itself. But when i convert my project to a binary created from py2exe (my binary is windows service with a namedpipe).I had two problems:
My service was not starting , i solved this by adding lxml.etree in includes option in py2exe function. then on my service started properly.
when xml generation in called here, is the error which I am seeing in my log
'module' object has no attribute 'fromstring'
where do i rectify this error ? And Is my first problem's solution correct ?
my xml generation Code :
from xml.etree import ElementTree
from xml.dom import minidom
from xml.etree.ElementTree import Element, SubElement, tostring, XML
import lxml.etree
def prettify_xml(xml_str):
root = lxml.etree.fromstring(xml_str)
xml_str = lxml.etree.tostring(root, pretty_print=True)
return xml_str
def dll_xml(status):
try:
xml_declaration = '<?xml version="1.0" standalone="no" ?>'
rootTagName='response'
root = Element(rootTagName)
root.set('id' , 'rp001')
parent = SubElement(root, 'command', opcode ='-ac')
# Create children
chdtag1Name = 'mode'
chdtag1Value = 'repreport'
chdtag2Name='status'
chdtag2Value = status
fullchildtag1 = ''+chdtag1Name+' value = "'+chdtag1Value+'"'
fullchildtag2=''+chdtag2Name+' value="'+chdtag2Value+'"'
children = XML('''<root><'''+fullchildtag1+''' /><'''+fullchildtag2+'''/></root> ''')
# Add parent
parent.extend(children)
dll_xml_doc = xml_declaration + tostring(root)
dll_xml_doc = prettify_xml(dll_xml_doc)
return dll_xml_doc
except Exception , error:
log.error("xml_generation_failed : %s" % error)

Try to use PyInstaller instead py2exe. I converted your program to binary .exe with no problem just by running python pyinstaller.py YourPath\xml_a.py.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

error traversing xml in Python - python

import xml.etree.ElementTree as ET import urllib xmlstr = urllib.urlopen('http://www.w3schools.com/xml/simple.xml').read() tree = ET.fromstring(xmlstr) for food in tree: print food.find('name').text

Related

Python XML element tree data extract

python minidom: 'NoneType' object has no attribute 'data' from url

Generating XML file with proper indentation

Extracting nested namespace from a xml using lxml

python lxml with py2exe

Categories

Resources