Entity references and lxml - python

Here's the code I have:
from cStringIO import StringIO
from lxml import etree
xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ENTITY test "This is a test">
]>
<root>
<sub>&test;</sub>
</root>''')
d1 = etree.parse(xml)
print '%r' % d1.find('/sub').text
parser = etree.XMLParser(resolve_entities=False)
d2 = etree.parse(xml, parser=parser)
print '%r' % d2.find('/sub').text
Here's the output:
'This is a test'
None
How do I get lxml to give me '&test;', i.e., the raw entity reference?

The "unresolved" Entity is left as child node of the element node sub
>>> print d2.find('/sub')[0]
&test;
>>> d2.find('/sub').getchildren()
[&test;]

Related

Reading a spreadsheet like .xml with ElementTree

I am reading an xml file using ElementTree but there is a Cell in which I cannot read its data.
I adapted my file to make a reproducable example that I present next:
from xml.etree import ElementTree
import io
xmlf = """<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook ss:ResourcesPackageName="" ss:ResourcesPackageVersion="" xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="DigitalOutput" ss:IsDeviceType="true">
<Row ss:AutoFitHeight="0">
<Cell><Data ss:Type="String">A</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">B</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">C</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="7"><ss:Data ss:Type="String"
xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">CAN'T READ </Font><Font>THIS</Font></ss:Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="10"><Data ss:Type="String">D</Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
</Row>
</Worksheet>
</Workbook>"""
ss = "urn:schemas-microsoft-com:office:spreadsheet"
worksheet_label = '{%s}Worksheet' % ss
row_label = '{%s}Row' % ss
cell_label = '{%s}Cell' % ss
data_label = '{%s}Data' % ss
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall(worksheet_label):
for table in ws.findall(row_label):
for c in table.findall(cell_label):
data = c.find(data_label)
print(data.text)
The output is:
A
B
C
None
D
So, the fourth cell was not read. Can you help me on fixing this?
Question: Reading a spreadsheet like .xml with ElementTree
Documentation: The lxml.etree Tutorial- Namespaces
Define the namespaces used
ns = {'ss':"urn:schemas-microsoft-com:office:spreadsheet",
'html':"http://www.w3.org/TR/REC-html40"
}
Use the namespaces with find(.../findall(...
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall('ss:Worksheet', ns):
for table in ws.findall('ss:Row', ns):
for c in table.findall('ss:Cell', ns):
data = c.find('ss:Data', ns)
if data.text is None:
text = []
data = data.findall('html:Font', ns)
for element in data:
text.append(element.text)
data_text = ''.join(text)
print(data_text)
else:
print(data.text)
Output:
A
B
C
CAN'T READ THIS
D
Tested with Python: 3.5
The text content of the fourth cell belongs to the two Font subelements, which are bound to another namespace. Demo:
for e in root.iter():
text = e.text.strip() if e.text else None
if text:
print(e, text)
Output:
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> A
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> B
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> C
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e08> CAN'T READ
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e48> THIS
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01e48> D

how to search a word in xml file and print it in python

i want to search a specific word(which is entered by user) in .xml file. This is my xml file.
<?xml version="1.0" encoding="UTF-8"?>
<words>
<entry>
<word>John</word>
<pron>()</pron>
<gram>[Noun]</gram>
<poem></poem>
<meanings>
<meaning>name</meaning>
</meanings>
</entry>
</words>
here is my Code
import nltk
from nltk.tokenize import word_tokenize
import os
import xml.etree.ElementTree as etree
sen = input("Enter Your sentence - ")
print(sen)
print("\n")
print(word_tokenize(sen)[0])
tree = etree.parse('roman.xml')
node=etree.fromstring(tree)
#node=etree.fromstring('<a><word>waya</word><gram>[Noun]</gram>
<meaning>talking</meaning></a>')
s = node.findtext(word_tokenize(sen)[0])
print(s)
i have tried everything but still its giving me error
a bytes-like object is required, not 'ElementTree'
i really don't know how to solve it.
the error happens because you are passing an elementtree object to the fromstring () methods. Do like this:
>>> import os
>>> import xml.etree.ElementTree as etree
>>> a = etree.parse('a.xml')
>>> a
<xml.etree.ElementTree.ElementTree object at 0x10fcabeb8>
>>> b = a.getroot()
>>> b
<Element 'words' at 0x10fb21f48>
>>> b[0][0].text
'John'
Use find() and findall() methods to search.
for more info, check lib: https://docs.python.org/3/library/xml.etree.elementtree.html
Simple example:
test.xml
<?xml version="1.0" encoding="UTF-8"?>
<words>
<word value="John"></word>
<word value="Mike"></word>
<word value="Scott"></word>
</words>
example.py
root = ET.parse("test.xml")
>>> search = root.findall(".//word/.[#value='John']")
>>> search
[<Element 'word' at 0x10be9c868>]
>>> search[0].attrib
{'value': 'John'}
>>> search[0].tag
'word'

Parsing XML file with lxml in Python

I need to parse an xml file, lest say called example.xml, that looks like the following:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nf:rpc-reply xmlns:nf="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns="http://www.cisco.com/nxos:1.0:if_manager">
<nf:data>
<show>
<interface>
<__XML__OPT_Cmd_show_interface___readonly__>
<__readonly__>
<TABLE_interface>
<ROW_interface>
<interface>Ethernet1/1</interface>
<state>down</state>
<state_rsn_desc>Link not connected</state_rsn_desc>
<admin_state>up</admin_state>
I need to get "interface", and "state" elements as such: ['Ethernet1/1', 'down']
Here is my solution that doesnt work:
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse('example.xml', parser)
print tree.xpath('//*/*/*/*/*/*/*/*/interface/text()')
print tree.xpath('//*/*/*/*/*/*/*/*/state/text()')
You need to handle namespaces here:
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse('example.xml', parser)
ns = {'ns': 'http://www.cisco.com/nxos:1.0:if_manager'}
interface = tree.find('//ns:ROW_interface', namespaces=ns)
print [interface.find('.//ns:interface', namespaces=ns).text,
interface.find('.//ns:state', namespaces=ns).text]
Prints:
['Ethernet1/1', 'down']
Using collections.namedtuple():
interface_node = tree.find('//ns:ROW_interface', ns)
Interface = namedtuple('Interface', ['interface', 'state'])
interface = Interface(interface=interface_node.find('.//ns:interface', ns).text,
state=interface_node.find('.//ns:state', ns).text)
print interface
Prints:
Interface(interface='Ethernet1/1', state='down')

etree.strip_tags returning 'None' when trying to strip tag

Script:
print entryDetails
for i in range(len(entryDetails)):
print etree.tostring(entryDetails[i])
print etree.strip_tags(entryDetails[i], 'entry-details')
Output:
[<Element entry-details at 0x234e0a8>, <Element entry-details at 0x234e878>]
<entry-details>2014-02-05 11:57:01</entry-details>
None
<entry-details>2014-02-05 12:11:05</entry-details>
None
How is etree.strip_tags failing to strip the entry-details tag? Is the dash in the tag name affecting it?
strip_tags() does not return anything. It strips off the tags in-place.
The documentation says: "Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.".
Demo code:
from lxml import etree
XML = """
<root>
<entry-details>ABC</entry-details>
</root>"""
root = etree.fromstring(XML)
ed = root.xpath("//entry-details")[0]
print ed
print
etree.strip_tags(ed, "entry-details") # Has no effect
print etree.tostring(root)
print
etree.strip_tags(root, "entry-details")
print etree.tostring(root)
Output:
<Element entry-details at 0x2123b98>
<root>
<entry-details>ABC</entry-details>
</root>
<root>
ABC
</root>

Parsing wsdl (retrieve namespaces from the definitions)using an Element Tree

I am trying to parse a wsdl file using ElementTree, As part of this I"d like to retrieve all the namespaces from a given wsdl definitions element.
For instance in the below snippet , I am trying to retrieve all the namespaces in the definitions tag
<?xml version="1.0"?>
<definitions name="DateService" targetNamespace="http://dev-b.handel-dev.local:8080/DateService.wsdl" xmlns:tns="http://dev-b.handel-dev.local:8080/DateService.wsdl"
xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:myType="DateType_NS" xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
My code looks like this
import xml.etree.ElementTree as ET
xml_file='<path_to_my_wsdl>'
tree = xml.parse(xml_file)
rootElement = tree.getroot()
print (rootElement.tag) #{http://schemas.xmlsoap.org/wsdl/}definitions
print(rootElement.attrib) #targetNamespace="http://dev-b..../DateService.wsdl"
As I understand, in ElementTree the namespace URI is combined with the local name of the element .How can I retrieve all the namespace entries from the definitions element?
Appreciate your help on this
P.S: I am new (very!) to python
>>> import xml.etree.ElementTree as etree
>>> from StringIO import StringIO
>>>
>>> s = """<?xml version="1.0"?>
... <definitions
... name="DateService"
... targetNamespace="http://dev-b.handel-dev.local:8080/DateService.wsdl"
... xmlns:tns="http://dev-b.handel-dev.local:8080/DateService.wsdl"
... xmlns="http://schemas.xmlsoap.org/wsdl/"
... xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
... xmlns:myType="DateType_NS"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema"
... xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
... </definitions>"""
>>> file_ = StringIO(s)
>>> namespaces = []
>>> for event, elem in etree.iterparse(file_, events=('start-ns',)):
... print elem
...
(u'tns', 'http://dev-b.handel-dev.local:8080/DateService.wsdl')
('', 'http://schemas.xmlsoap.org/wsdl/')
(u'soap', 'http://schemas.xmlsoap.org/wsdl/soap/')
(u'myType', 'DateType_NS')
(u'xsd', 'http://www.w3.org/2001/XMLSchema')
(u'wsdl', 'http://schemas.xmlsoap.org/wsdl/')
Inspired by the ElementTree documentation
You can use lxml.
from lxml import etree
tree = etree.parse(file)
root = tree.getroot()
namespaces = root.nsmap
see https://stackoverflow.com/a/26807636/5375693

Categories