I'm using lxml to parse a file that contains xi:include elements, and I'm resolve the includes using xinclude().
Given an element, is there any way to identify the file and source line that the element originally appeared in?
For example:
from lxml import etree
doc = etree.parse('file.xml')
doc.xinclude()
xpath_expression = ...
elt = doc.xpath(xpath_expression)
# Print file name and source line of `elt` location
The xinclude expansion will add an xml:base attribute to the top level expanded element,
and elt.base and elt.sourceline are also updated for the child nodes as well, so:
print elt.base, elt.sourceline
will give you what you want.
If elt is not part of the xinclude expansion, then elt.base will point to the base
document ( 'file.xml' ) and elt.sourceline will be the line number in that file.
( Note that sourceline usually seems to actually point to the line where the element tag
ends, not to the line where it begins, if the element is on multiple lines, just as
validation error messages usually point to the closing tag where the error occurs. )
You can find the initial xincluded elements and check this with:
xels = doc.xpath( '//*[#xml:base] )
for x in xels:
print x.tag, x.base, x.sourceline
for c in x.getchildren():
print c.tag, c.base, c.sourceline
Sadly, current versions of lxml no longer include this ability. However, I've developed a workaround using a simple custom loader. Here's a test script which demonstrates the bug in the approach above along with the workaround. Note that this approach only updates the xml:base attribute of the root tag of the included document.
The output of the program (using Python 3.9.1, lxml 4.6.3):
Included file was source.xml; xinclude reports it as document.xml
Included file was source.xml; workaround reports it as source.xml
Here's the sample program.
# Includes
# ========
from pathlib import Path
from textwrap import dedent
from lxml import etree as ElementTree
from lxml import ElementInclude
# Setup
# =====
# Create a sample document, taken from the `Python stdlib
# <https://docs.python.org/3/library/xml.etree.elementtree.html#id3>`_...
Path("document.xml").write_text(
dedent(
"""\
<?xml version="1.0"?>
<document xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="source.xml" parse="xml" />
</document>
"""
)
)
# ...and the associated include file.
Path("source.xml").write_text("<para>This is a paragraph.</para>")
# Failing xinclude case
# =====================
# Load and xinclude this.
tree = ElementTree.parse("document.xml")
tree.xinclude()
# Show that the ``base`` attribute refers to the top-level
# ``document.xml``, instead of the xincluded ``source.xml``.
root = tree.getroot()
print(f"Included file was source.xml; xinclude reports it as {root[0].base}")
# Workaround
# ==========
# As a workaround, define a loader which sets the ``xml:base`` of an
# xincluded element. While lxml evidently used to do this, a change
# eliminated this ability per some `discussion
# <https://mail.gnome.org/archives/xml/2014-April/msg00015.html>`_,
# which included a rejected patch fixing this problem. `Current source
# <https://github.com/GNOME/libxml2/blob/master/xinclude.c#L1689>`_
# lacks this patch.
def my_loader(href, parse, encoding=None, parser=None):
ret = ElementInclude._lxml_default_loader(href, parse, encoding, parser)
ret.attrib["{http://www.w3.org/XML/1998/namespace}base"] = href
return ret
new_tree = ElementTree.parse("document.xml")
ElementInclude.include(new_tree, loader=my_loader)
new_root = new_tree.getroot()
print(f"Included file was source.xml; workaround reports it as {new_root[0].base}")
Related
Trying to get the text from an HtmlElement in lxml. For example, I have the HTML read in by
thing = lxml.html.fromstring("<code><div></code>")
But when I call thing.text I get <div>, meaning that lxml is translating escape characters. Is there a way to get this raw text, i.e., <div>? This is part of the output when I do lxml.html.tostring(thing), but that includes the opening and closing tags which I don't want.
Tried calling tostring with a few different encoding options but no luck.
So I looked into it a bit closer:
cdef tostring(...) in src\lxml\etree.pyx - see https://github.com/lxml/lxml/blob/master/src/lxml/etree.pyx
cdef _tostring(...) in src\lxml\serializer.pxi - see https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi
and I couldn't find anything that would suggest you could get the escaped string by configuring the parameters of the tostring() function. It seems like it will always return the unescaped string maybe due to security concerns ...
The way I see it, you would have to use another function such as html.escape to get the escaped string:
import lxml.html
from html import escape as html_escape
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
raw_thing = lxml.html.tostring(thing, method="text", encoding="unicode") # <div>MY TEST DIV</div>
escaped_thing = html_escape(raw_thing) # <div>MY TEST DIV</div>
print(escaped_thing)
Essentialy what you are looking for is lxml.html.tostring(root, method='text', encoding="unicode"):
import lxml.html
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
output = lxml.html.tostring(thing, method='xml', encoding="unicode")
print(output) # <code><div>MY TEST DIV</div></code>
The problem is that it cannot separate the root element from its child in <code><div>MY TEST DIV</div></code>
However with a different approach you can get the desired output:
import xml.etree.ElementTree as ET
thing = """
<code><div>MY TEST DIV</div><div><div>AAA</div></div><div><div>BBB</div></div></code>
"""
root = ET.fromstring(thing)
root_text = ET._escape_attrib(root.text)
print(root_text)
for child in root:
child_text = ET._escape_attrib(child.text)
print(child_text)
The code above prints out:
<div>MY TEST DIV</div>
<div>AAA</div>
<div>BBB</div>
I am trying to parse some HTML which has as an example
<solids>
&sub2;
</solids>
The html file is read in as a string. I need to insert the HTML from a file that sub2 defines into the appropriate part of the string before then processing the whole string as XML.
I have tried HTMLParser and using its handlers with
class MyHTMLParser(HTMLParser):
def handle_entityref(self, name):
# This gets called when the entity is referenced
print "Entity reference : "+ name
print "Current Section : "+ self.get_starttag_text()
print self.getpos()
But getpos returns a line number and offset rather than position in the string. ( The insertion can be at any point in the file )
I found this link and this suggest to use lxml. I have looked at lxml but cannot see how it would solve the problem. Its scanner does not seem to have an entity handler and seems to be xml rather than html
Okay found that lxml will handle the ENTITY references for me.
Just had to setup parser with the option resolve_entities=True
parser = etree.XMLParser(resolve_entities=True)
root = etree.parse(filename, parser=parser)
I am a newbie and having 1 week experience writing python scripts.
I am trying to write a generic parser (Library for all my future jobs) which parses any input XML without any prior knowledge of tags.
Parse input XML.
Get the values from the XML and Set the values basing on the tags.
Use these values in the rest of the job.
I am using the "xml.etree.ElementTree" library and i am able to parse the XML in the below mentioned way.
#!/usr/bin/python
import os
import xml.etree.ElementTree as etree
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.info('start reading XML property file')
filename = "mood_ib_history_parameters_DEV.xml"
logger.info('getting the current location')
__currentlocation__ = os.getcwd()
__fullpath__ = os.path.join(__currentlocation__,filename)
logger.info('start parsing the XML property file')
tree = etree.parse(__fullpath__)
root = tree.getroot()
hive_db = root.find("hive_db").text
EDGE_HIVE_CONN = root.find("EDGE_HIVE_CONN").text
target_dir = root.find("target_dir").text
to_email_alias = root.find("to_email_alias").text
to_email_cc = root.find("to_email_cc").text
from_email_alias = root.find("from_email_alias").text
dburl = root.find("dburl").text
SQOOP_EDGE_CONN = root.find("SQOOP_EDGE_CONN").text
user_name = root.find("user_name").text
password = root.find("password").text
IB_log_table = root.find("IB_log_table").text
SR_DG_master_table = root.find("SR_DG_master_table").text
SR_DG_table = root.find("SR_DG_table").text
logger.info('Hive DB %s', hive_db)
logger.info('Hive DB %s', hive_db)
logger.info('Edge Hive Connection %s', EDGE_HIVE_CONN)
logger.info('Target Directory %s', target_dir)
logger.info('To Email address %s', to_email_alias)
logger.info('CC Email address %s', to_email_cc)
logger.info('From Email address %s', from_email_alias)
logger.info('DB URL %s',dburl)
logger.info('Sqoop Edge node connection %s',SQOOP_EDGE_CONN)
logger.info('Log table name %s',IB_log_table)
logger.info('Master table name %s',SR_DG_master_table)
logger.info('Data governance table name %s',SR_DG_table)
Now the question is if i want to parse an XML without any knowledge of the tags and elements and use the values how do i do it. I have gone through multiple tutorials but all of them help me with parsing the XML by using the tags like below
SQOOP_EDGE_CONN = root.find("SQOOP_EDGE_CONN").text
Can anybody point me to a right tutorial or library or a code snippet to parse the XML dynamically.
I think official documentation is pretty clear and contains some examples: https://docs.python.org/3/library/xml.etree.elementtree.html
The main part you need to implement is loop over the child nodes (potentially recursively):
for child in root:
# child.tag contains the tag name, child.attrib contains the attributes
print(child.tag, child.attrib)
Well parsing is easy as that - etree.parse(path)
Once you've got the root in hand using tree.getroot() you can just iterate over the tree using Python's "in":
for child_node in tree.getroot():
print child_node.text
Then, to see tags these child_nodes have, you do the same trick.
This lets you go over all tags in the XML without having to know the tag names at all.
in Python/Django, I need to parse and objectify a file .xml according to a given XMLSchema made of three .xsd files referring each other in such a way:
schema3.xsd (referring schema1.xsd)
schema2.xsd (referring schema1.xsd)
schema1.xsd (referring schema2.xsd)
xml schemas import
For this I'm using the following piece of code which I've already tested being succesfull when used with a couple of xml/xsd files (where .xsd is "standalone" without refering others .xsd):
import lxml
import os.path
from lxml import etree, objectify
from lxml.etree import XMLSyntaxError
def xml_validator(request):
# define path of files
path_file_xml = '../myxmlfile.xml'
path_file_xsd = '../schema3.xsd'
# get file XML
xml_file = open(path_file_xml, 'r')
xml_string = xml_file.read()
xml_file.close()
# get XML Schema
doc = etree.parse(path_file_xsd)
schema = etree.XMLSchema(doc)
#define parser
parser = objectify.makeparser(schema=schema)
# trasform XML file
root = objectify.fromstring(xml_string, parser)
test1 = root.tag
return render(request, 'risultati.html', {'risultato': test1})
Unfortunately, I'm stucked with the following error that i got with the multiple .xsd described above:
complex type 'ObjectType': The content model is not determinist.
Request Method: GET Request URL: http://127.0.0.1:8000/xml_validator
Django Version: 1.9.1 Exception Type: XMLSchemaParseError Exception
Value: complex type 'ObjectType': The content model is not
determinist., line 80
Any idea about that ?
Thanks a lot in advance for any suggestion or useful tips to approach this problem...
cheers
Update 23/03/2016
Here (and in the following answers to the post, because it actually exceed the max number of characters for a post), a sample of the files to figure out the problem...
sample files on GitHub
My best guess would be that your XSD model does not obey the Unique Particle Attribution rule. I would rule that out before looking at anything else.
Okay guys, I'm new to parsing XML and Python, and am trying to get this to work. If someone could help me with this it would be greatly appreciated. If you can help me (educate me) on how to figure it out for myself, that would be even better!
I am having trouble trying to figure out the range to reference for an XML document as I can't find any documentation on it. Here is my code and I'll include the entire Traceback after.
#import library to do http requests:
import urllib.request
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('Data.Results.Power.ID')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<id>','').replace('</id>','')
#print out the xml tag and data in this format: <tag>data</tag>
print(xmlTag)
#just print the data
print(xmlData)
Traceback
/usr/bin/python3.4 /home/mint/PycharmProjects/DnD_Project/Power_Name.py
Traceback (most recent call last):
File "/home/mint/PycharmProjects/DnD_Project/Power_Name.py", line 14, in <module>
xmlTag = dom.getElementsByTagName('id')[0].toxml()
IndexError: list index out of range
Process finished with exit code 1
print len( dom.getElementsByTagName('id') )
EDIT:
ids = dom.getElementsByTagName('id')
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
# rest of code
EDIT: I add example because I saw in other comment tha you don't know how to use it
BTW: I add some comment in code about file/connection
import urllib.request
from xml.dom.minidom import parseString
# create connection to data/file on server
connection = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
# read from server as string (not "convert" to string):
data = connection.read()
#close connection because we dont need it anymore:
connection.close()
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# check if there are any data
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
xmlData=xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
else:
print("Sorry, there was no data")
or you can use for loop if there is more tags
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
xmlData = xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
BTW:
getElementsByTagName() expects tagname ID - not path Data.Results.Power.ID
tagname is ID so you have to replace <ID> not <id>
for this tag you can event use one_tag.firstChild.nodeValue in place of xmlTag.replace
.
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('ID') # tagname
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
#xmlData = xmlTag.replace('<ID>','').replace('</ID>','')
xmlData = one_tag.firstChild.nodeValue
print(xmlTag)
print(xmlData)
I haven't used the built in xml library in a while, but it's covered in Mark Pilgrim's great Dive into Python book.
-- I see as I'm typing this that your question has already been answered but since you mention being new to Python I think you will find the text useful for xml parsing and as an excellent introduction to the language.
If you would like to try another approach to parsing xml and html, I highly recommend lxml.