How to read and parse XML without schema in Python?

How to read and parse XML without schema in Python? - python

Is there a way to read an XML document in Python without the schema? In my use case there is a file similar to the following.
<people>
<human>
<weight>75</weight>
<height>174</height>
</human>
<human>
<weight>89</weight>
<height>187</height>
</human>
</people>
I need to extract an array of weight from it. It can easily be done with string manipulation but there must be a cleaner way to do that with XML parser?

You could use ElementTree (included in the python standard library) and do the following:
import xml.etree.ElementTree
tree = xml.etree.ElementTree.parse("foo.xml")
myArray = [int(x.text) for x in tree.getroot().findall("human/weight")]

Related

Using xml.etree.ElementTree to parse a buffer, not a file [duplicate]

The ElementTree.parse reads from a file, how can I use this if I already have the XML data in a string?
Maybe I am missing something here, but there must be a way to use the ElementTree without writing out the string to a file and reading it again.
xml.etree.elementtree

You can parse the text as a string, which creates an Element, and create an ElementTree using that Element.
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(xmlstring))
I just came across this issue and the documentation, while complete, is not very straightforward on the difference in usage between the parse() and fromstring() methods.

If you're using xml.etree.ElementTree.parse to parse from a file, then you can use xml.etree.ElementTree.fromstring to get the root Element of the document. Often you don't actually need an ElementTree.
See xml.etree.ElementTree

You need the xml.etree.ElementTree.fromstring(text)
from xml.etree.ElementTree import XML, fromstring
myxml = fromstring(text)

io.StringIO is another option for getting XML into xml.etree.ElementTree:
import io
f = io.StringIO(xmlstring)
tree = ET.parse(f)
root = tree.getroot()
Hovever, it does not affect the XML declaration one would assume to be in tree (although that's needed for ElementTree.write()). See How to write XML declaration using xml.etree.ElementTree.

Modifying XML file from string source - how to do it?

I have a problem, where I want to change some lines in my XML, but this XML is not in file, it is in string. I am using Python 3.x and lib xml.etree.ElementTree for this purpose.
I have these piece of code which I know works for files in project, but as I said, I want no files, only operations on string sources.
source_tree = ET.ElementTree(ET.fromstring(source_config))
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
This works, but I don't know how to save it. I tried
source_tree.write(source_config, encoding='latin-1') but this doesn't work (treats all XML as a name).

I don't think you need both source_tree and source_tree_root. By having both, you're creating two separate things. When you write using source_tree, you don't get the changes made to source_tree_root.
Try creating an ElementTree from source_tree_root (which is just an Element), like this (untested since you didn't supply an mcve)...
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
ET.ElementTree(source_tree_root).write("output.xml")

OK, I thought that you were using lxml. My bad. Well here is how you would do it with lxml. I think that lxml is superior.
Here is the basic way of parsing a string into an XML document.
from lxml import etree
doc = etree.fromstring(yourstring)
for e in doc.xpath('//sometagname'):
e.set('foo', 'bar')

Understanding XML and XSD parsing in Python 3

So I will ask the question somewhat vaugley because I'm first not sure if the question can be asked.. Here goes,
I want to read an XML tree in using Python3 which I am new to. I have accomplished this with relative ease using:
xml.etree.ElementTree.parse(urllib.request.urlopen(url))
The XML stream are different data sets and there is a XSD which is available which I have also parsed int he same way. Now, my question is can I create a parser using the XSD schema? I am new to XML in this way, but I have found examples where the parser object was generated using the XSD then the XML was read in accordingly. However, I cannot find the equivalent in Python3.
Here is vaugely what I want in Python2.X:
schema = etree.XMLSchema(schema_root)
xmlparser = etree.XMLParser(schema=schema)
I'm not sure if I'm even conceptualizing this correctly. Maybe this is an XML problem not a python problem, i.e., maybe you can only validate the XML against the schema and not actually use it to parse with the specifics from the XSD. Anyone help clear this up?

xmlschema is a Python module to manage XML Schemas or validate XML instances to the XSD.
Example to validate XML documents from URL against the XML Schema using Python3.
import requests
import xmlschema
import xml.etree.ElementTree as ET
# to make it simple use external XML Schema and create a local file from it to validate XML examples
xsd_url = "https://raw.githubusercontent.com/sissaschool/xmlschema/master/tests/test_cases/examples/collection/collection.xsd"
with open("test.xsd", "w", newline="") as out:
out.write(requests.get(xsd_url).text)
xsd = xmlschema.XMLSchema("test.xsd")
# XML #1 validates to the Schema
url1 = "https://raw.githubusercontent.com/sissaschool/xmlschema/master/tests/test_cases/examples/collection/collection.xml"
xt = ET.fromstring(requests.get(url1).text)
print("xml1 valid=", xsd.is_valid(xt), sep="")
# XML #2 with invalid structure
url2 = "https://raw.githubusercontent.com/sissaschool/xmlschema/master/tests/test_cases/examples/collection/collection-1_error.xml"
xt = ET.fromstring(requests.get(url2).text)
print("xml2 valid=", xsd.is_valid(xt), sep="")
Output:
xml1 valid=True
xml2 valid=False

Following xs:include when parsing XSD as XML with lxml in Python

So, my problem is I'm trying to do something a little un-orthodox. I have a complicated set of XSD files. However I don't want to use these XSD files to verify an XML file; I want to parse these XSDs as XML and interrogate them just as I would a normal XML file. This is possible because XSDs are valid XML. I am using lxml with Python3.
The problem I'm having is with the statement:
<xs:include schemaLocation="sdm-extension.xsd"/>
If I instruct lxml to create an XSD for verifying like this:
schema = etree.XMLSchema(schema_root)
this dependency will be resolved (the file exists in the same directory as the one I've just loaded). HOWEVER, I am treating these as XML so, correctly, lxml just treats this as a normal element with an attribute and does not follow it.
Is there an easy or correct way to extend lxml so that I may have the same or similar behaviour as, say
<xi:include href="metadata.xml" parse="xml" xpointer="title"/>
I could, of course, create a separate xml file manually that includes all the dependencies in the XSD schema. That is perhaps a solution?

So it seems like one option is to use the xi:xinclude method and create a separate xml file that includes all the XSDs I want to parse. Something along the lines of:
<fullxsd>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm1-0-0.xsd" parse="xml"/>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm-ns-structure.xsd" parse="xml"/>
</fullxsd>
Then use some lxml along the lines of
def combine(xsd_file):
with open(xsd_file, 'rb') as f_xsd:
parser = etree.XMLParser(recover=True, encoding='utf-8',remove_comments=True, remove_blank_text=True)
xsd_source = f_xsd.read()
root = etree.fromstring(xsd_source, parser)
incl = etree.XInclude()
incl(root)
print(etree.tostring(root, pretty_print=True))
Its not ideal but it seems the proper way. I've looked at custom URI parsers in the lxml but that would mean actually altering the XSDs which seems messier.

Try this:
def validate_xml(schema_file, xml_file):
xsd_doc = etree.parse(schema_file)
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse(xml_file)
return xsd.validate(xml)

Python xml ElementTree from a string source?

The ElementTree.parse reads from a file, how can I use this if I already have the XML data in a string?
Maybe I am missing something here, but there must be a way to use the ElementTree without writing out the string to a file and reading it again.
xml.etree.elementtree

You can parse the text as a string, which creates an Element, and create an ElementTree using that Element.
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(xmlstring))
I just came across this issue and the documentation, while complete, is not very straightforward on the difference in usage between the parse() and fromstring() methods.

If you're using xml.etree.ElementTree.parse to parse from a file, then you can use xml.etree.ElementTree.fromstring to get the root Element of the document. Often you don't actually need an ElementTree.
See xml.etree.ElementTree

You need the xml.etree.ElementTree.fromstring(text)
from xml.etree.ElementTree import XML, fromstring
myxml = fromstring(text)

io.StringIO is another option for getting XML into xml.etree.ElementTree:
import io
f = io.StringIO(xmlstring)
tree = ET.parse(f)
root = tree.getroot()
Hovever, it does not affect the XML declaration one would assume to be in tree (although that's needed for ElementTree.write()). See How to write XML declaration using xml.etree.ElementTree.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read and parse XML without schema in Python? - python

You could use ElementTree (included in the python standard library) and do the following: import xml.etree.ElementTree tree = xml.etree.ElementTree.parse("foo.xml") myArray = [int(x.text) for x in tree.getroot().findall("human/weight")]

Related

Using xml.etree.ElementTree to parse a buffer, not a file [duplicate]

Modifying XML file from string source - how to do it?

Understanding XML and XSD parsing in Python 3

Following xs:include when parsing XSD as XML with lxml in Python

Python xml ElementTree from a string source?

Categories

Resources