lxml include relative path - python

Using Python's lxml library, I'm trying to load a .xsd as schema. The Python script is in one directory and the schemas are in another:
/root
my_script.py
/data
/xsd
schema_1.xsd
schema_2.xsd
The problem is that schema_1.xsd includes schema_2.xsd like this:
<xsd:include schemaLocation="schema_2.xsd"/>
Being schema_2.xsd a relative path (the two schemas are in the same directory), lxml doesn't find it and it rises and error:
schema_root = etree.fromstring(open('data/xsd/schema_1.xsd').read().encode('utf-8'))
schema = etree.XMLSchema(schema_root)
--> xml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document './schema_2.xsd' for inclusion
How to solve this problem without changing the schema files?

One option is to use an XML Catalog. You could also probably use a custom URI Resolver, but I've always used a catalog. It's easier for non-developers to make configuration changes. This is especially helpful if you're delivering an executable instead of plain Python.
Using a catalog is different between Windows and Linux; see here for more info.
Here's a Windows example using Python 3.#.
XSD #1 (schema_1.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:include schemaLocation="schema_2.xsd"/>
<xs:element name="doc">
<xs:complexType>
<xs:sequence>
<xs:element ref="test"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="test" type="test"/>
</xs:schema>
XSD #2 (schema_2.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:simpleType name="test">
<xs:restriction base="xs:string">
<xs:enumeration value="Hello World"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>
XML Catalog (catalog.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<!-- The path in #uri is relative to this file (catalog.xml). -->
<system systemId="schema_2.xsd" uri="./xsd_test/schema_2.xsd"/>
</catalog>
Python
import os
from urllib.request import pathname2url
from lxml import etree
# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
if "XML_CATALOG_FILES" not in os.environ:
# Path to catalog must be a url.
catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog.xml'))}"
# Temporarily set the environment variable.
os.environ['XML_CATALOG_FILES'] = catalog_path
schema_root = etree.fromstring(open('xsd_test/schema_1.xsd').read().encode('utf-8'))
schema = etree.XMLSchema(schema_root)
print(schema)
Print Output
<lxml.etree.XMLSchema object at 0x02B4B3F0>

There may also be a simpler solution in your case. I ran into this today and resolved it by temporarily changing the current working directory on importing the xml schema:
import os
from lxml import etree
xml_schema_path = 'data/xsd/schema_1.xsd'
# Get the working directory the script was run from
run_dir = os.getcwd()
# Set the working directory to the schema dir so relative imports resolve from there
os.chdir(os.path.dirname(xml_schema_path))
# Load the schema. Note that you can use the `file=` option to point to a file path
xml_schema = etree.XMLSchema(file=os.path.basename(xml_schema_path))
# Re-set the working directory
os.chdir(run_dir)

Related

Why does ElementTree eat/ignore namespaces (in attribute values)?

I'm trying to read XML with ElementTree and write the result back to disk. My long-term goal is to prettify the XML this way. However, in my naive approach, ElementTree eats all the namespace declarations in the document and I don't understand why. Here is an example
test.xsd
<?xml version='1.0' encoding='UTF-8'?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'
xmlns='sdformat/pose' targetNamespace='sdformat/pose'
xmlns:pose='sdformat/pose'
xmlns:types='http://sdformat.org/schemas/types.xsd'>
<xs:import namespace='sdformat/pose' schemaLocation='./pose.xsd'/>
<xs:element name='pose' type='poseType' />
<xs:simpleType name='string'><xs:restriction base='xs:string' /></xs:simpleType>
<xs:simpleType name='pose'><xs:restriction base='types:pose' /></xs:simpleType>
<xs:complexType name='poseType'>
<xs:simpleContent>
<xs:extension base="pose">
<xs:attribute name='relative_to' type='string' use='optional' default=''>
</xs:attribute>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:schema>
test.py
from xml.etree import ElementTree
ElementTree.register_namespace("types", "http://sdformat.org/schemas/types.xsd")
ElementTree.register_namespace("pose", "sdformat/pose")
ElementTree.register_namespace("xs", "http://www.w3.org/2001/XMLSchema")
tree = ElementTree.parse("test.xsd")
tree.write("test_out.xsd")
Produces test_out.xsd
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="sdformat/pose">
<xs:import namespace="sdformat/pose" schemaLocation="./pose.xsd" />
<xs:element name="pose" type="poseType" />
<xs:simpleType name="string"><xs:restriction base="xs:string" /></xs:simpleType>
<xs:simpleType name="pose"><xs:restriction base="types:pose" /></xs:simpleType>
<xs:complexType name="poseType">
<xs:simpleContent>
<xs:extension base="pose">
<xs:attribute name="relative_to" type="string" use="optional" default="">
</xs:attribute>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:schema>
Notice how test_out.xsd is missing any namespace declarations from test.xsd. I would expect them to be identical. I verified that the latter is valid XML by validating it. It validates with exception of my choice of namespace URI, which I think shouldn't matter.
Update:
Based on mzji's comment I realized that this only happens for values of attributes. With this in mind, I can manually add the namespaces like so:
from xml.etree import ElementTree
namespaces = {
"types": "http://sdformat.org/schemas/types.xsd",
"pose": "sdformat/pose",
"xs": "http://www.w3.org/2001/XMLSchema"
}
for prefix, ns in namespaces.items():
ElementTree.register_namespace(prefix, ns)
tree = ElementTree.parse("test.xsd")
root = tree.getroot()
queue = [tree.getroot()]
while queue:
element:ElementTree.Element = queue.pop()
for value in element.attrib.values():
try:
prefix, value = value.split(":")
except ValueError:
# no namespace, nothing to do
pass
else:
if prefix == "xs":
break # ignore XMLSchema namespace
root.attrib[f"xmlns:{prefix}"] = namespaces[prefix]
for child in element:
queue.append(child)
tree.write("test_out.xsd")
While this solves the problem, it is quite an ugly solution. I also still don't understand why this happens in the first place, so it doesn't answer the question.
There is a valid reason for this behaviour, but it requires a good understanding of XML Schema concepts.
First, some important facts:
Your XML document is not just any old XML document. It is an XSD.
An XSD is described by a schema (See schema for schema )
The attribute xs:restriction/#base is not an xs:string. Its type is xs:QName.
Based on the above facts, we can assert the following:
if test.xsd is parsed as an XML document, but without knowledge of the 'schema for schema' then the value of the base attribute will be treated as a string (technically, as PCDATA).
if test.xsd is parsed using a validating XML parser, with the 'schema for schema' as the XSD, then the value of the base attribute will be parsed as xs:QName
When ElementTree writes the output XML, its behaviour should depend on the data type of base. If base is a QName then ElementTree should detect that it is using the namespace prefix 'types' and it should emit the corresponding namespace declaration.
If you are not supplying the 'schema for schema' when parsing test.xsd then ElementTree is off the hook, because it cannot possibly know that base is supposed to be interpreted as a QName.

Why does XML only validate against an XSD when lxml etree.ElementTree is a string?

I have an XSD:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns="http://tempuri.org/me"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://tempuri.org/me">
<xs:element name="B"></xs:element>
</xs:schema>
Using lxml I can create an XML document and validate it against the XSD:
path = os.path.join(os.path.dirname(__file__), 'sample.xsd')
schema = etree.XMLSchema(file=path)
el = etree.Element('B', nsmap={None: 'http://tempuri.org/me'})
doc = etree.ElementTree(el)
schema.assertValid(doc)
However it produces the following error:
lxml.etree.DocumentInvalid: Element 'B': No matching global declaration available for the validation root.
That error doesn't occur if I convert doc to a string and back again it validates:
st = etree.tostring(doc)
schema.assertValid(etree.XML(st)) # This validates.
What is going on here? Why do I need to convert my etree document to a string and back again to make it validate? How can I prevent that wasteful step?
I'm using Python 3.8 and lxml 4.4.1.

error XMLSchemaParseError while creating xsd schema in python lxml

I encoutered a problem with creating xsd schema in python using lxml library.
I have prepared an xsd schema file below (content was cut to the minimum)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="qualified"
version="2.4">
<xs:annotation>
<xs:documentation xml:lang="de">Bundeseinheitlicher Medikationsplan</xs:documentation>
</xs:annotation>
<xs:element name="MP">
<xs:annotation>
<xs:documentation>Bundeseinheitlicher Medikationsplan</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:attribute name="p" use="prohibited">
<xs:annotation>
<xs:documentation>Name: Patchnummer</xs:documentation>
</xs:annotation>
<xs:simpleType>
<xs:restriction base="xs:int">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="99"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
</xs:schema>
and when using lxml library to create xsd schema like this
from lxml import etree
with open('some_file.xsd') as schema_file: # some_file.xsd is the file above
etree.XMLSchema(file=schema_file)
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "src/lxml/xmlschema.pxi", line 87, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:197804)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}attribute': The content is not valid. Expected is (annotation?)., line 16
But when doing this with the python standard library everything goes correct
import xml.etree.ElementTree as ET
with open('some_file.xsd') as f:
tree = ET.parse(f)
I played around a bit with the xsd file and discovered that when removing
use="prohibited" from the attribiute element resolves the problem with lxml library but I need that property.
What can be the reason for that? Is something wrong with lxml library or rather the xml structure of above xsd is incorrect?
This question is old, but had me stumped for a bit.
Here's how I solved it.
schema_root = etree.parse(xsd_filename)
schema = etree.XMLSchema(schema_root)
xml_parser = etree.XMLParser(schema=schema, no_network=False)
Then if you attempt to open it with something like
with open(xml_filename, 'rb') as f:
etree.fromstring(f.read(), xml_parser)
You will only get actual XMLSchemaErrors
https://lxml.de/api/lxml.etree.XMLParser-class.html

How to extend XSD scheme for supporting SVG?

I try to extend ISOSTS XSD scheme for supporting SVG images tags.
I found XSD scheme for SVG and has put it near ISOSTS.xsd.
Now I try to extend ISOSTS.xsd:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tbx="urn:iso:std:iso:30042:ed-1"
xmlns:xlink="http://www.w3.org/1999/xlink"
<!-- my line -->
xmlns:svg="http://www.w3.org/2000/svg"
elementFormDefault="qualified">
<xs:import namespace="http://www.w3.org/1998/Math/MathML"
schemaLocation="ncbi-mathml2/mathml2.xsd"/>
<xs:import namespace="http://www.w3.org/1999/xlink"
schemaLocation="xlink.xsd"/>
<!-- XSD import of namespace http://www.w3.org/2001/XMLSchema-instance suppressed (not necessary) -->
<xs:import namespace="http://www.w3.org/XML/1998/namespace"
schemaLocation="xml.xsd"/>
<xs:import namespace="urn:iso:std:iso:30042:ed-1"
schemaLocation="tbx.xsd"/>
<!-- my line -->
<xs:import namespace="http://www.w3.org/2000/svg"
schemaLocation="SVG.xsd"/>
....
<xs:element name="p">
<xs:complexType mixed="true">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<!-- my line --> <xs:element ref="svg:svg"/>
<xs:element ref="email"/>
....
But I have error when try to load scheme:
from lxml.etree import parse, XMLSchema
schema_file = open(self._schema_filename)
schema_doc = parse(schema_file)
schema_file.close()
self._xmlschema = XMLSchema(schema_doc) # Error
Error message:
File "src/lxml/xmlschema.pxi", line 87, in lxml.etree.XMLSchema.init (src/lxml/lxml.etree.c:197819)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': References from this schema to components in the namespace 'http://www.w3.org/2000/svg' are not allowed, since not indicated by an import statement., line 4664
What is wrong?
The message seems clear enough to me, I'm not sure which part of it you don't understand. Your schema document imports schema components for various namespaces (mathml, xlink, xml, etc) but it makes no attempt to import the schema for SVG, and the error message is telling you so.
I replicated your three modifications (declaring a namespace binding for the SVG namespace, importing the SVG namespace, and referring to the svg:svg element), but got no error from Xerces or Saxon EE.
So it seems to me that you've done everything right.
The error message suggests that your XSD validator is not picking up the import.
If I had to guess (and I suppose I have to, since while you've given a very concise statement of the problem, we don't have a reproducible error), your validator is looking at an interim version of the schema document in which the reference to svg:svg has been added to the content model of p, but the xs:import statement has not yet been added to the beginning of the schema document.
Possibly your Python bytecode is out of date and your Python needs to be recompiled? (Pure conjecture; I don't know how much schema information lxml generates at compile time and how much it generates at run time.)
Problem solved using next XSD schema for SVG: https://github.com/dumistoklus/svg-xsd-schema

xml validation: validating a URI type

I'm using python's lxml to validate xmls against a schema. I have a schema with an element:
<xs:element name="link-url" type="xs:anyURL"/>
and I test, for example, this (part of an) xml:
<a link-url="server/path"/>
I would like this test to FAIL because the link-url doesn't start with http://. I tried switching anyURI to anyURL but this results in an exception - it's not a valid tag.
Is this possible with lxml? is it possible at all with schema validation?
(I'm pretty sure xs:anyURL is not valid. The XML Schema standard calls it anyURI. And since link-url is an attribute, shouldn't you be using xs:attribute instead of xs:element?)
You could restrict the URIs by creating a new simpleType based on it, and put a restriction on the pattern. For example,
#!/usr/bin/env python2.6
from lxml import etree
from StringIO import StringIO
schema_doc = etree.parse(StringIO('''
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:simpleType name="httpURL">
<xs:restriction base="xs:anyURI">
<xs:pattern value='https?://.+'/>
<!-- accepts only http:// or https:// URIs. -->
</xs:restriction>
</xs:simpleType>
<xs:element name="a">
<xs:complexType>
<xs:attribute name="link-url" type="httpURL"/>
</xs:complexType>
</xs:element>
</xs:schema>
''')) #/
schema = etree.XMLSchema(schema_doc)
schema.assertValid(etree.parse(StringIO('<a link-url="http://sd" />')))
assert not schema(etree.parse(StringIO('<a link-url="server/path" />')))

Categories