xml validation: validating a URI type

xml validation: validating a URI type - python

I'm using python's lxml to validate xmls against a schema. I have a schema with an element:
<xs:element name="link-url" type="xs:anyURL"/>
and I test, for example, this (part of an) xml:
<a link-url="server/path"/>
I would like this test to FAIL because the link-url doesn't start with http://. I tried switching anyURI to anyURL but this results in an exception - it's not a valid tag.
Is this possible with lxml? is it possible at all with schema validation?

(I'm pretty sure xs:anyURL is not valid. The XML Schema standard calls it anyURI. And since link-url is an attribute, shouldn't you be using xs:attribute instead of xs:element?)
You could restrict the URIs by creating a new simpleType based on it, and put a restriction on the pattern. For example,
#!/usr/bin/env python2.6
from lxml import etree
from StringIO import StringIO
schema_doc = etree.parse(StringIO('''
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:simpleType name="httpURL">
<xs:restriction base="xs:anyURI">
<xs:pattern value='https?://.+'/>
<!-- accepts only http:// or https:// URIs. -->
</xs:restriction>
</xs:simpleType>
<xs:element name="a">
<xs:complexType>
<xs:attribute name="link-url" type="httpURL"/>
</xs:complexType>
</xs:element>
</xs:schema>
''')) #/
schema = etree.XMLSchema(schema_doc)
schema.assertValid(etree.parse(StringIO('<a link-url="http://sd" />')))
assert not schema(etree.parse(StringIO('<a link-url="server/path" />')))

Related

Why does ElementTree eat/ignore namespaces (in attribute values)?

I'm trying to read XML with ElementTree and write the result back to disk. My long-term goal is to prettify the XML this way. However, in my naive approach, ElementTree eats all the namespace declarations in the document and I don't understand why. Here is an example
test.xsd
<?xml version='1.0' encoding='UTF-8'?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'
xmlns='sdformat/pose' targetNamespace='sdformat/pose'
xmlns:pose='sdformat/pose'
xmlns:types='http://sdformat.org/schemas/types.xsd'>
<xs:import namespace='sdformat/pose' schemaLocation='./pose.xsd'/>
<xs:element name='pose' type='poseType' />
<xs:simpleType name='string'><xs:restriction base='xs:string' /></xs:simpleType>
<xs:simpleType name='pose'><xs:restriction base='types:pose' /></xs:simpleType>
<xs:complexType name='poseType'>
<xs:simpleContent>
<xs:extension base="pose">
<xs:attribute name='relative_to' type='string' use='optional' default=''>
</xs:attribute>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:schema>
test.py
from xml.etree import ElementTree
ElementTree.register_namespace("types", "http://sdformat.org/schemas/types.xsd")
ElementTree.register_namespace("pose", "sdformat/pose")
ElementTree.register_namespace("xs", "http://www.w3.org/2001/XMLSchema")
tree = ElementTree.parse("test.xsd")
tree.write("test_out.xsd")
Produces test_out.xsd
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="sdformat/pose">
<xs:import namespace="sdformat/pose" schemaLocation="./pose.xsd" />
<xs:element name="pose" type="poseType" />
<xs:simpleType name="string"><xs:restriction base="xs:string" /></xs:simpleType>
<xs:simpleType name="pose"><xs:restriction base="types:pose" /></xs:simpleType>
<xs:complexType name="poseType">
<xs:simpleContent>
<xs:extension base="pose">
<xs:attribute name="relative_to" type="string" use="optional" default="">
</xs:attribute>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:schema>
Notice how test_out.xsd is missing any namespace declarations from test.xsd. I would expect them to be identical. I verified that the latter is valid XML by validating it. It validates with exception of my choice of namespace URI, which I think shouldn't matter.
Update:
Based on mzji's comment I realized that this only happens for values of attributes. With this in mind, I can manually add the namespaces like so:
from xml.etree import ElementTree
namespaces = {
"types": "http://sdformat.org/schemas/types.xsd",
"pose": "sdformat/pose",
"xs": "http://www.w3.org/2001/XMLSchema"
}
for prefix, ns in namespaces.items():
ElementTree.register_namespace(prefix, ns)
tree = ElementTree.parse("test.xsd")
root = tree.getroot()
queue = [tree.getroot()]
while queue:
element:ElementTree.Element = queue.pop()
for value in element.attrib.values():
try:
prefix, value = value.split(":")
except ValueError:
# no namespace, nothing to do
pass
else:
if prefix == "xs":
break # ignore XMLSchema namespace
root.attrib[f"xmlns:{prefix}"] = namespaces[prefix]
for child in element:
queue.append(child)
tree.write("test_out.xsd")
While this solves the problem, it is quite an ugly solution. I also still don't understand why this happens in the first place, so it doesn't answer the question.

There is a valid reason for this behaviour, but it requires a good understanding of XML Schema concepts.
First, some important facts:
Your XML document is not just any old XML document. It is an XSD.
An XSD is described by a schema (See schema for schema )
The attribute xs:restriction/#base is not an xs:string. Its type is xs:QName.
Based on the above facts, we can assert the following:
if test.xsd is parsed as an XML document, but without knowledge of the 'schema for schema' then the value of the base attribute will be treated as a string (technically, as PCDATA).
if test.xsd is parsed using a validating XML parser, with the 'schema for schema' as the XSD, then the value of the base attribute will be parsed as xs:QName
When ElementTree writes the output XML, its behaviour should depend on the data type of base. If base is a QName then ElementTree should detect that it is using the namespace prefix 'types' and it should emit the corresponding namespace declaration.
If you are not supplying the 'schema for schema' when parsing test.xsd then ElementTree is off the hook, because it cannot possibly know that base is supposed to be interpreted as a QName.

lxml include relative path

Using Python's lxml library, I'm trying to load a .xsd as schema. The Python script is in one directory and the schemas are in another:
/root
my_script.py
/data
/xsd
schema_1.xsd
schema_2.xsd
The problem is that schema_1.xsd includes schema_2.xsd like this:
<xsd:include schemaLocation="schema_2.xsd"/>
Being schema_2.xsd a relative path (the two schemas are in the same directory), lxml doesn't find it and it rises and error:
schema_root = etree.fromstring(open('data/xsd/schema_1.xsd').read().encode('utf-8'))
schema = etree.XMLSchema(schema_root)
--> xml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document './schema_2.xsd' for inclusion
How to solve this problem without changing the schema files?

One option is to use an XML Catalog. You could also probably use a custom URI Resolver, but I've always used a catalog. It's easier for non-developers to make configuration changes. This is especially helpful if you're delivering an executable instead of plain Python.
Using a catalog is different between Windows and Linux; see here for more info.
Here's a Windows example using Python 3.#.
XSD #1 (schema_1.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:include schemaLocation="schema_2.xsd"/>
<xs:element name="doc">
<xs:complexType>
<xs:sequence>
<xs:element ref="test"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="test" type="test"/>
</xs:schema>
XSD #2 (schema_2.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:simpleType name="test">
<xs:restriction base="xs:string">
<xs:enumeration value="Hello World"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>
XML Catalog (catalog.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<!-- The path in #uri is relative to this file (catalog.xml). -->
<system systemId="schema_2.xsd" uri="./xsd_test/schema_2.xsd"/>
</catalog>
Python
import os
from urllib.request import pathname2url
from lxml import etree
# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
if "XML_CATALOG_FILES" not in os.environ:
# Path to catalog must be a url.
catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog.xml'))}"
# Temporarily set the environment variable.
os.environ['XML_CATALOG_FILES'] = catalog_path
schema_root = etree.fromstring(open('xsd_test/schema_1.xsd').read().encode('utf-8'))
schema = etree.XMLSchema(schema_root)
print(schema)
Print Output
<lxml.etree.XMLSchema object at 0x02B4B3F0>

There may also be a simpler solution in your case. I ran into this today and resolved it by temporarily changing the current working directory on importing the xml schema:
import os
from lxml import etree
xml_schema_path = 'data/xsd/schema_1.xsd'
# Get the working directory the script was run from
run_dir = os.getcwd()
# Set the working directory to the schema dir so relative imports resolve from there
os.chdir(os.path.dirname(xml_schema_path))
# Load the schema. Note that you can use the `file=` option to point to a file path
xml_schema = etree.XMLSchema(file=os.path.basename(xml_schema_path))
# Re-set the working directory
os.chdir(run_dir)

How to generate a List<String> with zeep?

I'm using the Python library zeep to talk to a SOAP service. One of the required arguments in the documentation is of type List<String> and in the WSDL I found this:
<xs:element minOccurs="0" maxOccurs="1" name="IncludedLenders" type="tns:ArrayOfString"/>
And I believe AraryOfString is defined as:
<xs:complexType name="ArrayOfString">
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" name="string" nillable="true" type="xs:string"/>
</xs:sequence>
</xs:complexType>
How do I make zeep generate the values for that? I tried with:
"IncludedLenders": [
"BMS",
"BME"
]
but that generates:
<ns0:IncludedLenders>
<ns0:string>BMS</ns0:string>
</ns0:IncludedLenders>
instead of:
<ns0:IncludedLenders>
<ns0:string>BMS</ns0:string>
<ns0:string>BME</ns0:string>
</ns0:IncludedLenders>
Any ideas how to generate the later?

I figured out. First I needed to extract the ArrayOfString type:
array_of_string_type = client.get_type("ns1:ArrayOfString")
and then create it this way:
"IncludedLenders": array_of_string_type(["BMS","BME"])

error XMLSchemaParseError while creating xsd schema in python lxml

I encoutered a problem with creating xsd schema in python using lxml library.
I have prepared an xsd schema file below (content was cut to the minimum)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="qualified"
version="2.4">
<xs:annotation>
<xs:documentation xml:lang="de">Bundeseinheitlicher Medikationsplan</xs:documentation>
</xs:annotation>
<xs:element name="MP">
<xs:annotation>
<xs:documentation>Bundeseinheitlicher Medikationsplan</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:attribute name="p" use="prohibited">
<xs:annotation>
<xs:documentation>Name: Patchnummer</xs:documentation>
</xs:annotation>
<xs:simpleType>
<xs:restriction base="xs:int">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="99"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
</xs:schema>
and when using lxml library to create xsd schema like this
from lxml import etree
with open('some_file.xsd') as schema_file: # some_file.xsd is the file above
etree.XMLSchema(file=schema_file)
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "src/lxml/xmlschema.pxi", line 87, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:197804)
lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}attribute': The content is not valid. Expected is (annotation?)., line 16
But when doing this with the python standard library everything goes correct
import xml.etree.ElementTree as ET
with open('some_file.xsd') as f:
tree = ET.parse(f)
I played around a bit with the xsd file and discovered that when removing
use="prohibited" from the attribiute element resolves the problem with lxml library but I need that property.
What can be the reason for that? Is something wrong with lxml library or rather the xml structure of above xsd is incorrect?

This question is old, but had me stumped for a bit.
Here's how I solved it.
schema_root = etree.parse(xsd_filename)
schema = etree.XMLSchema(schema_root)
xml_parser = etree.XMLParser(schema=schema, no_network=False)
Then if you attempt to open it with something like
with open(xml_filename, 'rb') as f:
etree.fromstring(f.read(), xml_parser)
You will only get actual XMLSchemaErrors
https://lxml.de/api/lxml.etree.XMLParser-class.html

spyne generates bad WSDL/XSD schema for ComplexModels with ComplexModel children

I'm trying to use spyne to implement a SOAP service in Python. My client sends SOAP requests like this:
<ns1:loadServices xmlns:ns1="dummy">
<serviceParams xmlns="dummy">
<header>
<user>foo</user>
<password>secret</password>
</header>
</serviceParams>
</ns1:loadServices>
But I have difficulties putting that structure into a spyne model.
So far I came up with this code:
class Header(ComplexModel):
__type_name__ = 'header'
user = Unicode
password = Unicode
class serviceParams(ComplexModel):
__type_name__ = 'serviceParams'
header = Header()
class DummyService(ServiceBase):
#rpc(serviceParams, _returns=Unicode)
def loadServices(ctx, serviceParams):
return '42'
The problem is that spyne generates and XSD like this:
...
<xs:complexType name="loadServices">
<xs:sequence>
<xs:element name="serviceParams" type="tns:serviceParams" minOccurs="0" nillable="true"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="serviceParams"/>
...
which is not what I want because essentially it says that "serviceParams" is just an empty tag without children.
Is that a bug in spyne? Or am I missing something?

It turned out that this line was the culprit:
header = Header()
that should be:
header = Header
Very nasty behavior and really easy to overlook.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

xml validation: validating a URI type - python

Related

Why does ElementTree eat/ignore namespaces (in attribute values)?

lxml include relative path

How to generate a List<String> with zeep?

error XMLSchemaParseError while creating xsd schema in python lxml

spyne generates bad WSDL/XSD schema for ComplexModels with ComplexModel children

Categories

Resources