How do I verify the validity of a JSON Schema file?

How do I verify the validity of a JSON Schema file? - python

I have a JSON Schema file like this one, which contains a couple of intentional bugs:
{
"$schema": "http://json-schema.org/schema#",
"type": "object",
"description": "MWE for JSON Schema Validation",
"properties": {
"valid_prop": {
"type": ["string", "number"],
"description": "This can be either a string or a number."
},
"invalid_prop": {
// NOTE: "type:" here should have been "type" (without the colon)
"type:": ["string", "null"],
"description": "Note the extra colon in the name of the type property above"
}
},
// NOTE: Reference to a non-existent property
"required": ["valid_prop", "nonexistent_prop"]
}
I'd like to write a Python script (or, even better, install a CLI with PiP) that can find those bugs.
I've seen this answer, which suggests doing the following (modified for my use case):
import json
from jsonschema import Draft4Validator
with open('./my-schema.json') as schemaf:
schema = json.loads('\n'.join(schemaf.readlines()))
Draft4Validator.check_schema(my_schema)
print("OK!") # on invalid schema we don't get here
but the above script doesn't detect either of the errors in the schema file. I would have suspected it to detect at least the extra colon in the "type:" property.
Am I using the library incorrectly? How do I write a validation script that detects this error?

You say the schema is invalid, but that isn't the case with the example you've provided.
Unknown keywords are ignored. This is to allow for extensions to be created. If unknown keywords were prevented, we wouldn't have the ecosystem of extensions that various people and groups have created, like form generation.
You say that the value in required is a "Reference to a non-existent property". The required keyword has no link to the properties keyword.
required determins which keys an object must have.
properties determines how a subschema should be applied to values in an object.
There's no need for values in required to also be included in properties. In fact it's common that they do not when building complex modular schemas.
In terms of validating if a schema is valid, you can use the JSON Schema meta schema.
In terms of checking for additional things that you consider non desireable, that's down to you, given the examples you've provided are valid.
Some libraries may provide a sanity check, but such is unlikely to pick up on the examples you've provided, as they aren't errors.

Related

How to see what can be set/updated on an issue?

I'm trying to use the JIRA Python API to create and update issues on different projects.
Currently I'm after timetracking but I've seen other fields that cannot be set on this or that project getting the error message:
... cannot be set. It is not on the appropriate screen, or unknown.
I can already set timetracking on some projects like:
issue.update(fields={'timetracking': {'originalEstimate': '4h'}})
But on others I get the mentioned error message although the field is clearly present among the issue fields:
>>> issue.fields.timetracking
<JIRA TimeTracking at 2072336111640>
There seems to be nothing obvious on the object itself that could make me identify the thing as "not set-able".
Here is a post on how to get the fields on the screen via REST API. I think that's what the Python thing is doing in the background. But do I really need to go that way?

Given the path from the REST API question answer we can get the data with the private _get_json:
path = 'issue/createmeta?projectKeys={KEY}&expand=projects.issuetypes.fields'
data = jira_connection._get_json(_FIELDS_PATH.format(KEY=project_key))
project_fields = {}
for issuetype in data['projects'][0]['issuetypes']:
project_fields[issuetype['name']] = dict((f, v['name']) for f,v in issuetype['fields'].items())
This will result in a project_fields dictionary like:
{
"ISSUE_TYPE_NAME": {
"FIELD_ID": "FIELD_NAME",
...
}, // for example:
"Task": {
"summary": "Summary",
"issuetype": "Issue Type",
...
}
}
As long as there is no such feature in the jira package directly.

Best practice for collections in jsons: array vs dict/map

I need to pass data in a python back-end to a front end through an api call, using a json format. In the python back end, the data is in a dictionary structure, which I can easily and directly convert to a json. But should I?
My front-end developer believes the answer is no, for reasons related to best practice.
But I challenge that:
Is the best to structure a json as it is in python, or should it rather be converted to some other form, such as several arrays (as would be necessary in my example case below)?
Or, differently put:
What should be the governing principles related to collections/dicts/maps/arrays for interfacing information through jsons?
I've done some googling for an answer, but I've not come across much that addresses this directly. Links would be appreciated.
(Note about the example below: of course if the data is written to a database, it would probably make most sense for the front-end to access the database directly, but let's assume this is not the case)
Example:
In the back end there is a collection of objects called pets:
each item in the collection has a unique pet_id, some non-optional properties, e.g. name and date_of_birth, some optional properties registration_certificate_nr, adopted_from_kennel, some lists like siblings and children and some objects like medication.
Assuming that the front end needs all of this info at some point, it could be
{
"pets": {
"17-01-24-01": {
"name": "Buster",
"date_of_birth": "04/01/2017",
"registration_certificate_nr": "AAD-1123-1432"
},
"17-03-04-01": {
"name": "Hooch",
"date_of_birth": "05/02/2015",
"adopted_from_kennel": "Pretoria Shire",
"children": [
"17-05-01-01",
"17-05-01-02",
"17-05-01-03"
]
},
"17-05-01-01": {
"name": "Snappy",
"date_of_birth": "17-05-01",
"siblings": [
"17-05-01-02",
"17-05-01-03"
]
},
"17-05-01-02": {
"name": "Gizmo",
"date_of_birth": "17-05-01",
"siblings": [
"17-05-01-01",
"17-05-01-03"
]
},
"17-05-01-03": {
"name": "Toothless",
"date_of_birth": "17-05-01",
"siblings": [
"17-05-01-01",
"17-05-01-03"
],
"medication": [
{
"name": "anti-worm",
"code": "aw445",
"dosage": "1 pill per day"
},
{
"name": "disinfectant",
"code": "pdi-2",
"dosage": "as required"
}
]
}
}
}

JSON formatting is a somewhat subjective matter, and related disagreements are usually best settled between colleagues.
That being said, there are some potentially valid criticisms to be made against the JSON format in the question, especially if we are trying to create a consistent, RESTful API.
The 2 pain points that stand out:
A map collection is represented in JSON, which isn't really JSON standard compliant, or particularly RESTful.
None of the pet objects have an id defined. There is a pet_id mentioned in the question, but it seems to be maintained separately from the pet object itself. If a value is accessed in the pets map in the question, a user of the API would have to manually add the pet_id to the provided pet object in order to have the id available further down the line, when the full JSON may no longer be available.
The closest things we have to guiding standards in this situation is the REST architectural style and the JSON standard.
We can start by looking at the JSON standard. Here is a quote from the JSON wiki:
JavaScript syntax defines several native data types that are not included in the JSON standard: Map, Set, Date, Error, Regular Expression, Function, Promise, and undefined.
The key takeaway here is that JSON is not meant to represent the map data type. Python dictionaries are a map implementation, so directly serializing a dictionary to JSON with the intent to represent a map-like collection goes against the intended use of JSON.
For an individual object like a pet, the JSON object is appropriate, but for collections there is one option: the JSON array. There is a usage example with the JSON array further down in this answer.
There may be edge cases where deviating from the standard makes sense, but I don't see a reason in this scenario.
There are also some shortcomings in the JSON format from a RESTful design perspective. RESTful API design is nice because it encourages one to keep things simple and consistent. It also happens to be a de facto industry standard.
In a RESTful HTTP API, this is how fetching a single pet resource should look:
Request: GET /api/pets/17-01-24-01
Response: 200 {
"id": "17-01-24-01",
"name": "Buster",
"date_of_birth": "04/01/2017",
"registration_certificate_nr": "AAD-1123-1432"
}
The response is a completely defined resource with an explicitly defined id. It is also the simplest complete JSON representation of a pet.
Next, we define what fetching multiple pet resources looks like, assuming only 2 pets are defined:
Request: GET /api/pets
Response: 200 [
{
"id": "17-01-24-01",
"name": "Buster",
"date_of_birth": "04/01/2017",
"registration_certificate_nr": "AAD-1123-1432"
},
{
"id": "17-03-04-01",
"name": "Hooch",
"date_of_birth": "05/02/2015",
"adopted_from_kennel": "Pretoria Shire",
"children": [
"17-05-01-01",
"17-05-01-02",
"17-05-01-03"
]
}
]
The above response format is the most straight forward way to pluralize the single resource response format, thus keeping the API as simple and consistent as possible. (for the sake of brevity, I only used 2 of the sample resources from the question). Once again, the ids are explicitly defined, and belong to their respective pet objects.
Nothing is gained from adding map keys to the above format.
Proponents of the JSON format in the question may suggest to just add the id field into each pet object in order to work around pain point 2, but that would raise the question of repeating data within the response. Why does the id need to be both inside and outside the object? Surely it should only be on the inside? After eliminating the redundant data, the result will look like the response above.
That is the REST argument. There are use cases where REST doesn't really work, but this is far from that.
PS. Front ends should never access databases directly. The API is responsible for writing to and reading from whatever data persistence infrastructure is used. In a lot of bigger real world systems, there is even an additional BFF layer between the front end and the API(s), separating the front end and the DB even further.

How to check the completeness of JSON data

I get data in JSON from an API and it may be that the received data is not complete (= some fields are missing). I am not sure either that the structure of the data follows JSON standards.
The solution for the second problem is simple: I will try: to decode the JSON and act accordingly on ValueError and TypeError exceptions.
For the first problem, my solution would also be to
d = {'a': 1}
try:
d['a']
d['b']
d['x']['shouldbethere']
except KeyError:
(...)
that is to list all the keys I need to have in the dict created from a successful JSON conversion.
This made me think that there may be a method to declare the expected keys (and possibly values types) and match the retrieved JSON against it, an unsuccessful match raising a specific exception?

Standard way to validate JSON structure is to use JSON Schema.
Basic characteristics (quoted from official webpage) are:
JSON Schema:
describes your existing data format
clear, human- and machine-readable documentation
complete structural validation, useful for
automated testing
validating client-submitted data
There is no built-in package to validate JSON object against schema, although you may use jsonschema from pypi.
Sample usage (paraphrased from official docs) may be:
import jsonschema
schema = {
"type": "object",
"properties": {
"price": {"type": "number"},
"name": {"type": "string"},
},
}
jsonschema.validate({"name": "Eggs", "price": 34.99}, schema)
# No exception from line above - document is valid
jsonschema.validate({"name": "Eggs", "price": "Invalid"}, schema)
# ValidationError: 'Invalid' is not of type 'number'

JSON parsers aren't terribly easy to use for error correction, so if the data isn't JSON I think it would be very difficult to apply any kind of auto-correction to allow you to parse it, so your solution for the invalid JSON is probably the most reasonable decision.
A function to verify that a dict contains a particular set of keys is relatively easy to implement. I'm not aware of any JSON object methods to perform that test, but given a JSON object j you could check it as follows (it might also be sensible to check that it's a dict, since JSON objects can also be of other types):
def has_all_keys(j, keylist):
return all(key in j for key in keylist)
Using this interactively suggests is might work (in this example I rely on the fact that iteration over a string yields the individual characters, but obviously you will need a real list of string key values).
>>> has_all_keys({}, "abc")
False
>>> has_all_keys({'a':1, 'b':1, 'c':1}, "abc")
True

Python Eve - where clause using objectid

I have the following resource defined in settings.py,
builds = {
'item_title': 'builds',
'schema': {
'sources': {
'type': 'list',
'schema': {
'type': 'objectid',
'data_relation': {
'resource': 'sources',
'embeddable': True,
}
}
},
'checkin_id': {
'type': 'string',
'required': True,
'minlength': 1,
},
}
}
When I try to filter based on a member whose value is an objectid, I get empty list.
http://127.0.0.1:5000/builds?where={"sources":"54e328ec537d3d20bbdf2ed5"}
54e328ec537d3d20bbdf2ed5 is the id of source
Is there anyway to do this?

Your query should work just fine assuming that you actually have the 54e328ec537d3d20bbdf2ed5 value included in any sources field within any builds document.
What I mean is, you can't query the builds endpoint for the existence of a document in the sources endpoint (you can of course do that at the sources endpoint.) But, if you actually store a builds document and it references a sources document, then you query will work fine because what you are actually asking is "get me all builds documents which have a reference to this sources document". For example, if you POST a document like this to the builds endpoint:
{
"sources": ["54e328ec537d3d20bbdf2ed5"]
"checkin_id": "A"
}
Then this query:
http://127.0.0.1:5000/builds?where={"sources":"54e328ec537d3d20bbdf2ed5"}
Will return that one document. Of course since you defined sources as embeddable you can also do:
http://127.0.0.1:5000/builds?where={"sources":"54e328ec537d3d20bbdf2ed5"}&embedded={"sources":1}
Which will get you referenced documents embedded along with any matching document, like so:
{
"sources": [{"field1": "hey", "field2":"I'm an embedded source"}]
"checkin_id": "A"
}
Whereas you would get a 'raw' document without the explicit embed. It is probably worth mentioning that you can also enable predefined embedding of referenced resources, so your clients don't have to explicitly request an embed.
Hope this helps.

New to Eve but I have an advance on Nicola's "should work", because my experience is that it does not and as this question is what comes up when looking trying to deal with the frustration of figuring out why...
Debugging this the library got me to the point where Eve automagically decides that something with a signature that looks like "54e328ec537d3d20bbdf2ed5" should be cast to an ObjectId, which is all good. However, then the comparison of type ObjectId:54e328ec537d3d20bbdf2ed5 against type string:54e328ec537d3d20bbdf2ed5 is not an equality so your filter returns no results
The really simple solution is to change checkin_id to ObjectId. Eve starters can be assured you don't need all the additional decorations, so in the above example just change 'type':'string' to 'type':'objectId' and will be good. Specifically, if you have calling code where this field is defined as a string, you can leave it as it is, the cast will occur within eve as described above and it will just work as expected.
edit - See also eve's schema level "query_objectid_as_string" configuration setting for which upon reading seems to override this behaviour.

structured query language for JSON (in Python)

I am working on a system to output a JSON file and I use Python to parse the data and display it in a UI (PySide). I now would like to add filtering to that system and I think instead of writing a query system, if there was one out there for JSON (in Python), that would save me a lot of development time. I found this thread:
Is there a query language for JSON?
but that's more for a Web-based system. Any ideas on a Python equivalent?
EDIT [for clarity]:
The format the data that I'll be generating is like this:
{
"Operations": [
{
"OpID": "0",
"type": "callback",
"stringTag1": "foo1",
"stringTag2": "FooMsg",
"Children": [...],
"value": "0.000694053"
},
{
"OpID": "1",
"type": "callback",
"stringTag1": "moo1",
"string2": "MooMsg",
"Children": [...],
"value": "0.000468427"
}
}
Where 'Children' could be nested arrays of the same thing (other operations). The system will be built to allow users to add their own tags as well to the data. My hope was to have a querying system that would allow users to define their own 'filters' as well, hence the question about the querying language. If there was something that would let me do something like "SELECT * WHERE "type" == "callback" and get the requisite operations back, that would be great.
The suggestion of Pync is interesting, I'll give that a look.

I notice this question was asked a few years ago but if someone else find this, here are some newer projects trying to address this same problem:
ObjectPath (for Python and Javascript): http://objectpath.org/
jsonpath (Python reimplementation of the Javascript equivalent): https://pypi.org/project/jsonpath/
yaql: https://yaql.readthedocs.io/en/latest/readme.html
pyjq (Python bindings for jq https://stedolan.github.io/jq/): https://pypi.org/project/pyjq/
JMESPath: https://github.com/jmespath/jmespath.py
I personally went with pyjq because I use jq all the time for data exploration but ObjectPath seems very attractive and not limited to json.

I thought about this a little bit, and I lean towards something less specific such as a "JSON Query Language" and considered something more generic. I remembered from working with C# a bit that they had a somewhat generic querying system called LINQ for handling these sort of querying issues.
It looks as though Python has something similar called Pynq which supports basic querying such as:
filtered_collection = From(some_collection).where("item.property > 10").select_many()
It even appears to have some basic aggregation functions. While not being specific to JSON, I think it's a least a good starting point for querying.

You can also check out PythonQL, a query language extension to Python that handles SQL and JSON queries: pythonql

pyjsonq
https://github.com/s1s1ty/py-jsonq
from pyjsonq import JsonQ
qe = JsonQ('myfile.json')
res = qe.at('products').where('cat', '=', 2).get()
print(res)
"""
[
{
id: 3,
city: 'dhk',
name: 'Redmi 3S Prime',
cat: 2,
price: 12000
},
...
]
I think it's important that the interaction with json is in-memory so that you can still do things manually for complex criteria

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.