Convert a JSON schema to a python class - python

Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?

So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p

python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.

I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo

Related

singer tap-zendesk - how to extract catalog.json with selected:True from discovery mode

I am using singer's tap-zendesk library and want to extract data from specific schemas.
I am running the following command in sync mode:
tap-zendesk --config config.json --catalog catalog.json.
Currently my config.json file has the following parameters:
{
"email": "<email>",
"api_token": "<token>",
"subdomain": "<domain>",
"start_date": "<start_date>"
}
I've managed to extract data by putting 'selected':true under schema, properties and metadata in the catalog.json file. But I was wondering if there was an easier way to do this? There are around 15 streams I need to go through.
I manage to get the catalog.json file through the discovery mode command:
tap-zendesk --config config.json --discover > catalog.json
The output looks something like the following, but that means that I have to go and add selected:True under every field.
{
"streams": [
{
"stream": "tickets",
"tap_stream_id": "tickets",
"schema": {
**"selected": "true"**,
"properties": {
"organization_id": {
**"selected": "true"**,},
"metadata": [
{
"breadcrumb": [],
"metadata": {
**"selected": "true"**
}
The selected=true needs to be applied only once per stream. This needs to be added to the metadata section under the stream where breadcrumbs = []. This is very poorly documented.
Please see this blog post for some helpful details: https://medium.com/getting-started-guides/extracting-ticket-data-from-zendesk-using-singer-io-tap-zendesk-57a8da8c3477

Scrapy Field Names that are not as per Python variable restrictions

Is it possible to have field names that do not conform to python variable naming rules? To elaborate, is it possible to have the field name as "Job Title" instead of "job_title" in the export file. While may not be useful in JSON or XML exports, such an functionality might be useful while exporting in CSV format. For instance, if I need to use this data to import to another system which is already configured to accept CSVs with a certain field name.
Tried to reading the Item Pipelines documentation but it appears to be for an "an item has been scraped by a spider" but not for the field names themselves (Could be totally wrong though).
Any help in this direction would be really helpful!
I would suggest you to use a third party lib called scrapy-jsonschema. With that you can define your Items like this:
from scrapy_jsonschema.item import JsonSchemaItem
class MyItem(JsonSchemaItem):
jsonschema = {
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "MyItem",
"description": "My Item with spaces",
"type": "object",
"properties": {
"id": {
"description": "The unique identifier for the employee",
"type": "integer"
},
"name": {
"description": "Name of the employee",
"type": "string"
},
"job title": {
"description": "The title of employee's job.",
"type": "string",
}
},
"required": ["id", "name", "job title"]
}
And populate it like this:
item = MyItem()
item['job title'] = 'Boss'
You can read more about here.
This solution address the Item definition as you asked, but you can achieve similar results without defining an Item. For example, you could just scrape the data into a dict and yield it back to scrapy.
yield {
"id": response.xpath('...').get(),
"name": response.xpath('...').get(),
"job title": response.xpath('...').get(),
}
with scrapy crawl myspider -o file.csv that would scrape into a csv and the columns will have the names you chose.
You could also have the spider directly write into a csv, or it's pipeline, etc. Several ways to do it without a Item definition.

Trouble saving repeated protobuf object to file (Python)

I'm new to protobuf, so I don't know how to frame the question correctly.
Anyways, I'm using this Model Config proto file. I converted it into python using this command protoc -I=. --python_out=. ./model_server_config.proto from Protocol Buffer page. Now I have some python files which I can import and work on. My objective is to create a file (for running the TensorFlow model server with multiple models) which should look like the following:
model_config_list: {
config: {
name: "name1",
base_path: "path1",
model_platform: "tensorflow"
},
config: {
name: "name2",
base_path: "path2",
model_platform: "tensorflow"
},
config: {
name: "name3",
base_path: "path3",
model_platform: "tensorflow"
},
}
Now using the python package compiled, I made a protobuf object which looks like this when I print it out:
model_config_list {
config {
name: "name1"
base_path: "path1"
model_platform: "tensorflow"
}
config {
name: "name2"
base_path: "path2"
model_platform: "tensorflow"
}
config {
name: "name3"
base_path: "path3"
model_platform: "tensorflow"
}
}
But while serializing the object using objectname.SerializeToString(), I get a weird output as :
b'\n\x94\x01\n \n\x04name1\x12\x0cpath1"\ntensorflow\n7\n\x08name2\x12\x1fpath2"\ntensorflow\n7\n\x08name3\x12\x1fpath3"\ntensorflow'
I tried converting it into Json also using the protobuf for python like this:
from google.protobuf.json_format import MessageToJson
MessageToJson(objectname)
which gave me a result like:
{
"modelConfigList": {
"config": [
{
"name": "name1",
"basePath": "path1",
"modelPlatform": "tensorflow"
},
{
"name": "name2",
"basePath": "path2",
"modelPlatform": "tensorflow"
},
{
"name": "name3",
"basePath": "path3",
"modelPlatform": "tensorflow"
}
]
}
}
with all the objects in a list and each objects as string, which is not acceptable for TensorFlow model server config.
Any ideas on how to write it into a file correctly? Or am I creating the whole objects incorrectly? Any help is welcome, Thanks in advance.
I don't know anything about what system will be reading your file, so I can't say anything about how you should write it to a file. It really depends on how the Model Server expects to read it.
That said, I don't see anything wrong with how you're creating the message, or any of the serialization methods you've shown.
The print method shows a "text format" proto, which is good for debugging and is sometimes used for storing configuration files. It's not very compact (field names are present in the file) and doesn't have all the backwards- and forwards-compatible features of the binary representation. It's actually funcionally the same as what you've said it "should look like": the colons and commas are actually optional.
The SerializeToString() method uses the binary serialization format. This is arguably what Protocol Buffers were built to do. It's a compact representation and provides backwards and forwards compatibility, but it's not very human-readable.
As the name suggests, the json_format module provides a JSON representation of the message. That's perfectly good if the system you're interacting with expects a JSON, but it's not exactly common.
Appendix: instead of using print(), the google.protobuf.text_format module has utilities better suited to using the text format programmatically. To write to a file, you could use:
from google.protobuf import text_format
(...)
with open(file_path, 'w') as output:
text_format.PrintMessage(my_message, output)

Add stdout of subprocess to JSON report if test case fails

I'm investigating methods of adding to the JSON report generated by either pytest-json or pytest-json-report: I'm not hung up on either plugin. So far, I've done the bulk of my evaluation using pytest-json. So, for example, the JSON object has this for a test case
{
"name": "fixture_test.py::test_failure1",
"duration": 0.0012421607971191406,
"run_index": 2,
"setup": {
"name": "setup",
"duration": 0.00011181831359863281,
"outcome": "passed"
},
"call": {
"name": "call",
"duration": 0.0008759498596191406,
"outcome": "failed",
"longrepr": "def test_failure1():\n> assert 3 == 4, \"3 always equals 3\"\nE AssertionError: 3 always equals 3\nE assert 3 == 4\n\nfixture_test.py:19: AssertionError"
},
"teardown": {
"name": "teardown",
"duration": 0.00014257431030273438,
"outcome": "passed"
},
"outcome": "failed"
}
This is from experiments I'm trying. In practice, some of the test cases are done by spawning a sub-process via Popen and the assert is that a certain string appears in the stdout. In the event that the test case fails, I need to add a key/value to the call dictionary which contains the stdout of that subprocess. I have tried in vain thus far to find the correct fixture or apparatus to accomplish this. It seems that the pytest_exception_interact may be the way to go, but drilling into the JSON structure has thus far eluded me. All I need to do is add/modify the JSON structure at the point of an error. It seems that pytest_runtest_call is too heavy handed.
Alternatively, is there a means of altering the value of longrepr in the above? I've been unable to find the correct way of doing either of these and it's time to ask.
As it would appear, the pytest-json project is rather defunct. The developer/owner of pytest-json-report has this to say (under Related Tools at this link).
pytest-json has some great features but appears to be unmaintained. I borrowed some ideas and test cases from there.
The pytest-json-report project handles exactly the case that I'm requiring: capturing stdout from a subprocess and putting it into the JSON report. A crude example of doing so follows:
import subprocess as sp
import pytest
import sys
import re
def specialAssertHandler(str, assertMessage):
# because pytest automatically captures stdout,stderr this is all that's needed
# when the report is generated, this will be in a field named "stdout"
print(str)
return assertMessage
def test_subProcessStdoutCapture():
# NOTE: if you're version of Python 3 is sufficiently mature, add text=True also
proc = sp.Popen(['find', '.', '-name', '*.json'], stdout=sp.PIPE)
# NOTE: I had this because on the Ubuntu I was using, this is the version of
# Python and the return of proc.stdout.read() is a binary object not a string
if sys.version[0] == 3 and sys.version[6]:
output = proc.stdout.read().decode()
elif sys.version[0] == 2:
# The other version of Python I'm using is 2.7.15, it's exceedingly frustrating
# that the Python language def changed so between 2 and 3. In 2, the output
# was already a string object
output = proc.stdout.read()
m = re.search('some string', output)
assert m is not None, specialAssertHandler(output, "did not find 'some string' in output")
With the above, using the pytest-json-report, the full output of the subprocess is captured by the infrastructure and placed into the afore mentioned report. An excerpt showing this is below:
{
"nodeid": "expirment_test.py::test_stdout",
"lineno": 25,
"outcome": "failed",
"keywords": [
"PyTest",
"test_stdout",
"expirment_test.py"
],
"setup": {
"duration": 0.0002694129943847656,
"outcome": "passed"
},
"call": {
"duration": 0.02718186378479004,
"outcome": "failed",
"crash": {
"path": "/home/afalanga/devel/PyTest/expirment_test.py",
"lineno": 32,
"message": "AssertionError: Expected to find always\nassert None is not None"
},
"traceback": [
{
"path": "expirment_test.py",
"lineno": 32,
"message": "AssertionError"
}
],
"stdout": "./.report.json\n./report.json\n./report1.json\n./report2.json\n./simple_test.json\n./testing_addition.json\n\n",
"longrepr": "..."
},
"teardown": {
"duration": 0.0004875659942626953,
"outcome": "passed"
}
}
The field longrepr holds the full text of the test case but in the interest of brevety, it is made an ellipsis. In the field crash, the value of assertMessage from my example is placed. This shows that it is possible to place such messages into the report at the point of occurrence instead of post processing.
I think it may be possible to "cleverly" handle this using the hook I referenced in my original question pytest_exception_interact. If I find it is so, I'll update this answer with a demonstration.

How do I resolve a single reference with Python jsonschema RefResolver

I am writing Python code to validate a .csv file using a JSON schema and the jsonschema Python module. I have a clinical manifest schema that looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/veoibd_schema.json",
"title": "clinical data manifest schema",
"description": "Validates clinical data manifests",
"type": "object",
"properties": {
"individualID": {
"type": "string",
"pattern": "^CENTER-"
},
"medicationAtDx": {
"$ref": "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
}
},
"required": [
"individualID",
"medicationAtDx"
]
}
The schema referenced by the $ref looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/clinicalData.json",
"definitions":{
"ageDxYears": {
"description": "Age in years at diagnosis",
"type": "number",
"minimum": 0,
"maximum": 90
},
"ageOnset": {
"description": "Age in years of first symptoms",
"type": "number",
"exclusiveMinimum": 0
},
"medicationAtDx": {
"description": "Medication prescribed at diagnosis",
"type": "string"
}
}
}
(Note that both schemas are quite a bit larger and have been edited for brevity.)
I need to be able to figure out the "type" of "medicationAtDx" and am trying to figure out how to use jsonschema.RefResolver to de-reference it, but am a little lost in the terminology used in the documentation and can't find a good example that explains what the parameters are and what it returns "in small words", i.e. something that a beginning JSON schema user would easily understand.
I created a RefResolver from the clinical manifest schema:
import jsonschema
testref = jsonschema.RefResolver.from_schema(clin_manifest_schema)
I fed it the url in the "$ref":
meddx_url = "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
testref.resolve_remote(meddx_url)["definitions"].keys()
What I was expecting to get back was:
dict_keys(['medicationAtDx'])
What I actually got back was:
dict_keys(['ageDxYears', 'ageOnset', 'medicationAtDx'])
Is this the expected behavior? If not, how can I narrow it down to just the definition for "medicationAtDx"? I can traverse the whole dictionary to get what I want if I have to, but I'd rather have it return just the reference I need.
Thanks in advance!
ETA: per Relequestual's comment below, I took a couple of passes with resolve_fragment as follows:
ref_doc = meddx_url.split("#")[0]
ref_frag = meddx_url.split("#")[1]
testref.resolve_fragment(ref_doc, ref_frag)
This gives me "TypeError: string indices must be integers" and "RefResolutionError: Unresolvable JSON pointer". I tried tweaking the parameters in different ways (adding the "#" back into the fragment, removing the leading slash, etc.) and got the same results. Relequestual's explanation of a fragment was very helpful, but apparently I'm still not understanding the exact parameters that resolve_fragment is expecting.

Categories