Is it possible to use references in JSON? - python

I have this JSON:
{
"app_name": "my_app",
"version": {
"1.0": {
"path": "/my_app/1.0"
},
"2.0": {
"path": "/my_app/2.0"
}
}
}
Is it somehow possible to reference the keywords app_name and the key of version so that I don't have to repeat "my_app" and the version numbering?
I was thinking something along the lines of... (code totally made up):
{
"#app_name": "my_app",
"version": {
"1.0": {
"path": "/{{$app_name}}/{{key[-1]]}}"
},
"2.0": {
"path": "/{{$app_name}}/{{key[-1]}}"
}
}
}
Or is this something that could instead be handled better using YAML?
In the end, I intend to read this data into a Python dictionary.

No, JSON does not have references. (The functionality you request here, with substring expansion, would open itself to memory attacks against the parser; by not supporting this functionality, JSON avoids vulnerability to such attacks).
If you want such functionality, you need to implement it yourself.

Not in pure JSON, but you could performs string substitution after you parse the JSON.

Related

Elasticsearch prevent indexing of Markdown hyperlinks

I am building a Markdown file content search using Elasticsearch. Currently the whole content inside the MD file is indexed in Elasticsearch. But the problem is it shows results like this [Mylink](https://link-url-here.org), [Mylink2](another_page.md)
in the search results.
I would like to prevent indexing of hyperlinks and reference to other pages. When someone search for "Mylink" it should only return the text without the URL. It would be great if someone could help me with the right solution for this.
You need to render Markdown in your indexing application, then remove HTML tags and save it alongside with the markdown source.
I think you have two main solutions for this problem.
first: clean the data in your source code before indexing it into Elasticsearch.
second: use the Elasticsearch filter to clean the data for you.
the first solution is the easy one but if you need to do this process inside the Elasticsearch you need to create a ingest pipeline.
then you can use the Script processor to clean the data you need by a ruby script that can find your regex and remove it
You could use an ingest pipeline with a script processor to extract the link text:
1. Set up the pipeline
PUT _ingest/pipeline/clean_links
{
"description": "...",
"processors": [
{
"script": {
"source": """
if (ctx["content"] == null) {
// nothing to do here
return
}
def content = ctx["content"];
Pattern pattern = /\[([^\]\[]+)\](\(((?:[^\()]+)+)\))/;
Matcher matcher = pattern.matcher(content);
def purged_content = matcher.replaceAll("$1");
ctx["purged_content"] = purged_content;
"""
}
}
]
}
The regex can be tested here and is inspired by this.
2. Include the pipeline when ingesting the docs
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink2](another_page.md)"
}
The python docs are here.
3. Verify
GET my-index/_search?filter_path=hits.hits._source
should yield
{
"hits" : {
"hits" : [
{
"_source" : {
"purged_content" : "Mylink anotherLink",
"content" : "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
},
{
"_source" : {
"purged_content" : "Mylink2",
"content" : "[Mylink2](another_page.md)"
}
}
]
}
}
You could instead replace the original content if you want to fully discard them from your _source.
In contrast, you could go a step further in the other direction and store the text + link pairs in a nested field of the form:
{
"content": "...",
"links": [
{
"text": "Mylink",
"href": "https://link-url-here.org"
},
...
]
}
so that when you later decide to make them searchable, you'll be able to do so with precision.
Shameless plug: you can find other hands-on ingestion guides in my Elasticsearch Handbook.

singer tap-zendesk - how to extract catalog.json with selected:True from discovery mode

I am using singer's tap-zendesk library and want to extract data from specific schemas.
I am running the following command in sync mode:
tap-zendesk --config config.json --catalog catalog.json.
Currently my config.json file has the following parameters:
{
"email": "<email>",
"api_token": "<token>",
"subdomain": "<domain>",
"start_date": "<start_date>"
}
I've managed to extract data by putting 'selected':true under schema, properties and metadata in the catalog.json file. But I was wondering if there was an easier way to do this? There are around 15 streams I need to go through.
I manage to get the catalog.json file through the discovery mode command:
tap-zendesk --config config.json --discover > catalog.json
The output looks something like the following, but that means that I have to go and add selected:True under every field.
{
"streams": [
{
"stream": "tickets",
"tap_stream_id": "tickets",
"schema": {
**"selected": "true"**,
"properties": {
"organization_id": {
**"selected": "true"**,},
"metadata": [
{
"breadcrumb": [],
"metadata": {
**"selected": "true"**
}
The selected=true needs to be applied only once per stream. This needs to be added to the metadata section under the stream where breadcrumbs = []. This is very poorly documented.
Please see this blog post for some helpful details: https://medium.com/getting-started-guides/extracting-ticket-data-from-zendesk-using-singer-io-tap-zendesk-57a8da8c3477

How do I resolve a single reference with Python jsonschema RefResolver

I am writing Python code to validate a .csv file using a JSON schema and the jsonschema Python module. I have a clinical manifest schema that looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/veoibd_schema.json",
"title": "clinical data manifest schema",
"description": "Validates clinical data manifests",
"type": "object",
"properties": {
"individualID": {
"type": "string",
"pattern": "^CENTER-"
},
"medicationAtDx": {
"$ref": "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
}
},
"required": [
"individualID",
"medicationAtDx"
]
}
The schema referenced by the $ref looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/clinicalData.json",
"definitions":{
"ageDxYears": {
"description": "Age in years at diagnosis",
"type": "number",
"minimum": 0,
"maximum": 90
},
"ageOnset": {
"description": "Age in years of first symptoms",
"type": "number",
"exclusiveMinimum": 0
},
"medicationAtDx": {
"description": "Medication prescribed at diagnosis",
"type": "string"
}
}
}
(Note that both schemas are quite a bit larger and have been edited for brevity.)
I need to be able to figure out the "type" of "medicationAtDx" and am trying to figure out how to use jsonschema.RefResolver to de-reference it, but am a little lost in the terminology used in the documentation and can't find a good example that explains what the parameters are and what it returns "in small words", i.e. something that a beginning JSON schema user would easily understand.
I created a RefResolver from the clinical manifest schema:
import jsonschema
testref = jsonschema.RefResolver.from_schema(clin_manifest_schema)
I fed it the url in the "$ref":
meddx_url = "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
testref.resolve_remote(meddx_url)["definitions"].keys()
What I was expecting to get back was:
dict_keys(['medicationAtDx'])
What I actually got back was:
dict_keys(['ageDxYears', 'ageOnset', 'medicationAtDx'])
Is this the expected behavior? If not, how can I narrow it down to just the definition for "medicationAtDx"? I can traverse the whole dictionary to get what I want if I have to, but I'd rather have it return just the reference I need.
Thanks in advance!
ETA: per Relequestual's comment below, I took a couple of passes with resolve_fragment as follows:
ref_doc = meddx_url.split("#")[0]
ref_frag = meddx_url.split("#")[1]
testref.resolve_fragment(ref_doc, ref_frag)
This gives me "TypeError: string indices must be integers" and "RefResolutionError: Unresolvable JSON pointer". I tried tweaking the parameters in different ways (adding the "#" back into the fragment, removing the leading slash, etc.) and got the same results. Relequestual's explanation of a fragment was very helpful, but apparently I'm still not understanding the exact parameters that resolve_fragment is expecting.

pymongo include javascript in aggregate query

I'm currently tasked with researching databases and am trying various queries using the pymongo library to investigate suitability for given projects.
My timestamps are saved in millisecond integer format and I'd like to do a simple sales by day aggregated query. I understand from here (answer by Alexandre Russel) that as the timestamps weren't uploaded in BSON format I can't use date and time functions to create bins, but can manipulate timestamps using embedded javascript.
As such I've written the following query:
[{
"$project": {
"year": {
"$year": {
"$add": ["new Date(0)", "$data.horaContacto"]
}
},
"month": {
"$month": {
"$add": ["new Date(0)", "$data.horaContacto"]
}
}
}
}, {
"$group": {
"_id": {
"year": "$year",
"month": "$month"
},
"sales": {
"$sum": {
"$cond": ["$data.estadoVenta", 1, 0]
}
}
}
}]
But get this error:
pymongo.errors.OperationFailure: exception: $add only supports numeric or date types, not String
I think whats happening is that the js "new Date(0)" is being interpreted by the mongo driver as a string, not applied as js. If I remove the encapsulating inverted double quotes then Python tries to interpret this code and errors accordingly. This is just one example and I'd like to include more js in queries in future tests but can't see a way to get it to play nicely with Python (having said this I'm fairly new to Python too).
Does anybody know if:
I'm correct in assuming the error occurs because mongo interprets the
JS as a string and tries to sum it directly?
If I can indicate to
mongo this is JS from Python without Python trying to intepret the
code?
So far I've tried searching via Google and various combinations of single and double inverted commas.
Pasted below is a few rows of randomly generated test data if required:
Thanks,
James
{'_id': 0,'data': {'edad': '74','estadoVenta': True,'visits': [{'visitLength': 1819.349246663518,'visitNo': 1,'visitTime': 1480244647948.0}],'apellido2': 'Aguilar','apellido1': 'Garcia','horaContacto': 1464869545373.0,'preNombre': 'Agustin','_id': 0,'telefono': 630331272,'location': {'province': 'Aragón','city': 'Zaragoza','type': 'Point','coordinates': [-0.900203, 41.747726],'country': 'Spain'}}},
{'_id': 1,'data': {'edad': '87','estadoVenta': False,'visits': [{'visitLength': 2413.9938072105024,'visitNo': 1,'visitTime': 1465417353597.0}],'apellido2': 'Torres','apellido1': 'Acosta','horaContacto': 1473404147769.0,'preNombre': 'Sara','_id': 1,'telefono': 665968746,'location': {'province': 'Galicia','city': 'Cualedro','type': 'Point','coordinates': [-7.659321, 41.925328],'country': 'Spain'}}},
{'_id': 2,'data': {'edad': '48','estadoVenta': True,'visits': [{'visitLength': 2413.9938072105024,'visitNo': 1,'visitTime': 1465415138597.0}],'apellido2': 'Perez','apellido1': 'Sanchez','horaContacto': 1473404923569.0,'preNombre': 'Sara','_id': 2,'telefono': 665967346,'location': {'province': 'Galicia','city': 'Barcelona','type': 'Point','coordinates': [-7.659321, 41.925328],'country': 'Spain'}}}
The MongoDB aggregation framework cannot use any Javascript. You must specify all the data in your aggregation pipeline using BSON. PyMongo can translate a standard Python datetime to BSON, and you can send it as part of the aggregation pipeline, like so:
import datetime
epoch = datetime.datetime.fromtimestamp(0)
pipeline = [{
"$project": {
"year": {
"$year": {
"$add": [epoch, "$data.horaContacto"]
}
},
# the rest of your pipeline here ....
}
}]
cursor = db.collection.aggregate(pipeline)

Convert a JSON schema to a python class

Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?
So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p
python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.
I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo

Categories