structured query language for JSON (in Python)

structured query language for JSON (in Python) - python

I am working on a system to output a JSON file and I use Python to parse the data and display it in a UI (PySide). I now would like to add filtering to that system and I think instead of writing a query system, if there was one out there for JSON (in Python), that would save me a lot of development time. I found this thread:
Is there a query language for JSON?
but that's more for a Web-based system. Any ideas on a Python equivalent?
EDIT [for clarity]:
The format the data that I'll be generating is like this:
{
"Operations": [
{
"OpID": "0",
"type": "callback",
"stringTag1": "foo1",
"stringTag2": "FooMsg",
"Children": [...],
"value": "0.000694053"
},
{
"OpID": "1",
"type": "callback",
"stringTag1": "moo1",
"string2": "MooMsg",
"Children": [...],
"value": "0.000468427"
}
}
Where 'Children' could be nested arrays of the same thing (other operations). The system will be built to allow users to add their own tags as well to the data. My hope was to have a querying system that would allow users to define their own 'filters' as well, hence the question about the querying language. If there was something that would let me do something like "SELECT * WHERE "type" == "callback" and get the requisite operations back, that would be great.
The suggestion of Pync is interesting, I'll give that a look.

I notice this question was asked a few years ago but if someone else find this, here are some newer projects trying to address this same problem:
ObjectPath (for Python and Javascript): http://objectpath.org/
jsonpath (Python reimplementation of the Javascript equivalent): https://pypi.org/project/jsonpath/
yaql: https://yaql.readthedocs.io/en/latest/readme.html
pyjq (Python bindings for jq https://stedolan.github.io/jq/): https://pypi.org/project/pyjq/
JMESPath: https://github.com/jmespath/jmespath.py
I personally went with pyjq because I use jq all the time for data exploration but ObjectPath seems very attractive and not limited to json.

I thought about this a little bit, and I lean towards something less specific such as a "JSON Query Language" and considered something more generic. I remembered from working with C# a bit that they had a somewhat generic querying system called LINQ for handling these sort of querying issues.
It looks as though Python has something similar called Pynq which supports basic querying such as:
filtered_collection = From(some_collection).where("item.property > 10").select_many()
It even appears to have some basic aggregation functions. While not being specific to JSON, I think it's a least a good starting point for querying.

You can also check out PythonQL, a query language extension to Python that handles SQL and JSON queries: pythonql

pyjsonq
https://github.com/s1s1ty/py-jsonq
from pyjsonq import JsonQ
qe = JsonQ('myfile.json')
res = qe.at('products').where('cat', '=', 2).get()
print(res)
"""
[
{
id: 3,
city: 'dhk',
name: 'Redmi 3S Prime',
cat: 2,
price: 12000
},
...
]
I think it's important that the interaction with json is in-memory so that you can still do things manually for complex criteria

Related

FHIR Converter for Python

Is there is any function available in python to convert the given json input into HL7 FHIR format. By passing the Liquid Template (Shopify) and the input source data.

Not that I am aware of, but I stand corrected; You will most likely have to create your own.
See example of someone who had a near similar problem: https://forums.librehealth.io/t/project-rest-json-to-fhir-json-mapping-implementation/1765/14

Thinking that you are looking for something like fhir.resources library, where you can provide some dictionary with data to some kind of object that can be serialized.
from fhir.resources.organization import Organization
from fhir.resources.address import Address
data = {
"id": "f001",
"active": True,
"name": "Acme Corporation",
"address": [{"country": "Switzerland"}]
}
org = Organization(**data)
org.resource_type == "Organization"
P.S. But don't quite follow why you mentioned Liquid Templates.

Check if JSON var has nullable key (Twitter Streaming API)

I'm downloading tweets from Twitter Streaming API using Tweepy. I manage to check if downloaded data has keys as 'extended_tweet', but I'm struggling with an specific key inside another key.
def on_data(self, data):
savingTweet = {}
if not "retweeted_status" in data:
dataJson = json.loads(data)
if 'extended_tweet' in dataJson:
savingTweet['text'] = dataJson['extended_tweet']['full_text']
else:
savingTweet['text'] = dataJson['text']
if 'coordinates' in dataJson:
if 'coordinates' in dataJson['coordinates']:
savingTweet['coordinates'] = dataJson['coordinates']['coordinates']
else:
savingTweet['coordinates'] = 'null'
I'm checking 'extended_key' propertly, but when I try to do the same with ['coordinates]['coordinates] I get the following error:
TypeError: argument of type 'NoneType' is not iterable
Twitter documentation says that key 'coordinates' has the following structure:
"coordinates":
{
"coordinates":
[
-75.14310264,
40.05701649
],
"type":"Point"
}
I achieved to solve it by just putting the conflictive check in a try, except, but I think this is not the most suitable approach to the problem. Any other idea?

So the twitter API docs are probably lying a bit about what they return (shock horror!) and it looks like you're getting a None in place of the expected data structure. You've already decided against using try, catch, so I won't go over that, but here are a few other suggestions.
Using dict get() default
There are a couple of options that occur to me, the first is to make use of the default ability of the dict get command. You can provide a fall back if the expected key does not exist, which allows you to chain together multiple calls.
For example you can achieve most of what you are trying to do with the following:
return {
'text': data.get('extended_tweet', {}).get('full_text', data['text']),
'coordinates': data.get('coordinates', {}).get('coordinates', 'null')
}
It's not super pretty, but it does work. It's likely to be a little slower that what you are doing too.
Using JSONPath
Another option, which is likely overkill for this situation is to use a JSONPath library which will allow you to search within data structures for items matching a query. Something like:
from jsonpath_rw import parse
matches = parse('extended_tweet.full_text').find(data)
if matches:
print(matches[0].value)
This is going to be a lot slower that what you are doing, and for just a few fields is overkill, but if you are doing a lot of this kind of work it could be a handy tool in the box. JSONPath can also express much more complicated paths, or very deeply nested paths where the get method might not work, or would be unweildy.
Parse the JSON first!
The last thing I would mention is to make sure you parse your JSON before you do your test for "retweeted_status". If the text appears anywhere (say inside the text of a tweet) this test will trigger.
JSON parsing with a competent library is usually extremely fast too, so unless you are having real speed problems it's not necessarily worth worrying about.

Best practice for collections in jsons: array vs dict/map

I need to pass data in a python back-end to a front end through an api call, using a json format. In the python back end, the data is in a dictionary structure, which I can easily and directly convert to a json. But should I?
My front-end developer believes the answer is no, for reasons related to best practice.
But I challenge that:
Is the best to structure a json as it is in python, or should it rather be converted to some other form, such as several arrays (as would be necessary in my example case below)?
Or, differently put:
What should be the governing principles related to collections/dicts/maps/arrays for interfacing information through jsons?
I've done some googling for an answer, but I've not come across much that addresses this directly. Links would be appreciated.
(Note about the example below: of course if the data is written to a database, it would probably make most sense for the front-end to access the database directly, but let's assume this is not the case)
Example:
In the back end there is a collection of objects called pets:
each item in the collection has a unique pet_id, some non-optional properties, e.g. name and date_of_birth, some optional properties registration_certificate_nr, adopted_from_kennel, some lists like siblings and children and some objects like medication.
Assuming that the front end needs all of this info at some point, it could be
{
"pets": {
"17-01-24-01": {
"name": "Buster",
"date_of_birth": "04/01/2017",
"registration_certificate_nr": "AAD-1123-1432"
},
"17-03-04-01": {
"name": "Hooch",
"date_of_birth": "05/02/2015",
"adopted_from_kennel": "Pretoria Shire",
"children": [
"17-05-01-01",
"17-05-01-02",
"17-05-01-03"
]
},
"17-05-01-01": {
"name": "Snappy",
"date_of_birth": "17-05-01",
"siblings": [
"17-05-01-02",
"17-05-01-03"
]
},
"17-05-01-02": {
"name": "Gizmo",
"date_of_birth": "17-05-01",
"siblings": [
"17-05-01-01",
"17-05-01-03"
]
},
"17-05-01-03": {
"name": "Toothless",
"date_of_birth": "17-05-01",
"siblings": [
"17-05-01-01",
"17-05-01-03"
],
"medication": [
{
"name": "anti-worm",
"code": "aw445",
"dosage": "1 pill per day"
},
{
"name": "disinfectant",
"code": "pdi-2",
"dosage": "as required"
}
]
}
}
}

JSON formatting is a somewhat subjective matter, and related disagreements are usually best settled between colleagues.
That being said, there are some potentially valid criticisms to be made against the JSON format in the question, especially if we are trying to create a consistent, RESTful API.
The 2 pain points that stand out:
A map collection is represented in JSON, which isn't really JSON standard compliant, or particularly RESTful.
None of the pet objects have an id defined. There is a pet_id mentioned in the question, but it seems to be maintained separately from the pet object itself. If a value is accessed in the pets map in the question, a user of the API would have to manually add the pet_id to the provided pet object in order to have the id available further down the line, when the full JSON may no longer be available.
The closest things we have to guiding standards in this situation is the REST architectural style and the JSON standard.
We can start by looking at the JSON standard. Here is a quote from the JSON wiki:
JavaScript syntax defines several native data types that are not included in the JSON standard: Map, Set, Date, Error, Regular Expression, Function, Promise, and undefined.
The key takeaway here is that JSON is not meant to represent the map data type. Python dictionaries are a map implementation, so directly serializing a dictionary to JSON with the intent to represent a map-like collection goes against the intended use of JSON.
For an individual object like a pet, the JSON object is appropriate, but for collections there is one option: the JSON array. There is a usage example with the JSON array further down in this answer.
There may be edge cases where deviating from the standard makes sense, but I don't see a reason in this scenario.
There are also some shortcomings in the JSON format from a RESTful design perspective. RESTful API design is nice because it encourages one to keep things simple and consistent. It also happens to be a de facto industry standard.
In a RESTful HTTP API, this is how fetching a single pet resource should look:
Request: GET /api/pets/17-01-24-01
Response: 200 {
"id": "17-01-24-01",
"name": "Buster",
"date_of_birth": "04/01/2017",
"registration_certificate_nr": "AAD-1123-1432"
}
The response is a completely defined resource with an explicitly defined id. It is also the simplest complete JSON representation of a pet.
Next, we define what fetching multiple pet resources looks like, assuming only 2 pets are defined:
Request: GET /api/pets
Response: 200 [
{
"id": "17-01-24-01",
"name": "Buster",
"date_of_birth": "04/01/2017",
"registration_certificate_nr": "AAD-1123-1432"
},
{
"id": "17-03-04-01",
"name": "Hooch",
"date_of_birth": "05/02/2015",
"adopted_from_kennel": "Pretoria Shire",
"children": [
"17-05-01-01",
"17-05-01-02",
"17-05-01-03"
]
}
]
The above response format is the most straight forward way to pluralize the single resource response format, thus keeping the API as simple and consistent as possible. (for the sake of brevity, I only used 2 of the sample resources from the question). Once again, the ids are explicitly defined, and belong to their respective pet objects.
Nothing is gained from adding map keys to the above format.
Proponents of the JSON format in the question may suggest to just add the id field into each pet object in order to work around pain point 2, but that would raise the question of repeating data within the response. Why does the id need to be both inside and outside the object? Surely it should only be on the inside? After eliminating the redundant data, the result will look like the response above.
That is the REST argument. There are use cases where REST doesn't really work, but this is far from that.
PS. Front ends should never access databases directly. The API is responsible for writing to and reading from whatever data persistence infrastructure is used. In a lot of bigger real world systems, there is even an additional BFF layer between the front end and the API(s), separating the front end and the DB even further.

Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample:
{
"product": {
"id": "abcdef",
"price": 19.99,
"specs": {
"voltage": "110v",
"color": "white"
}
},
"user": "Daniel Severo"
}
I want to create a parquet file with columns such as:
product.id, product.price, product.specs.voltage, product.specs.color, user
I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).
I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D
EDIT
So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177
which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product column is nested?
dask version: 0.15.1
fastparquet version: 0.1.1

Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.
This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.
I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

I believe this feature has finally been added in arrow/pyarrow 2.0.0:
https://issues.apache.org/jira/browse/ARROW-1644
https://arrow.apache.org/docs/python/json.html

This is not exactly the right answer, but it can helps.
We could try to convert your dictionary to a pandas DataFrame, and after this write this to .parquet file:
import pandas as pd
from fastparquet import write, ParquetFile
d = {
"product": {
"id": "abcdef",
"price": 19.99,
"specs": {
"voltage": "110v",
"color": "white"
}
},
"user": "Daniel Severo"
}
df_test = pd.DataFrame(d)
write('file_test.parquet', df_test)
This would raise and error:
ValueError: Can't infer object conversion type: 0 abcdef
1 19.99
2 {'voltage': '110v', 'color': 'white'}
Name: product, dtype: object
So a easy solution is to convert the product column to lists:
df_test['product'] = df_test['product'].apply(lambda x: [x])
# this should now works
write('file_test.parquet', df_test)
# and now compare the file with the initial DataFrame
ParquetFile('file_test.parquet').to_pandas().explode('product')
index product user
0 id abcdef Daniel Severo
1 price 19.99 Daniel Severo
2 specs {'voltage': '110v', 'color': 'white'} Daniel Severo

dynamodb boto put_item of type Map "M"

Has anyone successfully performed a put operation of a map into dynamodb using boto (python)?
I basically need to put a json object. So far I have only been able to put it as json string but I cannot find an example of inserting a map anywhere.
Thanks a lot.

Since it does not looks like boto supports JSON in its high-level API interface, you have to use the low-level API interface and annotate your JSON object into a DynamoDB-supported wire format as such:
"time": {
"M": {
"creation_timestamp_utc": {
"S": "2012-08-31T03:35:56.881Z"
},
"localtime": {
"S": "12:25:31"
},
"received_timestamp_utc": {
"S": "2012-08-31T07:50:50.367Z"
},
"spacecraft_clock": {
"S": "399657440.746"
}
}
In the above snippet, M is used to denote a "map" object, and S is used to denote the attribute type of each of the entries. You can find more information on what annotations to use for each type here.
I can understand why this is extremely annoying to do though, so you could always open an issue (perhap there is already one opened) at https://github.com/boto/boto/issues/new so they are aware of the feature request.

Support for Maps and Lists is now available in boto v2.35:
https://github.com/boto/boto/issues/2737
To upgrade: pip install -U boto

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.