I am trying to call an API which in turn triggers a store procedure from our sqlserver database. This is how I coded it.
class Api_Name(Resource):
def __init__(self):
pass
#classmethod
def get(self):
try:
engine = database_engine
connection = engine.connect()
sql = "DECLARE #return_value int EXEC #return_value = [dbname].[dbo].[proc_name])
return call_proc(sql, apiname, starttime, connection)
except Exception as e:
return {'message': 'Proc execution failed with error => {error}'.format(error=e)}, 400
pass
call_proc is the method where I return the JSON from database.
def call_proc(sql: str, connection):
try:
json_data = []
rv = connection.execute(sql)
for result in rv:
json_data.append(dict(zip(result.keys(), result)))
return Response(json.dumps(json_data), status=200)
except Exception as e:
return {'message': '{error}'.format(error=e)}, 400
finally:
connection.close()
The problem with the output is the way JSON is returned and the size of it.
At first the API used to take 1minute 30seconds: when the return statement was like this:
case1: return Response(json.dumps(json_data), status=200, mimetype='application/json')
After looking online, I found that the above statement is trying to prettify JSON. So I removed mimetype from the response & made it as
case2: return Response(json.dumps(json_data), status=200)
The API runs for 30seconds, although the JSON output is not aligned properly but its still JSON.
I see the output size of the JSON returned from the API is close 20MB. I observed this on postman response:
Status: 200 OK Time: 29s Size: 19MB
The difference in Json output:
case1:
[ {
"col1":"val1",
"col2":"val2"
},
{
"col1":"val1",
"col2":"val2"
}
]
case2:
[{"col1":"val1","col2":"val2"},{"col1":"val1","col2":"val2"}]
Will the difference in output from the two aforementioned cases are different ? If so, how can I fix the problem ?
If there is no difference, is there any way I speed up this further and reduce the run time further more, like compressing the JSON which I am returning ?
You can use gzip compression to make your plain text weight from Megabytes to even Kilobytes. Or even use flask-compress library for that.
Also I'd suggest to use ujson to make dump() call faster.
import gzip
from flask import make_response
import ujson as json
#app.route('/data.json')
def compress():
compression_level = 5 # of 9 max
data = [
{"col1": "val1", "col2": "val2"},
{"col1": "val1", "col2": "val2"}
]
content = gzip.compress(json.dumps(data).encode('utf8'), compression_level)
response = make_response(content)
response.headers['Content-length'] = len(content)
response.headers['Content-Encoding'] = 'gzip'
return response
Documentation:
https://docs.python.org/3/library/gzip.html
https://github.com/colour-science/flask-compress
https://pypi.org/project/ujson/
First of all, profile: if 90% the time is being spent transferring across the network then optimising processing speed is less useful than optimising transfer speed (for example, by compressing the response as wowkin recommended (though the web server may be configured to do this automatically, if you are using one)
Assuming that constructing the JSON is slow, if you control the database code you could use its JSON capabilities to serialise the data, and avoid doing it at the Python layer. For example,
SELECT col1, col2
FROM tbl
WHERE col3 > 42
FOR JSON AUTO
would give you
[
{
"col1": "foo",
"col2": 1
},
{
"col1": "bar",
"col2": 2
},
...
]
Nested structures can be created too, described in the docs.
If the requester only needs the data, return it as a download using flask's send_file feature and avoid the cost of constructing an HTML response:
from io import BytesIO
from flask import send_file
def call_proc(sql: str, connection):
try:
rv = connection.execute(sql)
json_data = rv.fetchone()[0]
# BytesIO expects encoded data; if you can get the server to encode
# the data instead it may be faster.
encoded_json = json_data.encode('utf-8')
buf = BytesIO(encoded_json)
return send_file(buf, mimetype='application/json', as_attachment=True, conditional=True)
except Exception as e:
return {'message': '{error}'.format(error=e)}, 400
finally:
connection.close()
You need to implement pagination on your API. 19MB is absurdly large and will lead to some very annoyed users.
gzip and clevererness with the JSON responses will sadly not be enough, you'll need to put in a bit more legwork.
Luckily, there's many pagination questions and answers, and Flasks modular approach to things will mean that someone probably wrote up a module that's applicable to your problem. I'd start off by re-implementing the method with an ORM. I heard that sqlalchemy is quite good.
To answer your question:
1 - Both JSON are semantically identical.
You can make use of http://www.jsondiff.com to compare two JSON.
2 - I would recommend you to make chunks of your data and send it across network.
This might help:
https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html
TL;DR; Try restructuring your JSON payload (i.e. change schema)
I see that you are constructing the JSON response in one of your APIs. Currently, your JSON payload looks something like:
[
{
"col0": "val00",
"col1": "val01"
},
{
"col0": "val10",
"col1": "val11"
}
...
]
I suggest you restructure it in such a way that each (first level) key in your JSON represents the entire column. So, for the above case, it will become something like:
{
"col0": ["val00", "val10", "val20", ...],
"col1": ["val01", "val11", "val21", ...]
}
Here are the results from some offline test I performed.
Experiment variables:
NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5
#!/usr/bin/env python3
import json
NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5
def get_column_name(id_):
return 'col%d' % id_
def random_data():
import string
import random
return ''.join(random.choices(string.ascii_letters, k=LENGTH_OF_STR_DATA))
def get_row():
return {
get_column_name(i): random_data()
for i in range(NUMBER_OF_COLUMNS)
}
# data1 has same schema as your JSON
data1 = [
get_row() for _ in range(NUMBER_OF_ROWS)
]
with open("/var/tmp/1.json", "w") as f:
json.dump(data1, f)
def get_column():
return [random_data() for _ in range(NUMBER_OF_ROWS)]
# data2 has the new proposed schema, to help you reduce the size
data2 = {
get_column_name(i): get_column()
for i in range(NUMBER_OF_COLUMNS)
}
with open("/var/tmp/2.json", "w") as f:
json.dump(data2, f)
Comparing sizes of the two JSONs:
$ du -h /var/tmp/1.json
17M
$ du -h /var/tmp/2.json
8.6M
In this case, it almost got reduced by half.
I would suggest you do the following:
First and foremost, profile your code to see the real culprit. If it is really the payload size, proceed further.
Try to change your JSON's schema (as suggested above)
Compress your payload before sending (either from your Flask WSGI app layer or your webserver level - if you are running your Flask app behind some production grade webserver like Apache or Nginx)
For large data that you can't paginate using something like ndjson (or any type of delimited record format) can really reduce the server resources needed since you'd be preventing holding the JSON object in memory. You would need to get access to the response stream to write each object/line to the response though.
The response
[ {
"col1":"val1",
"col2":"val2"
},
{
"col1":"val1",
"col2":"val2"
}
]
Would end up looking like
{"col1":"val1","col2":"val2"}
{"col1":"val1","col2":"val2"}
This also has advantages on the client since you can parse and process each line on it's own as well.
If you aren't dealing with nested data structures responding with a CSV is going to be even smaller.
I want to note that there is a standard way to write a sequence of separate records in JSON, and it's described in RFC 7464. For each record:
Write the record separator byte (0x1E).
Write the JSON record, which is a regular JSON document that can also contain inner line breaks, in UTF-8.
Write the line feed byte (0x0A).
(Note that the JSON text sequence format, as it's called, uses a more liberal syntax for parsing text sequences of this kind; see the RFC for details.)
In your example, the JSON text sequence would look as follows, where \x1E and \x0A are the record separator and line feed bytes, respectively:
\x1E{"col1":"val1","col2":"val2"}\x0A\x1E{"col1":"val1","col2":"val2"}\x0A
Since the JSON text sequence format allows inner line breaks, you can write each JSON record as you naturally would, as in the following example:
\x1E{
"col1":"val1",
"col2":"val2"}
\x0A\x1E{
"col1":"val1",
"col2":"val2"
}\x0A
Notice that the media type for JSON text sequences is not application/json, but application/json-seq; see the RFC.
I am trying to figure out on how to convert the following JSON object in to a dataframe
[
{
"id":"123",
"sources":[
{
"name":"ABC",
"first":"2020-02-26T03:19:23.247Z",
"last":"2020-02-26T03:19:23.247Z"
},
{
"name":"XYZ",
"first":"2020-02-26T03:19:23.247Z",
"last":"2020-02-26T03:19:23.247Z"
}
]
}
]
The dataframe should appear like this.
id ABC.first ABC.last XYZ.first XYZ.last
123 2020-02-26.. 2020-02-26.. 2020-02-26.. 2020-02-26T03:19:23.247Z
Thanks in advance for your help
Python includes the useful json module. You can find some documentation for it here.
To use it , you'll need to import it into your Python script using the import json command.
From there, you can use the json.loads() method and pass it a some JSON. The loads() method returns a Python dictionary which then lets you access the properties of your JSON object by name. Some example code might look like:
import json
# some JSON:
personJSON = '{"name":"John", "age":30, "city":"New York", "address": {"street": "Fake Street", "streetNumber": "123"}}'
# parse personJSON:
personDict = json.loads(personJSON)
# the result is a Python dictionary:
print(personDict["name"]) # Will print "John"
print(personDict["address"]["street"]) # Will print "Fake Street"
Good luck!
I have a JSON text grabbed from an API of a website:
{"result":"true","product":{"made":{"Taiwan":"Taipei","HongKong":"KongStore","Area":"Asia"}}}
I want to capture "Taiwan" and "Taipei" but always fail.
Here is my code:
import json
weather = urllib2.urlopen('url')
wjson = weather.read()
wjdata = json.loads(wjson)
print wjdata['product']['made'][0]['Taiwan']
I always get the following error:
Keyword 0 error
Whats the correct way to parse that json?
You are indexing an array where there are none.
The JSON is the following:
{
"result":"true",
"product": {
"made": {
"Taiwan":"Taipei",
"HongKong":"KongStore",
"Area":"Asia"
}
}
}
And the above contains no arrays.
You are assuming the JSON structure to be something like this:
{
"result":"true",
"product": {
"made": [
{"Taiwan":"Taipei"},
{"HongKong":"KongStore"},
{"Area":"Asia"}
]
}
}
From a brief look at the doc pages for the json package, I found this conversion table: Conversion table using json.loads
It tells us that a JSON object translates to a dict. And a dict has a method called keys, which returns a list of the keys.
I suggest you try something like this:
#... omitted code
objectKeys = wjdata['product']['made'].keys()
# You should now have a list of the keys stored in objectKeys.
for key in objectKeys:
print key
if key == 'Taiwan':
print 'Eureka'
I haven't tested the above code, but I think you get the gist here :)
wjdata['product']['made']['Taiwan'] works
I am attempting to write out some JSON output into a csv file but first i am trying to understand how the data is structured. I am working from a sample script which connects to an API and pulls down data based a query specified.
The json is returned from the server with this query:
response = api_client.get_search_results(search_id, 'application/json')
body = response.read().decode('utf-8')
body_json = json.loads(body)
If i perform a
print(body_json.keys())
i get the following output:
dict_keys(['events'])
So from this is it right to assume that the entries i am really interested in are another dictionary inside the events dictionary?
If so how can i 'access' them?
Sample JSON data the search query returns to the variable above
{
"events":[
{
"source_ip":"10.143.223.172",
"dest_ip":"104.20.251.41",
"domain":"www.theregister.co.uk",
"Domain Path":"NULL",
"Domain Query":"NULL",
"Http Method":"GET",
"Protocol":"HTTP",
"Category":"NULL",
"FullURL":"http://www.theregister.co.uk"
},
{
"source_ip":"10.143.223.172",
"dest_ip":"104.20.251.41",
"domain":"www.theregister.co.uk",
"Domain Path":"/2017/05/25/windows_is_now_built_on_git/",
"Domain Query":"NULL",
"Http Method":"GET",
"Protocol":"HTTP",
"Category":"NULL",
"FullURL":"http://www.theregister.co.uk/2017/05/25/windows_is_now_built_on_git/"
},
]
}
Any help would be greatly appreciated.
Json.keys() only returns the keys associated with json.
Here is the code:
for key in json_data.keys():
for i in range(len(json_data[key])):
key2 = json_data[key][i].keys()
for k in key2:
print k + ":" + json_data[key][i][k]
Output:
Http Method:GET
Category:NULL
domain:www.theregister.co.uk
Protocol:HTTP
Domain Query:NULL
Domain Path:NULL
source_ip:10.143.223.172
FullURL:http://www.theregister.co.uk
dest_ip:104.20.251.41
Http Method:GET
Category:NULL
domain:www.theregister.co.uk
Protocol:HTTP
Domain Query:NULL
Domain Path:/2017/05/25/windows_is_now_built_on_git/
source_ip:10.143.223.172
FullURL:http://www.theregister.co.uk/2017/05/25/windows_is_now_built_on_git/
dest_ip:104.20.251.41
To answer your question: yes. Your body_json has returned a dictionary with a key of "events" which contains a list of dictionaries.
The best way to 'access' them would be to iterate over them.
A very rudimentary example:
for i in body_json['events']:
print(i)
Of course, during the iteration you could access the specific data that you needed by replacing print(i) with print(i['FullURL'])and saving it to a variable and so on.
It's important to note that whenever you're working with API's that return a JSON response, you're simply working with dictionaries and Python data structures.
Best of luck.
I'm attempting to create a web service using MongoDB and Flask (using the pymongo driver). A query to the database returns documents with the "_id" field included, of course. I don't want to send this to the client, so how do I remove it?
Here's a Flask route:
#app.route('/theobjects')
def index():
objects = db.collection.find()
return str(json.dumps({'results': list(objects)},
default = json_util.default,
indent = 4))
This returns:
{
"results": [
{
"whatever": {
"field1": "value",
"field2": "value",
},
"whatever2": {
"field3": "value"
},
...
"_id": {
"$oid": "..."
},
...
}
]}
I thought it was a dictionary and I could just delete the element before returning it:
del objects['_id']
But that returns a TypeError:
TypeError: 'Cursor' object does not support item deletion
So it isn't a dictionary, but something I have to iterate over with each result as a dictionary. So I try to do that with this code:
for object in objects:
del object['_id']
Each object dictionary looks the way I'd like it to now, but the objects cursor is empty. So I try to create a new dictionary and after deleting _id from each, add to a new dictionary that Flask will return:
new_object = {}
for object in objects:
for key, item in objects.items():
if key == '_id':
del object['_id']
new_object.update(object)
This just returns a dictionary with the first-level keys and nothing else.
So this is sort of a standard nested dictionaries problem, but I'm also shocked that MongoDB doesn't have a way to easily deal with this.
The MongoDB documentation explains that you can exclude _id with
{ _id : 0 }
But that does nothing with pymongo. The Pymongo documentation explains that you can list the fields you want returned, but "(“_id” will always be included)". Seriously? Is there no way around this? Is there something simple and stupid that I'm overlooking here?
To exclude the _id field in a find query in pymongo, you can use:
db.collection.find({}, {'_id': False})
The documentation is somewhat missleading on this as it says the _id field is always included. But you can exclude it like shown above.
Above answer fails if we want specific fields and still ignore _id. Use the following in such cases:
db.collection.find({'required_column_A':1,'required_col_B':1, '_id': False})
You are calling
del objects['_id']
on the cursor object!
The cursor object is obviously an iterable over the result set and not single
document that you can manipulate.
for obj in objects:
del obj['_id']
is likely what you want.
So your claim is completely wrong as the following code shows:
import pymongo
c = pymongo.Connection()
db = c['mydb']
db.foo.remove({})
db.foo.save({'foo' : 42})
for row in db.foo.find():
del row['_id']
print row
$ bin/python foo.py
> {u'foo': 42}