JSON schema validator using Python

JSON schema validator using Python - python

I have written JSON validator in python using jsonschema module.
Its not validating the schema correctly. I am also using web based tool, http://jsonschemalint.com/ for validating.
I wanted to exactly have some thing similar. I am putting my code here please point out the things that I am mising.
from jsonschema import validate
import json
class jsonSchemaValidator(object):
def __init__(self, schema_file):
self.__schema_file = open(schema_file)
self.__json_schema_obj = json.load(self.__schema_file)
def validate(self, json_file):
json_data_obj = json.load(open(json_file))
try:
validate(json_data_obj, self.__json_schema_obj)
print 'The JSON is follows the schema'
except Exception, extraInfo:
print str(extraInfo)
data_file_path = 'C:\\Users\\LT-BPant\\Desktop\\Del\\Schema\\new schema\\sample_output\\'
schema_path = 'C:\\Users\\LT-BPant\\Desktop\\Del\\Schema\\new schema\\'
def main():
json_file = data_file_path + 'report.json'
schema = schema_path+ 'report_new.schema'
obj = jsonSchemaValidator(schema)
obj.validate(json_file)
main()
I have manually modified the json data but still I am getting JSON DATA follows the schema as oputput whereas the web based tool is correctly showing the difference.

Related

Import a JSON project wise, so it loads just once

I have a Python project that performs a JSON validation against a specific schema.
It will run as a Transform step in GCP Dataflow, so it's very important that all dependencies are gathered before the run to avoid downloading the same file again and again.
The schema is placed in a separated Git repository.
The nature of the Transformer is that you receive a single record in your class, and you work with it. The typical flow is that you load the JSON Schema, you validate the record against it, and then you do stuff with the invalid and with the valid. Loading the schema in this way means that I download the schema from the repo for every record, and it could be hundred thousands.
The code gets "cloned" into the workers and then work kinda independent.
Inspired by the way Python loads the requirements at the beginning (one single time) and using them as imports, I thought I could add the repository (where the JSON schema lives) as a Python requirement, and then simply use it in my Python code. But of course, it's a JSON, not a Python module to be imported. How can it work?
An example would be something like:
requirements.txt
git+git://github.com/path/to/json/schema#41b95ec
dataflow_transformer.py
import apache_beam as beam
import the_downloaded_schema
from jsonschema import validate
class Verifier(beam.DoFn):
def process(self, record: dict):
validate(instance=record, schema=the_downloaded_schema)
# ... more stuff
yield record
class Transformer(beam.PTransform):
def expand(self, record):
return (
record
| "Verify Schema" >> beam.ParDo(Verifier())
)

You can load the json schema once and use it as a side input.
An example:
import json
import requests
json_current='https://covidtracking.com/api/v1/states/current.json'
def get_json_schema(url):
with requests.Session() as session:
schema = json.loads(session.get(url).text)
return schema
schema_json = get_json_schema(json_current)
def feed_schema(data, schema):
yield {'record': data, 'schema': schema[0]}
schema = p | beam.Create([schema_json])
data = p | beam.Create(range(10))
data_with_schema = data | beam.FlatMap(feed_schema, schema=beam.pvalue.AsSingleton(schema))
# Now do your schema validation
Just a demonstration of what the data_with_schema pcollection looks like

Why don't you just use a class for loading your resources that uses a cache in order to prevent double loading? Something along the lines of:
class JsonLoader:
def __init__(self):
self.cache = set()
def import(self, filename):
filename = os.path.absname(filename)
if filename not in self.cache:
self._load_json(filename)
self.cache.add(filename)
def _load_json(self, filename):
...

JSON serialization Error in Python 3.2

I am using JSON library and trying to import a page feed to an CSV file. Tried many a ways to get the result however every time code execute it Gives JSON not serialzable. No Facebook use auth code which I have and used it so connection string will change however if you use a page which has public privacy you will still be able to get the result from below code.
following is the code
import urllib3
import json
import requests
#from pprint import pprint
import csv
from urllib.request import urlopen
page_id = "abcd" # username or id
api_endpoint = "https://graph.facebook.com"
fb_graph_url = api_endpoint+"/"+page_id
try:
#api_request = urllib3.Requests(fb_graph_url)
#http = urllib3.PoolManager()
#api_response = http.request('GET', fb_graph_url)
api_response = requests.get(fb_graph_url)
try:
#print (list.sort(json.loads(api_response.read())))
obj = open('data', 'w')
# write(json_dat)
f = api_response.content
obj.write(json.dumps(f))
obj.close()
except Exception as ee:
print(ee)
except Exception as e:
print( e)
Tried many approach but not successful. hope some one can help

api_response.content is the text content of the API, not a Python object so you won't be able to dump it.
Try either:
f = api_response.content
obj.write(f)
Or
f = api_response.json()
obj.write(json.dumps(f))

requests.get(fb_graph_url).content
is probably a string. Using json.dumps on it won't work. This function expects a list or a dictionary as the argument.
If the request already returns JSON, just write it to the file.

Google App Engine Datastore Query to JSON with Python

How can I get a JSON Object in python from getting data via Google App Engine Datastore?
I've got model in datastore with following field:
id
key_name
object
userid
created
Now I want to get all objects for one user:
query = Model.all().filter('userid', user.user_id())
How can I create a JSON object from the query so that I can write it?
I want to get the data via AJAX call.

Not sure if you got the answer you were looking for, but did you mean how to parse the model (entry) data in the Query object directly into a JSON object? (At least that's what I've been searching for).
I wrote this to parse the entries from Query object into a list of JSON objects:
def gql_json_parser(query_obj):
result = []
for entry in query_obj:
result.append(dict([(p, unicode(getattr(entry, p))) for p in entry.properties()]))
return result
You can have your app respond to AJAX requests by encoding it with simplejson e.g.:
query_data = MyModel.all()
json_query_data = gql_json_parser(query_data)
self.response.headers['Content-Type'] = 'application/json'
self.response.out.write(simplejson.dumps(json_query_data))
Your app will return something like this:
[{'property1': 'value1', 'property2': 'value2'}, ...]
Let me know if this helps!

If I understood you correctly I have implemented a system that works something like this. It sounds like you want to store an arbitrary JSON object in a GAE datastore model. To do this you need to encode the JSON into a string of some sort on the way into the database and decode it from a string into a python datastructure on the way out. You will need to use a JSON coder/decoder to do this. I think the GAE infrastructure includes one. For example you could use a "wrapper class" to handle the encoding/decoding. Something along these lines...
class InnerClass(db.Model):
jsonText = db.TextProperty()
def parse(self):
return OuterClass(self)
class Wrapper:
def __init__(self, storage=None):
self.storage = storage
self.json = None
if storage is not None:
self.json = fromJsonString(storage.jsonText)
def put(self):
jsonText = ToJsonString(self.json)
if self.storage is None:
self.storage = InnerClass()
self.storage.jsonText = jsonText
self.storage.put()
Then always operate on parsed wrapper objects instead of the inner class
def getall():
all = db.GqlQuery("SELECT * FROM InnerClass")
for x in all:
yield x.parse()
(untested). See datastoreview.py for some model implementations that work like this.

I did the following to convert the google query object to json. I used the logic in jql_json_parser above as well except for the part where everything is converted to unicode. I want to preserve the data-types like integer, floats and null.
import json
class JSONEncoder(json.JSONEncoder):
def default(self, obj):
if hasattr(obj, 'isoformat'): #handles both date and datetime objects
return obj.isoformat()
else:
return json.JSONEncoder.default(self, obj)
class BaseResource(webapp2.RequestHandler):
def to_json(self, gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Now you can subclass BaseResource and call self.to_json on the gql_object

How to find user id from session_data from django_session table?

In django_session table session_data is stored which is first pickled using pickle module of Python and then encoded in base64 by using base64 module of Python.
I got the decoded pickled session_data.
session_data from django_session table:
gAJ9cQEoVQ9fc2Vzc2lvbl9leHBpcnlxAksAVRJfYXV0aF91c2VyX2JhY2tlbmRxA1UpZGphbmdvLmNvbnRyaWIuYXV0aC5iYWNrZW5kcy5Nb2RlbEJhY2tlbmRxBFUNX2F1dGhfdXNlcl9pZHEFigECdS5iZmUwOWExOWI0YTZkN2M0NDc2MWVjZjQ5ZDU0YjNhZA==
after decoding it by base64.decode(session_data):
\x80\x02}q\x01(U\x0f_session_expiryq\x02K\x00U\x12_auth_user_backendq\x03U)django.contrib.auth.backends.ModelBackendq\x04U\r_auth_user_idq\x05\x8a\x01\x02u.bfe09a19b4a6d7c44761ecf49d54b3ad
I want to find out the value of auth_user_id from auth_user_idq\x05\x8a\x01\x02u.

I had trouble with Paulo's method (see my comment on his answer), so I ended up using this method from a scottbarnham.com blog post:
from django.contrib.sessions.models import Session
from django.contrib.auth.models import User
session_key = '8cae76c505f15432b48c8292a7dd0e54'
session = Session.objects.get(session_key=session_key)
uid = session.get_decoded().get('_auth_user_id')
user = User.objects.get(pk=uid)
print user.username, user.get_full_name(), user.email

NOTE: format changed since original answer, for 1.4 and above see the update below
import pickle
data = pickle.loads(base64.decode(session_data))
>>> print data
{'_auth_user_id': 2L, '_auth_user_backend': 'django.contrib.auth.backends.ModelBackend',
'_session_expiry': 0}
[update]
My base64.decode requires filename arguments, so then I tried base64.b64decode, but this returned "IndexError: list assignment index out of range".
I really don't know why I used the base64 module, I guess because the question featured it.
You can just use the str.decode method:
>>> pickle.loads(session_data.decode('base64'))
{'_auth_user_id': 2L, '_auth_user_backend': 'django.contrib.auth.backends.ModelBackend',
'_session_expiry': 0}
I found a work-around (see answer below), but I am curious why this doesn't work.
Loading pickled data from user sources (cookies) is a security risk, so the session_data format was changed since this question was answered (I should go after the specific issue in Django's bug tracker and link it here, but my pomodoro break is gone).
The format now (since Django 1.4) is "hash:json-object" where the first 40 byte hash is a crypto-signature and the rest is a JSON payload. For now you can ignore the hash (it allows checking if the data was not tampered by some cookie hacker).
>>> json.loads(session_data.decode('base64')[41:])
{u'_auth_user_backend': u'django.contrib.auth.backends.ModelBackend',
u'_auth_user_id': 1}

If you want to learn more about it and know how does encode or decode work, there are some relevant code.
By the way the version of Django that i use is 1.9.4.
django/contrib/sessions/backends/base.py
class SessionBase(object):
def _hash(self, value):
key_salt = "django.contrib.sessions" + self.__class__.__name__
return salted_hmac(key_salt, value).hexdigest()
def encode(self, session_dict):
"Returns the given session dictionary serialized and encoded as a string."
serialized = self.serializer().dumps(session_dict)
hash = self._hash(serialized)
return base64.b64encode(hash.encode() + b":" + serialized).decode('ascii')
def decode(self, session_data):
encoded_data = base64.b64decode(force_bytes(session_data))
try:
# could produce ValueError if there is no ':'
hash, serialized = encoded_data.split(b':', 1)
expected_hash = self._hash(serialized)
if not constant_time_compare(hash.decode(), expected_hash):
raise SuspiciousSession("Session data corrupted")
else:
return self.serializer().loads(serialized)
except Exception as e:
# ValueError, SuspiciousOperation, unpickling exceptions. If any of
# these happen, just return an empty dictionary (an empty session).
if isinstance(e, SuspiciousOperation):
logger = logging.getLogger('django.security.%s' %
e.__class__.__name__)
logger.warning(force_text(e))
return {}
django/contrib/sessions/serializer.py
class JSONSerializer(object):
"""
Simple wrapper around json to be used in signing.dumps and
signing.loads.
"""
def dumps(self, obj):
return json.dumps(obj, separators=(',', ':')).encode('latin-1')
def loads(self, data):
return json.loads(data.decode('latin-1'))
Let's focus on SessionBase's encode function.
Serialize the session dictionary to a json
create a hash salt
add the salt to serialized session , base64 the concatenation
So, decode is inverse.
We can simplify the decode function in the following code.
import json
import base64
session_data = 'YTUyYzY1MjUxNzE4MzMxZjNjODFiNjZmZmZmMzhhNmM2NWQzMTllMTp7ImNvdW50Ijo0fQ=='
encoded_data = base64.b64decode(session_data)
hash, serialized = encoded_data.split(b':', 1)
json.loads(serialized.decode('latin-1'))
And that what session.get_decoded() did.

from django.conf import settings
from django.contrib.auth.models import User
from django.utils.importlib import import_module
def get_user_from_sid(session_key):
django_session_engine = import_module(settings.SESSION_ENGINE)
session = django_session_engine.SessionStore(session_key)
uid = session.get('_auth_user_id')
return User.objects.get(id=uid)

I wanted to do this in pure Python with the latest version of DJango (2.05). This is what I did:
>>> import base64
>>> x = base64.b64decode('OWNkOGQxYjg4NzlkN2ZhOTc2NmU1ODY0NWMzZmQ4YjdhMzM4OTJhNjp7Im51bV92aXNpdHMiOjJ9')
>>> print(x)
b'9cd8d1b8879d7fa9766e58645c3fd8b7a33892a6:{"num_visits":2}'
>>> import json
>>> data = json.loads(x[41:])
>>> print(data)
{'num_visits': 2}

I just had to solve something like this on a Django install. I knew the ID (36) of the user and wanted to delete the session data for that specific user. I wanted to put this code out as a prototype to build from for finding a user in session data:
from django.contrib.sessions.models import Session
TARGET_USER = 36 # edit this to match target user.
TARGET_USER = str(TARGET_USER) # type found to be a string
for session in Session.objects.all():
raw_session= session.get_decoded()
uid = session.get_decoded().get('_auth_user_id')
if uid == TARGET_USER: # this could be a list also if multiple users
print(session)
# session.delete() # uncomment to delete session data associated with the user
Hope this helps anyone out there.

Extracting data from a URL result with special formatting

I have a URL:
http://somewhere.com/relatedqueries?limit=2&query=seedterm
where modifying the inputs, limit and query, will generate wanted data. Limit is the max number of term possible and query is the seed term.
The URL provides text result formatted in this way:
oo.visualization.Query.setResponse({version:'0.5',reqId:'0',status:'ok',sig:'1303596067112929220',table:{cols:[{id:'score',label:'Score',type:'number',pattern:'#,##0.###'},{id:'query',label:'Query',type:'string',pattern:''}],rows:[{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm1'}]},{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm2'}]}],p:{'totalResultsCount':'7727'}}});
I'd like to write a python script that takes two arguments (limit number and the query seed), go fetch the data online, parse the result and return a list with the new terms ['newterm1','newterm2'] in this case.
I'd love some help, especially with the URL fetching since I have never done this before.

It sounds like you can break this problem up into several subproblems.
Subproblems
There are a handful of problems that need to be solved before composing the completed script:
Forming the request URL: Creating a configured request URL from a template
Retrieving data: Actually making the request
Unwrapping JSONP: The returned data appears to be JSON wrapped in a JavaScript function call
Traversing the object graph: Navigating through the result to find the desired bits of information
Forming the request URL
This is just simple string formatting.
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')
Python 2 Note
You will need to use the string formatting operator (%) here.
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
Retrieving data
You can use the built-in urllib.request module for this.
import urllib.request
data = urllib.request.urlopen(url) # url from previous section
This returns a file-like object called data. You can also use a with-statement here:
with urllib.request.urlopen(url) as data:
# do processing here
Python 2 Note
Import urllib2 instead of urllib.request.
Unwrapping JSONP
The result you pasted looks like JSONP. Given that the wrapping function that is called (oo.visualization.Query.setResponse) doesn't change, we can simply strip this method call out.
result = data.read()
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
Parsing JSON
The resulting result string is just JSON data. Parse it with the built-in json module.
import json
result_object = json.loads(result)
Traversing the object graph
Now, you have a result_object that represents the JSON response. The object itself be a dict with keys like version, reqId, and so on. Based on your question, here is what you would need to do to create your list.
# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]
Putting it all together
#!/usr/bin/env python3
"""A script for retrieving and parsing results from requests to
somewhere.com.
This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""
import urllib.request
import json
import sys
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
# Strip JSONP function wrapper
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
# Deserialize JSON to Python objects
result_object = json.loads(result)
# Get the rows in the table, then get the second column's value
# for each row
return [row['c'][2]['v'] for row in result_object['table']['rows']]
def retrieve_terms(limit, seedterm):
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=limit, seedterm=seedterm)
try:
with urllib.request.urlopen(url) as data:
data = perform_request(limit, seedterm)
result = data.read()
except:
print('Could not request data from server', file=sys.stderr)
exit(E_OPERATION_ERROR)
terms = parse_result(result)
print(terms)
def main(limit, seedterm):
"""Retrieves and parses data and prints each term to standard output"""
terms = retrieve_terms(limit, seedterm)
for term in terms:
print(term)
if __name__ == '__main__'
try:
limit = int(sys.argv[1])
seedterm = sys.argv[2]
except:
error_message = '''{} limit seedterm
limit must be an integer'''.format(sys.argv[0])
print(error_message, file=sys.stderr)
exit(2)
exit(main(limit, seedterm))
Python 2.7 version
#!/usr/bin/env python2.7
"""A script for retrieving and parsing results from requests to
somewhere.com.
This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""
import urllib2
import json
import sys
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
# Strip JSONP function wrapper
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
# Deserialize JSON to Python objects
result_object = json.loads(result)
# Get the rows in the table, then get the second column's value
# for each row
return [row['c'][2]['v'] for row in result_object['table']['rows']]
def retrieve_terms(limit, seedterm):
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
try:
with urllib2.urlopen(url) as data:
data = perform_request(limit, seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
terms = parse_result(result)
print terms
def main(limit, seedterm):
"""Retrieves and parses data and prints each term to standard output"""
terms = retrieve_terms(limit, seedterm)
for term in terms:
print term
if __name__ == '__main__'
try:
limit = int(sys.argv[1])
seedterm = sys.argv[2]
except:
error_message = '''{} limit seedterm
limit must be an integer'''.format(sys.argv[0])
sys.stderr.write('%s\n' % error_message)
exit(2)
exit(main(limit, seedterm))

i didn't understand well your problem because from your code there it seem to me that you use Visualization API (it's the first time that i hear about it by the way).
But well if you are just searching for a way to fetch data from a web page you could use urllib2 this is just for getting data, and if you want to parse the retrieved data you will have to use a more appropriate library like BeautifulSoop
if you are dealing with another web service (RSS, Atom, RPC) rather than web pages you can find a bunch of python library that you can use and that deal with each service perfectly.
import urllib2
from BeautifulSoup import BeautifulSoup
result = urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))
htmletxt = resul.read()
result.close()
soup = BeautifulSoup(htmltext, convertEntities="html" )
# you can parse your data now check BeautifulSoup API.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

JSON schema validator using Python - python

Related

Import a JSON project wise, so it loads just once

JSON serialization Error in Python 3.2

Google App Engine Datastore Query to JSON with Python

How to find user id from session_data from django_session table?

Extracting data from a URL result with special formatting

Categories

Resources