I have a Python project that performs a JSON validation against a specific schema.
It will run as a Transform step in GCP Dataflow, so it's very important that all dependencies are gathered before the run to avoid downloading the same file again and again.
The schema is placed in a separated Git repository.
The nature of the Transformer is that you receive a single record in your class, and you work with it. The typical flow is that you load the JSON Schema, you validate the record against it, and then you do stuff with the invalid and with the valid. Loading the schema in this way means that I download the schema from the repo for every record, and it could be hundred thousands.
The code gets "cloned" into the workers and then work kinda independent.
Inspired by the way Python loads the requirements at the beginning (one single time) and using them as imports, I thought I could add the repository (where the JSON schema lives) as a Python requirement, and then simply use it in my Python code. But of course, it's a JSON, not a Python module to be imported. How can it work?
An example would be something like:
requirements.txt
git+git://github.com/path/to/json/schema#41b95ec
dataflow_transformer.py
import apache_beam as beam
import the_downloaded_schema
from jsonschema import validate
class Verifier(beam.DoFn):
def process(self, record: dict):
validate(instance=record, schema=the_downloaded_schema)
# ... more stuff
yield record
class Transformer(beam.PTransform):
def expand(self, record):
return (
record
| "Verify Schema" >> beam.ParDo(Verifier())
)
You can load the json schema once and use it as a side input.
An example:
import json
import requests
json_current='https://covidtracking.com/api/v1/states/current.json'
def get_json_schema(url):
with requests.Session() as session:
schema = json.loads(session.get(url).text)
return schema
schema_json = get_json_schema(json_current)
def feed_schema(data, schema):
yield {'record': data, 'schema': schema[0]}
schema = p | beam.Create([schema_json])
data = p | beam.Create(range(10))
data_with_schema = data | beam.FlatMap(feed_schema, schema=beam.pvalue.AsSingleton(schema))
# Now do your schema validation
Just a demonstration of what the data_with_schema pcollection looks like
Why don't you just use a class for loading your resources that uses a cache in order to prevent double loading? Something along the lines of:
class JsonLoader:
def __init__(self):
self.cache = set()
def import(self, filename):
filename = os.path.absname(filename)
if filename not in self.cache:
self._load_json(filename)
self.cache.add(filename)
def _load_json(self, filename):
...
Related
I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here
Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.
First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.
For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml
I need to test a function with different parameters, and the most proper way for this seems to be using the with self.subTest(...) context manager.
However, the function writes something to the db, and it ends up in an inconsistent state. I can delete the things I write, but it would be cleaner if I could recreate the whole db completely. Is there a way to do that?
Not sure how to recreate the database in self.subTest() but I have another technique I am currently using which might be of interest to you. You can use fixtures to create a "snapshot" of your database which will basically be copied in a second database used only for testing purposes. I currently use this method to test code on a big project I'm working on at work.
I'll post some example code to give you an idea of what this will look like in practice, but you might have to do some extra research to tailor the code to your needs (I've added links to guide you).
The process is rather straighforward. You would be creating a copy of your database with only the data needed by using fixtures, which will be stored in a .yaml file and accessed only by your test unit.
Here is what the process would look like:
List item you want to copy to your test database to populate it using fixtures. This will only create a db with the needed data instead of stupidly copying the entire db. It will be stored in a .yaml file.
generate.py
django.setup()
stdout = sys.stdout
conf = [
{
'file': 'myfile.yaml',
'models': [
dict(model='your.model', pks='your, primary, keys'),
dict(model='your.model', pks='your, primary, keys')
]
}
]
for fixture in conf:
print('Processing: %s' % fixture['file'])
with open(fixture['file'], 'w') as f:
sys.stdout = FixtureAnonymiser(f)
for model in fixture['models']:
call_command('dumpdata', model.pop('model'), format='yaml',indent=4, **model)
sys.stdout.flush()
sys.stdout = stdout
In your test unit, import your generated .yaml file as a fixture and your test will automatically use this the data from the fixture to carry out the tests, keeping your main database untouched.
test_class.py
from django.test import TestCase
class classTest(TestCase):
fixtures = ('myfile.yaml',)
def setUp(self):
"""setup tests cases"""
# create the object you want to test here, which will use data from the fixtures
def test_function(self):
self.assertEqual(True,True)
# write your test here
You can read up more here:
Django
YAML
If you have any questions because things are unclear just ask, I'd be happy to help you out.
Maybe my solution will help someone
I used transactions to roll back to the database state that I had at the start of the test.
I use Eric Cousineau's decorator function to parametrizing tests
More about database transactions at django documentation page
import functools
from django.db import transaction
from django.test import TransactionTestCase
from django.contrib.auth import get_user_model
User = get_user_model()
def sub_test(param_list):
"""Decorates a test case to run it as a set of subtests."""
def decorator(f):
#functools.wraps(f)
def wrapped(self):
for param in param_list:
with self.subTest(**param):
f(self, **param)
return wrapped
return decorator
class MyTestCase(TransactionTestCase):
#sub_test([
dict(email="new#user.com", password='12345678'),
dict(email="new#user.com", password='password'),
])
def test_passwords(self, email, password):
# open a transaction
with transaction.atomic():
# Creates a new savepoint. Returns the savepoint ID (sid).
sid = transaction.savepoint()
# create user and check, if there only one with this email in DB
user = User.objects.create(email=email, password=password)
self.assertEqual(User.objects.filter(email=user.email).count(), 1)
# Rolls back the transaction to savepoint sid.
transaction.savepoint_rollback(sid)
I'm using TinyDB for a small CLI utility to manage personal document drafts. The database stores metadata for each draft; the file should be human-editable (so that I can add details manually), and for this reason I'd like to use YAML over JSON as the format.
I implemented a YamlStorage class subclassing storages.Storage as indicated in the TinyDB docs:
class TestYamlStorage(Storage):
"""
Store the data in a YAML file.
Written following the example at http://tinydb.readthedocs.io/en/latest/extend.html#write-a-custom-storage
"""
def __init__(self, filename): # (1)
super().__init__()
self.filename = filename
touch(filename)
def read(self):
with open(self.filename) as handle:
try:
data = yaml.load(handle.read())
return data
except yaml.YAMLError:
return None # (3)
def write(self, data):
print('writing data: {}'.format(data))
with open(self.filename, 'w') as handle:
yaml.dump(data, handle)
def close(self): # (4)
pass
Everything works fine when inserting only one element, or multiple elements at the same time using insert_multiple:
db = TinyDB('db.yaml', storage=TestYamlStorage)
dicts = [
dict(name='Homer', age=38),
dict(name='Marge', age=34),
dict(name='Bart', age=10)
]
# this works as expected
db.insert_multiple(dicts)
The resulting db.yaml:
_default:
1: {age: 38, name: Homer}
2: {age: 34, name: Marge}
3: {age: 10, name: Bart}
However, when inserting elements multiple times with insert, the resulting YAML file is different:
db = TinyDB('db.yaml', storage=TestYamlStorage)
db.insert(dict(name='Homer', age=38))
db.insert(dict(name='Bart', age=10))
db.yaml:
_default:
1: !!python/object/new:tinydb.database.Element
dictitems: {age: 38, name: Homer}
state: {eid: 1}
2: {age: 10, name: Bart}
The data in this format (apart from looking messier) seems to be not compatible with yaml.safe_load (calling db.all() returns []). My interpretation is that the YAML serialization process is in some way "over-eager", i.e. that the Element instance gets written to db.yaml instead of the underlying data.
Is there something wrong with my code? I've tried to fiddle with PyYAML options, using a different YAML module (ruamel.yaml), and create a second YamlStorage class copying from the default JSONStorage, but without any difference.
Version info: Python 3.4.3, TinyDB 3.2.0, PyYAML 3.11. I posted a runnable MWE with all imports here.
Edit
After #Anthon's suggestion, I tried printing the YAML output to sys.stdout immediately before dumping to file. The problem is reproduced also in this case. See notebook.
When you update an existing "database" you retrieve a database.Element which includes (as you can see in the second YAML file) state information.
When that again is saved you are not saving a dict, but an instance of this Element which is a subclass of dict and for that ruamel.yaml (and PyYAML) needs to store both the dictitems (the key value pairs for the dict) and the state (a dictionary representing that attributes and their values).
Converting your Element to a dict explicitly before writing should do the trick:
def write(self, data):
print('writing data: {}'.format(data))
with open(self.filename, 'w') as handle:
yaml.dump(dict(data), handle)
# ^^^^ ^
I have written JSON validator in python using jsonschema module.
Its not validating the schema correctly. I am also using web based tool, http://jsonschemalint.com/ for validating.
I wanted to exactly have some thing similar. I am putting my code here please point out the things that I am mising.
from jsonschema import validate
import json
class jsonSchemaValidator(object):
def __init__(self, schema_file):
self.__schema_file = open(schema_file)
self.__json_schema_obj = json.load(self.__schema_file)
def validate(self, json_file):
json_data_obj = json.load(open(json_file))
try:
validate(json_data_obj, self.__json_schema_obj)
print 'The JSON is follows the schema'
except Exception, extraInfo:
print str(extraInfo)
data_file_path = 'C:\\Users\\LT-BPant\\Desktop\\Del\\Schema\\new schema\\sample_output\\'
schema_path = 'C:\\Users\\LT-BPant\\Desktop\\Del\\Schema\\new schema\\'
def main():
json_file = data_file_path + 'report.json'
schema = schema_path+ 'report_new.schema'
obj = jsonSchemaValidator(schema)
obj.validate(json_file)
main()
I have manually modified the json data but still I am getting JSON DATA follows the schema as oputput whereas the web based tool is correctly showing the difference.