Parsing Multiple AVRO (avsc files) which refer each other using python (fastavro) - python

I have a AVRO schema which is currently in single avsc file like below. Now I want to move address record to a different common avsc file which should be referenced from many other avsc file. So Customer and address will be separate avsc files. How can I separate them and and have customer avsc file reference address avsc file. Also how would both the files can be processed using python. I am currently using fast avro in python3 to process the single avsc file but open to use any other utility in python3 or pyspark.
File name - customer_details.avsc
[
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
},
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
]
import fastavro
s1 = fastavro.schema.load_schema('customer_details.avsc')
How can split the schema in different file where address record file can be referenced from other avsc file. Then how would I process multiple avsc files using fast Avro (Python) or any other python utility?

To do this, the schema for the AddressRecord should be in a file called com.company.model.AddressRecord.avsc with the following contents:
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
}
The Customer schema doesn't necessarily need a special naming convention since it is the top level schema, but it's probably a good idea to follow the same convention. So it would be in a file named com.company.model.Customer.avsc with the following contents:
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
The files must be in the same directory.
Then you should be able to do fastavro.schema.load_schema('com.company.model.Customer.avsc')

Related

Validating Avro schema that is referencing another schema

I am using the Python 3 avro_validator library.
The schema I want to validate references other schemas in sperate avro files. The files are in the same folder. How do I compile all the referenced schemas using the library?
Python code as follows:
from avro_validator.schema import Schema
schema_file = 'basketEvent.avsc'
schema = Schema(schema_file)
parsed_schema = schema.parse()
data_to_validate = {"test": "test"}
parsed_schema.validate(data_to_validate)
The error I get back:
ValueError: Error parsing the field [contentBasket]: The type [ContentBasket] is not recognized by Avro
And example Avro file(s) below:
basketEvent.avsc
{
"type": "record",
"name": "BasketEvent",
"doc": "Indicates that a user action has taken place with a basket",
"fields": [
{
"default": "basket",
"doc": "Restricts this event to having type = basket",
"name": "event",
"type": {
"name": "BasketEventType",
"symbols": ["basket"],
"type": "enum"
}
},
{
"default": "create",
"doc": "What is being done with the basket. Note: create / delete / update will always follow a product event",
"name": "action",
"type": {
"name": "BasketEventAction",
"symbols": ["create","delete","update","view"],
"type": "enum"
}
},
{
"default": "ContentBasket",
"doc": "The set of values that are specific to a Basket event",
"name": "contentBasket",
"type": "ContentBasket"
},
{
"default": "ProductDetail",
"doc": "The set of values that are specific to a Product event",
"name": "productDetail",
"type": "ProductDetail"
},
{
"default": "Timestamp",
"doc": "The time stamp for the event being sent",
"name": "timestamp",
"type": "Timestamp"
}
]
}
contentBasket.avsc
{
"name": "ContentBasket",
"type": "record",
"doc": "The set of values that are specific to a Basket event",
"fields": [
{
"default": [],
"doc": "A range of details about product / basket availability",
"name": "availability",
"type": {
"type": "array",
"items": "Availability"
}
},
{
"default": [],
"doc": "A range of care pland applicable to the basket",
"name": "carePlan",
"type": {
"type": "array",
"items": "CarePlan"
}
},
{
"default": "Category",
"name": "category",
"type": "Category"
},
{
"default": "",
"doc": "Unique identfier for this basket",
"name": "id",
"type": "string"
},
{
"default": "Price",
"doc": "Overall pricing info about the basket as a whole - individual product pricings will be dealt with at a product level",
"name": "price",
"type": "Price"
}
]
}
availability.avsc
{
"name": "Availability",
"type": "record",
"doc": "A range of values relating to the availability of a product",
"fields": [
{
"default": [],
"doc": "A list of offers associated with the overall basket - product level offers will be dealt with on an individual product basis",
"name": "shipping",
"type": {
"type": "array",
"items": "Shipping"
}
},
{
"default": "",
"doc": "The status of the product",
"name": "stockStatus",
"type": {
"name": "StockStatus",
"symbols": ["in stock","out of stock",""],
"type": "enum"
}
},
{
"default": "",
"doc": "The ID for the store when the stock can be collected, if relevant",
"name": "storeId",
"type": "string"
},
{
"default": "",
"doc": "The status of the product",
"name": "type",
"type": {
"name": "AvailabilityType",
"symbols": ["collection","shipping",""],
"type": "enum"
}
}
]
}
maxDate.avsc
{
"type": "record",
"name": "MaxDate",
"doc": "Indicates the timestamp for latest day a delivery should be made",
"fields": [
{
"default": "Timestamp",
"doc": "The time stamp for the delivery",
"name": "timestamp",
"type": "Timestamp"
}
]
}
minDate.avsc
{
"type": "record",
"name": "MinDate",
"doc": "Indicates the timestamp for earliest day a delivery should be made",
"fields": [
{
"default": "Timestamp",
"doc": "The time stamp for the delivery",
"name": "timestamp",
"type": "Timestamp"
}
]
}
shipping.avsc
{
"name": "Shipping",
"type": "record",
"doc": "A range of values relating to shipping a product for delivery",
"fields": [
{
"default": "MaxDate",
"name": "maxDate",
"type": "MaxDate"
},
{
"default": "MinDate",
"name": "minDate",
"type": "minDate"
},
{
"default": 0,
"doc": "Revenue generated from shipping - note, once a specific shipping object is selected, the more detailed revenye data sits within the one of object in pricing - this is more just to define if shipping is free or not",
"name": "revenue",
"type": "int"
},
{
"default": "",
"doc": "The shipping supplier",
"name": "supplier",
"type": "string"
}
]
}
timestamp.avsc
{
"name": "Timestamp",
"type": "record",
"doc": "Timestamp for the action taking place",
"fields": [
{
"default": 0,
"name": "timestampMs",
"type": "long"
},
{
"default": "",
"doc": "Timestamp converted to a string in ISO format",
"name": "isoTimestamp",
"type": "string"
}
]
}
I'm not sure if that library supports what you are trying to do, but fastavro should.
If you put the first schema in a file called BasketEvent.avsc and the second schema in a file called ContentBasket.avsc then you can do the following:
from fastavro.schema import load_schema
from fastavro import validate
schema = load_schema("BasketEvent.avsc")
validate({"test": "test"}, schema)
Note that when I tried to do this I got an error of fastavro._schema_common.UnknownType: Availability because it seems that there are other referenced schemas that you haven't posted here.

Validate Json schema with repeating json response in Python

I am getting none when I try to validate my Json schema with my Json response using Validate from Jsonschema.validate, while it shows matched on https://www.jsonschemavalidator.net/
Json Schema
{
"KPI": [{
"KPIDefinition": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"version": {
"type": "number"
},
"description": {
"type": "string"
},
"datatype": {
"type": "string"
},
"units": {
"type": "string"
}
},
"KPIGroups": [{
"id": {
"type": "number"
},
"name": {
"type": "string"
}
}]
}],
"response": [{
"Description": {
"type": "string"
}
}]
}
JSON Response
JSON Response
{
"KPI": [
{
"KPIDefinition": {
"id": "2",
"name": "KPI 2",
"version": 1,
"description": "This is KPI 2",
"datatype": "1",
"units": "perHour"
},
"KPIGroups": [
{
"id": 7,
"name": "Group 7"
}
]
},
{
"KPIDefinition": {
"id": "3",
"name": "Parameter 3",
"version": 1,
"description": "This is KPI 3",
"datatype": "1",
"units": "per Hour"
},
"KPIGroups": [
{
"id": 7,
"name": "Group 7"
}
]
}
],
"response": [
{
"Description": "RECORD Found"
}
]
}
Code
json_schema2 = {"KPI":[{"KPIDefinition":{"id_new":{"type":"number"},"name":{"type":"string"},"version":{"type":"number"},"description":{"type":"string"},"datatype":{"type":"string"},"units":{"type":"string"}},"KPIGroups":[{"id":{"type":"number"},"name":{"type":"string"}}]}],"response":[{"Description":{"type":"string"}}]}
json_resp = {"KPI":[{"KPIDefinition":{"id":"2","name":"Parameter 2","version":1,"description":"This is parameter 2 definition version 1","datatype":"1","units":"kN"},"KPIGroups":[{"id":7,"name":"Group 7"}]},{"KPIDefinition":{"id":"3","name":"Parameter 3","version":1,"description":"This is parameter 3 definition version 1","datatype":"1","units":"kN"},"KPIGroups":[{"id":7,"name":"Group 7"}]}],"response":[{"Description":"RECORD FETCHED"}]}
print(jsonschema.validate(instance=json_resp, schema=json_schema2))
Validation is not being done correctly, I changed the dataType and key name in my response but still, it is not raising an exception or error.
jsonschema.validate(..) is not supposed to return anything.
Your schema object and the JSON object are both okay and validation has passed if it didn't raise any exceptions -- which seems to be the case here.
That being said, you should wrap your call within a try-except block so as to be able to catch validation errors.
Something like:
try:
jsonschema.validate(...)
print("Validation passed!")
except ValidationError:
print("Validation failed")
# similarly catch SchemaError too if needed.
Update: Your schema is invalid. As it stands, it will validate almost all inputs. A schema JSON should be an object (dict) that should have fields like "type" and based on the type, it might have other required fields like "items" or "properties". Please read up on how to write JSONSchema.
Here's a schema I wrote for your JSON:
{
"type": "object",
"required": [
"KPI",
"response"
],
"properties": {
"KPI": {
"type": "array",
"items": {
"type": "object",
"required": ["KPIDefinition","KPIGroups"],
"properties": {
"KPIDefinition": {
"type": "object",
"required": ["id","name"],
"properties": {
"id": {"type": "string"},
"name": {"type": "string"},
"version": {"type": "integer"},
"description": {"type": "string"},
"datatype": {"type": "string"},
"units": {"type": "string"},
},
"KPIGroups": {
"type": "array",
"items": {
"type": "object",
"required": ["id", "name"],
"properties": {
"id": {"type": "integer"},
"name": {"type": "string"}
}
}
}
}
}
}
},
"response": {
"type": "array",
"items": {
"type": "object",
"required": ["Description"],
"properties": {
"Description": {"type": "string"}
}
}
}
}
}

Python: JSON to CSV

I am receiving a JSON file from a Docparser API, which I would like to convert to a CSV document.
The structure is here below:
{
"type": "object",
"properties": {
"id": {
"type": "string"
},
"document_id": {
"type": "string"
},
"remote_id": {
"type": "string"
},
"file_name": {
"type": "string"
},
"page_count": {
"type": "integer"
},
"uploaded_at": {
"type": "string"
},
"processed_at": {
"type": "string"
},
"table_data": [
{
"type": "array",
"items": {
"type": "object",
"properties": {
"account_ref": {
"type": "string"
},
"client": {
"type": "string"
},
"transaction_type": {
"type": "string"
},
"key_4": {
"type": "string"
},
"date_yyyymmdd": {
"type": "string"
},
"amount_excl": {
"type": "string"
}
},
"required": [
"account_ref",
"client",
"transaction_type",
"key_4",
"date_yyyymmdd",
"amount_excl"
]
}
}
]
}
}
The first problem that I have is how to only work with the table_data section?
My second problem is writing the actual code that allows me to put each section, i.e. account_ref, client, etc., into their own columns. I had so many changes to my code, the output varied from adding the properties into columns and dumping the table_data part into one cell, to only printing the headers into a single cell (as a list).
Here's my current code (which is not working correctly):
import pydocparser
import json
import pandas as pd
parser = pydocparser.Parser()
parser.login('API')
data2 = str(parser.fetch("Name of Parser", 'documentID'))
data2 = str(data2).replace("'", '"') # I had to put this in because it kept saying that it needs double quotes.
y = json.loads(str(data2))
json_file = open(r"C:\File.json", "w")
json_file.write(str(y))
json_file.close()
df1 = df = pd.DataFrame({str(y)})
df1.to_csv(r"C:\jsonCSV.csv")
Thanks for your help!
Pandas has a nice built in function called pandas.json_noramlize()
If you're using pandas version lower then 1.0.0 use pandas.io.json.json_normalize(), it should split the columns nicely.
read more about it here:
>1.0.0:
https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.io.json.json_normalize.html
=<1.0.0
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

How do I apply a subschema to an value when the key is unknown with JSON Schema?

Got a bit of a puzzle here, I'm trying to build a schema to use in my python app, But I can't figure out how to get this "we" field to be both required and contain a random string (ex: "QWERT1")
{
"we": [
{
"finished": "01.23.2020 12:56:31",
"run": "02611",
"scenarios": [
{
"name": "name",
"status": "failed",
"run_id": "42",
"tests": [
{
"test_id": "7",
"name": "TC29",
"status": "success",
"finished": "01.23.2020 12:56:31"
}
]
}
]
}
]
}
Rest of the fields should be also mandatory (name, status etc). If I exclude the "we" from the required the rest of the fields are treated as non-mandatory, and if I add the "we" as mandatory I can't then use there any other word :/
This my schema I've ended up with (with "we" mandatory):
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"we": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"finished": {
"type": "string"
},
"run": {
"type": "string"
},
"scenarios": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string"
},
"status": {
"type": "string"
},
"run_id": {
"type": "string"
},
"tests": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"test_id": {
"type": "string"
},
"name": {
"type": "string"
},
"status": {
"type": "string"
},
"finished": {
"type": "string"
}
},
"required": [
"test_id",
"name",
"status",
"finished"
]
}
]
}
},
"required": [
"name",
"status",
"run_id",
"tests"
]
}
]
}
},
"required": [
"finished",
"run",
"scenarios"
]
}
]
}
},
"required": [
"we"
]
}
Any ideas ?
If I understand correctly, the root object key could be any string.
First, you need to replace required with minProperties: 1. If you require only 1 property, you also need maxProperties: 1.
Next, you need to use additionalProperties rather than properties > we.
additionalProperties applies the value subschema to all property values at the JSON instance location object.
Here's a bare version of that schema...
{
"$schema": "http://json-schema.org/draft-07/schema#",
"minProperties": 1,
"additionalProperties": {}
}
You can test it with your schema and instance here: https://jsonschema.dev/s/2kE9y

How to nest records in an Avro schema?

I'm trying to get Python to parse Avro schemas such as the following...
from avro import schema
mySchema = """
{
"name": "person",
"type": "record",
"fields": [
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"},
{
"name": "address",
"type": "record",
"fields": [
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"}
]
}
]
}"""
parsedSchema = schema.parse(mySchema)
...and I get the following exception:
avro.schema.SchemaParseException: Type property "record" not a valid Avro schema: Could not make an Avro Schema object from record.
What am I doing wrong?
According to other sources on the web I would rewrite your second address definition:
mySchema = """
{
"name": "person",
"type": "record",
"fields": [
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"},
{
"name": "address",
"type": {
"type" : "record",
"name" : "AddressUSRecord",
"fields" : [
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"}
]
}
}
]
}"""
Every time we provide the type as named type, the field needs to be given as:
"name":"some_name",
"type": {
"name":"CodeClassName",
"type":"record/enum/array"
}
However, if the named type is union, then we do not need an extra type field and should be usable as:
"name":"some_name",
"type": [{
"name":"CodeClassName1",
"type":"record",
"fields": ...
},
{
"name":"CodeClassName2",
"type":"record",
"fields": ...
}]
Hope this clarifies further!

Categories