How to load json nested data into bigquery - python

I'm trying to load the json data from an API into bigquery table on GCP however I got an issue that the json data seem to miss a square bracket so it got an error '"Repeated record with name trip_update added outside of an array."}]'. I don't know how
Here is the data sample:
{
"header": {
"gtfs_realtime_version": "1.0",
"timestamp": 1607630971
},
"entity": [
{
"id": "65.5.17-120-cm1-1.18.O",
"trip_update": {
"trip": {
"trip_id": "65.5.17-120-cm1-1.18.O",
"start_time": "18:00:00",
"start_date": "20201210",
"schedule_relationship": "SCHEDULED",
"route_id": "17-120-cm1-1"
},
"stop_time_update": [
{
"stop_sequence": 1,
"departure": {
"delay": 0
},
"stop_id": "8220B1351201",
"schedule_relationship": "SCHEDULED"
},
{
"stop_sequence": 23,
"arrival": {
"delay": 2340
},
"departure": {
"delay": 2340
},
"stop_id": "8260B1025301",
"schedule_relationship": "SCHEDULED"
}
]
}
}
]
}
Here is a schema and code:
schema
[
{ "name":"header",
"type": "record",
"fields": [
{ "name":"gtfs_realtime_version",
"type": "string",
"description": "version of speed specification"
},
{ "name": "timestamp",
"type": "integer",
"description": "The moment where this dataset was generated on server e.g. 1593102976"
}
]
},
{"name":"entity",
"type": "record",
"mode": "REPEATED",
"description": "Multiple entities can be included in the feed",
"fields": [
{"name":"id",
"type": "string",
"description": "unique identifier for the entity"
},
{"name": "trip_update",
"type": "struct",
"mode": "REPEATED",
"description": "Data about the realtime departure delays of a trip. At least one of the fields trip_update, vehicle, or alert must be provided - all these fields cannot be empty.",
"fields": [
{ "name":"trip",
"type": "record",
"mode": "REPEATED",
"fields": [
{"name": "trip_id",
"type": "string",
"description": "selects which GTFS entity (trip) will be affected"
},
{ "name":"start_time",
"type": "string",
"description": "The initially scheduled start time of this trip instance 13:30:00"
},
{ "name":"start_date",
"type": "string",
"description": "The start date of this trip instance in YYYYMMDD format. Whether start_date is required depends on the type of trip: e.g. 20200625"
},
{ "name":"schedule_relationship",
"type": "string",
"description": "The relation between this trip and the static schedule e.g. SCHEDULED"
},
{ "name":"route_id",
"type": "string",
"description": "The route_id from the GTFS feed that this selector refers to e.g. 10-263-e16-1"
}
]
}
]
},
{ "name":"stop_time_update",
"type": "record",
"mode": "REPEATED",
"description": "Updates to StopTimes for the trip (both future, i.e., predictions, and in some cases, past ones, i.e., those that already happened). The updates must be sorted by stop_sequence, and apply for all the following stops of the trip up to the next specified stop_time_update. At least one stop_time_update must be provided for the trip unless the trip.schedule_relationship is CANCELED - if the trip is canceled, no stop_time_updates need to be provided.",
"fields": [
{"name":"stop_sequence",
"type": "string",
"description": "Must be the same as in stop_times.txt in the corresponding GTFS feed e.g 3"
},
{ "name":"arrival",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "string",
"description": "Delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule). Delay of 0 means that the vehicle is exactly on time e.g 5"
}
]
},
{ "name": "departure",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "integer"
}
]
},
{ "name":"stop_id",
"type": "string",
"description": "Must be the same as in stops.txt in the corresponding GTFS feed e.g. 8430B2552301"
},
{"name":"schedule_relationship",
"type": "string",
"description": "The relation between this StopTime and the static schedule e.g. SCHEDULED , SKIPPED or NO_DATA"
}
]
}
]
}
]
function (following google guideline https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions?authuser=2#before-you-begin)
def _insert_into_bigquery(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
row = json.loads(blob.download_as_string())
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,
json_rows=row,
ignore_unknown_values=True,
retry=retry.Retry(deadline=30))
if errors != []:
raise BigQueryError(errors)

Your schema definition is wrong. trip_update isn't a struct repeated, but a record nullable (or not, but not repeated)
{"name": "trip_update",
"type": "record",
"mode": "NULLABLE",

One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
I believe that your "trip_update" and "trip" fields must contain an array of values (indicated by square brackets), the same as you did for "stop_time_update".
"trip_update": [
{
"trip": [
{
"trip_id
I am not sure that will be enough though to load your data flawlessly.
Your example row has many newline characters in the middle of your JSON row, and when you are loading data from JSON files, the rows must be newline delimited. BigQuery expects newline-delimited JSON files to contain a single record per line (the parser is trying to interpret each line as a separate JSON row) (Reference).
Example of how your JSON data file should look like.

Related

Validating Avro schema that is referencing another schema

I am using the Python 3 avro_validator library.
The schema I want to validate references other schemas in sperate avro files. The files are in the same folder. How do I compile all the referenced schemas using the library?
Python code as follows:
from avro_validator.schema import Schema
schema_file = 'basketEvent.avsc'
schema = Schema(schema_file)
parsed_schema = schema.parse()
data_to_validate = {"test": "test"}
parsed_schema.validate(data_to_validate)
The error I get back:
ValueError: Error parsing the field [contentBasket]: The type [ContentBasket] is not recognized by Avro
And example Avro file(s) below:
basketEvent.avsc
{
"type": "record",
"name": "BasketEvent",
"doc": "Indicates that a user action has taken place with a basket",
"fields": [
{
"default": "basket",
"doc": "Restricts this event to having type = basket",
"name": "event",
"type": {
"name": "BasketEventType",
"symbols": ["basket"],
"type": "enum"
}
},
{
"default": "create",
"doc": "What is being done with the basket. Note: create / delete / update will always follow a product event",
"name": "action",
"type": {
"name": "BasketEventAction",
"symbols": ["create","delete","update","view"],
"type": "enum"
}
},
{
"default": "ContentBasket",
"doc": "The set of values that are specific to a Basket event",
"name": "contentBasket",
"type": "ContentBasket"
},
{
"default": "ProductDetail",
"doc": "The set of values that are specific to a Product event",
"name": "productDetail",
"type": "ProductDetail"
},
{
"default": "Timestamp",
"doc": "The time stamp for the event being sent",
"name": "timestamp",
"type": "Timestamp"
}
]
}
contentBasket.avsc
{
"name": "ContentBasket",
"type": "record",
"doc": "The set of values that are specific to a Basket event",
"fields": [
{
"default": [],
"doc": "A range of details about product / basket availability",
"name": "availability",
"type": {
"type": "array",
"items": "Availability"
}
},
{
"default": [],
"doc": "A range of care pland applicable to the basket",
"name": "carePlan",
"type": {
"type": "array",
"items": "CarePlan"
}
},
{
"default": "Category",
"name": "category",
"type": "Category"
},
{
"default": "",
"doc": "Unique identfier for this basket",
"name": "id",
"type": "string"
},
{
"default": "Price",
"doc": "Overall pricing info about the basket as a whole - individual product pricings will be dealt with at a product level",
"name": "price",
"type": "Price"
}
]
}
availability.avsc
{
"name": "Availability",
"type": "record",
"doc": "A range of values relating to the availability of a product",
"fields": [
{
"default": [],
"doc": "A list of offers associated with the overall basket - product level offers will be dealt with on an individual product basis",
"name": "shipping",
"type": {
"type": "array",
"items": "Shipping"
}
},
{
"default": "",
"doc": "The status of the product",
"name": "stockStatus",
"type": {
"name": "StockStatus",
"symbols": ["in stock","out of stock",""],
"type": "enum"
}
},
{
"default": "",
"doc": "The ID for the store when the stock can be collected, if relevant",
"name": "storeId",
"type": "string"
},
{
"default": "",
"doc": "The status of the product",
"name": "type",
"type": {
"name": "AvailabilityType",
"symbols": ["collection","shipping",""],
"type": "enum"
}
}
]
}
maxDate.avsc
{
"type": "record",
"name": "MaxDate",
"doc": "Indicates the timestamp for latest day a delivery should be made",
"fields": [
{
"default": "Timestamp",
"doc": "The time stamp for the delivery",
"name": "timestamp",
"type": "Timestamp"
}
]
}
minDate.avsc
{
"type": "record",
"name": "MinDate",
"doc": "Indicates the timestamp for earliest day a delivery should be made",
"fields": [
{
"default": "Timestamp",
"doc": "The time stamp for the delivery",
"name": "timestamp",
"type": "Timestamp"
}
]
}
shipping.avsc
{
"name": "Shipping",
"type": "record",
"doc": "A range of values relating to shipping a product for delivery",
"fields": [
{
"default": "MaxDate",
"name": "maxDate",
"type": "MaxDate"
},
{
"default": "MinDate",
"name": "minDate",
"type": "minDate"
},
{
"default": 0,
"doc": "Revenue generated from shipping - note, once a specific shipping object is selected, the more detailed revenye data sits within the one of object in pricing - this is more just to define if shipping is free or not",
"name": "revenue",
"type": "int"
},
{
"default": "",
"doc": "The shipping supplier",
"name": "supplier",
"type": "string"
}
]
}
timestamp.avsc
{
"name": "Timestamp",
"type": "record",
"doc": "Timestamp for the action taking place",
"fields": [
{
"default": 0,
"name": "timestampMs",
"type": "long"
},
{
"default": "",
"doc": "Timestamp converted to a string in ISO format",
"name": "isoTimestamp",
"type": "string"
}
]
}
I'm not sure if that library supports what you are trying to do, but fastavro should.
If you put the first schema in a file called BasketEvent.avsc and the second schema in a file called ContentBasket.avsc then you can do the following:
from fastavro.schema import load_schema
from fastavro import validate
schema = load_schema("BasketEvent.avsc")
validate({"test": "test"}, schema)
Note that when I tried to do this I got an error of fastavro._schema_common.UnknownType: Availability because it seems that there are other referenced schemas that you haven't posted here.

Parsing Multiple AVRO (avsc files) which refer each other using python (fastavro)

I have a AVRO schema which is currently in single avsc file like below. Now I want to move address record to a different common avsc file which should be referenced from many other avsc file. So Customer and address will be separate avsc files. How can I separate them and and have customer avsc file reference address avsc file. Also how would both the files can be processed using python. I am currently using fast avro in python3 to process the single avsc file but open to use any other utility in python3 or pyspark.
File name - customer_details.avsc
[
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
},
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
]
import fastavro
s1 = fastavro.schema.load_schema('customer_details.avsc')
How can split the schema in different file where address record file can be referenced from other avsc file. Then how would I process multiple avsc files using fast Avro (Python) or any other python utility?
To do this, the schema for the AddressRecord should be in a file called com.company.model.AddressRecord.avsc with the following contents:
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
}
The Customer schema doesn't necessarily need a special naming convention since it is the top level schema, but it's probably a good idea to follow the same convention. So it would be in a file named com.company.model.Customer.avsc with the following contents:
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
The files must be in the same directory.
Then you should be able to do fastavro.schema.load_schema('com.company.model.Customer.avsc')

ElasticSearch Parse Error

I am attempting to read JSON Data from a Network Port Scan and store these results in an ElasticSearch Index as a document. However, whenever I try to do this, I get a MapperParsingException error on the scan output results. In my mapping, I even tried to change the analysis to not_analyzed and no, but the error doesnt go away. Then, I figured that ES might be trying to interpret certain values as date values and attempted to set date_format to 0 or none. That led to a dead-end as well, with the mapping throwing an Unsupported option exception.
I have a dump of the values that I want to index in ElasticSearch here:
{
"protocol": "tcp",
"service": "ssh",
"state": "open",
"script_out": [
{
"output": "\n 1024 de:4e:50:33:cd:f6:8a:d0:c4:5a:e9:7d:1e:7b:13:12 (DSA)\nssh-dss AAAAB3NzaC1kc3MAAACBANkPx1nphZwsN1SVPPQHwz93abIHuEC4wMEeZiXdBC8RoSUUeCmdgPfIh4or0LvZ1pqaZP/k0qzCLyVxFt/eI7n36Lb9sZdVMf1Ao7E9TSc7lj9wg5ffY58WbWob/GQs1llGZ2K9Gp7oWuwCjKP164MsxMvahoJAAaWfap48ZiXpAAAAFQCnRMwRp8wBzzQU6lia8NegIb5rswAAAIEAxvN66VMDxE5aU8SvwwVmcUNwtVQWZ6pxn2W0gzF6H7JL1BhcnbCwQ3J/S6WdtqL2Dscw8drdAvsrN4XC8RT6Jowsir4q4HSQCybll6fSpNEdlv/nLIlYsH5ZuZZUIMxbTQ9vT0oYvzpDHejIQ/Zl1inYnJ+6XJmOc0LPUsu5PEsAAACAQO+Tsd3inLGskrqyrWSDO0VDD3cApYW7C+uTWXBfIoh/sVw+X9+OPa833w/PQkpacm68kYPXKS7GK8lqhg93dwbUNYFKz9MMNY6WVOjeAX9HtUAbglgLyRIt0CBqmL4snoZeKab22Nlmaf4aU5cHFlG9gnFEcK0vVIwIWp2EM/I=\n 2048 94:5f:86:77:81:39:2e:03:e0:42:d8:7d:10:a5:60:f0 (RSA)\nssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDV9BKj+QSavAr4UcDaCoHADVIaMOpcI5/hx/X9CRLDTxmB/WvEiL42tziMZEx7ipHT28/hl4HOwK64eXZuK75JMrMDutCZ2gmvRmvFKl6mAVbUEOlVkMGZeNJxATCZyWQyrZ6wA9E2ns5+id6l9C8we+bdq39cIR/e+yR8Ht8sfaigDi0gcW67GrHDI/oIgTQ79l+T/xAqCVrtQxqn/6pCuaCWQUVCxgOPXmJPbsd+g+oqZtm0aEjIJvcDJocMkZ2qMMlgMPeJBN27FCTKB80UUbV57iHXHzZF+cD7v+Jlw0fmyMapMkkPH+aabOUy7Kkbty1mucrFxaisLsckEf47",
"elements": {
"null": [
{
"type": "ssh-dss",
"bits": "1024",
"key": "AAAAB3NzaC1kc3MAAACBANkPx1nphZwsN1SVPPQHwz93abIHuEC4wMEeZiXdBC8RoSUUeCmdgPfIh4or0LvZ1pqaZP/k0qzCLyVxFt/eI7n36Lb9sZdVMf1Ao7E9TSc7lj9wg5ffY58WbWob/GQs1llGZ2K9Gp7oWuwCjKP164MsxMvahoJAAaWfap48ZiXpAAAAFQCnRMwRp8wBzzQU6lia8NegIb5rswAAAIEAxvN66VMDxE5aU8SvwwVmcUNwtVQWZ6pxn2W0gzF6H7JL1BhcnbCwQ3J/S6WdtqL2Dscw8drdAvsrN4XC8RT6Jowsir4q4HSQCybll6fSpNEdlv/nLIlYsH5ZuZZUIMxbTQ9vT0oYvzpDHejIQ/Zl1inYnJ+6XJmOc0LPUsu5PEsAAACAQO+Tsd3inLGskrqyrWSDO0VDD3cApYW7C+uTWXBfIoh/sVw+X9+OPa833w/PQkpacm68kYPXKS7GK8lqhg93dwbUNYFKz9MMNY6WVOjeAX9HtUAbglgLyRIt0CBqmL4snoZeKab22Nlmaf4aU5cHFlG9gnFEcK0vVIwIWp2EM/I=",
"fingerprint": "de4e5033cdf68ad0c45ae97d1e7b1312"
},
{
"type": "ssh-rsa",
"bits": "2048",
"key": "AAAAB3NzaC1yc2EAAAADAQABAAABAQDV9BKj+QSavAr4UcDaCoHADVIaMOpcI5/hx/X9CRLDTxmB/WvEiL42tziMZEx7ipHT28/hl4HOwK64eXZuK75JMrMDutCZ2gmvRmvFKl6mAVbUEOlVkMGZeNJxATCZyWQyrZ6wA9E2ns5+id6l9C8we+bdq39cIR/e+yR8Ht8sfaigDi0gcW67GrHDI/oIgTQ79l+T/xAqCVrtQxqn/6pCuaCWQUVCxgOPXmJPbsd+g+oqZtm0aEjIJvcDJocMkZ2qMMlgMPeJBN27FCTKB80UUbV57iHXHzZF+cD7v+Jlw0fmyMapMkkPH+aabOUy7Kkbty1mucrFxaisLsckEf47",
"fingerprint": "945f867781392e03e042d87d10a560f0"
}
]
},
"id": "ssh-hostkey"
}
],
"banner": "product: OpenSSH version: 6.2 extrainfo: protocol 2.0",
"port": "22"
},
Update
I am able to index the content in the "output" key. However, the error appears when I try and index the content in the "elements" key
Update 2
There's a possibility that there's something wrong with my mapping. This is the python code that I am using for the mapping.
"scan_info": {
"properties": {
"protocol": {
"type": "string",
"index": "analyzed"
},
"service": {
"type": "string",
"index": "analyzed"
},
"state": {
"type": "string",
"index": "not_analyzed"
},
"banner": {
"type": "string",
"index": "analyzed"
},
"port": {
"type": "string",
"index": "not_analyzed"
},
"script_out": { #is this the problem??
"type": "object",
"dynamic": True
}
}
}
I am drawing a blank here. What do I need to do?

Is there a way to use JSON schemas to enforce values between fields?

I've recently started playing with JSON schemas to start enforcing API payloads. I'm hitting a bit of a roadblock with defining the schema for a legacy API that has some pretty kludgy design logic which has resulted (along with poor documentation) in clients misusing the endpoint.
Here's the schema so far:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"type": {
"type": "string"
},
"object_id": {
"type": "string"
},
"question_id": {
"type": "string",
"pattern": "^-1|\\d+$"
},
"question_set_id": {
"type": "string",
"pattern": "^-1|\\d+$"
},
"timestamp": {
"type": "string",
"format": "date-time"
},
"values": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"type",
"object_id",
"question_id",
"question_set_id",
"timestamp",
"values"
],
"additionalProperties": false
}
}
Notice that for question_id and question_set_id, they both take a numeric string that can either be a -1 or some other non-negative integer.
My question: is there a way to enforce that if question_id is set to -1, that question_set_id is also set to -1 and vice-versa.
It would be awesome if I could have that be validated by the parser rather than having to do that check in application logic.
Just for additional context, I've been using python's jsl module to generate this schema.
You can achieve the desired behavior by adding the following to your items schema. It asserts that the schema must conform to at least one of the schemas in the list. Either both are "-1" or both are positive integers. (I assume you have good reason for representing integers as strings.)
"anyOf": [
{
"properties": {
"question_id": { "enum": ["-1"] },
"question_set_id": { "enum": ["-1"] }
}
},
{
"properties": {
"question_id": {
"type": "string",
"pattern": "^\\d+$"
},
"question_set_id": {
"type": "string",
"pattern": "^\\d+$"
}
}
}

How to make json-schema to allow one but not another field?

Is it possible to make jsonschema to have only one of two fields.
For example, image if I want to have a JSON with ether start_dt or end_dt but not both of them at the same time. like this:
OK
{
"name": "foo",
"start_dt": "2012-10-10"
}
OK
{
"name": "foo",
"end_dt": "2012-10-10"
}
NOT OK
{
"name": "foo",
"start_dt": "2012-10-10"
"end_dt": "2013-11-11"
}
What should I add to the schema:
{
"title": "Request Schema",
"type": "object",
"properties": {
"name":
{
"type": "string"
},
"start_dt":
{
"type": "string",
"format": "date"
},
"end_dt":
{
"type": "string",
"format": "date"
}
}
}
You can express this using oneOf. This means that the data must match exactly one of the supplied sub-schemas, but not more than one.
Combining this with required, this schema says that instances must either define start_dt, OR define end_dt - but if they contain both, then it is invalid:
{
"type": "object",
"properties": {
"name": {"type": "string"},
"start_dt": {"type": "string", "format": "date"},
"end_dt": {"type": "string", "format": "date"}
},
"oneOf": [
{"required": ["start_dt"]},
{"required": ["end_dt"]}
]
}
Online demos with your three examples:
OK
OK
NOT OK

Categories