I am building a data pipeline and making use of Avro as the file format. The pipeline is built in Python. The source data is received as csv and converted to avro format using a defined avro schema as expected.
During the conversion, there is obviously validation between the file schema and the avro schema. This will output an error if a column is missing from the file or if data types in the file do not match the schema.
My issue is the source files are received from an external source every month and often have changes to the headers and the content of the files. This is not an issue for major changes as this would need to be corrected on our side regardless. However, the headers of the source files often contain minor changes that do not warrant a break in the system and interference from our side - say if the schema header is "Client" and the source file header is "Client Name". Or even simpler, "Client" and "Clients".
Is there a way to perform partial header matches or keyword matching when converting to avro format? Any assistance with this would be much appreciated!
Related
Can someone give me a better approach for below use case please ?
upload XML file
Scan XML file with specific tags
Store required data in which format ? (I thought of building JSON dump ?)
I have data in different models for different components.
How can i compare data that i have in step3 with django models and produce some output ? ( Sort of data comparison )
Note : JSON Dump which i get at step 3 is full dump of required data and data at step 4 is being referred to small small chunks of data which has to be combined and compared against recently uploaded file JSON Dump
I would define a Model where you can store the uploaded file and a form.
(https://docs.djangoproject.com/en/3.2/topics/http/file-uploads/#handling-uploaded-files-with-a-model)
Either use lxml etree or generateDS to scan XML Files. (https://www.davekuhlman.org/generateDS.html)
To store you can use a JSON-Dump or a Picklefield where you can store the Object of the XML-File in it, if you use generateDS
Store the data in a the Database and write a model for it in Django. Try to make it as granular as possible so you can compare the new XML-File when you import it and maybe only store the difference as Objects with Pickle.
Hope that helps a bit.
I'm just doing some reverse engineering exercise and have ran across application/x-protobuff protocol..
I am currently sniffing network calls from redfin using mitmproxy. I see a endpoint for a result, however the response is unstructured JSON formatted data with content type application/x-protobuff After doing a bit of research, I found out that protobuff uses a schema to map the data internally, and I am assuming the schema also sits in the client somewhere, called .proto file.
SS
To validate my assumption on what that screenshot tells is that
I can see there is a response header called X-ProtoBuf-Schema is that the the location where the schma would be located, the same schema I can use to decrypt the response data? How would I go on about reading that data in a more structured manner?
I am able to make a request using requests to that endpoint, just gives me protobuffers.
PS: This is what the JSON format looks like
https://pastebin.com/LY51X9KZ
"and I am assuming the schema also sits in the client somewhere, called .proto file." - I wouldn't assume that at all; the client, once built, doesn't need the .proto - the generated code is used instead of any explicit schema. If a site is publishing a schema, it is probably a serialized FileDescriptorSet from google/protobuf/descriptor.proto, which contains the intent of the .proto, but as data.
I have some Python script/job that daily calls an API, transforms the data and stores it to a data store. The problem is the API has been changing the schema of the JSON (adding some fields mainly), so the job fails. I can of course, just write in the script to parse the fields needed for the data store, but I just wonder if there is some Python utility/function/library which reads some JSON schema definition (written in a file to be read by the script) and simply extract the data from the JSON based on what conforms to that schema definition.
I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.
It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.
Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?
At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.
Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?
I am using python to download a city's data. I want to get the schema with python (data types and names) for columns download with sodapy. I can get the data fine but is is all of type string. So, it would be nice to have data types so I could build a proper schema to load the data into.
On the website, they have data types laid out for the columns.
https://data.muni.org/Housing-and-Homelessness/CAMA-Property-Inventory-Residential-with-Details/r3di-nq2j
There is so metadata at this url but it does not have the column info.
https://data.muni.org/api/views/metadata/v1/r3di-nq2j
Unable to comment due to low rep, so adding a new answer, in case anyone else comes across this in search results, as I did.
Omitting /metadata/v1/ from OPs original URL as such: https://data.muni.org/api/views/r3di-nq2j will return full table metadata.
Additionally, Socrata HTML response headers include special fields (X-SODA2-Fields and X-SODA2-Types) containing a list of the column names and data types returned in the results.
Ref: https://dev.socrata.com/docs/response-codes.html
There may be a better way but this url returns a json response with the information.
http://api.us.socrata.com/api/catalog/v1?ids=r3di-nq2j