Pyspark: Read in only certain fields from nested json data - python

I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.
It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.
Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?

At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.
Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?

Related

Upload XML Files in Django, Parse XML & Compare with existing models dataset

Can someone give me a better approach for below use case please ?
upload XML file
Scan XML file with specific tags
Store required data in which format ? (I thought of building JSON dump ?)
I have data in different models for different components.
How can i compare data that i have in step3 with django models and produce some output ? ( Sort of data comparison )
Note : JSON Dump which i get at step 3 is full dump of required data and data at step 4 is being referred to small small chunks of data which has to be combined and compared against recently uploaded file JSON Dump
I would define a Model where you can store the uploaded file and a form.
(https://docs.djangoproject.com/en/3.2/topics/http/file-uploads/#handling-uploaded-files-with-a-model)
Either use lxml etree or generateDS to scan XML Files. (https://www.davekuhlman.org/generateDS.html)
To store you can use a JSON-Dump or a Picklefield where you can store the Object of the XML-File in it, if you use generateDS
Store the data in a the Database and write a model for it in Django. Try to make it as granular as possible so you can compare the new XML-File when you import it and maybe only store the difference as Objects with Pickle.
Hope that helps a bit.

Python function that automatically extracts conformed-to-schema data from a json

I have some Python script/job that daily calls an API, transforms the data and stores it to a data store. The problem is the API has been changing the schema of the JSON (adding some fields mainly), so the job fails. I can of course, just write in the script to parse the fields needed for the data store, but I just wonder if there is some Python utility/function/library which reads some JSON schema definition (written in a file to be read by the script) and simply extract the data from the JSON based on what conforms to that schema definition.

I want to parse out a json with only the information that I want

Are there any tools for parsing a .json file format
You cannot selectively parse JSON. This makes no sense as how can yo tell where is the section you selectively demanding, without parsing whole file? That said, JSON have to be read usually as whole, however you can selectively process loaded JSON by own criteria. I'd sufficient to iterate over activities node, check value of action_type and act accordingly. You need to write that code yourself. There's no makePortal() magic in the language to help you with that.
Load the full json data and filter it after loading
import json
data = json.load(open('data.json'))
activities = [act for act in data['activities'] if '_STAGE_CHANGE' in act['composite_id']]

Storing pandas DataFrames in SQLAlchemy models

I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!

How can I convert a python array into a JSON string so that it can be stored in the database?

I know in javascript there is the stringify command, but is there something like this in python for pyramid applications? Right now I have a view callable that takes an uploaded stl file and parses it into a format as such. data= [[[x1,x2,x3],...],[[v1,v2,v3],...]] How can I convert this into a JSON string so that it can be stored in an SQLite database? Can I insert the javascript stringify command into my views.py file? Is there an easier way to do this?
You can use the json module to do this:
import json
data_str = json.dumps(data)
There are other array representations that can be stored in a database as well (see pickle).
However, if you're actually constructing a database, you should know that's it's considered a violation of basic database principles (first normal form) to store multiple data in a single value in a relational database. What you should do is decompose the array into rows (and possibly separate tables) and store a single value in each "cell". That will allow you to query and analyze the data using SQL.
If you're not trying to build an actual database (if the array is completely opaque to your application and you'll never want to search, sort, aggregate, or report by the values inside the array) you don't need to worry so much about normal form but you may also find that you don't need the overhead of an SQL database.
you also can use cjson, it is faster than json library.
import cjson
json_str = cjson.encode(your_string)

Categories