I've been trying to normalize a JSON file and wanted a python(pandas) or pyspark script the more generic as possible that can extract data from a very nested mongodb JSON - it comes from a third party API and saved in MongoDB - and return it in a relational dataset so we can consume it from the datalake.
There are a lot of records and fields, so we can't do it in only one dataframe. Also, the layout does not have a pattern.
Could you please help us?
What is the best way to do this in best practices and, if possible, recursively?
Below is a chunk of the json file
https://raw.githubusercontent.com/migueelcruz/sample_json/main/sample.json
We expect multiple dataframes that link each other so we can consume data like a relational database. Also, the files must be like a database table.
Thanks a lot for your help!
A way to approach this problem would be using the json module to deserialize the data into a python dictionary.
# Get the data
import urllib.request as urllib
link = "https://raw.githubusercontent.com/migueelcruz/sample_json/main/sample.json"
f = urllib.urlopen(link)
myfile = f.read()
# Deserialize
import json
data = json.loads(myfile)
data
Now the way you would get the data is using python dictionaries syntax.
i.e if you want to get eventos which is under dados which is under eventos would be:
data["dados"]["nfe"]["eventos"]
Related
I created an API using FastAPI that returned a JSON. First, I used to turn the Dataframe to JSON using the Pandas .to_json() method, which allowed me to choose the correct "orient" parameter. This saved a .json file and then opened it to make fastAPI return it as it follows:
DATA2.to_json("json_records.json",orient="records")
with open('json_records.json', 'r') as f:
data = json.load(f)
return(data)
This worked perfectly, but i was told that my script shouldn't save any files since this script would be running on my company's server, so I had to directly turn the dataframe into JSON and return it. I tried doing this:
data = DATA2.to_json(orient="records")
return(data)
But now the API's output is a JSON full of "\". I guess there is a problem with the parsing but i can't really find a way to do it properly.
The output now looks like this:
"[{\"ExtraccionHora\":\"12:53:00\",\"MiembroCompensadorCodigo\":117,\"MiembroCompensadorDescripcion\":\"OMEGA CAPITAL S.A.\",\"CuentaCompensacionCodigo\":\"1143517\",\"CuentaNeteoCodigo\":\"160234117\",\"CuentaNeteoDescripcion\":\"UNION FERRO SRA A\",\"ActivoDescripcion\":\"X17F3\",\"ActivoID\":8,\"FinalidadID\":2,\"FinalidadDescripcion\":\"Margenes\",\"Cantidad\":11441952,\"Monto\":-16924935.3999999985,\"Saldo\":-11379200.0,\"IngresosVerificados\":11538288.0,\"IngresosNoVerificado\":0.0,\"MargenDelDia\":0.0,\"SaldoConsolidadoFinal\":-16765847.3999999985,\"CuentaCompensacionCodigoPropia\":\"80500\",\"SaldoCuentaPropia\":-7411284.3200000003,\"Resultado\":\"0\",\"MiembroCompensadorID\":859,\"CuentaCompensacionID\":15161,\"CuentaNeteoID\":7315285}.....
What would be a proper way of turning my dataframe into a JSON using the "records" orient, and then returning it as the FastAPI output?
Thanks!
update: i changed the to_json() method to to_dict() using the same parameters and seems to work... don't know if its correct.
data = DATA2.to_dict(orient="records")
return(data)
I'm building a site that, based on a user's input, sorts through JSON data and prints a schedule for them into an html table. I want to give it the functionality that once the their table is created they can export the data to a CSV/Excel file so we don't have to store their credentials (logins & schedules in a database). Is this possible? If so, how can I do it using python preferably?
This is not the exact answer but rather steps for you to follow in order to get a solution:
1 Read data from json. some_dict = json.loads(json_string)
2 Appropriate code to get the data from dictionary (sort/ conditions etc) and get that data in a 2D array (list)
3 Save that list as csv: https://realpython.com/python-csv/
I'm pretty lazy and like to utilize pandas for things like this. It would be something along the lines of
import pandas as pd
file = 'data.json'
with open(file) as j:
json_data = json.load(j)
df = pd.DataFrame.from_dict(j, orient='index')
df.to_csv("data.csv")
I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()
The URL gives a csv formatted data. I am trying to get the data and push it in database. However, I am unable to read data as it only prints header of the file and not complete csv data. Could there be better option?
#!/usr/bin/python3
import pandas as pd
data = pd.read_csv("some-url") //URL not provided due to security restrictions.
for row in data:
print(row)
You can iterate through the results of df.to_dict(orient="records"):
data = pd.read_csv("some-url")
for row in data.to_dict(orient="records"):
# For each loop, `row` will be filled with a key:value dict where each
# key takes the value of the column name.
# Use this dict to create a record for your db insert, eg as raw SQL or
# to create an instance for an ORM like SQLAlchemy.
I do a similar thing to pre-format data for SQLAlchemy inserts, although I'm using Pandas to merge data from multiple sources rather than just reading the file.
Side note: There will be plenty of other ways to do this without Pandas and just iterate through the lines of the file. However Pandas's intuituve handling of CSVs makes it an attractive shortcut to do what you need.
Scenario: I've 85789142 JSON documents, and a textfile with 32227957 items.
The textfile would look like:
url1
url2
url3
And a sample JSON document:
{"key1":"value1","key2":"value2","url":"some_url"}
I want to find the JSON documents corresponding to the items in the textfile.
What I've done:
import json
textfile_rdd = sc.textFile("path/to/textfile.txt")
urls = set(textfile_rdd.collect())
json_files_rdd = sc.textFile("path/to/the/directory/of/json/files")
json_rdd = json_files_rdd.filter(lambda x: (json.loads(x)).get("url") in urls )
This code works for a textfile of small size (I've tried with 500000 documents).
Currently I'm splitting my textfile of 32227957 into smaller files, are there any better approaches?
I would suggest you to parse your json file using sparkSQL and load the textfile as a DataFrame of one column too. Then you could simply join the two datasets, without any need to collect the first file to the driver, which is your scalability issue now...