Turning JSON into dataframe with pandas - python

I'm trying to get a data frame but keep running into various error messages depending on the arguments I specify in read.json after I specify my file.
I've run through many of the arguments in the pandas.read_json documentation, but haven't been able to identify a solution.
import pandas
json_file = "https://gis.fema.gov/arcgis/rest/services/NSS/OpenShelters/MapServer/0/query?where=1%3D1&outFields=*&returnGeometry=false&outSR=4326&f=json"
pandas.read_json(json_file)
I'm trying to get a data frame but keep running into various error messages depending on the arguments I specify in read.json after I specify my file.

Because the JSON is not directly convertible to DataFrame. read_json works only with a few formats defined by the orient parameter. Your JSON doesn't follow any of the allowed formats so you need to manipulate the JSON before converting it to a data frame.
Let's take a high level look at your JSON:
{
"displayFieldName": ...,
"fieldAliases": {...},
"fields": {...},
"features": [...]
}
I'm gonna fathom a guess and assume the features node is what you want. Let's div deeper into features:
"features": [
{
"attributes": {
"OBJECTID": 1,
"SHELTER_ID": 223259,
...
}
},
{
"attributes": {
"OBJECTID": 2,
"SHELTER_ID": 223331,
...
}
},
...
]
features contains a list of objects, each having an attributes node. The data contained in the attributes node is what you actually want.
Here's the code
import pandas as pd
import json
from urllib.request import urlopen
json_file = "https://gis.fema.gov/arcgis/rest/services/NSS/OpenShelters/MapServer/0/query?where=1%3D1&outFields=*&returnGeometry=false&outSR=4326&f=json"
data = urlopen(json_file).read()
raw_json = json.loads(data)
formatted_json = [feature['attributes'] for feature in raw_json['features']]
formatted_json is now a list of dictionaries containing the data we are after. It is no longer JSON. To create the data frame:
df = pd.DataFrame(formatted_json)

Related

How to manipulate and slice multi-dimensional JSON data in Python?

I'm trying to set up a convenient system for storing and analyzing data from experiments. For the data files I use the following JSON format:
{
"sample_id": "",
"timestamp": "",
"other_metadata1": "",
"measurements": {
"type1": {
"timestamp": "",
"other_metadata2": ""
"data": {
"parameter1": [1,2,3],
"parameter2": [4,5,6]
}
}
"type2": { ... }
}
}
Now for analyzing many of these files, I want to filter for sample metadata and measurement metadata to get a subset of the data to plot. I wrote a function like this:
def get_subset(data_dict, include_samples={}, include_measurements={}):
# Start with a copy of all datasets
subset = copy.deepcopy(data_dict)
# Include samples if they satisfy certain properties
for prop, req in include_samples.items():
subset = {file: sample for file, sample in subset.items() if sample[prop] == req}
# Filter by measurement properties
for file, sample in subset.items():
measurements = sample['measurements'].copy()
for prop, req in include_measurements.items():
measurements = [meas for meas in measurements if meas[prop] == req]
# Replace the measurements list
sample['measurements'] = measurements
return subset
While this works, I feel like I'm re-inventing the wheel of something like pandas. I would like to have more functionality like dropping all NaN values, excluding based on metadata, etc, All of which is available in pandas. However my data format is not compatible with the 2D nature of that.
Any suggestions on how to go about manipulating and slicing such data strutures without reinventing a lot of things?

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

Preserve Key Order: Loading JSON data into Snowflake VARIANT column using Python script

I was trying to extract the JSON response data from an API and load it into Snowflake VARIANT column using Python Script.
While loading the data, I noticed that the keys are re-arranged in alphabetical order.
Python/Postman data:
{
"Data": [
{
"CompanyID": 3522,
"MarketID": 23259,
"MarketName": "XYZ_Market"
"LocationID": 17745,
"LocationName": "XYZ_Location"
}
}
Snowflake data:
{
"Data": [
{
"CompanyID": 3522,
"LocationID": 17745,
"LocationName": "XYZ_Location",
"MarketID": 23259,
"MarketName": "XYZ_Market"
}
}
I was using PARSE_JSON() query function to load the data into snowflake. Is there any way to preserve the order of keys ?
In Python 3.6+ dictionaries maintain their insertion order. However, as noted in the snowflake docs JSON objects are unordered. So, you may be limited by how your data is stored.
If you need to maintain order, consider an array of arrays, instead.
[
["CompanyID", 3522],
["MarketID", 23259],
["MarketName", "XYZ_Market"],
["LocationID", 17745],
["LocationName", "XYZ_Location"]
]

How to compare a json with a CSV file

I have one json payload which is used for one service request. After processing that payload(JSON) will be stored in S3 and through Athena we can download those data in CSV format. Now in the actual scenario, there are more than 100 fields. I want to verify their value through some automated script instead of manual.
say my sample payload is similar to the following:
{
"BOOK": {
"serialno": "123",
"author": "xyz",
"yearofpublish": "2015",
"price": "16"
}, "Author": [
{
"isbn": "xxxxx", "title": "first", "publisher": "xyz", "year": "2020"
}, {
"isbn": "yyyy", "title": "second", "publisher": "zmy", "year": "2019"
}
]
}
the sample csv will be like following:
Can anyone please help me how exactly I can do it on Python? Maybe the library or dictionary?
it looks like you just want to flatten out the JSON structure. It'll be easiest to loop over the "Author" list. Since the CSV has renamed the columns you'll need some way to represent that mapping. Based only on example, this works:
import json
fin=open(some_json_file, 'r')
j=json.load(fin)
result=[]
for author in j['Author']:
val = {'book_serialno': j['BOOK']['serialno'],
'book_author': j['BOOK']['author'],
'book_yearofpublish': j['BOOK']['yearofpublish'],
'book_price': j['BOOK']['price'],
'author_isbn': author['isbn'],
'author_title': author['title'],
'author_publisher': author['publisher'],
'author_year': author['year']}
result.append(val)
This is using a dictionary to show the mapping of data points to the new column names. You might be able to get away with using a list as well. Depends how you want to use it later on. To write to a CSV:
import csv
fout=open(some_csv_file, 'w')
writer=csv.writer(fout)
writer.writerow(result[0].keys())
writer.writerows(r.values() for r in result)
This writes the column names in the first row, then the data. If you don't want the column names, just leave out the writerow(...) line.

Format JSON objects in python

I have json objects in a notepad(C:\data.txt).There are millions of records I just used one record as an example.But I want to see only data on my notepad like:
1 123-567-9876 TEST1 TEST 717-567-9876 Harrisburg null US_PA
I dont want paranthesis,etc
Once I get the clean data,plan is to import the data from notepad(say C:\data2.txt) into SQL database.
This is the format of json object.
{
"status":"ok",
"items":[
{
"1":{
"Work_Phone":"123-567-9876",
"Name_Part":[
"TEST1",
"TEST"
],
"Residence_Phone":"717-567-9876",
"Mailing_City":"Harrisburg",
"Mailing_Street_Address_line_1":"",
"Cell_Phone":null,
"Mailing_Country_AND_Province_OR_State":"US_PA"
}
}
]
}
Can someone pls help with python code to format this json object and export it to notepad.
You can use
import simplejson as json
Then you can open your file and load it into a python-Dictionary:
f = file("C:/data.txt","r")
data = json.loads(f.read())
But this works only, when the json-objects are stored in an array in your file. So this has to look like this:
[{ ... first date ...},
{... second date ...},
...,
{... last date ...}]
Then in data there is an array of dictionaries. Now you can write the dates in another file:
g = file("output.txt","w")
for d in data:
for i in items:
for k in i.keys:
g.write(... some string build from the parameters ...)
If well done the file output.txt contains the lines. In detail it might be difficult becaus each item seems to contain some arrays.

Categories