How to manipulate and slice multi-dimensional JSON data in Python? - python

I'm trying to set up a convenient system for storing and analyzing data from experiments. For the data files I use the following JSON format:
{
"sample_id": "",
"timestamp": "",
"other_metadata1": "",
"measurements": {
"type1": {
"timestamp": "",
"other_metadata2": ""
"data": {
"parameter1": [1,2,3],
"parameter2": [4,5,6]
}
}
"type2": { ... }
}
}
Now for analyzing many of these files, I want to filter for sample metadata and measurement metadata to get a subset of the data to plot. I wrote a function like this:
def get_subset(data_dict, include_samples={}, include_measurements={}):
# Start with a copy of all datasets
subset = copy.deepcopy(data_dict)
# Include samples if they satisfy certain properties
for prop, req in include_samples.items():
subset = {file: sample for file, sample in subset.items() if sample[prop] == req}
# Filter by measurement properties
for file, sample in subset.items():
measurements = sample['measurements'].copy()
for prop, req in include_measurements.items():
measurements = [meas for meas in measurements if meas[prop] == req]
# Replace the measurements list
sample['measurements'] = measurements
return subset
While this works, I feel like I'm re-inventing the wheel of something like pandas. I would like to have more functionality like dropping all NaN values, excluding based on metadata, etc, All of which is available in pandas. However my data format is not compatible with the 2D nature of that.
Any suggestions on how to go about manipulating and slicing such data strutures without reinventing a lot of things?

Related

Combining geoJSON/topoJSON with my own JSON generated from tabular data

I have a very large topoJSON file I generated from pulling census shapefiles for ZCTA and converting with mapshaper. https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2021&layergroup=ZIP+Code+Tabulation+Areas
An example of this homemade JSON looks like:
zips = [
{
"zip": "01001",
"quantile": "high"
},
{
"zip": "01002",
"quantile": "high"
},
{
"zip": "52155",
"quantile": "low"
},
{
"zip": "52156",
"quantile": "low"
}
]
I would like to somehow fuse these JSONs together so that I can eventually fill the ZipCode polygons with specific colors based on the quantile value in my zips. Or is there perhaps a better way to essentially use my homemade JSON source as a lookup JSON so that I can find where the zip from my topojson matches my homemade json and extract the needed value.
In Python what I have tried so far was to iterate through the geometries and then loop through the lookup (homemade json) each time to find if the zip code values match, then I append that value. This however, takes a bit of time to process and I was curious if there were better methods to do something like this?
# loop through topojson file
for i in data['objects']['tl_2021_us_zcta520']['geometries']:
# loop through lookup homemade json and if the zip found matches, we will append a new key to our properties in the topojson
for j in lkp:
if i['properties']['zip'] == j['zip']:
i['properties']['quantile'] = j['quantile']

Converting csv to nested Json using python

I want to convert csv file to json file.
I have large data in csv file.
CSV Column Structure
This is my column structure in csv file . I has 200+ records.
id.oid libId personalinfo.Name personalinfo.Roll_NO personalinfo.addr personalinfo.marks.maths personalinfo.marks.physic clginfo.clgName clginfo.clgAddr clginfo.haveCert clginfo.certNo clginfo.certificates.cert_name_1 clginfo.certificates.cert_no_1 clginfo.certificates.cert_exp_1 clginfo.certificates.cert_name_2 clginfo.certificates.cert_no_2 clginfo.certificates.cert_exp_2 clginfo.isDept clginfo.NoofDept clginfo.DeptDetails.DeptName_1 clginfo.DeptDetails.location_1 clginfo.DeptDetails.establish_date_1 _v updatedAt.date
Expected Json
[{
"id":
{
"$oid": "00001"
},
"libId":11111,
"personalinfo":
{
"Name":"xyz",
"Roll_NO":101,
"addr":"aa bb cc ddd",
"marks":
[
"maths":80,
"physic":90
.....
]
},
"clginfo"
{
"clgName":"pqr",
"clgAddr":"qwerty",
"haveCert":true, //this is boolean true or false
"certNo":1, //this could be 1-10
"certificates":
[
{
"cert_name_1":"xxx",
"cert_no_1":12345,
"cert_exp.1":"20/2/20202"
},
{
"cert_name_2":"xxx",
"cert_no_2":12345,
"cert_exp_2":"20/2/20202"
},
......//could be up to 10
],
"isDept":true, //this is boolean true or false
"NoofDept":1 , //this could be 1-10
"DeptDetails":
[
{
"DeptName_1":"yyy",
"location_1":"zzz",
"establish_date_1":"1/1/1919"
},
......//up to 10 records
]
},
"__v": 1,
"updatedAt":
{
"$date": "2022-02-02T13:35:59.843Z"
}
}]
I have tried using pandas but I'm getting output as
My output
[{
"id.$oid": "00001",
"libId":11111,
"personalinfo.Name":"xyz",
"personalinfo.Roll_NO":101,
"personalinfo.addr":"aa bb cc ddd",
"personalinfo.marks.maths":80,
"personalinfo.marks.physic":90,
"clginfo.clgName":"pqr",
"clginfo.clgAddr":"qwerty",
"clginfo.haveCert":true,
"clginfo.certNo":1,
"clginfo.certificates.cert_name_1":"xxx",
"clginfo.certificates.cert_no_1":12345,
"clginfo.certificates.cert_exp.1":"20/2/20202"
"clginfo.certificates.cert_name_2":"xxx",
"clginfo.certificates.cert_no_2":12345,
"clginfo.certificates.cert_exp_2":"20/2/20202"
"clginfo.isDept":true,
"clginfo.NoofDept":1 ,
"clginfo.DeptDetails.DeptName_1":"yyy",
"clginfo.DeptDetails.location_1":"zzz",
"eclginfo.DeptDetails.stablish_date_1":"1/1/1919",
"__v": 1,
"updatedAt.$date": "2022-02-02T13:35:59.843Z",
}]
I am new to python I only know the basic Please help me getting this output.
200+ records is really tiny, so even naive solution is good.
It can't be totally generic because I don't see how it can be seen from the headers that certificates is a list, unless we rely on all names under certificates having _N at the end.
Proposed solution using only basic python:
read header row - split all column names by period. Iterate over resulting list and create nested dicts with appropriate keys and dummy values (if you want to handle lists: create array if current key ends with _N and use N as an index)
for all rows:
clone dictionary with dummy values
for each column use split keys from above to put the value into the corresponding dict. same solution from above for lists.
append the dictionary to list of rows

Convert two CSV tables with one-to-many relation to JSON with embedded list of subdocuments

I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.
This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.
So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")

How to compare two dicts and find matching values

I'm pulling data from an API for a weather system. The API returns a single JSON object with sensors broken up into two sub-nodes for each sensor. I'm trying to associate two (or more) sensors with their time-stamps. Unfortunately, not every sensor polls every single time (although they're supposed to).
In effect, I have a JSON object that looks like this:
{
"sensor_data": {
"mbar": [{
"value": 1012,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 1011,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 1010,
"timestamp": "2019-10-31T00:45:00"
}],
"temperature": [{
"value": 10.3,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 10.2,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 10.0,
"timestamp": "2019-10-31T00:45:00"
}, {
"value": 9.8,
"timestamp": "2019-10-31T01:00:00"
}]
}
}
This examples shows I have one extra temperature reading, and this example is a really small one.
How can I take this data and associate a single reading for each timestamp, gathering as much sensor data as I can pull from matching timestamps? Ultimately, I want to export the data into a CSV file, with each row representing a slice in time from the sensor, to be graphed or further analyzed after.
For lists that are exactly the same length, I have a solution:
sensor_id = '007_OHMSS'
sensor_data = read_json('sensor_data.json') # wrapper function for open and load json
list_a = sensor_data['mbar']
list_b = sensor_data['temperature']
pair_perfect_sensor_list(sensor_id, list_a, list_b)
def pair_perfect_sensor_lists(sensor_id, list_a, list_b):
# in this case, list a will be mbar, list_b will be temperature
matches = list()
if len(list_a) == len(list_b):
for idx, reading in enumerate(list_a):
mbar_value = reading['value']
timestamp = reading['timestamp']
t_reading = list_b[idx]
t_time = t_reading['timestamp']
temp_value = t_reading['value']
print(t_time == timestamp)
if t_time == timestamp:
match = {
'sensor_id': sensor_id,
'mbar_index': idx,
'time_index': idx,
'mbar_value': mbar_value,
'temp_value': temp_value,
'mbar_time': timestamp,
'temp_time': t_time,
}
print('here is your match:')
print(match)
matches.append(match)
else:
print("IMPERFECT!")
print(t_time)
print(timestamp)
return matches
return failure
When there's not a match, I want to skip a reading for the missing sensor (in this case, the last mbar reading) and just do an N/A.
In most cases, the offset is just one node - meaning temp has one extra reading, somewhere in the middle.
I was using the idx index to optimize the speed of the process, so I don't have to loop through the second (or third, or nth) dict to see if the timestamp exists in it, but I know that's not preferred either, because dicts aren't ordered. In this case, it appears every sub-node sensor dict is ordered by timestamp, so I was trying to leverage that convenience.
Is this a common problem? If so, just point me to the terminology. But I've searched already and cannot find a reasonable, efficient answer besides "loop through each sub-dict and look for a match".
Open to any ideas, because I'll have to do this often, and on large (25 MB files or larger, sometimes) JSON objects. The full dump is up and over 300 MB, but I've sliced them up by sensor IDs so they're more manageable.
You can use .get to avoid type errors to get an output like this.
st=yourjsonabove
mbar={}
for item in st['sensor_data']['mbar']:
mbar[item['timestamp']] = item['value']
temperature={}
for item in st['sensor_data']['temperature']:
temperature[item['timestamp']] = item['value']
for timestamp in temperature:
print("Timestamp:" , timestamp, "Sensor Reading: ", mbar.get(timestamp), "Temperature Reading: ", temperature[timestamp])
leading to output:
Timestamp: 2019-10-31T00:15:00 Sensor Reading: 1012 Temperature Reading: 10.3
Timestamp: 2019-10-31T00:30:00 Sensor Reading: 1011 Temperature Reading: 10.2
Timestamp: 2019-10-31T00:45:00 Sensor Reading: 1010 Temperature Reading: 10.0
Timestamp: 2019-10-31T01:00:00 Sensor Reading: None Temperature Reading: 9.8
Does that help?
You could make a dict with timestamp keys of your sensor readings like
mbar = {s['timestamp']:s['value'] for s in sensor_data['mbar']}
temp = {s['timestamp']:s['value'] for s in sensor_data['temperature']}
Now it is easy to compare using the difference of the key sets
mbar.keys() - temp.keys()
temp.keys() - mbar.keys()

Turning JSON into dataframe with pandas

I'm trying to get a data frame but keep running into various error messages depending on the arguments I specify in read.json after I specify my file.
I've run through many of the arguments in the pandas.read_json documentation, but haven't been able to identify a solution.
import pandas
json_file = "https://gis.fema.gov/arcgis/rest/services/NSS/OpenShelters/MapServer/0/query?where=1%3D1&outFields=*&returnGeometry=false&outSR=4326&f=json"
pandas.read_json(json_file)
I'm trying to get a data frame but keep running into various error messages depending on the arguments I specify in read.json after I specify my file.
Because the JSON is not directly convertible to DataFrame. read_json works only with a few formats defined by the orient parameter. Your JSON doesn't follow any of the allowed formats so you need to manipulate the JSON before converting it to a data frame.
Let's take a high level look at your JSON:
{
"displayFieldName": ...,
"fieldAliases": {...},
"fields": {...},
"features": [...]
}
I'm gonna fathom a guess and assume the features node is what you want. Let's div deeper into features:
"features": [
{
"attributes": {
"OBJECTID": 1,
"SHELTER_ID": 223259,
...
}
},
{
"attributes": {
"OBJECTID": 2,
"SHELTER_ID": 223331,
...
}
},
...
]
features contains a list of objects, each having an attributes node. The data contained in the attributes node is what you actually want.
Here's the code
import pandas as pd
import json
from urllib.request import urlopen
json_file = "https://gis.fema.gov/arcgis/rest/services/NSS/OpenShelters/MapServer/0/query?where=1%3D1&outFields=*&returnGeometry=false&outSR=4326&f=json"
data = urlopen(json_file).read()
raw_json = json.loads(data)
formatted_json = [feature['attributes'] for feature in raw_json['features']]
formatted_json is now a list of dictionaries containing the data we are after. It is no longer JSON. To create the data frame:
df = pd.DataFrame(formatted_json)

Categories