Very nested JSON with optional fields into pandas dataframe - python

I have a JSON with the following structure. I want to extract some data to different lists so that I will be able to transform them into a pandas dataframe.
{
"ratings": {
"like": {
"average": null,
"counts": {
"1": {
"total": 0,
"users": []
}
}
}
},
"sharefile_vault_url": null,
"last_event_on": "2021-02-03 00:00:01",
],
"fields": [
{
"type": "text",
"field_id": 130987800,
"label": "Name and Surname",
"values": [
{
"value": "John Smith"
}
],
{
"type": "category",
"field_id": 139057651,
"label": "Gender",
"values": [
{
"value": {
"status": "active",
"text": "Male",
"id": 1,
"color": "DCEBD8"
}
}
],
{
"type": "category",
"field_id": 151333010,
"label": "Field of Studies",
"values": [
{
"value": {
"status": "active",
"text": "Languages",
"id": 3,
"color": "DCEBD8"
}
}
],
}
}
For example, I create a list
names = []
where if "label" in the "fields" list is "Name and Surname" I append ["values"][0]["value"] so names now contains "John Smith". I do exactly the same for the "Gender" label and append the value to the list genders.
The above dictionary is contained in a list of dictionaries so I just have to loop though the list and extract the relevant fields like this:
names = []
genders = []
for r in range(len(users)):
for i in range(len(users[r].json()["items"])):
for field in users[r].json()["items"][i]["fields"]:
if field["label"] == "Name and Surname":
names.append(field["values"][0]["value"])
elif field["label"] == "Gender":
genders.append(field["values"][0]["value"]["text"])
else:
# Something else
where users is a list of responses from the API, each JSON of which has the items is a list of dictionaries where I can find the field key which has as the value a list of dictionaries of different fields (like Name and Surname and Gender).
The problem is that the dictionary with "label: Field of Studies" is optional and is not always present in the list of fields.
How can I manage to check for its presence, and if so append its value to a list, and None otherwise?

To me it seems that the data you have is not valid JSON. However if I were you I would try using pandas.json_normalize. According to the documentation this function will put None if it encounters an object with a label not inside it.

Related

Find a value in a list of dictionaries

I have the following list:
{
"id":1,
"name":"John",
"status":2,
"custom_attributes":[
{
"attribute_code":"address",
"value":"st"
},
{
"attribute_code":"city",
"value":"st"
},
{
"attribute_code":"job",
"value":"test"
}]
}
I need to get the value from the attribute_code that is equal city
I've tried this code:
if list["custom_attributes"]["attribute_code"] == "city" in list:
var = list["value"]
But this gives me the following error:
TypeError: list indices must be integers or slices, not str
What i'm doing wrong here? I've read this solution and this solution but din't understood how to access each value.
Another solution, using next():
dct = {
"id": 1,
"name": "John",
"status": 2,
"custom_attributes": [
{"attribute_code": "address", "value": "st"},
{"attribute_code": "city", "value": "st"},
{"attribute_code": "job", "value": "test"},
],
}
val = next(d["value"] for d in dct["custom_attributes"] if d["attribute_code"] == "city")
print(val)
Prints:
st
Your data is a dict not a list.
You need to scan the attributes according the criteria you mentioned.
See below:
data = {
"id": 1,
"name": "John",
"status": 2,
"custom_attributes": [
{
"attribute_code": "address",
"value": "st"
},
{
"attribute_code": "city",
"value": "st"
},
{
"attribute_code": "job",
"value": "test"
}]
}
for attr in data['custom_attributes']:
if attr['attribute_code'] == 'city':
print(attr['value'])
break
output
st

Substring string column Pandas Python

I have a pandas dataframe with two columns : ticket number and history.
History is a string with the following structure. I need to create third column which include author name who change status from New to Open. Is it possible?
[
{
"id": "1,
"author": {
"name": "user1",
"emailAddress": "user1#test.com",
"displayName": "user1"
},
"created": "2021-06-09T12:54:22.915+0000",
"items": [
{
"field": "name",
"from": "1",
"fromString": null,
"to": "2",
"toString": "test"
}
]
},
{
"id": "2",
"author": {
"name": "user2",
"emailAdress": "user2#test.com",
"displayName": "user2"
},
"created": "2021-06-11T09:33:18.692+0000",
"items": [
{
"field": "status",
"from": 3,
"fromString": "New",
"to": "7",
"toString": "Open"
}
]
}]
If your dataframe is named df, the history column (column 2) is named history and the items in the history column actually are json strings with a structure like the one you've provided, you could do the following:
import json
def extract_author(json_string):
records = json.loads(json_string)
for record in records:
items = record['items'][0]
if (items['field'] == 'status'
and items['fromString'] == 'New'
and items['toString'] == 'Open'):
return record['author']['name']
return None
df['author'] = df['history'].map(extract_author)

How to read JSON file with number of key in nested dictionary keeps changing

I have this JSON nested dictionary I need to parse into SQL table. The problem is the number of key tank (max is 4) in nested dictionary keeps changing with different site_id. Are there a way to read them?
{
"data": [
{
"site_id": 30183,
"city": "Seattle",
"state": "US-WA",
"tank": [
{
"id": 00001,
"name": "Diesel"
},
{
"id": 00002,
"name": "Diesel"
},
{
"id": 00003,
"name": "Unleaded 89"
}
]
},
{
"site_id": 200942,
"city": "Boise",
"state": "ID-WA",
"tank": [
{
"id": 00001,
"name": "Diesel"
},
{
"id": 00002,
"name": "Unleaded 95"
}
]
}
]
}
Here is my current code:
for site in response['data']:
row = []
row.extend([site['site_id'], site['city'], site['state']])
for tank in site['tank']:
row.extend([tank['id'], tank['name']])
Any site_id that does not have enough 4 tank can have missing value replaced with NULL
I don't know how to modify it to adjust to different number of tank keys. Any suggestion help! Thank you

how json to reference data from another json

I have the following json data.
users:
[
{
"group_ids": [
"group_1"
],
"user_id": "U_1",
"name": "kite"
},
{
"group_ids": [
"group_1",
"group_2"
],
"user_id": "U_2",
"name": "mike"
},
{
"group_ids": [
"group_1",
"group_3"
],
"user_id": "U_3",
"name": "an"
},
{
"group_ids": [
"group_3"
],
"user_id": "U_4",
"name": "joe"
}
]
groups:
{
"group_1": {
"label": "sre",
"group_type": "freelance"
},
"group_2": {
"label": "dev",
"group_type": "staff"
},
"group_3": {
"label": "qa",
"group_type": "member"
},
"group_4": {
"label": "ops",
"group_type": "staff"
}
}
I want to get the following output with the keys in order when given user id U_2.
Any pseudo code or hints will be good.
{
"groups": [
{"label": "sre", "group_type": "freelance"},
{"label": "dev", "group_type": "staff"}
],
"user_id": "U_2",
"name": "mike"
}
To keep dict keys in order, you'll have to use the standard OrderedDict class.
In the snippet below, I assume you have two JSON files users.json and groups.json.
from collections import OrderedDict
import json
from pathlib import Path
# Load data from JSON files
users = json.loads(Path("users.json").read_text())
groups = json.loads(Path("groups.json").read_text())
# Index users by their ID
users = {user["user_id"]: user for user in users}
def get_groups(user_id):
# Get the required user representation
user = users[user_id].copy()
# Get list of its group IDs and remove it from its representation
group_ids = user.pop("group_ids")
# Add group representation for each group
user["groups"] = [groups[group_id] for group_id in group_ids]
# Convert user to OrderedDict to ensure keys are sorted
keys = "groups", "user_id", "name"
user = OrderedDict([(key, user[key]) for key in keys])
# Done!
return user
The result of get_groups("U_2") is then:
>>> get_groups("U_2")
OrderedDict([('groups', [{'label': 'sre', 'group_type': 'freelance'}, {'label': 'dev', 'group_type': 'staff'}]), ('user_id', 'U_2'), ('name', 'mike')])
Finally, the standard json.dump and json.dumps to convert to JSON string will respect the order of keys when you pass an OrderedDict to them.
>>> print(json.dumps(get_groups("U_2"), indent=4))
{
"groups": [
{
"label": "sre",
"group_type": "freelance"
},
{
"label": "dev",
"group_type": "staff"
}
],
"user_id": "U_2",
"name": "mike"
}

Unable to pull data from json using python

I have the following json
{
"response": {
"message": null,
"exception": null,
"context": [
{
"headers": null,
"name": "aname",
"children": [
{
"type": "cluster-connectivity",
"name": "cluster-connectivity"
},
{
"type": "consistency-groups",
"name": "consistency-groups"
},
{
"type": "devices",
"name": "devices"
},
{
"type": "exports",
"name": "exports"
},
{
"type": "storage-elements",
"name": "storage-elements"
},
{
"type": "system-volumes",
"name": "system-volumes"
},
{
"type": "uninterruptible-power-supplies",
"name": "uninterruptible-power-supplies"
},
{
"type": "virtual-volumes",
"name": "virtual-volumes"
}
],
"parent": "/clusters",
"attributes": [
{
"value": "true",
"name": "allow-auto-join"
},
{
"value": "0",
"name": "auto-expel-count"
},
{
"value": "0",
"name": "auto-expel-period"
},
{
"value": "0",
"name": "auto-join-delay"
},
{
"value": "1",
"name": "cluster-id"
},
{
"value": "true",
"name": "connected"
},
{
"value": "synchronous",
"name": "default-cache-mode"
},
{
"value": "true",
"name": "default-caw-template"
},
{
"value": "blah",
"name": "default-director"
},
{
"value": [
"blah",
"blah"
],
"name": "director-names"
},
{
"value": [
],
"name": "health-indications"
},
{
"value": "ok",
"name": "health-state"
},
{
"value": "1",
"name": "island-id"
},
{
"value": "blah",
"name": "name"
},
{
"value": "ok",
"name": "operational-status"
},
{
"value": [
],
"name": "transition-indications"
},
{
"value": [
],
"name": "transition-progress"
}
],
"type": "cluster"
}
],
"custom-data": null
}
}
which im trying to parse using the json module in python. I am only intrested in getting the following information out of it.
Name Value
operational-status Value
health-state Value
Here is what i have tried.
in the below script data is the json returned from a webpage
json = json.loads(data)
healthstate= json['response']['context']['operational-status']
operationalstatus = json['response']['context']['health-status']
Unfortunately i think i must be missing something as the above results in an error that indexes must be integers not string.
if I try
healthstate= json['response'][0]
it errors saying index 0 is out of range.
Any help would be gratefully received.
json['response']['context'] is a list, so that object requires you to use integer indices.
Each item in that list is itself a dictionary again. In this case there is only one such item.
To get all "name": "health-state" dictionaries out of that structure you'd need to do a little more processing:
[attr['value'] for attr in json['response']['context'][0]['attributes'] if attr['name'] == 'health-state']
would give you a list of of matching values for health-state in the first context.
Demo:
>>> [attr['value'] for attr in json['response']['context'][0]['attributes'] if attr['name'] == 'health-state']
[u'ok']
You have to follow the data structure. It's best to interactively manipulate the data and check what every item is. If it's a list you'll have to index it positionally or iterate through it and check the values. If it's a dict you'll have to index it by it's keys. For example here is a function that get's the context and then iterates through it's attributes checking for a particular name.
def get_attribute(data, attribute):
for attrib in data['response']['context'][0]['attributes']:
if attrib['name'] == attribute:
return attrib['value']
return 'Not Found'
>>> data = json.loads(s)
>>> get_attribute(data, 'operational-status')
u'ok'
>>> get_attribute(data, 'health-state')
u'ok'
json['reponse']['context'] is a list, not a dict. The structure is not exactly what you think it is.
For example, the only "operational status" I see in there can be read with the following:
json['response']['context'][0]['attributes'][0]['operational-status']

Categories