Pandas DataFrame to JSON with Nested list/dicts - python

Here is my df:
text
date
channel
sentiment
product
segment
0
I like the new layout
2021-08-30T18:15:22Z
Snowflake
predict
Skills
EMEA
I need to convert this to JSON output that matches the following:
[
{
"text": "I like the new layout",
"date": "2021-08-30T18:15:22Z",
"channel": "Snowflake",
"sentiment": "predict",
"fields": [
{
"field": "product",
"value": "Skills"
},
{
"field": "segment",
"value": "EMEA"
}
]
}
]
I'm getting stuck with mapping the keys of the columns to the values in the first dict and mapping the column and row to new keys in the final dict. I've tried various options using df.groupby with .apply() but am coming up short.
Samples of what I've tried:
df.groupby(['text', 'date','channel','sentiment','product','segment']).apply(
lambda r: r[['27cf2f]].to_dict(orient='records')).unstack('text').apply(lambda s: [
{s.index.name: idx, 'fields': value}
for idx, value in s.items()]
).to_json(orient='records')
Any and all help is appreciated!

Solved with this:
# Specify field column names
fieldcols = ['product','segment']
# Build a dict for each group as a Series named `fields`
res = (df.groupby(['text', 'date','channel','sentiment'])
.apply(lambda s: [{'field': field,
'value': value}
for field in fieldcols
for value in s[field].values])
).rename('fields')
# Convert Series to DataFrame and then to_json
res = res.reset_index().to_json(orient='records', date_format='iso')
Output:
[
{
"text": "I like the new layout",
"date": "2021-08-30T18:15:22Z",
"channel": "Snowflake",
"sentiment": "predict",
"fields": [
{
"field": "product",
"value": "Skills"
},
{
"field": "segment",
"value": "EMEA"
}
]
}
]

Related

how to use json_normalize but flip the axis

Hey guys I've been working on converting some json text I'm receiving from an API and I noticed some people using json_normalize, but in my case it doesn't solve the full issue and I was wondering if someone could help.
import pandas as pd
my_json = [
{
"total": "null",
"items": [
{
"key": "time",
"label": "Time",
"value": "2022-12-13T23:59:59.939-07:00"
},
{
"key": "agentNotes",
"label": "Agent Notes",
"value": "null"
},
{
"key": "blindTransferToAgent",
"label": "Blind Transfer To Agent",
"value": "0"
}]},
{"total": "null",
"items": [
{
"key": "time",
"label": "Time",
"value": "2022-12-13T23:59:59.939-07:00"
},
{
"key": "agentNotes",
"label": "Agent Notes",
"value": "null"
},
{
"key": "blindTransferToAgent",
"label": "Blind Transfer To Agent",
"value": "0"
}
]}]
df = pd.json_normalize(my_json, ["items"])
print(df)
This gives me a result like this
key ... value
0 time ... 2022-12-13T23:59:59.939-07:00
1 agentNotes ... null
2 blindTransferToAgent ... 0
[3 rows x 3 columns]
But I'm trying to create my keys as columns and the values as the values so the end result look like this.
time agentNotes blindTransfertoAgent
2022-12-13T23:590:59.939-07:00 null 0
Any help would be appreciated.
I did not find any shortcuts for this problem and maybe someone could enlighten us.
However the solution isn't that long, so I thought to post it anyways.
Your "JSON" isn't really a JSON if I am reading correctly from your question, it is a list that contains a dictionary with two keys, total and items. The value of items is a list of dictionaries, so we can iterate through the values and take the key-value elements from each one:
from collections import defaultdict
import pandas as pd
dict_to_df = defaultdict(list)
dictionaries = [inner_dicts for items_dict in my_json for inner_dicts in items_dict['items']]
for dictionary in dictionaries:
dict_to_df[dictionary['key']].append(dictionary['value'])
df = pd.DataFrame.from_dict(dict_to_df, orient='index').T
print(df)
Which outputs:
time agentNotes blindTransferToAgent
0 2022-12-13T23:59:59.939-07:00 null 0
Explanations:
Initialize an empty defaultdict (with default value of a list) which we will read to a pandas dataframe.
Insert the values per key in the "JSON" we have
Read the dictionary into a pandas DataFrame - orienting the indices and transposing, if these values don't have the same length. (such as having one more blindTransferToAgent with value 1) to get around if the JSON looks like:
{
"key": "time",
"label": "Time",
"value": "2022-12-13T23:59:59.939-07:00"
},
{
"key": "agentNotes",
"label": "Agent Notes",
"value": "null"
},
{
"key": "blindTransferToAgent",
"label": "Blind Transfer To Agent",
"value": "0"
},
{
"key": "blindTransferToAgent",
"label": "Blind Transfer To Agent",
"value": "4"
}
Which will output:
time agentNotes blindTransferToAgent
0 2022-12-13T23:59:59.939-07:00 null 0
1 None None 4
Try changing this:
df = pd.json_normalize(my_json, ["items"])
into this:
df = pd.json_normalize(my_json, ["items"]).T
The T attribute in a Pandas DataFrame object stores a transposition of the index and columns, which is what you're looking for.
Output:
0 1 2
key time agentNotes blindTransferToAgent
label Time Agent Notes Blind Transfer To Agent
value 2022-12-13T23:59:59.939-07:00 null 0

Very nested JSON with optional fields into pandas dataframe

I have a JSON with the following structure. I want to extract some data to different lists so that I will be able to transform them into a pandas dataframe.
{
"ratings": {
"like": {
"average": null,
"counts": {
"1": {
"total": 0,
"users": []
}
}
}
},
"sharefile_vault_url": null,
"last_event_on": "2021-02-03 00:00:01",
],
"fields": [
{
"type": "text",
"field_id": 130987800,
"label": "Name and Surname",
"values": [
{
"value": "John Smith"
}
],
{
"type": "category",
"field_id": 139057651,
"label": "Gender",
"values": [
{
"value": {
"status": "active",
"text": "Male",
"id": 1,
"color": "DCEBD8"
}
}
],
{
"type": "category",
"field_id": 151333010,
"label": "Field of Studies",
"values": [
{
"value": {
"status": "active",
"text": "Languages",
"id": 3,
"color": "DCEBD8"
}
}
],
}
}
For example, I create a list
names = []
where if "label" in the "fields" list is "Name and Surname" I append ["values"][0]["value"] so names now contains "John Smith". I do exactly the same for the "Gender" label and append the value to the list genders.
The above dictionary is contained in a list of dictionaries so I just have to loop though the list and extract the relevant fields like this:
names = []
genders = []
for r in range(len(users)):
for i in range(len(users[r].json()["items"])):
for field in users[r].json()["items"][i]["fields"]:
if field["label"] == "Name and Surname":
names.append(field["values"][0]["value"])
elif field["label"] == "Gender":
genders.append(field["values"][0]["value"]["text"])
else:
# Something else
where users is a list of responses from the API, each JSON of which has the items is a list of dictionaries where I can find the field key which has as the value a list of dictionaries of different fields (like Name and Surname and Gender).
The problem is that the dictionary with "label: Field of Studies" is optional and is not always present in the list of fields.
How can I manage to check for its presence, and if so append its value to a list, and None otherwise?
To me it seems that the data you have is not valid JSON. However if I were you I would try using pandas.json_normalize. According to the documentation this function will put None if it encounters an object with a label not inside it.

Merge nested key/values and nested list into a json

I am trying to merge nested list of dictionary key/value into single key and list of values. I am loading csv file into data frame and from that I am trying to convert it into nested json. Please see below I have tried this. Should I be going this route to create json or does pandas have a native functionality that do this type of conversion?
Sample Data:
Subject,StudentName,Category
ENGLISH,Jane,
ENGLISH,,A
MATH,Matt,B
MATH,Newman,AA
MATH,,B
MATH,Dylan,A
ENGLISH,Noah,
ENGLISH,,C
Tried this:
df1 = pd.read_csv('../data/file.csv')
json_doc = defaultdict(list)
for _id in df1.T:
data = df1.T[_id]
key = data.Subject
values = {'StudentName': data.StudentName,'Category':data.Category}
json_doc[key].append(values)
new_d = json.dumps(json_doc, indent=4)
{k: int(v) for k, v in new_d} # error: ValueError: not enough values to unpack (expected 2, got 1)
and I get this from the code above:
{
"ENGLISH": [
{
"StudentName": "Jane",
"Category": NaN
},
{
"StudentName": NaN,
"Category": "A"
},
{
"StudentName": "Noah",
"Category": NaN
},
{
"StudentName": NaN,
"Category": "C"
}
],
"MATH": [
{
"StudentName": "Matt",
"Category": "B"
},
{
"StudentName": "Newman",
"Category": "AA"
},
{
"StudentName": NaN,
"Category": "B"
},
{
"StudentName": "Dylan",
"Category": "A"
}
]
}
How do I merge key/value to get it look like this one?
{
"ENGLISH": [
{
"StudentName": ["Jane","Noah"],
"Category": ["A","C"]
}
],
"MATH": [
{
"StudentName": ["Matt","Newman","Dylan"]
"Category": ["B","AA","A"]
}
]
}
It is not entirely clear to me if it is safe to ignore missing values, but here is my one-liner:
df.groupby('Subject').agg(lambda g: list(g.dropna())).to_dict(orient='index')
Default methods (to_json, to_dict) do not have a suitable orient option. So, we have to do some work by hands by grouping by index and then converting column data to a list. Then, .to_dict(orient='index') will do what you want (replace with to_json if you want a string instead of an object).
Note: Subject here is expected to be a column, not an index.

Aggregation json by elements of sub-json

I have the following structure:
[
{
"Name": "a-1",
"Tags": [
{
"Value": "a",
"Key": "Type"
}
],
"CreationDate": "2018-02-25T17:33:19.000Z"
},
{
"Name": "a-2",
"Tags": [
{
"Value": "a",
"Key": "Type"
}
],
"CreationDate": "2018-02-26T17:33:19.000Z"
},
{
"Name": "b-1",
"Tags": [
{
"Value": "b",
"Key": "Type"
}
],
"CreationDate": "2018-01-21T17:33:19.000Z"
},
{
"Name": "b-2",
"Tags": [
{
"Value": "b",
"Key": "Type"
}
],
"CreationDate": "2018-01-22T17:33:19.000Z"
},
{
"Name": "c-1",
"Tags": [
{
"Value": "c",
"Key": "Type"
}
],
"CreationDate": "2018-08-29T17:33:19.000Z"
}
]
I want to print out the oldest Name of each Value when there are more than one member in the group (This should be configurable. For example: The x oldest items when there are more than y members). In this case there are two a, two b and one c, So the expected result will be:
a-1
b-1
Here if my Python code:
data = ec2.describe_images(Owners=['11111'])
images = data['Images']
grouper = groupby(map(itemgetter('Tags'), images))
groups = (list(vals) for _, vals in grouper)
res = list(chain.from_iterable(filter(None, groups)))
Currently res contains only list of Key and Value and it's not grouped by. Any one can show my how to continue the code to the expected result?
Here is a solution using pandas, it takes a json string as input (json_string)
A lot of the time pandas is overkill, but here I think it will be nice because you basically want to group by value and then eliminate some groups based on criteria such as how many members they have
import pandas as pd
# load the dataframe from the json string
df = pd.read_json(json_string)
df['CreationDate'] = pd.to_datetime(df['CreationDate'])
# create a value column from the nested tags column
df['Value'] = df['Tags'].apply(lambda x: x[0]['Value'])
# groupby value and iterate through groups
groups = df.groupby('Value')
output = []
for name, group in groups:
# skip groups with fewer than 2 members
if group.shape[0] < 2:
continue
# sort rows by creation date
group = group.sort_values('CreationDate')
# save the row with the most recent date
most_recent_from_group = group.iloc[0]
output.append(most_recent_from_group['Name'])
print(output)

how to create a pandas.dataframe of a json with values in a list with no key in Python

I have this json:
{
"columns": [
{
"field": "date",
"name": "Date",
"type": "dim"
},
{
"field": "adsetstart_time",
"name": "Ad set start time",
"type": "dim"
},
{
"field": "adsetend_time",
"name": "Ad set end time",
"type": "dim"
},
{
"field": "adcampaign_id",
"name": "Campaign ID",
"type": "dim"
},
{
"field": "cost",
"name": "Amount spent",
"type": "met"
}
],
"data": [
[
"2017-10-21",
"2017-10-02",
"2017-11-01",
"6076466058814",
81.32
],
[
"2017-10-21",
"2017-10-04",
"2017-11-01",
"6076547852614",
47.46
],
[
"2017-10-21",
"2017-10-04",
"2017-11-01",
"6076549546014",
128.58
]
],
"notes": {
"datasource": "FA",
"numeric_format_columns_start": 4,
"numeric_format_rows_start": 0,
"result_rows": 50,
"result_values": 250,
"runtime_sec": 3,
"status": "success"
}
}
and I want this table:
table
I'm new to programming, I tried to modify the dictionary structure by creating a new one with the keys-value inside data like this:
{"data": ["Date": "2017-10-24", "Ad set start time": "2017-10-16", "Ad set end time": "2017-10-27", "Campaign ID": "6076811156014", "Amount spent": 106.84],...
I managed to extract the columns in a list but I can not iterate data to assign each value its corresponding key.
What is the best way to create the dataframe from this json structure?
import pandas as pd
# data = dict provided in OP
colnames = [x["name"] for x in data["columns"]]
pd.DataFrame(data["data"], columns=colnames)
Date Ad set start time Ad set end time Campaign ID Amount spent
0 2017-10-21 2017-10-02 2017-11-01 6076466058814 81.32
1 2017-10-21 2017-10-04 2017-11-01 6076547852614 47.46
2 2017-10-21 2017-10-04 2017-11-01 6076549546014 128.58
First read the json object, extract the columns, and then use the data and columns within a DataFrame constructor.
import json
with open('jsonfile.json') as f:
data = json.loads(f)
columns = [field['name'] for field in data['columns']]
pd.DataFrame(data['data'], columns=columns)

Categories