I have the below JSON string which I converted from a Pandas data frame.
[
{
"ID":"1",
"Salary1":69.43,
"Salary2":513.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
}
]
I want to change the above JSON to the below format.
[
{
"Date":"2022-06-09",
"Name":"john",
"DateTime":"2022-09-0710:57:55",
"employeeId":12,
"Results":[
{
"ID":1,
"Salary1":69.43,
"Salary2":513
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123
}
]
}
]
Kindly let me know how we can achieve this in Python.
Original Dataframe:
ID Salary1 Salary2 Date Name employeeId DateTime
1 69.43 513.0 2022-06-09 john 12 2022-09-0710:57:55
2 691.43 5123.0 2022-06-09 john 12 2022-09-0710:57:55
Thank you.
As #Harsha pointed, you can adapt one of the answers from another question, with just some minor tweaks to make it work for OP's case:
(
df.groupby(["Date","Name","DateTime","employeeId"])[["ID","Salary1","Salary2"]]
# to_dict(orient="records") - returns list of rows, where each row is a dict,
# "oriented" like [{column -> value}, … , {column -> value}]
.apply(lambda x: x.to_dict(orient="records"))
# groupBy makes a Series: with grouping columns as index, and dict as values.
# This structure is no good for the next to_dict() method.
# So here we create new DataFrame out of grouped Series,
# with Series' indexes as columns of DataFrame,
# and also renamimg our Series' values to "Results" while we are at it.
.reset_index(name="Results")
# Finally we can achieve the desired structure with the last call to to_dict():
.to_dict(orient="records")
)
# [{'Date': '2022-06-09', 'Name': 'john', 'DateTime': '2022-09-0710:57:55', 'employeeId': 12,
# 'Results': [
# {'ID': 1, 'Salary1': 69.43, 'Salary2': 513.0},
# {'ID': 2, 'Salary1': 691.43, 'Salary2': 5123.0}
# ]}]
Related
Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?
You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}
This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
I need to convert this json file to a data frame in python:
print(resp2)
{
"totalCount": 1,
"nextPageKey": null,
"result": [
{
"metricId": "builtin:tech.generic.cpu.usage",
"data": [
{
"dimensions": [
"process_345678"
],
"dimensionMap": {
"dt.entity.process_group_instance": "process_345678"
},
"timestamps": [
1642021200000,
1642024800000,
1642028400000
],
"values": [
10,
15,
12
]
}
]
}
]
}
Output needs to be like this:
metricId dimensions timestamps values
builtin:tech.generic.cpu.usage process_345678 1642021200000 10
builtin:tech.generic.cpu.usage process_345678 1642024800000 15
builtin:tech.generic.cpu.usage process_345678 1642028400000 12
I have tried this:
print(pd.json_normalize(resp2, "data"))
I get invalid syntax, any ideas?
Take a look at the examples of json_normalize, and you'll see a list of dictionaries that have the key names of the columns you want, unique to each row. When you have nested lists/objects, then the columns will be flatten to have dot-notation, but nested arrays will not end up duplicated across rows.
Therefore, parse the data into a flat list, then you can use from_records.
data = []
for r in resp2['result']:
metricId = r['metricId']
for d in r['data']:
dimension = d['dimensions'][0] # unclear why this is an array
timestamps = d['timestamps']
values = d['values']
for t, v in zip(timestamps, values):
data.append({'metricId': metricId, 'dimensions': dimension, 'timestamps': t, 'values': v})
df = pd.DataFrame.from_records(data)
Requirement
My requirement is to have a Python code extract some records from a database, format and upload a formatted JSON to a sink.
Planned approach
1. Create JSON-like templates for each record. E.g.
json_template_str = '{{
"type": "section",
"fields": [
{{
"type": "mrkdwn",
"text": "Today *{total_val}* customers saved {percent_derived}%."
}}
]
}}'
2. Extract records from DB to a dataframe.
3. Loop over dataframe and replace the {var} variables in bulk using something like .format(**locals()))
Question
I haven't worked with dataframes before.
What would be the best way to accomplish Step 3 ? Currently I am
3.1 Looping over the dataframe objects 1 by 1 for i, df_row in df.iterrows():
3.2 Assigning
total_val= df_row['total_val']
percent_derived= df_row['percent_derived']
3.3 In the loop format and add str to a list block.append(json.loads(json_template_str.format(**locals()))
I was trying to use the assign() method in dataframe but was not able to figure out a way to use like a lambda function to create a new column with my expected value that I can use.
As a novice in pandas, I feel there might be a more efficient way to do this (which may even involve changing the JSON template string - which I can totally do). Will be great to hear thoughts and ideas.
Thanks for your time.
I would not write a JSON string by hand, but rather create a corresponding python object and then use the json library to convert it into a string. With this in mind, you could try the following:
import copy
import pandas as pd
# some sample data
df = pd.DataFrame({
'total_val': [100, 200, 300],
'percent_derived': [12.4, 5.2, 6.5]
})
# template dictionary for a single block
json_template = {
"type": "section",
"fields": [
{"type": "mrkdwn",
"text": "Today *{total_val:.0f}* customers saved {percent_derived:.1f}%."
}
]
}
# a function that will insert data from each row
# of the dataframe into a block
def format_data(row):
json_t = copy.deepcopy(json_template)
text_t = json_t["fields"][0]["text"]
json_t["fields"][0]["text"] = text_t.format(
total_val=row['total_val'], percent_derived=row['percent_derived'])
return json_t
# create a list of blocks
result = df.agg(format_data, axis=1).tolist()
The resulting list looks as follows, and can be converted into a JSON string if needed:
[{
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *100* customers saved 12.4%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *200* customers saved 5.2%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *300* customers saved 6.5%.'
}]
}]
I am receiving the following json from a webservice:
{
"headers":[
{
"seriesId":"18805",
"Name":"Name1",
"assetId":"4"
},
{
"seriesId":"18801",
"Name":"Name2",
"assetId":"209"
}
],
"values":[
{
"Date":"01-Jan-2021",
"18805":"127.93",
"18801":"75.85"
}
]
}
Is there a way to create a MultiIndex dataframe from this data? I would like Date to be the row index and the rest to be column indexes.
the values key is a straight forward data frame
columns can be rebuilt from headers key
js = {'headers': [{'seriesId': '18805', 'Name': 'Name1', 'assetId': '4'},
{'seriesId': '18801', 'Name': 'Name2', 'assetId': '209'}],
'values': [{'Date': '01-Jan-2021', '18805': '127.93', '18801': '75.85'}]}
# get values into dataframe
df = pd.DataFrame(js["values"]).set_index("Date")
# get headers for use in rebuilding column names
dfc = pd.DataFrame(js["headers"])
# rebuild columns
df.columns = pd.MultiIndex.from_tuples(dfc.apply(tuple, axis=1), names=dfc.columns)
print(df)
seriesId 18805 18801
Name Name1 Name2
assetId 4 209
Date
01-Jan-2021 127.93 75.85
I have dataframe as below with NaN value.
Category,Type,Capacity,Efficiency
Chiller,ChillerA,1000,6.0
Chiller,ChillerB,2000,5.5
Cooling Tower,Cooling TowerA,1000,NaN
Cooling Tower,Cooling TowerB,2000,NaN
I want to convert this pandas dataframe to below json format.
Can anyone tell me how to implement this?
{
"Chiller":{
"ChillerA":{
"Capacity":1000,
"Efficiency":6.0
},
"ChillerB":{
"Capacity":2000,
"Efficiency":5.5
},
},
"Cooling Tower":{
"Cooling TowerA":{
"Capacity":1000 <=Will not include efficiency because efficiency was NaN for this.
},
"Cooling TowerB":{
"Capacity":2000
},
},
}
This is a very robust solution that will get you to desired output using nested dict comprehension:
df = df.set_index(['Category', 'Type'])
{level: {chiller: {name: value for name, value in values.items() if not np.isnan(value)} for chiller, values in df.xs(level).to_dict('index').items()} for level in df.index.levels[0]}
#{'Cooling Tower':
# {'Cooling TowerA':
# {'Capacity': 1000.0},
# 'Cooling TowerB':
# {'Capacity': 2000.0}},
# 'Chiller':
# {'ChillerA': {'Efficiency': 6.0, 'Capacity': 1000.0},
# 'ChillerB': {'Efficiency': 5.5, 'Capacity': 2000.0}}}