Pandas DataFrame - remove / replace dict values based on key - python

Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?

You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com

Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}

This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com

Related

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

How to remove redundant elements from a JSON string in Python

I have the below JSON string which I converted from a Pandas data frame.
[
{
"ID":"1",
"Salary1":69.43,
"Salary2":513.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
}
]
I want to change the above JSON to the below format.
[
{
"Date":"2022-06-09",
"Name":"john",
"DateTime":"2022-09-0710:57:55",
"employeeId":12,
"Results":[
{
"ID":1,
"Salary1":69.43,
"Salary2":513
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123
}
]
}
]
Kindly let me know how we can achieve this in Python.
Original Dataframe:
ID Salary1 Salary2 Date Name employeeId DateTime
1 69.43 513.0 2022-06-09 john 12 2022-09-0710:57:55
2 691.43 5123.0 2022-06-09 john 12 2022-09-0710:57:55
Thank you.
As #Harsha pointed, you can adapt one of the answers from another question, with just some minor tweaks to make it work for OP's case:
(
df.groupby(["Date","Name","DateTime","employeeId"])[["ID","Salary1","Salary2"]]
# to_dict(orient="records") - returns list of rows, where each row is a dict,
# "oriented" like [{column -> value}, … , {column -> value}]
.apply(lambda x: x.to_dict(orient="records"))
# groupBy makes a Series: with grouping columns as index, and dict as values.
# This structure is no good for the next to_dict() method.
# So here we create new DataFrame out of grouped Series,
# with Series' indexes as columns of DataFrame,
# and also renamimg our Series' values to "Results" while we are at it.
.reset_index(name="Results")
# Finally we can achieve the desired structure with the last call to to_dict():
.to_dict(orient="records")
)
# [{'Date': '2022-06-09', 'Name': 'john', 'DateTime': '2022-09-0710:57:55', 'employeeId': 12,
# 'Results': [
# {'ID': 1, 'Salary1': 69.43, 'Salary2': 513.0},
# {'ID': 2, 'Salary1': 691.43, 'Salary2': 5123.0}
# ]}]

Create pandas MultiIndex Dataframe from json

I am receiving the following json from a webservice:
{
"headers":[
{
"seriesId":"18805",
"Name":"Name1",
"assetId":"4"
},
{
"seriesId":"18801",
"Name":"Name2",
"assetId":"209"
}
],
"values":[
{
"Date":"01-Jan-2021",
"18805":"127.93",
"18801":"75.85"
}
]
}
Is there a way to create a MultiIndex dataframe from this data? I would like Date to be the row index and the rest to be column indexes.
the values key is a straight forward data frame
columns can be rebuilt from headers key
js = {'headers': [{'seriesId': '18805', 'Name': 'Name1', 'assetId': '4'},
{'seriesId': '18801', 'Name': 'Name2', 'assetId': '209'}],
'values': [{'Date': '01-Jan-2021', '18805': '127.93', '18801': '75.85'}]}
# get values into dataframe
df = pd.DataFrame(js["values"]).set_index("Date")
# get headers for use in rebuilding column names
dfc = pd.DataFrame(js["headers"])
# rebuild columns
df.columns = pd.MultiIndex.from_tuples(dfc.apply(tuple, axis=1), names=dfc.columns)
print(df)
seriesId 18805 18801
Name Name1 Name2
assetId 4 209
Date
01-Jan-2021 127.93 75.85

DataFrame groupby a column which has dictionary values

I'm having a dataframe which contains a column as dictionary. And I need to groupby the column by the dictionary values. For example,
import pandas as pd
data = [
{
"name":"xx",
"values":{
"element":[
{
"path":"path1/id1"
},
{
"path":"path2/id1"
}
],
"nonrequired":[
{}
]
}
},
{
"name":"yy",
"values":{
"element":[
{
"path":"path1/id2"
},
{
"path":"path2/id2"
}
],
"nonrequired":[
{}
]
}
}
]
df = pd.DataFrame(data)
What I'm looking for,
I want to groupby the column "values" by inside specific key.
The grouping should be values->element->path
The grouping should be based on the partial path values. For example if path="path1/id2", the
grouping should be based on path="path1"
After grouping I need to extract the result as dictionary.
Expected result:
result = {
'path1': [
{
"name":'xx',
"renamecolumn":['id1','id2']
}
],
'path2': [
{
"name":'yy',
"renamecolumn":['id1','id2']
}
]
}
Still not 100% sure of the logic of the final dictionary creation as the example input and output don't quite match up. However, here is how you can extract the values and you can create your desired dictionary from there.
# ectract the values and split them on the forward slash
df['split'] = df['values'].apply(lambda x: [item['path'].split('/') for item in x['element']])
# generate the path and ids columns
df['path'] = df['split'].apply(lambda x: [x[i][0] for i in range(0,len(x))])
df['ids'] = df['split'].apply(lambda x: [x[i][1] for i in range(0,len(x))])
# separate out all the lists and
result = df.drop(['values', 'split'], axis=1) \
.explode('ids').explode('path').drop_duplicates()
Result is:
name path ids
0 xx path1 id1
0 xx path2 id1
1 yy path1 id2
1 yy path2 id2

creating a json object from pandas dataframe

Groups sub-groups selections
0 sg1 csg1 sc1
1 sg1 csg1 sc2
2 sg1 csg2 sc3
3 sg1 csg2 sc4
4 sg2 csg3 sc5
5 sg2 csg3 sc6
6 sg2 csg4 sc7
7 sg2 csg4 sc8
I have the dataframe mentioned above and I am trying to create a JSON object as follows:
{
"sg1": {
"csg1": ['sc1', 'sc2'],
"csg2": ['sc3', 'sc4']
},
"sg2": {
"csg3": ['sc5', 'sc6'],
"csg4": ['sc7', 'sc8']
}
}
I tried using the pandas to_json and to_dict with orient arguments but I am not getting the expected result. I also tried grouping by the columns and then creating the list and converting it into a JSON.
Any help is much appreciated.
You can groupby ['Groups','sub-groups'] and build a dictionary from the multiindex series with a dictionary comprehension:
s = df.groupby(['Groups','sub-groups']).selections.agg(list)
d = {k1:{k2:v} for (k1,k2),v in s.iteritems()}
print(d)
# {'sg1': {'csg2': ['sc3', 'sc4']}, 'sg2': {'csg4': ['sc7', 'sc8']}}
You need to group on the columns of interest such as:
import pandas as pd
data = {
'Groups': ['sg1', 'sg1', 'sg1', 'sg1', 'sg2', 'sg2', 'sg2', 'sg2'],
'sub-groups': ['csg1', 'csg1', 'csg2', 'csg2', 'csg3', 'csg3', 'csg4', 'csg4'],
'selections': ['sc1', 'sc2', 'sc3', 'sc4', 'sc5', 'sc6', 'sc7', 'sc8']
}
df = pd.DataFrame(data)
print(df.groupby(['Groups', 'sub-groups'])['selections'].unique().to_dict())
The output is:
{
('sg1', 'csg1'): array(['sc1', 'sc2'], dtype=object),
('sg1', 'csg2'): array(['sc3', 'sc4'], dtype=object),
('sg2', 'csg3'): array(['sc5', 'sc6'], dtype=object),
('sg2', 'csg4'): array(['sc7', 'sc8'], dtype=object)
}
Lets try dictify function which builds a nested dictionary with top level keys from the Groups and corresponding sub level keys from sub-groups:
from collections import defaultdict
def dictify():
dct = defaultdict(dict)
for (x, y), g in df.groupby(['Groups', 'sub-groups']):
dct[x][y] = [*g['selections']]
return dict(dct)
# dictify()
{
"sg1": {
"csg1": ["sc1","sc2"],
"csg2": ["sc3","sc4"]
},
"sg2": {
"csg3": ["sc5","sc6"],
"csg4": ["sc7","sc8"]
}
}

Categories