Replace DataFrame column with nested dictionary value - python

I'm trying to replace the 'starters' column of this DataFrame
starters
roster_id
Bob 3086
Bob 1234
Cam 6130
... ...
with the player names from a large nested dict like this. The values in my 'starters' column are the keys.
{
"3086": {
"team": "NE",
"player_id":"3086",
"full_name": "tombrady",
},
"1234": {
"team": "SEA",
"player_id":"1234",
"full_name": "RussellWilson",
},
"6130": {
"team": "BUF",
"player_id":"6130",
"full_name": "DevinSingletary",
},
...
}
I tried using DataFrame.replace(dict) and Dataframe.map(dict) but that gives me back all the player info instead of just the name.
is there a way to do this with a nested dict? thanks.

let df be the dataframe and d be the dictionary, then you can use apply from pandas on axis 1 to change the column
df.apply(lambda x: d[str(x.starters)]['full_name'], axis=1)

I am not sure, if I understand your question correctly. Have you tried using dict['full_name'] instead of simply dict?

Try pd.concat with series.map:
>>> pd.concat([
df,
pd.DataFrame.from_records(
df.astype(str)
.starters
.map(dct)
.values
).set_index(df.index)
], axis=1)
starters team player_id full_name
roster_id
Bob 3086 NE 3086 tombrady
Bob 1234 SEA 1234 RussellWilson
Cam 6130 BUF 6130 DevinSingletary

Related

Load nested JSON with incremental key/timestamp into Pandas DataFrame

I am trying to read a JSON dataset (see below a part of it). I want to use it in a flattened Pandas DataFrame to have access to all columns, in particular "A" and "B "with some data as columns for further processing.
import pandas as pd
datajson= {
"10001": {
"extra": {"user": "Tom"},
"data":{"A":5, "B":10}
},
"10002":{
"extra": {"user": "Ben"},
"data":{"A":7, "B":20}
},
"10003":{
"extra": {"user": "Ben"},
"data":{"A":6, "B":15}
}
}
df = pd.read_json(datajson, orient='index')
# same with DataFrame.from_dict
# df2 = pd.DataFrame.from_dict(datajson, orient='index')
which results in Dataframe.
I am assuming there is a simple way without looping/appending and making a complicated and slow decoder but rather using for example Panda's json_normalize().
I don't think you will be able to do that without looping through the json. You can do that relatively efficiently though if you make use of a list comprehension:
def parse_inner_dictionary(data):
return pd.concat([pd.DataFrame(i, index=[0]) for i in data.values()], axis=1)
df = pd.concat([parse_inner_dictionary(v) for v in datajson.values()])
df.index = datajson.keys()
print(df)
user A B
10001 Tom 5 10
10002 Ben 7 20
10003 Ben 6 15

Pandas DataFrame - remove / replace dict values based on key

Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?
You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}
This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com

How to create a multi index dataframe from a nested dictionary?

I have a nested dictionary, whose first level keys are [0, 1, 2...] and the corresponding values of each key are of the form:
{
"geometry": {
"type": "Point",
"coordinates": [75.4516454, 27.2520587]
},
"type": "Feature",
"properties": {
"state": "Rajasthan",
"code": "BDHL",
"name": "Badhal",
"zone": "NWR",
"address": "Kishangarh Renwal, Rajasthan"
}
}
I want to make a pandas dataframe of the form:
Geometry Type Properties
Type Coordinates State Code Name Zone Address
0 Point [..., ...] Features Rajasthan BDHL ... ... ...
1
2
I am not able to understand the examples over the net about multi indexing/nested dataframe/pivoting. None of them seem to take the first level keys as the primary index in the required dataframe.
How do I get from the data I have, to making it into this formatted dataframe?
I would suggest to create columns as "geometry_type", "geometry_coord", etc.. in order to differentiate theses columns from the column that you would name "type". In other words, using your first key as a prefix, and the subkey as the name, and hence creating a new name. And after, just parse and fill your Dataframe like that:
import json
j = json.loads("your_json.json")
df = pd.DataFrame(columns=["geometry_type", "geometry_coord", ... ])
for k, v in j.items():
if k == "geometry":
df = df.append({
"geometry_type": v.get("type"),
"geometry_coord": v.get("coordinates")
}, ignore_index=True)
...
Your output could then looks like this :
geometry_type geometry_coord ...
0 [75.4516454, 27.2520587] NaN ...
PS : And if you really want to go for your initial option, you could check here : Giving a column multiple indexes/headers
I suppose you have a list of nested dictionaries.
Use json_normalize to read json data and split current column index into 2 part using str.partition:
import pandas as pd
import json
data = json.load(open('data.json'))
df = pd.json_normalize(data)
df.columns = df.columns.str.partition('.', expand=True).droplevel(level=1)
Output:
>>> df.columns
MultiIndex([( 'type', ''),
( 'geometry', 'type'),
( 'geometry', 'coordinates'),
('properties', 'state'),
('properties', 'code'),
('properties', 'name'),
('properties', 'zone'),
('properties', 'address')],
)
>>> df
type geometry properties
type coordinates state code name zone address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan
You can use pd.json_normalize() to normalize the nested dictionary into a dataframe df.
Then, split the column names with dots into multi-index with Index.str.split on df.columns with parameter expand=True, as follows:
Step 1: Normalize nested dict into a dataframe
j = {
"geometry": {
"type": "Point",
"coordinates": [75.4516454, 27.2520587]
},
"type": "Feature",
"properties": {
"state": "Rajasthan",
"code": "BDHL",
"name": "Badhal",
"zone": "NWR",
"address": "Kishangarh Renwal, Rajasthan"
}
}
df = pd.json_normalize(j)
Step 1 Result:
print(df)
type geometry.type geometry.coordinates properties.state properties.code properties.name properties.zone properties.address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan
Step 2: Create Multi-index column labels
df.columns = df.columns.str.split('.', expand=True)
Step 2 (Final) Result:
print(df)
type geometry properties
NaN type coordinates state code name zone address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan

Create json files from dataframe

I have a dataframe with 10 rows like this:
id
name
team
1
Employee1
Team1
2
Employee2
Team2
...
How can I generate 10 json files from the dataframe with python?
Here is the format of each json file:
{
"Company": "Company",
"id": "1",
"name": "Employee1",
"team": "Team1"
}
The field "Company": "Company" is the same in all json files.
Name of each json file is the name of each employee (i.e Employee1.json)
I do not really like iterrows but as you need a file per row, I cannot imagine how to vectorize the operation:
for _, row in df.iterrows():
row['Company'] = 'Company'
row.to_json(row['name'] + '.json')
You could use apply in the following way:
df.apply(lambda x: x.to_json(), axis=1)
And inside the to_json pass the employee name, it’s available to you in x
Another approach is to iterate over the rows like:
for i in df.index:
df.loc[i].to_json("Employee{}.json".format(i))

How to create a Pandas DataFrame from row-based list of dictionaries

I have a data structure like this:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
},
.
.
.
]
Each key in the dictionaries has values of type integer, string or
list of strings (not all keys are in all dicts present), each
dictionary represents a row in a table; all rows are given as the list
of dictionaries.
How can I easily import this into Pandas? I tried
df = pd.DataFrame.from_records(data)
but here I get an "ValueError: arrays must all be same length" error.
The DataFrame constructor takes row-based arrays (amoungst other structures) as data input. Therefore the following works:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
}]
df = pd.DataFrame(data)
print(df)
Output:
character name pattern skills
0 mean leopard striped [sprinting, hiding]
1 good antilope NaN [running]

Categories