How to create a multi index dataframe from a nested dictionary? - python

I have a nested dictionary, whose first level keys are [0, 1, 2...] and the corresponding values of each key are of the form:
{
"geometry": {
"type": "Point",
"coordinates": [75.4516454, 27.2520587]
},
"type": "Feature",
"properties": {
"state": "Rajasthan",
"code": "BDHL",
"name": "Badhal",
"zone": "NWR",
"address": "Kishangarh Renwal, Rajasthan"
}
}
I want to make a pandas dataframe of the form:
Geometry Type Properties
Type Coordinates State Code Name Zone Address
0 Point [..., ...] Features Rajasthan BDHL ... ... ...
1
2
I am not able to understand the examples over the net about multi indexing/nested dataframe/pivoting. None of them seem to take the first level keys as the primary index in the required dataframe.
How do I get from the data I have, to making it into this formatted dataframe?

I would suggest to create columns as "geometry_type", "geometry_coord", etc.. in order to differentiate theses columns from the column that you would name "type". In other words, using your first key as a prefix, and the subkey as the name, and hence creating a new name. And after, just parse and fill your Dataframe like that:
import json
j = json.loads("your_json.json")
df = pd.DataFrame(columns=["geometry_type", "geometry_coord", ... ])
for k, v in j.items():
if k == "geometry":
df = df.append({
"geometry_type": v.get("type"),
"geometry_coord": v.get("coordinates")
}, ignore_index=True)
...
Your output could then looks like this :
geometry_type geometry_coord ...
0 [75.4516454, 27.2520587] NaN ...
PS : And if you really want to go for your initial option, you could check here : Giving a column multiple indexes/headers

I suppose you have a list of nested dictionaries.
Use json_normalize to read json data and split current column index into 2 part using str.partition:
import pandas as pd
import json
data = json.load(open('data.json'))
df = pd.json_normalize(data)
df.columns = df.columns.str.partition('.', expand=True).droplevel(level=1)
Output:
>>> df.columns
MultiIndex([( 'type', ''),
( 'geometry', 'type'),
( 'geometry', 'coordinates'),
('properties', 'state'),
('properties', 'code'),
('properties', 'name'),
('properties', 'zone'),
('properties', 'address')],
)
>>> df
type geometry properties
type coordinates state code name zone address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan

You can use pd.json_normalize() to normalize the nested dictionary into a dataframe df.
Then, split the column names with dots into multi-index with Index.str.split on df.columns with parameter expand=True, as follows:
Step 1: Normalize nested dict into a dataframe
j = {
"geometry": {
"type": "Point",
"coordinates": [75.4516454, 27.2520587]
},
"type": "Feature",
"properties": {
"state": "Rajasthan",
"code": "BDHL",
"name": "Badhal",
"zone": "NWR",
"address": "Kishangarh Renwal, Rajasthan"
}
}
df = pd.json_normalize(j)
Step 1 Result:
print(df)
type geometry.type geometry.coordinates properties.state properties.code properties.name properties.zone properties.address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan
Step 2: Create Multi-index column labels
df.columns = df.columns.str.split('.', expand=True)
Step 2 (Final) Result:
print(df)
type geometry properties
NaN type coordinates state code name zone address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan

Related

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

How to use for loop in lamba and map?

I have a dataframe df that has a column tags . Each element of the row tags is a list of dictionary and looks like this:
[
{
"id": "new",
"name": "new",
"slug": null,
"type": "HashTag",
"endIndex": 0,
"startIndex": 0
},
{
"id": "1234",
"name": "abc ltd.",
"slug": "5678",
"type": "StockTag",
"endIndex": 0,
"startIndex": 0
}
]
The list can have any number of elements.
I want to filter the dataframe df for rows where any element of the tags column has the type can be either StockTag or UserTag
I was able to check if the first element of the list has the type: StockTag as follows
df[df['tags'].map(lambda d: d[0]['type'] == 'StockTag')]
I am unable to check for other elements. Instead of checking only the first(index=0) element, I want to iterate through all the elements and check.
Any help on this?
I'm supposing, you have dataframe like this:
data tags
0 some_data [{'id': 'new', 'name': 'new', 'slug': None, 't...
Where tags column contains list of dictionaries.
Then you can use any() to search the tags column for StockTag type:
print(df[df["tags"].apply(lambda x: any(d["type"] == "StockTag" for d in x))])

Replace DataFrame column with nested dictionary value

I'm trying to replace the 'starters' column of this DataFrame
starters
roster_id
Bob 3086
Bob 1234
Cam 6130
... ...
with the player names from a large nested dict like this. The values in my 'starters' column are the keys.
{
"3086": {
"team": "NE",
"player_id":"3086",
"full_name": "tombrady",
},
"1234": {
"team": "SEA",
"player_id":"1234",
"full_name": "RussellWilson",
},
"6130": {
"team": "BUF",
"player_id":"6130",
"full_name": "DevinSingletary",
},
...
}
I tried using DataFrame.replace(dict) and Dataframe.map(dict) but that gives me back all the player info instead of just the name.
is there a way to do this with a nested dict? thanks.
let df be the dataframe and d be the dictionary, then you can use apply from pandas on axis 1 to change the column
df.apply(lambda x: d[str(x.starters)]['full_name'], axis=1)
I am not sure, if I understand your question correctly. Have you tried using dict['full_name'] instead of simply dict?
Try pd.concat with series.map:
>>> pd.concat([
df,
pd.DataFrame.from_records(
df.astype(str)
.starters
.map(dct)
.values
).set_index(df.index)
], axis=1)
starters team player_id full_name
roster_id
Bob 3086 NE 3086 tombrady
Bob 1234 SEA 1234 RussellWilson
Cam 6130 BUF 6130 DevinSingletary

How to create a Pandas DataFrame from row-based list of dictionaries

I have a data structure like this:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
},
.
.
.
]
Each key in the dictionaries has values of type integer, string or
list of strings (not all keys are in all dicts present), each
dictionary represents a row in a table; all rows are given as the list
of dictionaries.
How can I easily import this into Pandas? I tried
df = pd.DataFrame.from_records(data)
but here I get an "ValueError: arrays must all be same length" error.
The DataFrame constructor takes row-based arrays (amoungst other structures) as data input. Therefore the following works:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
}]
df = pd.DataFrame(data)
print(df)
Output:
character name pattern skills
0 mean leopard striped [sprinting, hiding]
1 good antilope NaN [running]

Setting column in Google Sheets API (with Python) to be number-formatted

I'm trying to format a column of numbers in Google Sheets using the API (Sheets API v.4 and Python 3.6.1, specifically). A portion of my non-functional code is below. I know it's executing, as the background color of the column gets set, but the numbers still show as text, not numbers.
Put another way, I'm trying to get the equivalent of clicking on a column header (A, B, C, or whatever) then choosing the Format -> Number -> Number menu item in the GUI.
def sheets_batch_update(SHEET_ID,data):
print ( ("Sheets: Batch update"))
service.spreadsheets().batchUpdate(spreadsheetId=SHEET_ID,body=data).execute() #,valueInputOption='RAW'
data={
"requests": [
{
"repeatCell": {
"range": {
"sheetId": all_sheets['Users'],
"startColumnIndex": 19,
"endColumnIndex": 20
},
"cell": {
"userEnteredFormat": {
"numberFormat": {
"type": "NUMBER",
"pattern": "#,##0",
},
"backgroundColor": {
"red": 0.0,
"green": 0.4,
"blue": 0.4
},
}
},
"fields": "userEnteredFormat(numberFormat,backgroundColor)"
}
},
]
}
sheets_batch_update(SHEET_ID, data)
The problem is likely that your data is currently stored as strings and therefore not affected by the number format.
"userEnteredValue": {
"stringValue": "1000"
},
"formattedValue": "1000",
"userEnteredFormat": {
"numberFormat": {
"type": "NUMBER",
"pattern": "#,##0"
}
},
When you set a number format via the UI (Format > Number > ...) it's actually doing two things at once:
Setting the number format.
Converting string values to number values, if possible.
Your API call is only doing #1, so any cells that are currently set with a string value will remain a string value and will therefore be unaffected by the number format. One solution would be to go through the affected values and move the stringValue to a numberValue if the cell contains a number.
To flesh out the answer from Eric Koleda a bit more, I ended up solving this two ways, depending on how I was getting the data for the Sheet:
First, if I was appending cells to the sheet, I used a function:
def set_cell_type(cell_contents):
current_cell_contents=str(cell_contents).replace(',', '')
float_cell=re.compile("^\d+\.\d+$")
int_cell=re.compile("^\d+$")
if int_cell.search(current_cell_contents):
data = {"userEnteredValue": {"numberValue": int(current_cell_contents)}}
elif float_cell.search(current_cell_contents):
data = {"userEnteredValue": {"numberValue": float(current_cell_contents)}}
else:
data = {"userEnteredValue": {"stringValue": str(cell_contents)}}
return data
To format the cells properly. Here's the call that actually did the appending:
rows = [{"values": [set_cell_type(cell) for cell in row]} for row in daily_data_output]
data = { "requests": [ { "appendCells": { "sheetId": all_sheets['Daily record'], "rows": rows, "fields": "*", } } ], }
sheets_batch_update(SHEET_ID,data)
Second, if I was replacing a whole sheet, I did:
#convert the ints to ints and floats to floats
float_cell=re.compile("^\d+\.\d+$")
int_cell=re.compile("^\d+$")
row_list=error_message.split("\t")
i=0
while i < len(row_list):
current_cell=row_list[i].replace(',', '') #remove the commas from any numbers
if int_cell.search(current_cell):
row_list[i]=int(current_cell)
elif float_cell.search(current_cell):
row_list[i]=float(current_cell)
i+=1
error_output.append(row_list)
then the following to actually save error_output to the sheet:
data = {'values': [row for row in error_output]}
sheets_update(SHEET_ID,data,'Errors!A1')
those two techniques, coupled with the formatting calls I had already figured out in my initial question, did the trick.

Categories