How to convert nested json structure to dataframe - python

I converted a JSON into DataFrame and ended up with a column 'Structure_value' having below values as list of dictionary/dictionaries:
Structure_value
[{'Room': 6, 'Length': 7}, {'Room': 6, 'Length': 7}]
[{'Room': 6, 'Length': 22}]
[{'Room': 6, 'Length': 8}, {'Room': 6, 'Length': 9}]
Since it is an object so I guess it ended up in this format.
I need to split it into below four columns:
Structure_value_room_1
Structure_value_length_1
Structure_value_room_2
Structure_value_length_2
All other solutions on StackOverflow only deal with converting Simple JSON into DataFrame and not the nested structure.
P.S.: I know I can do something by explicitly naming fields but I need a generic solution so that in future any JSON of this format can be handled
[Edit]: The output should look like this:
Structure_value_room_1 Structure_value_length_1 Structure_value_room_2 \
0 6 7 6.0
1 6 22 NaN
2 6 8 6.0
Structure_value_length_2
0 7.0
1 NaN
2 9.0

Use list comprehension with nested dictionary comprehension with enumerate for deduplicate keys of dicts, last pass list of dictionaries to DataFrame constructor:
L = [ {f"{k}_{i}": v for i, y in enumerate(x, 1)
for k, v in y.items()}
for x in df["Structure_value"] ]
df = pd.DataFrame(L)
print(df)
Room_1 Length_1 Room_2 Length_2
0 6 7 6.0 7.0
1 6 22 NaN NaN
2 6 8 6.0 9.0
For columns names from question use:
def json_to_df(df, column):
L = [ {f"{column}_{k.lower()}_{i}": v for i, y in enumerate(x, 1)
for k, v in y.items()}
for x in df[column] ]
return pd.DataFrame(L)
df1 = json_to_df(df, 'Structure_value')
print(df1)
Structure_value_room_1 Structure_value_length_1 Structure_value_room_2 \
0 6 7 6.0
1 6 22 NaN
2 6 8 6.0
Structure_value_length_2
0 7.0
1 NaN
2 9.0

A non-Pandas solution you can probably apply to your original JSON data, here represented by rows:
import pprint
rows = [
{"Foo": "1", "Structure": [{'Room': 6, 'Length': 7}, {'Room': 6, 'Length': 7}]},
{"Foo": "2", "Structure": [{'Room': 6, 'Length': 22}]},
{"Foo": "3", "Structure": [{'Room': 6, 'Length': 8}, {'Room': 6, 'Length': 9}]},
]
for row in rows: # Modifies `rows` in-place
for index, room_info in enumerate(row.pop("Structure", ()), 1):
for key, value in room_info.items():
row[f"Structure_value_{key.lower()}_{index}"] = value
pprint.pprint(rows)
outputs
[{'Foo': '1',
'Structure_value_length_1': 7,
'Structure_value_length_2': 7,
'Structure_value_room_1': 6,
'Structure_value_room_2': 6},
{'Foo': '2', 'Structure_value_length_1': 22, 'Structure_value_room_1': 6},
{'Foo': '3',
'Structure_value_length_1': 8,
'Structure_value_length_2': 9,
'Structure_value_room_1': 6,
'Structure_value_room_2': 6}]

Related

Parsing a pandas dataframe into a nested list object

Does anyone have a neat way of packing a dataframe including some columns which indicate hierarchy into a nested array?
Say I have the following data frame:
from pandas import DataFrame
df = DataFrame(
{
"var1": [1, 2, 3, 4, 9],
"var2": [5, 6, 7, 8, 9],
"group_1": [1, 1, 1, 1, 2],
"group_2": [None, 1, 2, 1, None],
"group_3": [None, None, None, 1, None],
}
)
var1 var2 group_1 group_2 group_3
0 1 5 1 NaN NaN
1 2 6 1 1.0 NaN
2 3 7 1 2.0 NaN
3 4 8 1 1.0 1.0
4 9 9 2 NaN NaN
The group_ columns show that the records on the 2nd and 3rd rows are children of the one on the first row. The 4th row is a child of the 2nd row, and the last row of the table has no children. I am looking to derive something like the following:
[
{
"var1": 1,
"var2": 5,
"children": [
{
"var1": 2,
"var2": 6,
"children": [{"var1": 4, "var2": 8, "children": []}],
},
{"var1": 3, "var2": 7, "children": []},
],
},
{"var1": 9, "var2": 9, "children": []},
]
You could try if the following recursive .groupby over the group_n columns works for you:
def nest_it(df, level=1):
record = {"var1": None, "var2": None, "children": []}
for key, gdf in df.groupby(f"group_{level}", dropna=False):
if pd.isna(key):
record["var1"], record["var2"] = map(int, gdf.iloc[0, 0:2])
elif level == 3:
var1, var2 = map(int, gdf.iloc[0, 0:2])
record["children"].append({"var1": var1, "var2": var2, "children": []})
else:
record["children"].append(nest_it(gdf, level=level + 1))
return record
result = nest_it(df)["children"]
While going over the key, group tuples from a (nested) df.groupby("group_n") 3 things could happen:
The key is a NaN, i.e. it's time to record the vars and there aren't any more children.
The level is 3, i.e. the end of the dataframe is reached, so it's also time to wrap up, but this time as child.
Otherwise (recursion): Put the recursively retrieved children in the resp. list.
Remark: I've only initialized the record dicts front up to get the item order as in your expected output.
Result for the sample:
[{'var1': 1,
'var2': 5,
'children': [{'var1': 2,
'var2': 6,
'children': [{'var1': 4, 'var2': 8, 'children': []}]},
{'var1': 3, 'var2': 7, 'children': []}]},
{'var1': 9, 'var2': 9, 'children': []}]

Is there a way to store a dictionary on each row of a dataframe column using a vectorized operation?

I am attempting to next a dictionary inside of a dataframe.
here's an example of what I have:
x y z
1 2 3
4 5 6
7 8 9
here's an example of what I want:
x y z
1 2 {'z':3}
4 5 {'z':6}
7 8 {'z':9}
For this specific application, the whole point of using pandas is the vectorized operations that are scalable and efficient. Is it possible to transform that column into a column of dictionaries? I have attempted to use string concatenation, but then it is stored in pandas as a string and not a dict, and returns later with quotations around the dictionary because it is a string.
Example
data = {'x': {0: 1, 1: 4, 2: 7}, 'y': {0: 2, 1: 5, 2: 8}, 'z': {0: 3, 1: 6, 2: 9}}
df = pd.DataFrame(data)
Code
df['z'] = pd.Series(df[['z']].T.to_dict())
df
x y z
0 1 2 {'z': 3}
1 4 5 {'z': 6}
2 7 8 {'z': 9}

extract duplicate values with 3 or more duplicates in a column pandas dataframe

I'm trying to extract a dataframe which only shows duplicates with e.g 3 or more duplicates in a column. For example:
df = pd.DataFrame({
'one': pd.Series(['Berlin', 'Berlin', 'Tokyo', 'Stockholm','Berlin','Stockholm','Amsterdam']),
'two': pd.Series([1, 2, 3, 4, 5, 6, 7]),
'three': pd.Series([8, 9, 10, 11, 12])
})
Expected output:
one two three
0 Berlin 1 8
The extraction should only show the row of the first duplicate.
You could do it like this:
rows = df.groupby('one').filter(lambda group: group.shape[0] >= 3).groupby('one').first()
Output:
>>> rows
two three
one
Amsterdam 7 1.0
Berlin 1 8.0
It works with multiple groups of 3+ duplicates, too. I tested it.

How to convert dataframe into dictionary of sets?

I have a dataframe and want to convert a dictionary consists of set.
To be specific, my dataframe and what I want to make it as below:
month date
0 JAN 1
1 JAN 1
2 JAN 1
3 FEB 2
4 FEB 2
5 FEB 3
6 MAR 1
7 MAR 2
8 MAR 3
My goal:
dict = {'JAN' : {1}, 'FEB' : {2,3}, 'MAR' : {1,2,3}}
I also wrote a code below, however, I am not sure it is suitable.
In reality, the data is large,
so I would like to know any tips or other efficient (faster) way to make it.
import pandas as pd
df = pd.DataFrame({'month' : ['JAN','JAN','JAN','FEB','FEB','FEB','MAR','MAR','MAR'],
'date' : [1, 1, 1, 1, 2, 3, 1, 2, 3]})
df_list = df.values.tolist()
monthSet = ['JAN','FEB','MAR']
inst_id_dict = {}
for i in df_list:
monStr = i[0]
if monStr in monthSet:
inst_id = i[1]
inst_id_dict.setdefault(monStr, set([])).add(inst_id)
Let's try grouping on the "month' column, then aggregating by GroupBy.unique:
df.groupby('month', sort=False)['date'].unique().map(set).to_dict()
# {'JAN': [1], 'FEB': [2, 3], 'MAR': [1, 2, 3]}
Or, if you'd prefer a dictionary of sets, use Groupby.agg:
df.groupby('month', sort=False)['date'].agg(set).to_dict()
# {'JAN': {1}, 'FEB': {2, 3}, 'MAR': {1, 2, 3}}
Another idea is to iteratively build a dict (don't worry, despite using loops this is likely to outspeed the groupby option):
out = {}
for m, d in df.drop_duplicates(['month', 'date']).to_numpy():
out.setdefault(m, set()).add(d)
out
# {'JAN': {1}, 'FEB': {2, 3}, 'MAR': {1, 2, 3}}

How to multiply to each value in each element in SArray in Python?

I'm using Graphlab, but I guess this question can apply to pandas.
import graphlab
sf = graphlab.SFrame({'id': [1, 2, 3], 'user_score': [{"a":4, "b":3}, {"a":5, "b":7}, {"a":2, "b":3}], 'weight': [4, 5, 2]})
I want to create a new column where the value of each element in 'user_score' is multiplied by the number in 'weight'. That is,
sf = graphlab.SFrame({'id': [1, 2, 3], 'user_score': [{"a":4, "b":3}, {"a":5, "b":7}, {"a":2, "b":3}], 'weight': [4, 5, 2]}, 'new':[{"a":16, "b":12}, {"a":25, "b":35}, {"a":4, "b":6}])
I tried to write a simple function below and applied to no avail. Any thoughts?
def trans(x, y):
d = dict()
for k, v in x.items():
d[k] = v*y
return d
sf.apply(trans(sf['user_score'], sf['weight']))
It got the following error message:
AttributeError: 'SArray' object has no attribute 'items'
I'm using pandas dataframe, but it should also work in your case.
import pandas as pd
df['new']=[dict((k,v*y) for k,v in x.items()) for x, y in zip(df['user_score'], df['weight'])]
Input dataframe:
df
Out[34]:
id user_score weight
0 1 {u'a': 4, u'b': 3} 4
1 2 {u'a': 5, u'b': 7} 5
2 3 {u'a': 2, u'b': 3} 2
Output:
df
Out[36]:
id user_score weight new
0 1 {u'a': 4, u'b': 3} 4 {u'a': 16, u'b': 12}
1 2 {u'a': 5, u'b': 7} 5 {u'a': 25, u'b': 35}
2 3 {u'a': 2, u'b': 3} 2 {u'a': 4, u'b': 6}
This is subtle, but I think what you want is this:
sf.apply(lambda row: trans(row['user_score'], row['weight']))
The apply function takes a function as its argument, and will pass each row as the parameter to that function. In your version, you are evaluating the trans function before apply is called, which is why the error message complains about passing an SArray to the trans function when a dict is expected.
here is one of many possible solutions:
In [69]: df
Out[69]:
id user_score weight
0 1 {'b': 3, 'a': 4} 4
1 2 {'b': 7, 'a': 5} 5
2 3 {'b': 3, 'a': 2} 2
In [70]: df['user_score'] = df['user_score'].apply(lambda x: pd.Series(x)).mul(df.weight, axis=0).to_dict('record')
In [71]: df
Out[71]:
id user_score weight
0 1 {'b': 12, 'a': 16} 4
1 2 {'b': 35, 'a': 25} 5
2 3 {'b': 6, 'a': 4} 2

Categories