Pandas json_normalize on recursively nested json

Pandas json_normalize on recursively nested json - python

I have a json file with a deeply nested recursive structure:
{"children": [
"val" = x
"data" = y
"children": [{
"val" = x
"data" = y
"children": [{
....
"val" = x
"data" = y
"children": [{
"val" = x
"data" = y
"children": [{
....
Using pandas json_normalize as follows:
json_normalize(data = self.data["children"], record_path="children")
gives dataframe where the first level is flattened but the deepers levels remain json strings within the dataframe.
How can i flatten my dataframe such that the entire json tree is unpacked and flattened?

Providing your json is well formatted and has the same structure at all levels you can extract all the data by passing a List of keywords to json_normalize from each level.
json = {'children': [{
'val': 1,
'data': 2,
'children': [{
'val': 3,
'data' : 4,
'children': [{'val' : 4,
'data' : 5}],
}],
},{
'val' : 6,
'data' : 7,
'children': [{
'val' : 8,
'data' : 9,
'children': [{'val' : 10,
'data' : 11}],
}]
}]}
for i in range(1,3):
print( json_normalize(data = json,record_path=['children']*i) )
This gives the following output, which you can use recursively add into a single DataFrame if you wish.
children data val
0 [{'val': 3, 'data': 4, 'children': [{'val': 4,... 2 1
1 [{'val': 8, 'data': 9, 'children': [{'val': 10... 7 6
children data val
0 [{'val': 4, 'data': 5}] 4 3
1 [{'val': 10, 'data': 11}] 9 8
data val
0 5 4
1 11 10

Related

Nested structures in beam

Question: I want to do a similar operation to
ARRAY_AGG(STRUCT(table)) in beam for python.
Background:
Similar to this thread I'm running a beam pipeline in python. I have two tables, one with ids and a sum:
ID
total
1
10
2
15
3
5
And one breakdown table where each row is:
table1_id
item_name
item_price
1
a
2
1
b
8
2
c
5
2
d
5
2
e
5
3
f
7
I want the output in biq query to look like:
id
total
item.item_name
item.item_price
1
10
a
2
b
8
2
15
c
5
d
5
e
5
3
5
f
7
In BQ this is solvable by doing an ARRAY_AGG(SRUCT(line_items)) and grouping by table1_id which can then be joined on table1. Is there a smart way to do so in beam with python?
(Assuming it's something with groupby by haven't been able to get it working)

I propose you a full code to implement your solution in an unit test :
from typing import List, Dict, Tuple, Any
import apache_beam as beam
import pytest
from apache_beam import Create
from apache_beam.pvalue import AsList
from apache_beam.testing.test_pipeline import TestPipeline
def test_pipeline(self):
with TestPipeline() as p:
ids = [
{
'ID': 1,
'total': 10
},
{
'ID': 2,
'total': 15
},
{
'ID': 3,
'total': 5
}
]
items = [
{
'table1_id': 1,
'item_name': 'a',
'item_price': 2
},
{
'table1_id': 1,
'item_name': 'b',
'item_price': 8
},
{
'table1_id': 2,
'item_name': 'c',
'item_price': 5
},
{
'table1_id': 2,
'item_name': 'd',
'item_price': 5
},
{
'table1_id': 2,
'item_name': 'e',
'item_price': 5
},
{
'table1_id': 3,
'item_name': 'f',
'item_price': 7
}
]
ids_side_inputs = p | 'Side input IDs' >> Create(ids)
result = (p
| 'Input items' >> Create(items)
| beam.GroupBy(lambda i: i['table1_id'])
| beam.Map(self.to_item_tuple_with_total, ids=AsList(ids_side_inputs))
| beam.Map(self.to_item_result)
)
result | "Print outputs" >> beam.Map(print)
def to_item_tuple_with_total(self, item_tuple: Tuple[int, Any], ids: List[Dict]) -> Tuple[Dict, List[Dict]]:
table_id = item_tuple[0]
total = next(id_element for id_element in ids if id_element['ID'] == table_id)['total']
return {'id': table_id, 'total': total}, item_tuple[1]
def to_item_result(self, item_tuple: Tuple[Dict, Any]) -> Dict:
item_key = item_tuple[0]
return {'id': item_key['id'], 'total': item_key['total'], 'item': item_tuple[1]}
The result is :
{
'id': 1,
'total': 10,
'item':
[
{'table1_id': 1, 'item_name': 'a', 'item_price': 2},
{'table1_id': 1, 'item_name': 'b', 'item_price': 8}
]
}
{
'id': 2,
'total': 15,
'item':
[
{'table1_id': 2, 'item_name': 'c', 'item_price': 5},
{'table1_id': 2, 'item_name': 'd', 'item_price': 5},
{'table1_id': 2, 'item_name': 'e', 'item_price': 5}
]
}
{
'id': 3,
'total': 5,
'item':
[
{'table1_id': 3, 'item_name': 'f', 'item_price': 7}
]
}
Some explanations :
I simulated the items input PCollection from BigQuery
I sumulated the ids side input PCollection from BigQuery
I added a GroupBy on table1_id from item PCollection
I added a Map with side input list IDs to link total to items
The last Map returns a Dict with expected fields before to save the result to BigQuery

Converting pandas dataframe to JSON Object Column

I have a pandas dataframe that has information about a user with multiple orders and within each order there are multiple items purchases. An example of the dataframe format:
user_id | order_num | item_id | item_desc
1 1 1 red
1 1 2 blue
1 1 3 green
I want to convert it to JSONb Object in a column so that I can query it in postgresql.
Currently I am using the following code:
j = (reg_test.groupby(['user_id', 'order_num'], as_index=False)
.apply(lambda x: x[['item_id','item_desc']].to_dict('r'))
.reset_index()
.rename(columns={0:'New-Data'})
.to_json(orient='records'))
This is the result I am getting:
'''
[
{
"New-Data": [
{
"item_id": "1",
"item_desc": "red",
},
{
"item_id": "2",
"item_desc": "blue",
},
{
"item_id": "3",
"item_desc": "green",
}
],
"order_number": "1",
"user_id": "1"
}
]
'''
While that is correct json format, I want the result to look like this:
'''
[
{
"New-Data": [{
"1":
{
"item_id": "1",
"item_desc": "red",
},
"2": {
"item_id": "2",
"item_desc": "blue",
},
"3":
{
"item_id": "3",
"item_desc": "green",
}
}
],
"order_number": "1",
"user_id": "1"
}
]
'''

as an alternative to #rpanai's solution, i moved the processing into vanilla python :
convert dataframe to dict :
M = df.to_dict("records")
create the dict for the items
items = [
{key: value
for key, value in entry.items()
if key not in ("user_id", "order_num")}
for entry in M
]
item_details = [{str(num + 1): entry}
for num, entry
in enumerate(items)]
print(item_details)
[{'1': {'item_id': 1, 'item_desc': 'red'}},
{'2': {'item_id': 2, 'item_desc': 'blue'}},
{'3': {'item_id': 3, 'item_desc': 'green'}}]
Initialize dict and add the remaining data
d = dict()
d['New-Data'] = item_details
d['order_number'] = M[0]['order_num']
d['user_id'] = M[0]['user_id']
wrapper = [d]
print(wrapper)
[{'New-Data': [{'1': {'item_id': 1, 'item_desc': 'red'}},
{'2': {'item_id': 2, 'item_desc': 'blue'}},
{'3': {'item_id': 3, 'item_desc': 'green'}}],
'order_number': 1,
'user_id': 1}]

Have you considered to use a custom function
import pandas as pd
df = pd.DataFrame({'user_id': {0: 1, 1: 1, 2: 1},
'order_num': {0: 1, 1: 1, 2: 1},
'item_id': {0: 1, 1: 2, 2: 3},
'item_desc': {0: 'red', 1: 'blue', 2: 'green'}})
out = df.groupby(['user_id', 'order_num'])[["item_id", "item_desc"]]\
.apply(lambda x: x.to_dict("records"))\
.apply(lambda x: [{str(l["item_id"]):l for l in x}])\
.reset_index(name="New-Data")\
.to_dict("records")
where out returns
[{'user_id': 1,
'order_num': 1,
'New-Data': [{'1': {'item_id': 1, 'item_desc': 'red'},
'2': {'item_id': 2, 'item_desc': 'blue'},
'3': {'item_id': 3, 'item_desc': 'green'}}]}]

turn a dict that may contain a pandas dataframe to several dicts

I have a dict that may be 'infinitely' nested and contain several pandas DataFrame's (all the DataFrame's have the same amount of rows).
I want to create a new dict for each row in the DataFrame's, with the row being transformed to a dict (the key's are the column names) and the rest of the dictionary staying the same.
Note: I am not making a cartesian product between the rows of the different DataFrame's.
what would be the best and most pythonic way to do it?
Example:
the original dict:
d = {'a': 1,
'inner': {
'b': 'string',
'c': pd.DataFrame({'c_col1': range(1,3), 'c_col2': range(2,4)})
},
'd': pd.DataFrame({'d_col1': range(4,6), 'd_col2': range(7,9)})
}
the desired result:
lst_of_dicts = [
{'a': 1,
'inner': {
'b': 'string',
'c': {
'c_col1': 1, 'c_col2':2
}
},
'd': {
'd_col1': 4, 'd_col2': 7
}
},
{'a': 1,
'inner': {
'b': 'string',
'c': {
'c_col1': 2, 'c_col2': 3
}
},
'd': {
'd_col1': 5, 'd_col2': 8
}
}
]

pandas: split string, and count values? [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 3 years ago.
I've got a pandas dataset with a column that's a comma-separated string, e.g. 1,2,3,10:
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
I'd like to get a count and a mean score for each value in topics. So:
topic_id,count,mean
1,2,5
11,2,8
12,1,6
et cetera. How can I do this?
I've got as far as:
df['topic_ids'] = df.topics.str.split()
But now I guess I want to explode topic_ids out, so there's a column for each unique value in the entire set of values...?

unnest then groupby and agg
df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})
New_df.groupby('topics').score.agg(['count','mean'])
Out[1256]:
count mean
topics
1 2 5.0
11 2 8.0
12 1 6.0
18 2 5.5
22 1 9.0
30 4 6.5

In [111]: def mean1(x): return np.array(x).astype(int).mean()
In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
mean1 len
0 21.000000 3
1 19.666667 3
2 14.333333 3
3 16.333333 3

This is one way. Reindex & stack, then groupby & agg.
import pandas as pd
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
.stack()\
.reset_index(name='topics')\
.drop('level_2', 1)
df2.groupby('topics').score.agg(['count', 'mean']).reset_index()

Sort dictionary by value of other dictionary in Python

How to sort this dictionary by 'votes' in Python?
{
1 : {
'votes' : 2,
'id' : 10
},
2 : {
'votes' : 10,
'id' : 12
},
3 : {
'votes' : 98,
'id' : 14
}
}
To results in:
{
3 : {
'votes' : 98,
'id' : 14
},
2 : {
'votes' : 10,
'id' : 12
},
1 : {
'votes' : 2,
'id' : 10
}
}

You could use an OrderedDict:
>>> from collections import OrderedDict
>>> od = OrderedDict(sorted(d.items(),
key=lambda t: t[1]['votes'],
reverse=True))
>>> od
OrderedDict([(3, {'votes': 98, 'id': 14}),
(2, {'votes': 10, 'id': 12}),
(1, {'votes': 2, 'id': 10})])
where d is your original dictionary.

Dictionaries are unsorted, if you want to be able to access elements from you dictionary in a specific order you can use an OrderedDict as in jcollado's answer, or just sort the list of keys by whatever metric you are interested in, for example:
data = {1: {'votes': 2, 'id': 10}, 2: {'votes': 10, 'id': 12}, 3: {'votes': 98, 'id': 14}}
votes_order = sorted(data, key=lambda k: data[k]['votes'], reverse=True)
for key in votes_order:
print key, ':', data[key]
Output:
3 : {'votes': 98, 'id': 14}
2 : {'votes': 10, 'id': 12}
1 : {'votes': 2, 'id': 10}

Standard dictionaries do not have an order, so sorting them makes no sense. Both those dictionaries are exactly the same.
Perhaps what you really want is a list?
aslist = originaldict.values()
aslist.sort(key=lambda item:item['votes'], reverse=True)
This will extract the items from your dict as a list and resort the list by votes.

You could also sort the items in the dictionary:
print sorted(d.items(), key=lambda x: x[1]['votes'], reverse=True)
like Francis' suggestion, but you know the original key for every item.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas json_normalize on recursively nested json - python

Related

Nested structures in beam

Converting pandas dataframe to JSON Object Column

turn a dict that may contain a pandas dataframe to several dicts

pandas: split string, and count values? [duplicate]

Sort dictionary by value of other dictionary in Python

Categories

Resources