Extract objects from nested json with Pandas - python

I have a nested json (like the one reported below) of translated labels, and I want to extract the leaves in separate json files, based on the languages key (it, en, etc).
I don't know at "compile time" the depth and the schema of the json, because there are a lot of files similiar to the big nested one, but I know that I always have the following structure: key path/to/en/label and value content.
I tried using Pandas with the json_normalize function to flatten my json, and works great, but afterwards I had trouble rebuilding the json schema, e.g. with the following json I get a 1x12 DataFrame, but I want a resulting DataFrame with shape 4x3, where 4 are the different labels (index) and 3 are the different languages (columns).
def fix_df(df: pd.DataFrame):
assert df.shape[0] == 1
columns = df.columns
columns_last_piece = [s.split("/")[-1] for s in columns]
fixed_columns = [s.split(".")[1] for s in columns_last_piece]
index = [".".join(elem.split(".")[2:]) for elem in columns_last_piece]
return pd.DataFrame(df.values, index=index, columns=fixed_columns)
def main():
path = pathlib.Path(os.getenv("FIXTURE_FLATTEN_PATH"))
assert path.exists()
json_dict = json.load(open(path, encoding="utf-8"))
flattened_json = pd.json_normalize(json_dict)
flattened_json_fixed = fix_df(flattened_json)
# do something with flattened_json_fixed
Example of my_labels.json:
{
"dynamicData": {
"bff_scoring": {
"subCollection": {
"dynamicData/bff_scoring/translations": {
"it": {
"area_title.people": "PERSONE",
"area_title.planet": "PIANETA",
"area_title.prosperity": "PROSPERITÀ",
"area_title.principle-gov": "PRINCIPI DI GOVERNANCE"
},
"en": {
"area_title.people": "PEOPLE",
"area_title.planet": "PLANET",
"area_title.prosperity": "PROSPERITY",
"area_title.principle-gov": "PRINCIPLE OF GOVERNANCE"
},
"fr":{
"area_title.people": "PERSONNES",
"area_title.planet": "PLANÈTE",
"area_title.prosperity": "PROSPERITÉ",
"area_title.principle-gov": "PRINCIPES DE GOUVERNANCE"
}
}
}
}
}
}
Example of my_labels_it.json:
{
"area_title.people": "PERSONE",
"area_title.planet": "PIANETA",
"area_title.prosperity": "PROSPERITÀ",
"area_title.principle-gov": "PRINCIPI DI GOVERNANCE"
}

I finally managed to solve this problem.
First, I need to use the melt function.
>>> df = flattened_json.melt()
>>> df
variable value
0 dynamicData.bff_scoring.subCollection.dynamicD... PERSONE
1 dynamicData.bff_scoring.subCollection.dynamicD... PIANETA
2 dynamicData.bff_scoring.subCollection.dynamicD... PROSPERITÀ
3 dynamicData.bff_scoring.subCollection.dynamicD... PRINCIPI DI GOVERNANCE
...
From here, I can extract the fields I'm interested with a regular expression. I tried using .str.extractall and explode, but I was greeted with an exception, so I relied to use .str.extract two times.
>>> df2 = df.assign(language=df.variable.str.extract(r".*\.([a-z]{2})\.[\w\.-]+$"), label=df.variable.str.extract(r"(?<=\.[a-z]{2}\.)([\w\.-]+)$")).drop(columns="variable")
>>> df2
value language label
0 PERSONE it area_title.people
1 PIANETA it area_title.planet
2 PROSPERITÀ it area_title.prosperity
3 PRINCIPI DI GOVERNANCE it area_title.principle-gov
...
And then, with a pivot, I can have the dataframe with the desired schema.
>>> df3 = df2.pivot(index="label", columns="language", values="value")
>>> df3
language en ... it
label ...
area_title.people PEOPLE ... PERSONE
area_title.planet PLANET ... PIANETA
area_title.principle-gov PRINCIPLE OF GOVERNANCE ... PRINCIPI DI GOVERNANCE
area_title.prosperity PROSPERITY ... PROSPERITÀ
From this dataframe is very simple to obtain the expected json.
>>> df3["it"].to_json(force_ascii=False)
'{"area_title.people":"PERSONE","area_title.planet":"PIANETA","area_title.principle-gov":"PRINCIPI DI GOVERNANCE","area_title.prosperity":"PROSPERITÀ"}'

Related

Pandas JSON Orient Autodetection

I'm trying to find out if Pandas.read_json performs some level of autodetection. For example, I have the following data:
data_records = [
{
"device": "rtr1",
"dc": "London",
"vendor": "Cisco",
},
{
"device": "rtr2",
"dc": "London",
"vendor": "Cisco",
},
{
"device": "rtr3",
"dc": "London",
"vendor": "Cisco",
},
]
data_index = {
"rtr1": {"dc": "London", "vendor": "Cisco"},
"rtr2": {"dc": "London", "vendor": "Cisco"},
"rtr3": {"dc": "London", "vendor": "Cisco"},
}
If I do the following:
import pandas as pd
import json
pd.read_json(json.dumps(data_records))
---
device dc vendor
0 rtr1 London Cisco
1 rtr2 London Cisco
2 rtr3 London Cisco
though I get the output that I desired, the data is record based. Being that the default orient is columns, I would have not thought this would have worked.
Therefore is there some level of autodetection going on? With index based inputs the behaviour seems more inline. As this shows appears to have parsed the data based on a column orient by default.
pd.read_json(json.dumps(data_index))
rtr1 rtr2 rtr3
dc London London London
vendor Cisco Cisco Cisco
pd.read_json(json.dumps(data_index), orient="index")
dc vendor
rtr1 London Cisco
rtr2 London Cisco
rtr3 London Cisco
We can't speak of auto-detection, but rather of a nested hierarchical structure determined by a specified orientation or by the one used by default.
Moreover, it should be specified that we cannot use any data structure with a given orientation.
Case 1 : Dataframes to / from a JSON string
read_json : Convert a JSON string to Dataframe with argument typ = 'frame'
to_json : Convert a DataFrame to a JSON string
Orient value is explicitly specified with the Pandas to_json and read_json functions in case of split, index, record, table and values orientations.
This is not necessary to specify orient value for columns because orientation is columns by default.
Case 2 : Series to / from a JSON string
read_json : Convert a JSON string to Series with argument typ = 'series'
to_json : Convert a Series to a JSON string
If typ = series in read_json, default value for argument orient is index see Pandas documentation
When trying to convert a Series into a JSON string using to_json, default orient value is also index.
With the other allowed orientation values for a Series split, records, index, table argument orient must be specified.
Resources
We have some oriented structures in the comment section at this github link (have a look around line 680 in the _json.py file).
Note that there are no examples with orient=columns in the code comments on git-hub.
This is simply because in the absence of an orientation specification, columns is used by default.
Clearer view of a nested hierarchical structure
import pandas as pd
import json
##### BEGINNING : HIERARCHICAL LEVEL #####
# Second Level - Values levels
d21 = {'v1': "value 1", 'v2': "value 3"}
d22 = {'v1': "value 3", 'v2': "value 4"}
# First Level - Rows levels
d1 = {'row1': d21, 'row2': d22}
# 0-Level - Columns Levels
d = {'col1': d1}
##### END : HIERARCHICAL LEVEL #####
print(pd.read_json(json.dumps(d))) # No need for specification : orient is columns by default
# col1
# row1 {'v1': 'value 1', 'v2': 'value 3'}
# row2 {'v1': 'value 3', 'v2': 'value 4'}
Be careful here
Data structures cannot be used with any value given to the orient argument. Otherwise, a builtins.AttributeError exception should be raised (see the github link to see the diffrent structures).
pd.read_json(json.dumps(data_records))
# device dc vendor
# 0 rtr1 London Cisco
# 1 rtr2 London Cisco
# 2 rtr3 London Cisco
#### Like orient is columns by default the previous is the same as following
pd.read_json(json.dumps(data_records), orient='columns')
# device dc vendor
# 0 rtr1 London Cisco
# 1 rtr2 London Cisco
# 2 rtr3 London Cisco
pd.read_json(json.dumps(data_records), orient='values')
# device dc vendor
# 0 rtr1 London Cisco
# 1 rtr2 London Cisco
# 2 rtr3 London Cisco
#### Dataframe shape is also important and exception could be raised
pd.read_json(json.dumps(data_records), orient='index')
# ...
# builtins.AttributeError: 'list' object has no attribute 'values'
pd.read_json(json.dumps(data_records), orient='table')
# builtins.KeyError: 'schema'
pd.read_json(json.dumps(data_records), orient='split')
# builtins.AttributeError: 'list' object has no attribute 'items'
Is there an autodetection mechanism ?
After some experimentations I can say now answer is no.
On github, split data shape is presented like the following :
data = {\
"columns":["col 1","col 2"],\
"index":["row 1","row 2"],\
"data":[["a","b"],["c","d"]]\
}
So let's do an experiment.
We will use read_json to read the data without filling in the orientation and see if the split shape is recognized.
Then we will read the data by entering the split orientation.
If there is an automatic shape recognition, the result should be the same in both cases.
import pandas as pd
import json
data = {\
"columns":["col 1","col 2"],\
"index":["row 1","row 2"],\
"data":[["a","b"],["c","d"]]\
}
json_string = json.dumps(data)
we print without filling in the orientation :
>>> pd.read_json(json_string)
columns index data
0 col 1 row 1 [a, b]
1 col 2 row 2 [c, d]
and now we print with filling in the split orientation.
>>> pd.read_json(json_string, orient='split')
col 1 col 2
row 1 a b
row 2 c d
Dataframes are different, Pandas do not recognize the split shape. There is no automatic detection mechanism.
TL;DR
When using pd.read_json() with orient=None, the representation of the data is automatically determined through pd.DataFrame().
Explanation
The pandas documentation is a bit misleading here. When not specifying orient, the parser for 'columns' is used, which is self.obj = pd.DataFrame(json.loads(json)). So
pd.read_json(json.dumps(data_records))
is equivalent to
pd.DataFrame(json.loads(json.dumps(data_records)))
which again is equivalent to
pd.DataFrame(data_records)
I.e., you pass a list of dicts to the DataFrame constructor, which then performs the automatic determination of the data representation. Note that this does not mean that orient is auto-detected. Instead, simple heuristics (see below) on how the data should be loaded into a DataFrame are applied.
Loading JSON-like data through pd.DataFrame()
For the 3 most relevant cases of JSON-structured data, the DataFrame construction through pd.DataFrame() is:
Dict of lists
In[1]: data = {"a": [1, 2, 3], "b": [9, 8, 7]}
...: pd.DataFrame(data)
Out[1]:
a b
0 1 9
1 2 8
2 3 7
Dict of dicts
In[2]: data = {"a": {"x": 1, "y": 2, "z": 3}, "b": {"x": 9, "y": 8, "z": 7}}
...: pd.DataFrame(data)
Out[2]:
a b
x 1 9
y 2 8
z 3 7
List of dicts
In[3]: data = [{'a': 1, 'b': 9}, {'a': 2, 'b': 8}, {'a': 3, 'b': 7}]
...: pd.DataFrame(data)
Out[3]:
a b
0 1 9
1 2 8
2 3 7
No, Pandas does not perform any autodetection when using the read_json function.
It is entirely determined by the orient parameter, which specifies the format of the input json data.
In your first example, you passed the data_records list to the json.dumps function, which is then converted it to a json-string. After passing the resulting json string to pd.read_json, it is seen as a record orientation.
In your second example, you passed the data_index to json.dumps which is thenseen as a "column" orientation
In both cases, the behavior of the read_json function is entirely based on the value of the orient parameter and not by an automatic detection by Pandas.
If you want to understand every detail of a function call, I would suggest using VSCode and setting "justMyCode": false in your launch.json for debugging.
That being said, if you follow what's going on when you call pd.read_json() you'll find out that it instantiates a JsonReader, before reading it which then instantiates a FrameParser in turn parsed with _parse_no_numpy:
def _parse_no_numpy(self):
json = self.json
orient = self.orient
if orient == "columns": # default
self.obj = DataFrame(
loads(json, precise_float=self.precise_float), dtype=None
)
elif orient == "split":
decoded = {
str(k): v
for k, v in loads(json, precise_float=self.precise_float).items()
}
self.check_keys_split(decoded)
self.obj = DataFrame(dtype=None, **decoded)
elif orient == "index": # your second case
self.obj = DataFrame.from_dict(
loads(json, precise_float=self.precise_float),
dtype=None,
orient="index",
)
elif orient == "table":
self.obj = parse_table_schema(json, precise_float=self.precise_float)
else:
self.obj = DataFrame(
loads(json, precise_float=self.precise_float), dtype=None
)
As you can see, like stated in a previous answer, in terms of orientation:
pd.read_json(json.dumps(data_records))
is equivalent to
pd.DataFrame(data_records)
and
pd.read_json(json.dumps(data_index), orient='index')
to
pd.DataFrame.from_dict(data_index, orient='index')
So in the end it all boils down to how pd.DataFrame handles the passed list of dict.
Going down this hole, you'll find that the constructor checks if the data is list-like and then calls nested_data_to_arrays which in turn calls to_arrays that finally calls _list_of_dict_to_arrays:
def _list_of_dict_to_arrays(
data: list[dict],
columns: Index | None,
) -> tuple[np.ndarray, Index]:
"""
Convert list of dicts to numpy arrays
if `columns` is not passed, column names are inferred from the records
- for OrderedDict and dicts, the column names match
the key insertion-order from the first record to the last.
- For other kinds of dict-likes, the keys are lexically sorted.
Parameters
----------
data : iterable
collection of records (OrderedDict, dict)
columns: iterables or None
Returns
-------
content : np.ndarray[object, ndim=2]
columns : Index
"""
if columns is None:
gen = (list(x.keys()) for x in data)
sort = not any(isinstance(d, dict) for d in data)
pre_cols = lib.fast_unique_multiple_list_gen(gen, sort=sort)
columns = ensure_index(pre_cols)
# assure that they are of the base dict class and not of derived
# classes
data = [d if type(d) is dict else dict(d) for d in data]
content = lib.dicts_to_array(data, list(columns))
return content, columns
The "autodetection" is actually the hierarchical handling of all possible cases/types.

Converting complex nested json to csv via pandas

I have the following json file
{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}
Note - Above json is fragment of original json. The actual file contains lot more attributes after 'margins', some of them nested and others not so. I just put some for brevity and to give an idea of expectations.
My goal is to flatten the data and load it into CSV. Here is the code I have written so far -
import json
import pandas as pd
path = r"/Users/samt/Downloads/test_data.json"
with open(path) as f:
t_data = {}
data = json.load(f)
for team in data['matches']:
if team['margins']:
for idx, margin in enumerate(team['margins']):
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
else:
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
df = pd.DataFrame.from_dict(t_data, orient='index')
print(df)
I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.
My goal is once, all the results are appended, use to_csv and convert them into rows. For each margin, the entire data is to be replicated as a seperate row. Here is what I am expecting the output to be. Can someone please help how to translate this?
From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand. Also, is there a better way to parse the json than doing the loop twice for one attribute i.e. margins?
I can't use json_normalize as that library is not supported in our environment.
[output data]
Using the json and csv modules: create a dictionary for each team, for each margin if there is one.
import json, csv
s = '''{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}'''
j = json.loads(s)
matches = j['matches']
rows = []
for thing in matches:
# print(thing)
if not thing['margins']:
rows.append(thing)
else:
for bar in (b['bar'] for b in thing['margins']):
d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
d['margins'] = bar
rows.append(d)
# for row in rows: print(row)
# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')
fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())
team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,
Getting multiple item values from a dictionary can be aided by using operator.itemgetter()
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
... 'overallResult': 'Won',
... 'team': 'Sunrisers Hyderabad',
... 'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>
I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.
You can use pd.DataFrame to create DataFrame and explode the margins column
import json
import pandas as pd
with open('data.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)
team overallResult totalMatches margins
0 Sunrisers Hyderabad Won 3 {'bar': 290}
1 Sunrisers Hyderabad Won 3 {'bar': 90}
2 Pune Warriors None 0 None
Then fill the None value in margins column to dictionary and convert it to column
bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)
bar
0 290
1 90
2 <NA>
At last, join the Series to original dataframe
df = df.join(bar).drop(columns='margins')
print(df)
team overallResult totalMatches bar
0 Sunrisers Hyderabad Won 3 290
1 Sunrisers Hyderabad Won 3 90
2 Pune Warriors None 0 <NA>

How to load JSON to Dataframe with key value pair as two colums

I have some JSON, that I would like to load into a dataframe, but would like to retain the key/value pair structure.
I've tried this:
body = '''{
"groupId": "1",
"categories":[
{
"model":"xxx",
"colour":"Black",
"width":"100",
"height":"200"
}
]
}'''
categories = json.loads(body)['categories']
dfQuery = pd.DataFrame(categories)
print(dfQuery)
and that gives me this:
But I need it pivoted and to alias the columns so that it appears like this:
I've tried transposing, but can't work out how to alias the 2 columns:
dfQuery = pd.DataFrame.from_dict(categories).T
Transposing is right, but you also need to reset the index and rename the columns:
dfQuery = dfQuery.T.reset_index().rename({'index': 'type', 0: 'value'}, axis=1)
Output:
>>> dfQuery
type value
0 model xxx
1 colour Black
2 width 100
3 height 200

Python dataframe to nested json file

I have a python dataframe as below.
python dataframe:-
Emp_No Name Project Task
1 ABC P1 T1
1 ABC P2 T2
2 DEF P3 T3
3 IJH Null Null
I need to convert it to json file and save it to disk as below
Json File
{
"Records"[
{
"Emp_No":"1",
"Project_Details":[
{
"Project":"P1",
"Task":"T1"
},
{
"Project":"P2",
"Task":"T2"
}
],
"Name":"ÄBC"
},
{
"Emp_No":"2",
"Project_Details":[
{
"Project":"P2",
"Task":"T3"
}
],
"Name":"DEF"
},
{
"Emp_No":"3",
"Project_Details":[
],
"Name":"IJH"
}
]
}
I feel like this post is not a doubt per se, but a cheecky atempt to avoid formatting the data, hahaha. But, since i'm trying to get used to the dataframe structure and the different ways of handling it, here you go!
import pandas as pd
asutosh_data = {'Emp_No':["1","1","2","3"], 'Name':["ABC","ABC","DEF","IJH"], 'Project':["P1","P2","P3","Null"], 'Task':["T1","T2","T3","Null"]}
df = pd.DataFrame(data=asutosh_data)
records = []
dif_emp_no = df['Emp_No'].unique()
for emp_no in dif_emp_no :
emp_data = df.loc[df['Emp_No'] == emp_no]
emp_project_details = []
for index,data in emp_data.iterrows():
if data["Project"]!="Null":
emp_project_details.append({"Project":data["Project"],"Task":data["Task"]})
records.append({"Emp_No":emp_data.iloc[0]["Emp_No"], "Project_Details":emp_project_details , "Name":emp_data.iloc[0]["Name"]})
final_data = {"Records":records}
print(final_data)
If you have any question about the code above, feel free to ask. I'll also leave below the documentation i've used to solve your problem (you may wanna check that):
unique : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
loc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
iloc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

How to convert csv to json with multi-level nesting using pandas

I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. I'm trying to convert a CSV to JSON.
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared.
Here's a link to one record in the CSV.
Think of this sample as two different parties attached to one document.
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate.
While the fields of each entity are unique.
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema.
Here's what I've tried. It doesn't contain all fields because I'm having a difficult time understanding what it all means.
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}
You might need to use the json_normalize function from pandas.io.json
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize.
Below is one way to export your data:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants contain duplicates and must run drop_duplicates() as we do on parcels, then assign(entity) can be moved to defining the participants in the get_nested_data() function:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')

Categories