Having difficulty in transforming nested json to flat json using python - python

I have a below API response. This is a very small subset which I am pasting here for reference. there can be 80+ columns on this.
[["name","age","children","city", "info"], ["Richard Walter", "35", ["Simon", "Grace"], {"mobile":"yes","house_owner":"no"}],
["Mary", "43", ["Phil", "Marshall", "Emily"], {"mobile":"yes","house_owner":"yes", "own_stocks": "yes"}],
["Drew", "21", [], {"mobile":"yes","house_owner":"no", "investor":"yes"}]]
Initially I thought pandas could help here and searched accordingly but as a newbie to python/coding I was not able to get much out of it. any help or guidance is appreciated.
I am expecting output in a JSON key-value pair format such as below.
{"name":"Mary", "age":"43", "children":["Phil", "Marshall", "Emily"],"info_mobile":"yes","info_house_owner":"yes", "info_own_stocks": "yes"},
{"name":"Drew", "age":"21", "children":[], "info_mobile":"yes","info_house_owner":"no", "info_investor":"yes"}]```

I assume that the first list always will be the headers (column names)?
If that is the case, maybe something like this could work.
import pandas as pd
data = [["name", "age", "children", "info"], ["Ned", 40, ["Arya", "Rob"], {"dead": "yes", "winter is coming": "yes"}]]
headers = data[0]
data = data[1:]
df = pd.DataFrame(data, columns=headers)
df_json = df.to_json()
print(df)

Assuming that the first list always represents the keys ["name", "age"... etc]
and then the subsequent lists represent the actual data/API response then you can construct a dictionary (key pair values) like this.
keys = ["name", "age", "children", "info"]
api_response = ["Richard Walter", "35", ["Simon", "Grace"], {"mobile":"yes","house_owner":"no"}]
data_dict = {k: v for k, v in zip(keys, api_response)}

Related

How to convert a dataframe to nested json

I have this DataFrame:
df = pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"})
All the dataframe fields are ASCII strings and is the output from a SQL query (pd.read_sql_query) so the line to create the dataframe above may not be quite right.
And I wish the final JSON output to be in the form
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": [
"4700A1/305",
"4700A1/312"
}]
I realize that may not be 'normal' JSON but that is the format expected by a program over which I have no control.
The nearest I have achieved so far is
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": "4700A1/305, 4700A1/312"
}]
Problem might be the structure of the dataframe but how to reformat it to produce the requirement is not clear to me.
The JSON line is:
df.to_json(orient='records', indent=2)
Isn't the only thing you need to do to parse the Sections into a list?
import pandas as pd
df= pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}, index=[0])
df['Sections'] = df['Sections'].str.split(', ')
print(df.to_json(orient='records', indent=2))
[
{
"Survey":"001_220816080015",
"BCD":"001_220816080015.bcd",
"Sections":[
"4700A1\/305",
"4700A1\/312"
]
}
]
The DataFrame won't help you here, since it's just giving back the input parameter you gave it.
You should just split the specific column you need into an array:
input_data = {'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}
input_data['Sections'] = input_data['Sections'].split(', ')
nested_json = [input_data]

replacing values in a data frame from a dictionary with multiple keys

I have not seen any posts about this on here. I have a data frame with some data that i would like to replace the values with the values found in a dictionary. This could simply be done with .replace but I want to keep this dynamic and reference the df column names using a paired dictionary map.
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
replace_dict={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict():
return
df_map={
"letters": [replace_dict],
"State": [replace_dict]
}
#replace the df values with the replace_dict values
I hope this makes sense but to explain more i want to replace the data under columns 'letters' and 'State' with the values found in replace_dict but referencing the column names from the keys found in df_map. I know this is overcomplicated for this example but i want to provide an easier example to understand.
I need help making the function 'replace_dict' to do the operations above.
Expected output is:
data=[['ABC', 'IN']]
df=pd.DataFrame(data,columns=['letters','State'])
by creating a function and then running the function with something along these lines
for i in df_map:
for j in df_map[i]:
df= j(i, df)
how would i create a function to run these operations? I have not seen anyone try to do this with multiple dictionary keys in the replace_dict
I'd keep the replace_dict keys the same as the column names.
def map_from_dict(data: pd.DataFrame, cols: list, mapping: dict) -> pd.DataFrame:
return pd.DataFrame([data[x].map(mapping.get(x)) for x in cols]).transpose()
df = pd.DataFrame({
"letters": ["Alphabet"],
"states": ["Indiana"]
})
replace_dict = {
"states": {"Illinois": "IL", "Indiana": "IN"},
"letters": {"Alphabet": "ABC", "Alphabet end": "XYZ"}
}
final_df = map_from_dict(df, ["letters", "states"], replace_dict)
print(final_df)
letters states
0 ABC IN
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
dict_={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict(df, dict_):
for d in dict_.values():
for val in d:
for c in df.columns:
df[c][df[c]==val] = d[val]
return df
df = replace_dict(df, dict_)

Create JSON from specific lines of data

I have this test data that I am trying to use to create a JSON for just select items. I have the items listed and would just like to output a JSON list with select item.
What I have:
import json
# Data to be written
dictionary ={
"id": "04",
"name": "sunil",
"department": "HR"
}
# Serializing json
x= json.dumps(dictionary,indent=0)
# JSON String
y = json.loads(x)
# Goal is to print:
{
"id": "04",
"name": "sunil"
}
If you don't need to save the department key you can use this:
del y['department']
Then your y variable will print what you wanted:
{"id": "04", "name": "sunil"}
Other ways to solve the same issue
for key in list(y.keys()):
#add all potential keys you want to remain in the final dictionary
if key == "id" or key == "name":
continue
else:
del y[key]
However, iterating over a dictionary is pretty slow. You could assign values to temporary variables and then remake the dictionary like this:
temp_id = y['id']
temp_name = y['name']
y.clear()
y['id'] = temp_id
y['name'] = temp_name
This should be faster that iterating over a dictionary.

Convert two CSV tables with one-to-many relation to JSON with embedded list of subdocuments

I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.
This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.
So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")

Loading JSON data into pandas data frame and creating custom columns

Here is example JSON im working with.
{
":#computed_region_amqz_jbr4": "587",
":#computed_region_d3gw_znnf": "18",
":#computed_region_nmsq_hqvv": "55",
":#computed_region_r6rf_p9et": "36",
":#computed_region_rayf_jjgk": "295",
"arrests": "1",
"county_code": "44",
"county_code_text": "44",
"county_name": "Mifflin",
"fips_county_code": "087",
"fips_state_code": "42",
"incident_count": "1",
"lat_long": {
"type": "Point",
"coordinates": [
-77.620031,
40.612749
]
}
I have been able to pull out select columns I want except I'm having troubles with "lat_long". So far my code looks like:
# PRINTS OUT SPECIFIED COLUMNS
col_titles = ['county_name', 'incident_count', 'lat_long']
df = df.reindex(columns=col_titles)
However 'lat_long' is added to the data frame as such: {'type': 'Point', 'coordinates': [-75.71107, 4...
I thought once I figured out how properly add the coordinates to the data frame I would then create two seperate columns, one for latitude and one for longitude.
Any help with this matter would be appreciated. Thank you.
If I don't misunderstood your requirements then you can try this way with json_normalize. I just added the demo for single json, you can use apply or lambda for multiple datasets.
import pandas as pd
from pandas.io.json import json_normalize
df = {":#computed_region_amqz_jbr4":"587",":#computed_region_d3gw_znnf":"18",":#computed_region_nmsq_hqvv":"55",":#computed_region_r6rf_p9et":"36",":#computed_region_rayf_jjgk":"295","arrests":"1","county_code":"44","county_code_text":"44","county_name":"Mifflin","fips_county_code":"087","fips_state_code":"42","incident_count":"1","lat_long":{"type":"Point","coordinates":[-77.620031,40.612749]}}
df = pd.io.json.json_normalize(df)
df_modified = df[['county_name', 'incident_count', 'lat_long.type']]
df_modified['lat'] = df['lat_long.coordinates'][0][0]
df_modified['lng'] = df['lat_long.coordinates'][0][1]
print(df_modified)
Here is how you can do it as well:
df1 = pd.io.json.json_normalize(df)
pd.concat([df1, df1['lat_long.coordinates'].apply(pd.Series) \
.rename(columns={0: 'lat', 1: 'long'})], axis=1) \
.drop(columns=['lat_long.coordinates', 'lat_long.type'])

Categories