Using Pandas json_normalize on nested Json with arrays

Using Pandas json_normalize on nested Json with arrays - python

The problem is normalizing a json with nested array of json objects. I have looked at similar questions and tried to use their solution to no avail.
This is what my json object looks like.
{
"results": [
{
"_id": "25",
"Product": {
"Description": "3 YEAR",
"TypeLevel1": "INTEREST",
"TypeLevel2": "LONG"
},
"Settlement": {},
"Xref": {
"SCSP": "96"
},
"ProductSMCP": [
{
"SMCP": "01"
}
]
},
{
"_id": "26",
"Product": {
"Description": "10 YEAR",
"TypeLevel1": "INTEREST",
"Currency": "USD",
"Operational": true,
"TypeLevel2": "LONG"
},
"Settlement": {},
"Xref": {
"BBT": "CITITYM9",
"TCK": "ZN"
},
"ProductSMCP": [
{
"SMCP": "01"
},
{
"SMCP2": "02"
}
]
}
]
}
Here is my code for normalizing the json object.
data = json.load(j)
data = data['results']
print pd.io.json.json_normalize(data)
The results that I WANT should be like this
id Description TypeLevel1 TypeLevel2 Currency \
25 3 YEAR US INTEREST LONG NAN
26 10 YEAR US INTEREST NAN USD
BBT TCT SMCP SMCP2 SCSP
NAN NAN 521 NAN 01
M9 ZN 01 02 NAN
However, the result I get is this:
Product.Currency Product.Description Product.Operational Product.TypeLevel1 \
0 NaN 3 YEAR NaN INTEREST
1 USD 10 YEAR True INTEREST
Product.TypeLevel2 ProductSMCP Xref.BBT Xref.SCSP \
0 LONG [{'SMCP': '01'}] NaN 96
1 LONG [{'SMCP': '01'}, {'SMCP2': '02'}] CITITYM9 NaN
Xref.TCK _id
0 NaN 25
1 ZN 26
As you can see, the issue is at ProductSCMP, it is not completely flattening the array.

Once we get past first normalization, I'd apply a lambda to finish the job.
from cytoolz.dicttoolz import merge
pd.io.json.json_normalize(data).pipe(
lambda x: x.drop('ProductSMCP', 1).join(
x.ProductSMCP.apply(lambda y: pd.Series(merge(y)))
)
)
Product.Currency Product.Description Product.Operational Product.TypeLevel1 Product.TypeLevel2 Xref.BBT Xref.SCSP Xref.TCK _id SMCP SMCP2
0 NaN 3 YEAR NaN INTEREST LONG NaN 96 NaN 25 01 NaN
1 USD 10 YEAR True INTEREST LONG CITITYM9 NaN ZN 26 01 02
Trim Column Names
pd.io.json.json_normalize(data).pipe(
lambda x: x.drop('ProductSMCP', 1).join(
x.ProductSMCP.apply(lambda y: pd.Series(merge(y)))
)
).rename(columns=lambda x: re.sub('(Product|Xref)\.', '', x))
Currency Description Operational TypeLevel1 TypeLevel2 BBT SCSP TCK _id SMCP SMCP2
0 NaN 3 YEAR NaN INTEREST LONG NaN 96 NaN 25 01 NaN
1 USD 10 YEAR True INTEREST LONG CITITYM9 NaN ZN 26 01 02

Related

Convert csv to json according to the requirements

so given the csv files with the given tables:
marks:
test_id student_id mark
1 1 78
2 1 87
3 1 95
4 1 32
5 1 65
6 1 78
7 1 40
1 2 78
2 2 87
3 2 15
6 2 78
7 2 40
1 3 78
2 3 87
3 3 95
4 3 32
5 3 65
6 3 78
7 3 40
course :
id name
1 A
2 B
3 C
tests:
id course_id weight
1 1 10
2 1 40
3 1 50
4 2 40
5 2 60
6 3 90
7 3 10
students:
id name
1 A
2 B
3 C
note: weight: how much of the student’s final grade the test is worth. For example, if a test is worth 50, that means that this test is worth 50% of the final grade for this course.
need to convert them to json in this format:
{
"students": [
{
"id": 1,
"name": "A",
"totalAverage": 72.03,
"courses": [
{
"id": 1,
"name": "Biology",
"teacher": "Mr. D",
"courseAverage": 90.1
},
{
"id": 3,
"name": "Math",
"teacher": "Mrs. C",
"courseAverage": 74.2
},
{
"id": 2,
"name": "History",
"teacher": "Mrs. P",
"courseAverage": 51.8
}
]
},
{
"id": 2,
"name": "B",
"totalAverage": 62.15,
"courses": [
{
"id": 1,
"name": "Biology",
"teacher": "Mr. D",
"courseAverage": 50.1
},
{
"id": 3,
"name": "Math",
"teacher": "Mrs. C",
"courseAverage": 74.2
}
]
},
{
"id": 3,
"name": "C",
"totalAverage": 72.03,
"courses": [
{
"id": 1,
"name": "Biology",
"teacher": "Mr. D",
"courseAverage": 90.1
},
{
"id": 2,
"name": "History",
"teacher": "Mrs. P",
"courseAverage": 51.8
},
{
"id": 3,
"name": "Math",
"teacher": "Mrs. C",
"courseAverage": 74.2
}
]
}
]
}
new to this type of problem so looking for how to get values from different tables to calculate courseAverage and also the totalAverage. also looking for how to put it accordingly in json

Your problem is pretty typical of as data processing workload: combine data from various sources, perform some calculations and output the result in a defined format.
Let's first define the input dataframes:
from io import StringIO
marks = pd.read_csv(StringIO('''
test_id student_id mark
1 1 78
2 1 87
3 1 95
4 1 32
5 1 65
6 1 78
7 1 40
1 2 78
2 2 87
3 2 15
6 2 78
7 2 40
1 3 78
2 3 87
3 3 95
4 3 32
5 3 65
6 3 78
7 3 40
'''), sep='\s+')
courses = pd.read_csv(StringIO('''
id name
1 A
2 B
3 C
'''), sep='\s+')
tests = pd.read_csv(StringIO('''
id course_id weight
1 1 10
2 1 40
3 1 50
4 2 40
5 2 60
6 3 90
7 3 10
'''), sep='\s+')
students = pd.read_csv(StringIO('''
id name
1 A
2 B
3 C
'''), sep='\s+')
And the processing:
# Combine (aka join or merge) the 4 tables into one
# `id` has different meaning for each table so we will disambiguate
# it by renaming it `student_id`, `course_id`, etc.
combined = (
students.add_prefix('student_')
.merge(marks, on='student_id')
.merge(tests.rename(columns={'id': 'test_id'}), on='test_id')
.merge(courses.add_prefix('course_'), on='course_id')
)
combined['weighted_mark'] = combined['mark'] * combined['weight'] / 100
# Build a student-course level summary
# You didn't provide a `teacher_name` column in the `courses` dataframe
course_summary = (
combined.groupby(['student_id', 'student_name', 'course_id', 'course_name'], as_index=False)
.agg(course_average=('weighted_mark', 'sum'))
)
# Assemble the summary into a dictionary
course_summary['course'] = course_summary.apply(lambda row: {
'id': row['course_id'],
'name': row['course_name'],
'courseAverage': round(row['course_average'], 1)
}, axis=1)
# Build a student-level summary
student_summary = (
course_summary.groupby(['student_id', 'student_name'], as_index=False)
.agg(
# Aggregate all courses taken by each student into a list
courses=('course', lambda c: list(c)),
# Student's average is the mean of all course averages
total_average=('course_average', 'mean')
)
)
# Assemble the summary into a dictionary
student_summary['student'] = student_summary.apply(lambda row: {
'id': row['student_id'],
'name': row['student_name'],
'totalAverage': round(row['total_average'], 2),
'courses': row['courses']
}, axis=1)
# The final output:
import json
with open('output.json', 'w') as fp:
json.dump({
'students': student_summary['student'].to_list()
}, fp, indent=2)

Json_normalize applied to a Pandas Series returns '>' not supported between instances of 'str' and 'int'

I have a function that returns a flattened series from a Pandas Series using json_normalize.
def extract(self, mydf,field_name):
return json_normalize(mydf[field_name])
It returns: TypeError: '>' not supported between instances of 'str' and 'int'
I verified and mydf[field_name] works and I am able to get to the series value, so is not something related with the Dataframe.
Example of my json (Series row):
'{\'myfield\': \'XXXX\', \'fieldA\': \'ValueA\'}'
I have found related issues about the error log but nothing related using json_normalize.
The whole values in the series:
0 = {str} '{\'my_type\': \'Earn\', \'r\': {\'a\': \'275\', \'t\': \'F\'}}'
1 = {str} '{\'my_type\': \'Log\', \'m\': \'first\'}'
2 = {str} '{\'my_type\': \'Earn\', \'r\': {\'a\': \'3\', \'t\': \'Ess\'}}'
3 = {str} '{\'my_type\': \'Earn\', \'r\': {\'a\': \'20\', \'t\': \'E\'}}'
4 = {str} '{\'my_type\': \'Match\', \'d\': \'5\', \'t\': \'p\', \'name\': \'3-3\'}'

I'm assuming your JSON looks like the one below:
[
{
"my_type": "Earn",
"r": {
"a": "275",
"t": "F"
}
},
{
"my_type": "Log",
"m": "first"
},
{
"my_type": "Earn",
"r": {
"a": "3",
"t": "Ess"
}
},
{
"my_type": "Earn",
"r": {
"a": "20",
"t": "E"
}
},
{
"my_type": "Match",
"d": "5",
"t": "p",
"name": "3-3"
}
]
Python code to convert it to DataFrame:
import pandas as pd
import json
with open('1.json', 'r+') as f:
data = json.load(f)
df = pd.json_normalize(data)
print(df)
my_type r.a r.t m d t name
0 Earn 275 F NaN NaN NaN NaN
1 Log NaN NaN first NaN NaN NaN
2 Earn 3 Ess NaN NaN NaN NaN
3 Earn 20 E NaN NaN NaN NaN
4 Match NaN NaN NaN 5 p 3-3

building a data frame with pandas out of a nested structure in python

I want to implement machine learning with a dataset a bit too complex. I want to work with pandas and then use some of the built-in models in skit-learn.
The data looks is given in JSON file, a sample looks like that below:
{
"demo_Profile": {
"sex": "male",
"age": 98,
"height": 160,
"weight": 139,
"bmi": 5,
"someinfo1": [
"some_more_info1"
],
"someinfo2": [
"some_more_inf2"
],
"someinfo3": [
"some_more_info3"
],
},
"event": {
"info_personal": {
"info1": 219.59,
"info2": 129.18,
"info3": 41.15,
"info4": 94.19,
},
"symptoms": [
{
"name": "name1",
"socrates": {
"associations": [
"associations1"
],
"onsetType": "onsetType1",
"timeCourse": "timeCourse1"
}
},
{
"name": "name2",
"socrates": {
"timeCourse": "timeCourse2"
}
},
{
"name": "name3",
"socrates": {
"onsetType": "onsetType2"
}
},
{
"name": "name4",
"socrates": {
"onsetType": "onsetType3"
}
},
{
"name": "name5",
"socrates": {
"associations": [
"associations2"
]
}
}
],
"labs": [
{
"name": "name1 ",
"value": "valuelab"
}
]
}
}
I want to create a pandas data frame that considers this kind of "nested data" but I don't know how to build a data frame which takes into account "nested parameters" besides of "singles parameters"
For example, I don't know how to merge "demo_Profile" which contains "single parameters" with symptoms which is a list of dictionaries of, in same cases single values, and in other cases lists.
Anybody knows any way to deal with this issue?
EDIT*********
The JSON shown above is just one example, in other cases, the number of values in lists would be different, as well as the number of symptoms. So, the example shown above is not fixed for every case.

Consider pandas's json_normalize. However, because there are even deeper nests, consider processing data in pieces separately, then concatenate together with a fill forward on "normalized" columns.
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('myfile.json', 'r') as f:
data = json.loads(f.read())
final_df = pd.concat([json_normalize(data['demo_Profile']),
json_normalize(data['event']['symptoms']),
json_normalize(data['event']['info_personal']),
json_normalize(data['event']['labs'])], axis=1)
# FLATTEN NESTED LISTS
n_list = ['someinfo1', 'someinfo2', 'someinfo3', 'socrates.associations']
final_df[n_list] = final_df[n_list].apply(lambda col:
col.apply(lambda x: x if pd.isnull(x) else x[0]))
# FILLING FORWARD
norm_list = ['age', 'bmi', 'height', 'weight', 'sex', 'someinfo1', 'someinfo2', 'someinfo3',
'info1', 'info2', 'info3', 'info4', 'name', 'value']
final_df[norm_list] = final_df[norm_list].ffill()
Output
print(final_df)
# age bmi height sex someinfo1 someinfo2 someinfo3 weight name socrates.associations socrates.onsetType socrates.timeCourse info1 info2 info3 info4 name value
# 0 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name1 associations1 onsetType1 timeCourse1 219.59 129.18 41.15 94.19 name1 valuelab
# 1 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name2 NaN NaN timeCourse2 219.59 129.18 41.15 94.19 name1 valuelab
# 2 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name3 NaN onsetType2 NaN 219.59 129.18 41.15 94.19 name1 valuelab
# 3 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name4 NaN onsetType3 NaN 219.59 129.18 41.15 94.19 name1 valuelab
# 4 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name5 associations2 NaN NaN 219.59 129.18 41.15 94.19 name1 valuelab

A quick and easy way to flatten your json data is to use the flatten_json package which can be installed via pip
pip install flatten_json
I expect that you have a list of many entries which look like the one you have provided. Therefore the following code will give you the desired result:
import pandas as pd
from flatten_json import flatten
json_data = [{...patient1...}, {patient2...}, ...]
flattened = (flatten(entry) for entry in json_data)
df = pd.DataFrame(flattened)
In the flattened data, the list entries get suffixed with numbers (I added another patient with an additional entry in the "labs" list):
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| index demo_Profile_age demo_Profile_bmi demo_Profile_height demo_Profile_sex demo_Profile_someinfo1_0 demo_Profile_someinfo2_0 demo_Profile_someinfo3_0 demo_Profile_weight event_info_personal_info1 event_info_personal_info2 event_info_personal_info3 event_info_personal_info4 event_labs_0_name event_labs_0_value event_labs_1_name event_labs_1_value event_symptoms_0_name event_symptoms_0_socrates_associations_0 event_symptoms_0_socrates_onsetType event_symptoms_0_socrates_timeCourse event_symptoms_1_name event_symptoms_1_socrates_timeCourse event_symptoms_2_name event_symptoms_2_socrates_onsetType event_symptoms_3_name event_symptoms_3_socrates_onsetType event_symptoms_4_name event_symptoms_4_socrates_associations_0 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0 98 5 160 male some_more_info1 some_more_inf2 some_more_info3 139 219.59 129.18 41.15 94.19 name1 valuelab NaN NaN name1 associations1 onsetType1 timeCourse1 name2 timeCourse2 name3 onsetType2 name4 onsetType3 name5 associations2 |
| 1 98 5 160 male some_more_info1 some_more_inf2 some_more_info3 139 219.59 129.18 41.15 94.19 name1 valuelab name2 valuelabr2 name1 associations1 onsetType1 timeCourse1 name2 timeCourse2 name3 onsetType2 name4 onsetType3 name5 associations2 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The flatten method contains additional parameters to remove unwanted columns or prefixes.
Note: While this method gives you a flattened DataFrame as desired, I expect that you will run into other problems when feeding the dataset into a machine learning algorithm, depending on what will be your prediction target and how you want to encode the data as features.

Python Pandas - Iterate over unique columns

I am trying to iterate over a list of unique column-values to create three different keys with dictionaries inside a dictionary. This is the code I have now:
import pandas as pd
dataDict = {}
metrics = frontendFrame['METRIC'].unique()
for metric in metrics:
dataDict[metric] = frontendFrame[frontendFrame['METRIC'] == metric].to_dict('records')
print(dataDict)
This works fine for low amounts of data, but as fast as the amount of data increases this can take almost one second (!!!!).
I've tried with groupby in pandas which is even slower, and also with map, but I don't want to return things to a list. How can I iterate over this and create what I want in a faster way? I am using Python 3.6
UPDATE:
Input:
DATETIME METRIC ANOMALY VALUE
0 2018-02-27 17:30:32 SCORE 2.0 -1.0
1 2018-02-27 17:30:32 VALUE NaN 0.0
2 2018-02-27 17:30:32 INDEX NaN 6.6613381477499995E-16
3 2018-02-27 17:31:30 SCORE 2.0 -1.0
4 2018-02-27 17:31:30 VALUE NaN 0.0
5 2018-02-27 17:31:30 INDEX NaN 6.6613381477499995E-16
6 2018-02-27 17:32:30 SCORE 2.0 -1.0
7 2018-02-27 17:32:30 VALUE NaN 0.0
8 2018-02-27 17:32:30 INDEX NaN 6.6613381477499995E-16
Output:
{
"INDEX": [
{
"DATETIME": 1519759710000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
},
{
"DATETIME": 1519759770000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
}],
"SCORE": [
{
"DATETIME": 1519759710000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
}],
"VALUE": [
{
"DATETIME": 1519759710000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
}]
}

One possible solution:
a = defaultdict( list )
_ = {x['METRIC']: a[x['METRIC']].append(x) for x in frontendFrame.to_dict('records')}
a = dict(a)
from collections import defaultdict
a = defaultdict( list )
for x in frontendFrame.to_dict('records'):
a[x['METRIC']].append(x)
a = dict(a)
Slow:
dataDict = frontendFrame.groupby('METRIC').apply(lambda x: x.to_dict('records')).to_dict()

Fill missing timeseries data using pandas or numpy

I have a list of dictionaries which looks like this :
L=[
{
"timeline": "2014-10",
"total_prescriptions": 17
},
{
"timeline": "2014-11",
"total_prescriptions": 14
},
{
"timeline": "2014-12",
"total_prescriptions": 8
},
{
"timeline": "2015-1",
"total_prescriptions": 4
},
{
"timeline": "2015-3",
"total_prescriptions": 10
},
{
"timeline": "2015-4",
"total_prescriptions": 3
}
]
This basically is the result of a SQL query which when given a start date and an end date gives the count of total prescriptions for each month starting from the start date till the end month.However,for months where the prescriptions count is 0(Feb 2015),it completely skips that month.Is it possible using pandas or numpy to alter this list so that it adds an entry for the missing month with 0 as the total prescription as follows:
[
{
"timeline": "2014-10",
"total_prescriptions": 17
},
{
"timeline": "2014-11",
"total_prescriptions": 14
},
{
"timeline": "2014-12",
"total_prescriptions": 8
{
"timeline": "2015-1",
"total_prescriptions": 4
},
{
"timeline": "2015-2", # 2015-2 to be inserted for missing month
"total_prescriptions": 0 # 0 to be inserted for total prescription
},
{
"timeline": "2015-3",
"total_prescriptions": 10
},
{
"timeline": "2015-4",
"total_prescriptions": 3
}
]

What you are talking about is called "Resampling" in Pandas; first convert the your time to a numpy datetime and set as your index:
df = pd.DataFrame(L)
df.index=pd.to_datetime(df.timeline,format='%Y-%m')
df
timeline total_prescriptions
timeline
2014-10-01 2014-10 17
2014-11-01 2014-11 14
2014-12-01 2014-12 8
2015-01-01 2015-1 4
2015-03-01 2015-3 10
2015-04-01 2015-4 3
Then you can add in your missing months with resample('MS') (MS stands for "month start" I guess), and use fillna(0) to convert null values to zero as in your requirement.
df = df.resample('MS').fillna(0)
df
total_prescriptions
timeline
2014-10-01 17
2014-11-01 14
2014-12-01 8
2015-01-01 4
2015-02-01 NaN
2015-03-01 10
2015-04-01 3
To convert back to your original format, convert the datetime index back to string using to_native_types, and then export using to_dict('records'):
df['timeline']=df.index.to_native_types()
df.to_dict('records')
[{'timeline': '2014-10-01', 'total_prescriptions': 17.0},
{'timeline': '2014-11-01', 'total_prescriptions': 14.0},
{'timeline': '2014-12-01', 'total_prescriptions': 8.0},
{'timeline': '2015-01-01', 'total_prescriptions': 4.0},
{'timeline': '2015-02-01', 'total_prescriptions': 0.0},
{'timeline': '2015-03-01', 'total_prescriptions': 10.0},
{'timeline': '2015-04-01', 'total_prescriptions': 3.0}]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Pandas json_normalize on nested Json with arrays - python

Related

Convert csv to json according to the requirements

Json_normalize applied to a Pandas Series returns '>' not supported between instances of 'str' and 'int'

building a data frame with pandas out of a nested structure in python

Python Pandas - Iterate over unique columns

Fill missing timeseries data using pandas or numpy

Categories

Resources