Convert csv to json according to the requirements - python

so given the csv files with the given tables:
marks:
test_id student_id mark
1 1 78
2 1 87
3 1 95
4 1 32
5 1 65
6 1 78
7 1 40
1 2 78
2 2 87
3 2 15
6 2 78
7 2 40
1 3 78
2 3 87
3 3 95
4 3 32
5 3 65
6 3 78
7 3 40
course :
id name
1 A
2 B
3 C
tests:
id course_id weight
1 1 10
2 1 40
3 1 50
4 2 40
5 2 60
6 3 90
7 3 10
students:
id name
1 A
2 B
3 C
note: weight: how much of the student’s final grade the test is worth. For example, if a test is worth 50, that means that this test is worth 50% of the final grade for this course.
need to convert them to json in this format:
{
"students": [
{
"id": 1,
"name": "A",
"totalAverage": 72.03,
"courses": [
{
"id": 1,
"name": "Biology",
"teacher": "Mr. D",
"courseAverage": 90.1
},
{
"id": 3,
"name": "Math",
"teacher": "Mrs. C",
"courseAverage": 74.2
},
{
"id": 2,
"name": "History",
"teacher": "Mrs. P",
"courseAverage": 51.8
}
]
},
{
"id": 2,
"name": "B",
"totalAverage": 62.15,
"courses": [
{
"id": 1,
"name": "Biology",
"teacher": "Mr. D",
"courseAverage": 50.1
},
{
"id": 3,
"name": "Math",
"teacher": "Mrs. C",
"courseAverage": 74.2
}
]
},
{
"id": 3,
"name": "C",
"totalAverage": 72.03,
"courses": [
{
"id": 1,
"name": "Biology",
"teacher": "Mr. D",
"courseAverage": 90.1
},
{
"id": 2,
"name": "History",
"teacher": "Mrs. P",
"courseAverage": 51.8
},
{
"id": 3,
"name": "Math",
"teacher": "Mrs. C",
"courseAverage": 74.2
}
]
}
]
}
new to this type of problem so looking for how to get values from different tables to calculate courseAverage and also the totalAverage. also looking for how to put it accordingly in json

Your problem is pretty typical of as data processing workload: combine data from various sources, perform some calculations and output the result in a defined format.
Let's first define the input dataframes:
from io import StringIO
marks = pd.read_csv(StringIO('''
test_id student_id mark
1 1 78
2 1 87
3 1 95
4 1 32
5 1 65
6 1 78
7 1 40
1 2 78
2 2 87
3 2 15
6 2 78
7 2 40
1 3 78
2 3 87
3 3 95
4 3 32
5 3 65
6 3 78
7 3 40
'''), sep='\s+')
courses = pd.read_csv(StringIO('''
id name
1 A
2 B
3 C
'''), sep='\s+')
tests = pd.read_csv(StringIO('''
id course_id weight
1 1 10
2 1 40
3 1 50
4 2 40
5 2 60
6 3 90
7 3 10
'''), sep='\s+')
students = pd.read_csv(StringIO('''
id name
1 A
2 B
3 C
'''), sep='\s+')
And the processing:
# Combine (aka join or merge) the 4 tables into one
# `id` has different meaning for each table so we will disambiguate
# it by renaming it `student_id`, `course_id`, etc.
combined = (
students.add_prefix('student_')
.merge(marks, on='student_id')
.merge(tests.rename(columns={'id': 'test_id'}), on='test_id')
.merge(courses.add_prefix('course_'), on='course_id')
)
combined['weighted_mark'] = combined['mark'] * combined['weight'] / 100
# Build a student-course level summary
# You didn't provide a `teacher_name` column in the `courses` dataframe
course_summary = (
combined.groupby(['student_id', 'student_name', 'course_id', 'course_name'], as_index=False)
.agg(course_average=('weighted_mark', 'sum'))
)
# Assemble the summary into a dictionary
course_summary['course'] = course_summary.apply(lambda row: {
'id': row['course_id'],
'name': row['course_name'],
'courseAverage': round(row['course_average'], 1)
}, axis=1)
# Build a student-level summary
student_summary = (
course_summary.groupby(['student_id', 'student_name'], as_index=False)
.agg(
# Aggregate all courses taken by each student into a list
courses=('course', lambda c: list(c)),
# Student's average is the mean of all course averages
total_average=('course_average', 'mean')
)
)
# Assemble the summary into a dictionary
student_summary['student'] = student_summary.apply(lambda row: {
'id': row['student_id'],
'name': row['student_name'],
'totalAverage': round(row['total_average'], 2),
'courses': row['courses']
}, axis=1)
# The final output:
import json
with open('output.json', 'w') as fp:
json.dump({
'students': student_summary['student'].to_list()
}, fp, indent=2)

Related

How to normalize uneven JSON structures in pandas?

I am using the Google Maps Distance Matrix API to get several distances from multiple origins. The API response comes in a JSON structured like:
{
"destination_addresses": [
"Destination 1",
"Destination 2",
"Destination 3"
],
"origin_addresses": [
"Origin 1",
"Origin 2"
],
"rows": [
{
"elements": [
{
"distance": {
"text": "8.7 km",
"value": 8687
},
"duration": {
"text": "19 mins",
"value": 1129
},
"status": "OK"
},
{
"distance": {
"text": "223 km",
"value": 222709
},
"duration": {
"text": "2 hours 42 mins",
"value": 9704
},
"status": "OK"
},
{
"distance": {
"text": "299 km",
"value": 299156
},
"duration": {
"text": "4 hours 17 mins",
"value": 15400
},
"status": "OK"
}
]
},
{
"elements": [
{
"distance": {
"text": "216 km",
"value": 215788
},
"duration": {
"text": "2 hours 44 mins",
"value": 9851
},
"status": "OK"
},
{
"distance": {
"text": "20.3 km",
"value": 20285
},
"duration": {
"text": "21 mins",
"value": 1283
},
"status": "OK"
},
{
"distance": {
"text": "210 km",
"value": 210299
},
"duration": {
"text": "2 hours 45 mins",
"value": 9879
},
"status": "OK"
}
]
}
],
"status": "OK"
}
Note the rows array has the same number of elements in origin_addresses (2), while each elements array has the same number of elements in destination_addresses (3).
Is one able to use the pandas API to normalize everything inside rows while fetching the corresponding data from origin_addresses and destination_addresses?
The output should be:
status distance.text distance.value duration.text duration.value origin_addresses destination_addresses
0 OK 8.7 km 8687 19 mins 1129 Origin 1 Destination 1
1 OK 223 km 222709 2 hours 42 mins 9704 Origin 1 Destination 2
2 OK 299 km 299156 4 hours 17 mins 15400 Origin 1 Destination 3
3 OK 216 km 215788 2 hours 44 mins 9851 Origin 2 Destination 1
4 OK 20.3 km 20285 21 mins 1283 Origin 2 Destination 2
5 OK 210 km 210299 2 hours 45 mins 9879 Origin 2 Destination 3
If pandas does not provide a relatively simple way to do it, how would one accomplish this operation?
If data contains the dictionary from the question you can try:
df = pd.DataFrame(data["rows"])
df["origin_addresses"] = data["origin_addresses"]
df = df.explode("elements")
df = pd.concat([df.pop("elements").apply(pd.Series), df], axis=1)
df = pd.concat(
[df.pop("distance").apply(pd.Series).add_prefix("distance."), df], axis=1
)
df = pd.concat(
[df.pop("duration").apply(pd.Series).add_prefix("duration."), df], axis=1
)
df["destination_addresses"] = data["destination_addresses"] * len(
data["origin_addresses"]
)
print(df)
Prints:
duration.text duration.value distance.text distance.value status origin_addresses destination_addresses
0 19 mins 1129 8.7 km 8687 OK Origin 1 Destination 1
0 2 hours 42 mins 9704 223 km 222709 OK Origin 1 Destination 2
0 4 hours 17 mins 15400 299 km 299156 OK Origin 1 Destination 3
1 2 hours 44 mins 9851 216 km 215788 OK Origin 2 Destination 1
1 21 mins 1283 20.3 km 20285 OK Origin 2 Destination 2
1 2 hours 45 mins 9879 210 km 210299 OK Origin 2 Destination 3

How do you explode more than one list column with different lengths

I have come across this issue I'm having so I checked to see whether there were any similar questions posted, but all of the solutions are referring to lists which have the same amount of items within or just one single list column, but my dataset contains 2 list columns both of different lengths.
Lets say I have this dataset:
{
"_id" : 43,
"userId" : 5,
"Ids" : [
"10",
"59",
"1165",
"1172"
],
"roles" : [
"5f84d38", "6245d38"
]
}
Current Dataframe:
_id userId Ids roles
43 5 [10,59,1165,1172] [5f84d38,6245d38]
How do I explode both lists so that it will give this output below.
Desired Dataframe:
_id userId Ids roles
43 5 10 5f84d38
43 5 59 5f84d38
43 5 1165 5f84d38
43 5 1172 5f84d38
43 5 10 6245d38
43 5 59 6245d38
43 5 1165 6245d38
43 5 1172 6245d38
Try this:
import pandas as pd
d = {
"_id" : 43,
"userId" : 5,
"Ids" : [
"10",
"59",
"1165",
"1172"
],
"roles" : [
"5f84d38", "6245d38"
]
}
df = pd.DataFrame(columns=d.keys())
rows = []
for role in d['roles']:
for _id in d['Ids']:
df = df.append({"_id" :d["_id"], "userId": d["userId"], "Ids":_id, "roles": role}, ignore_index=True)

Problems with flattening nested JSON list to Pandas DataFrame, because of unequal data length

I'm currently trying to work with a JSON file with the following format:
response = {
"leads": [{
"id": 208827181,
"campaignId": 2595,
"contactId": 2919361,
"contactAttempts": 1,
"contactAttemptsInvalid": 0,
"lastModifiedTime": "2017-03-14T13:37:20Z",
"nextContactTime": "2017-03-15T14:37:20Z",
"created": "2017-03-14T13:16:42Z",
"updated": "2017-03-14T13:37:20Z",
"lastContactedBy": 1271,
"status": "automaticRedial",
"active": True,
"masterData": [{
"id": 2054,
"label": "Firmanavn",
"value": "Firma_1"
},
{
"id": 2055,
"label": "Adresse",
"value": "Gadenavn_1"
},
{
"id": 2056,
"label": "Postnr.",
"value": "2000"
},
{
"id": 2057,
"label": "Bydel",
"value": "Frederiksberg"
},
{
"id": 2058,
"label": "Telefonnummer",
"value": "25252525"
}
]
}]
}
masterData is in a nested list format but also varies in length. Basically, each row/entry can have different columns assigned to it. I'm looking to keep a specific column or columns for each entry. With my current indexing, however, due to the different lengths of the nested list, my indexing breaks.
This is my code:
leads = json_normalize(response['leads'])
df = pd.concat([leads.drop('masterData', 1),
pd.DataFrame(list(pd.DataFrame(list(leads['masterData']))[4]))
.drop(['id', 'label'], 1)
.rename(columns={"value": "tlf"})], axis=1)
The desired output is:
active campaignId contactAttempts contactAttemptsInvalid contactId created id lastContactedBy lastModifiedTime nextContactTime resultData status updated tlf
0 True 2595 1 0 2919361 2017-03-14T13:16:42Z 208827181 1271.0 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z [] automaticRedial 2017-03-14T13:37:20Z 37373737
1 True 2595 2 0 2919359 2017-03-14T13:16:42Z 208827179 1271.0 2017-03-14T13:33:30Z 2017-03-15T14:33:30Z [] privateRedial 2017-03-14T13:33:30Z 55555555
2 True 2595 1 0 2919360 2017-03-14T13:16:42Z 208827180 1271.0 2017-03-14T13:36:06Z None [] success 2017-03-14T13:36:06Z 22222222
3 True 2595 1 0 2919362 2017-03-14T13:16:42Z 208827182 1271.0 2017-03-14T13:56:39Z None [] success 2017-03-14T13:56:39Z 34343434
Where "tlf" is the added column from "masterData".
Use only json_normalize with specify columns names in list:
L = ['active', 'campaignId', 'contactAttempts', 'contactAttemptsInvalid',
'contactId', 'created', 'id', 'lastContactedBy', 'lastModifiedTime',
'nextContactTime', 'status', 'updated']
df = json_normalize(response['leads'], 'masterData', L, record_prefix='masterData.')
print (df)
masterData.id masterData.label masterData.value active campaignId \
0 2054 Firmanavn Firma_1 True 2595
1 2055 Adresse Gadenavn_1 True 2595
2 2056 Postnr. 2000 True 2595
3 2057 Bydel Frederiksberg True 2595
4 2058 Telefonnummer 25252525 True 2595
contactAttempts contactAttemptsInvalid contactId created \
0 1 0 2919361 2017-03-14T13:16:42Z
1 1 0 2919361 2017-03-14T13:16:42Z
2 1 0 2919361 2017-03-14T13:16:42Z
3 1 0 2919361 2017-03-14T13:16:42Z
4 1 0 2919361 2017-03-14T13:16:42Z
id lastContactedBy lastModifiedTime nextContactTime \
0 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
1 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
2 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
3 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
4 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
status updated
0 automaticRedial 2017-03-14T13:37:20Z
1 automaticRedial 2017-03-14T13:37:20Z
2 automaticRedial 2017-03-14T13:37:20Z
3 automaticRedial 2017-03-14T13:37:20Z
4 automaticRedial 2017-03-14T13:37:20Z

get value of nested lists and dictionaries of a json

I'm trying to get value of 'description' and first 'x','y' of related to that description from a json file so I used pandas.io.json.json_normalize and followed this example at end of page but getting error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('description',))
How can I get value of 'description' "Play" and "Game" and first 'x','y' of related to that description (0,2) and (1, 2) respectively from following json file and save result as a data frame?
I edited the code and I want to get this as result:
0 1 2 3
0 Play Game
1
2
3
4
but Game is not in the x,y that should be.
import pandas as pd
from pandas.io.json import json_normalize
data = [
{
"responses": [
{
"text": [
{
"description": "Play",
"bounding": {
"vertices": [
{
"x": 0,
"y": 2
},
{
"x": 513,
"y": -5
},
{
"x": 513,
"y": 73
},
{
"x": 438,
"y": 73
}
]
}
},
{
"description": "Game",
"bounding": {
"vertices": [
{
"x": 1,
"y": 2
},
{
"x": 307,
"y": 29
},
{
"x": 307,
"y": 55
},
{
"x": 201,
"y": 55
}
]
}
}
]
}
]
}
]
#w is columns h is rows
w, h = 4, 5;
Matrix = [[' ' for j in range(w)] for i in range(h)]
for row in data:
for response in row["responses"]:
for entry in response["text"]:
Description = entry["description"]
x = entry["bounding"]["vertices"][0]["x"]
y = entry["bounding"]["vertices"][0]["y"]
Matrix[x][y] = Description
df = pd.DataFrame(Matrix)
print(df)
you need to pass data[0]['responses'][0]['text'] to json_normalize like this
df = json_normalize(data[0]['responses'][0]['text'],[['bounding','vertices']], 'description')
which will result in
x y description
0 438 -5 Play
1 513 -5 Play
2 513 73 Play
3 438 73 Play
4 201 29 Game
5 307 29 Game
6 307 55 Game
7 201 55 Game
I hope this is what you are expecting.
EDIT:
df.groupby('description').get_group('Play').iloc[0]
will give you the first item of a group 'play'
x 438
y -5
description Play
Name: 0, dtype: object

Using Pandas json_normalize on nested Json with arrays

The problem is normalizing a json with nested array of json objects. I have looked at similar questions and tried to use their solution to no avail.
This is what my json object looks like.
{
"results": [
{
"_id": "25",
"Product": {
"Description": "3 YEAR",
"TypeLevel1": "INTEREST",
"TypeLevel2": "LONG"
},
"Settlement": {},
"Xref": {
"SCSP": "96"
},
"ProductSMCP": [
{
"SMCP": "01"
}
]
},
{
"_id": "26",
"Product": {
"Description": "10 YEAR",
"TypeLevel1": "INTEREST",
"Currency": "USD",
"Operational": true,
"TypeLevel2": "LONG"
},
"Settlement": {},
"Xref": {
"BBT": "CITITYM9",
"TCK": "ZN"
},
"ProductSMCP": [
{
"SMCP": "01"
},
{
"SMCP2": "02"
}
]
}
]
}
Here is my code for normalizing the json object.
data = json.load(j)
data = data['results']
print pd.io.json.json_normalize(data)
The results that I WANT should be like this
id Description TypeLevel1 TypeLevel2 Currency \
25 3 YEAR US INTEREST LONG NAN
26 10 YEAR US INTEREST NAN USD
BBT TCT SMCP SMCP2 SCSP
NAN NAN 521 NAN 01
M9 ZN 01 02 NAN
However, the result I get is this:
Product.Currency Product.Description Product.Operational Product.TypeLevel1 \
0 NaN 3 YEAR NaN INTEREST
1 USD 10 YEAR True INTEREST
Product.TypeLevel2 ProductSMCP Xref.BBT Xref.SCSP \
0 LONG [{'SMCP': '01'}] NaN 96
1 LONG [{'SMCP': '01'}, {'SMCP2': '02'}] CITITYM9 NaN
Xref.TCK _id
0 NaN 25
1 ZN 26
As you can see, the issue is at ProductSCMP, it is not completely flattening the array.
Once we get past first normalization, I'd apply a lambda to finish the job.
from cytoolz.dicttoolz import merge
pd.io.json.json_normalize(data).pipe(
lambda x: x.drop('ProductSMCP', 1).join(
x.ProductSMCP.apply(lambda y: pd.Series(merge(y)))
)
)
Product.Currency Product.Description Product.Operational Product.TypeLevel1 Product.TypeLevel2 Xref.BBT Xref.SCSP Xref.TCK _id SMCP SMCP2
0 NaN 3 YEAR NaN INTEREST LONG NaN 96 NaN 25 01 NaN
1 USD 10 YEAR True INTEREST LONG CITITYM9 NaN ZN 26 01 02
Trim Column Names
pd.io.json.json_normalize(data).pipe(
lambda x: x.drop('ProductSMCP', 1).join(
x.ProductSMCP.apply(lambda y: pd.Series(merge(y)))
)
).rename(columns=lambda x: re.sub('(Product|Xref)\.', '', x))
Currency Description Operational TypeLevel1 TypeLevel2 BBT SCSP TCK _id SMCP SMCP2
0 NaN 3 YEAR NaN INTEREST LONG NaN 96 NaN 25 01 NaN
1 USD 10 YEAR True INTEREST LONG CITITYM9 NaN ZN 26 01 02

Categories