I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.
Here's the JSON file I am reading:
{
"data": [{
"name": "george",
"age": 16,
"exams": [{
"subject": "geometry",
"score": 56
},
{
"subject": "poetry",
"score": 88
}
]
}, {
"name": "nora",
"age": 7,
"exams": [{
"subject": "geometry",
"score": 87
},
{
"subject": "poetry",
"score": 94
}
]
}]
}
Here is the resulting DataFrame I would like.
name
age
exam_subject
exam_score
george
16
geometry
56
george
16
poetry
88
nora
7
geometry
87
nora
7
poetry
94
Here's how I'd accomplish this with pandas:
df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)
Dask doesn't have json_normalize, so what's the best way to accomplish this task?
If the file contains json-lines, then the most scale-able approach is to use dask.bag and then map the pandas snippet across each bag partition.
If the file is a large json, then the opening/ending brackets will cause problems, so an additional function will be needed to remove them before mapping the text into json.
Rough pseudo-code:
import dask.bag as db
bag = db.read_text("students3.json")
# if there are json-lines
option1 = bag.map(json.loads).map(pandas_fn)
# if there is a single json
option2 = bag.map(convert_to_jsonlines).map(json.loads).map(pandas_fn)
Use pd.json_normalize
import json
import pandas as pd
with open('students3.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['data'], record_path='exams', meta=['name', 'age'])
subject score name age
0 geometry 56 george 16
1 poetry 88 george 16
2 geometry 87 nora 7
3 poetry 94 nora 7
Pydantic offers excellent JSON validation and ingest. Several Pydantic models (one of each 'top level' JSON entry) can be converted to Python dictionaries in a loop to create a list of dictionaries, type: List[Dict], which may be converted to DataFrame objects.
I was inspired by the other answers to come up with this solution.
ddf = dd.read_json("students3.json", orient="split")
def pandas_fn(df):
exploded = df.explode("exams")
return pd.concat(
[
exploded[["name", "age"]].reset_index(drop=True),
pd.json_normalize(exploded["exams"]),
],
axis=1,
)
res = ddf.map_partitions(
lambda df: pandas_fn(df),
meta=(
("name", "object"),
("age", "int64"),
("subject", "object"),
("score", "int64"),
),
)
print(res.compute()) gives this output:
name age subject score
0 george 16 geometry 56
1 george 16 poetry 88
2 nora 7 geometry 87
3 nora 7 poetry 94
Related
Have a dataframe with values
df
name rank subject marks age
tom 123 math 25 10
mark 124 math 50 10
How to insert the dataframe data into mongodb using pymongo like first two columns as a regular insert and another 3 as array
{
"_id": "507f1f77bcf86cd799439011",
"name":"tom",
"rank":"123"
"scores": [{
"subject": "math",
"marks": 25,
"age": 10
}]
}
{
"_id": "507f1f77bcf86cd799439012",
"name":"mark",
"rank":"124"
"scores": [{
"subject": "math",
"marks": 50,
"age": 10
}]
}
tried this :
convert_dict = df.to_dict("records")
mydb.school_data.insert_many(convert_dict)
I use this solution
convert_dict = df.to_dict(orient="records")
mydb.school_data.insert_many(convert_dict)
I am working on large dataset where I want to replace value of 1 column based on the value of another column. I have been trying different combinations, but not satisfied, is there a simple way like one liner?
Sample code with error Solution:
import pandas as pd
people = pd.DataFrame(
{
"name": ["Ram", "Sham", "Ghanu", "Dhanu", "Jeetu"],
"age": [25, 30, 25, 31, 31],
"loc": ['Vashi', 'Nerul', 'Airoli', 'Panvel', 'CBD'],
},)
print(people)
areacode = pd.DataFrame(
{
"loc": ['Vashi', 'Nerul', 'CBD', 'Panvel'],
"pin": [400703, 400706, 421504, 410206],
},)
print()
print(areacode)
people = pd.merge(people, areacode, how='left', on='loc').drop(columns='loc').fillna('')
people.rename(columns={'pin':'loc'}, inplace=True)
print(people)
output of people Dataframe before change:
name age loc
0 Ram 25 Vashi
1 Sham 30 Nerul
2 Ghanu 25 Airoli
3 Dhanu 31 Panvel
4 Jeetu 31 CBD
output of areacode Dataframe:
loc pin
0 Vashi 400703
1 Nerul 400706
2 CBD 421504
3 Panvel 410206
output of people Dataframe after change:
name age loc
0 Ram 25 400703.0
1 Sham 30 400706.0
2 Ghanu 25
3 Dhanu 31 410206.0
4 Jeetu 31 421504.0
I don't like this approach as 1. Its long and 2. I am getting float in loc column, I need int. Please help me
people = pd.DataFrame(
{
"name": ["Ram", "Sham", "Ghanu", "Dhanu", "Jeetu"],
"age": [25, 30, 25, 31, 31],
"loc": ['Vashi', 'Nerul', 'Airoli', 'Panvel', 'CBD'],
},)
print(people)
areacode = pd.DataFrame(
{
"loc": ['Vashi', 'Nerul', 'CBD', 'Panvel'],
"pin": [400703, 400706, 421504, 410206],
},)
print()
print(areacode)
d = dict(zip(areacode["loc"], areacode["pin"]))
people["loc"] = people["loc"].apply(lambda x: int(d[x]) if x in d else "")
print(people)
I see no issue with your appraoch. Just cast loc as integer.
Alternative would be map, but I suspect it would be slower. You still will cast loc as integer anyway
people=people.assign(loc=people['loc'].map(dict(zip(areacode['loc'],areacode['pin']))).fillna('0').astype(int))
Hi how are things? I have a dataframe, which looks like a recursive table, my idea is to be able to transform it to a json (in a mamushka way). Im using python
my example:
Datafame:
id
name
relations
1
config
0
2
buttons
1
3
accept
2
4
delete
2
5
descripton
1
6
title
1
7
juan
0
and the json that i whant is
[{
"id":"1"
"name":"config",
"relations":
[{
"id":"2"
"name":"buttons"
"relations":[{
"id":"3"
"name":"accept"
},
{
"id":"4",
"name":"delete"
}],
},
{
"id":"5"
"name":"descripton",
"relations":[]
},
"id":"6"
"name":"title",
"relations":[]
}],
"id":"7",
"name":"juan",
"relations":[]
}]
As you will see, in the column "relation", you can see that it joins with its parents (id)
I have come across this issue I'm having so I checked to see whether there were any similar questions posted, but all of the solutions are referring to lists which have the same amount of items within or just one single list column, but my dataset contains 2 list columns both of different lengths.
Lets say I have this dataset:
{
"_id" : 43,
"userId" : 5,
"Ids" : [
"10",
"59",
"1165",
"1172"
],
"roles" : [
"5f84d38", "6245d38"
]
}
Current Dataframe:
_id userId Ids roles
43 5 [10,59,1165,1172] [5f84d38,6245d38]
How do I explode both lists so that it will give this output below.
Desired Dataframe:
_id userId Ids roles
43 5 10 5f84d38
43 5 59 5f84d38
43 5 1165 5f84d38
43 5 1172 5f84d38
43 5 10 6245d38
43 5 59 6245d38
43 5 1165 6245d38
43 5 1172 6245d38
Try this:
import pandas as pd
d = {
"_id" : 43,
"userId" : 5,
"Ids" : [
"10",
"59",
"1165",
"1172"
],
"roles" : [
"5f84d38", "6245d38"
]
}
df = pd.DataFrame(columns=d.keys())
rows = []
for role in d['roles']:
for _id in d['Ids']:
df = df.append({"_id" :d["_id"], "userId": d["userId"], "Ids":_id, "roles": role}, ignore_index=True)
This question already has answers here:
Pandas dataframe to json without index
(4 answers)
Closed 5 years ago.
I created a table like below using pandas pivot table.
print(pd_pivot_table)
category_id name
3 name3 0.329204
24 name24 0.323727
31 name31 0.319526
19 name19 0.008992
23 name23 0.005897
I want to create JSON based on this pivot_table, but I do not know how.
[
{
"category_id": 3,
"name": "name3",
"score": 0.329204
},
{
"category_id": 24,
"name": "name24",
"score": 0.323727
},
{
"category_id": 31,
"name": "name31",
"score": 0.319526
},
{
"category_id": 19,
"name": "name19",
"score": 0.008992
},
{
"category_id": 23,
"name": "name23",
"score": 0.005897
}
]
Or, I do not know how to get category_id and name values in the first place.
Even if you write the code below you can not get the results you want.
for data in pd_pivot_table:
print(data) # 0.329204
print(data["category_id"]) # *** IndexError: invalid index to scalar variable.
You can use Series.reset_index first for DataFrame and then DataFrame.to_json:
print (df)
category_id name
3 name3 0.329204
24 name24 0.323727
31 name31 0.319526
19 name19 0.008992
23 name23 0.005897
Name: score, dtype: float64
print (type(df))
<class 'pandas.core.series.Series'>
json = df.reset_index().to_json(orient='records')
print (json)
[{"category_id":3,"name":"name3","score":0.329204},
{"category_id":24,"name":"name24","score":0.323727},
{"category_id":31,"name":"name31","score":0.319526},
{"category_id":19,"name":"name19","score":0.008992},
{"category_id":23,"name":"name23","score":0.005897}]
If need output to file:
df.reset_index().to_json('file.json',orient='records')
Details:
print (df.reset_index())
category_id name score
0 3 name3 0.329204
1 24 name24 0.323727
2 31 name31 0.319526
3 19 name19 0.008992
4 23 name23 0.005897
print (type(df.reset_index()))
<class 'pandas.core.frame.DataFrame'>