I'm trying to parse json I've recieved from an api into a pandas DataFrame. That json is ierarchical, in this example I have city code, line name and list of stations for this line. Unfortunately I can't "unpack" it. Would be gratefull for help and explanation.
Json:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская', <------Line name
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино', <------Station 1
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево', <------Station 2
'order': 1},
etc.
I'm trying to recieve evrything from lowest level and the add all higher level information (starting from linename):
c = r.content
j = simplejson.loads(c)
tmp=[]
i=0
data1=pd.DataFrame(tmp)
data2=pd.DataFrame(tmp)
pd.concat
station['name']
for station in j['lines']:
data2 = data2.append(pd.DataFrame(station['stations'], station['name']),ignore_index=True)
data2
Once more - the questions are:
How to make it work?
Is this solution an optimal one, or there are some functions I should know about?
Update:
The Json parses normally:
json_normalize(j)
id lines name
1 [{'hex_color': 'FFCD1C', 'stations': [{'lat': ... Москва
Current DataFrame I can get:
data2 = data2.append(pd.DataFrame(station['stations']),ignore_index=True)
id lat lng name order
0 8.189 55.745113 37.864052 Новокосино 0
1 8.88 55.752237 37.814587 Новогиреево 1
Desired dataframe can be represented as:
id lat lng name order Line_Name Id_Top Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 1 Москва
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 1 Москва
In addition to MaxU's answer, I think you still need the highest level id, this should work:
json_normalize(data, ['lines','stations'], ['id',['lines','name']],record_prefix='station_')
Assuming you have the following dictionary:
In [70]: data
Out[70]:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская',
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино',
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево',
'order': 1}]}]}
Solution: use pandas.io.json.json_normalize:
In [71]: pd.io.json.json_normalize(data['lines'],
['stations'],
['name', 'id'],
meta_prefix='parent_')
Out[71]:
id lat lng name order parent_name parent_id
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 8
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 8
UPDATE: reflects updated question
res = (pd.io.json.json_normalize(data,
['lines', 'stations'],
['id', ['lines', 'name']],
meta_prefix='Line_')
.assign(Name_Top='Москва'))
Result:
In [94]: res
Out[94]:
id lat lng name order Line_id Line_lines.name Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 1 Калининская Москва
1 8.88 55.752237 37.814587 Новогиреево 1 1 Калининская Москва
Related
Good day all!
I am trying to flatten some nested JSON using json_normalize, but I the output I keep getting is not what I need.
Here's my code so far:
df1 = pd.read_csv('data_file.csv')
groups_dict = df1['groups']
df2 = pd.json_normalize(groups_dict)
The bit where the dictionary gets created seems to be working as seen here:
groups_dict.info()
groups_dict.head()
<class 'pandas.core.series.Series'>
RangeIndex: 19 entries, 0 to 18
Series name: groups
Non-Null Count Dtype
-------------- -----
19 non-null object
dtypes: object(1)
memory usage: 280.0+ bytes
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
But when I try to normalize the dictionary, I get the following output:
df2 = pd.json_normalize(groups_dict)
df2.head()
0
1
2
3
4
I need to have each item from the groups column listed as it's own column to complete my project. Please see example below for sample data file (csv format) and what I am trying to accomplish.
CSV:
campaign_id,name,groups,status,content,duration_type,start_date,end_date,relative_duration,auto_enroll,allow_multiple_enrollments,completion_percentage
201644,Clicker 1 Retraining ,"[{'group_id': 798800, 'name': 'Clickers 1 '}]",Closed,"[{'store_purchase_id': 1076203, 'content_type': 'Store Purchase', 'name': 'Spot the Phish Game: Foundational', 'description': 'Make sure you can spot a phishing attempt by using this condensed Spot the Phish game. With ten...', 'type': 'Game', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-10-02T17:08:16.000Z', 'publisher': 'APP1', 'purchase_date': '2022-04-13T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-19T08:00:00.000Z,,1 weeks,TRUE,FALSE,14
201645,Clicker 2 Retraining ,"[{'group_id': 798803, 'name': 'Clickers 2'}]",In Progress,"[{'store_purchase_id': 1060139, 'content_type': 'Store Purchase', 'name': 'Micro-module – Social Engineering', 'description': 'This five-minute micro-module defines social engineering and describes what criminals are after....', 'type': 'Training Module', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-09-09T16:06:01.000Z', 'publisher': 'APP2', 'purchase_date': '2022-03-21T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-13T08:00:00.000Z,,1 weeks,TRUE,FALSE,0
Before script:
df1['groups'].head()
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
After script:
df2.head()
group_id name
0 798800 Clickers 1
1 798803 Clickers 2
2 848426 Colin Safe Brow...
3 798804 Clickers 3
4 855348 Email Whitelist...
Anyone have pointers on how I should proceed?
Any assistance would be greatly appreaciated. Thanks!
You need to first extract the nested dict from its str representation by using eval or ast.literal_eval from the ast module.
You can then create a separate dataframe from the column you want by doing:
import ast
df1['groups'] = df1['groups'].apply(ast.literal_eval)
However, this returns a list of a single dict in your dataset. To combat this, we'll extract the first element of each row.
df1['groups'] = df1['groups'].apply(lambda l: l[0])
df2 = df1['groups'].apply(pd.Series)
Then you can access individual columns such as group_id and name using:
df2['group_id']
df2['name'] # etc.
group_id
0 798800
1 798803
2 848426
3 798804
4 855348
Similarly for other columns within your nested dict.
I have a dataframe as follows:
lat
long
city
nameDisease
numberCases
0
2
rio
Dengue
1
0
2
rio
Chicungunha
2
1
3
sp
Dengue
3
1
3
sp
COVID
4
I want to aggregate the rows with same (lat,long,city) and generate a json as follows:
[{lat:0,long:2,city:"rio",diseases:[{nameDisease:"Dengue",numberCases:1},{nameDisease:"Chicungunha",numberCases:2}],{lat:1,long:3,city:"sp",diseases:[{nameDisease:"Dengue",numberCases:3},{nameDisease:"COVID",numberCases:4}]]
How can I do this kind of transformation with pandas?
A few to_dict + groupby calls:
json = df.groupby(cols).apply(lambda g: g.drop(cols, axis=1).to_dict('records')).reset_index().rename({0:'diseases'}, axis=1).to_dict('records')
Output:
>>> json
[{'lat': 0,
'long': 2,
'city': 'rio',
'diseases': [{'nameDisease': 'Dengue', 'numberCases': 1},
{'nameDisease': 'Chicungunha', 'numberCases': 2}]},
{'lat': 1,
'long': 3,
'city': 'sp',
'diseases': [{'nameDisease': 'Dengue', 'numberCases': 3},
{'nameDisease': 'COVID', 'numberCases': 4}]}]
>>> json == expected_output
True
So I'm working on a movie genre data set and the dataset has all the genres in a single column but I want to split them.
here's how the data set looks like:
genres
----------------------------------------------
[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}]
[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
So what I want to do is get only the first genre so the new column should look like:
genres
_____________
Animation
Comedy
Comedy
Comedy
Action
I hope this is clear enough to understand my problem.
Use DataFrame.apply.
The first dictionary in the list is selected in each cell. From that dictionary the name field is selected:
df['genres']=df['genres'].apply(lambda x: x[0]['name'])
print(df)
ID genres
0 0 Animation
1 1 Comedy
2 2 Comedy
3 3 Comedy
4 4 Action
or
df['genres']=df['genres'].apply(lambda x: eval(x)[0]['name'])
TRY THIS
def decode_str_dict(x):
try:
out=eval(x)[0]['name']
except Exception:
try:
out=eval(x)['name']
except Exception:
try:
out=eval(x)
except Exception:
out=x
return out
df['genres'].apply(decode_str_dict)
df['genres'] = df['genres'].map(lambda x:[i['name'] for i in x])
df['first_genre'] = df['genres'][0]
df = df[['name','first_genre']]
This works if the values are considered a string.
from ast import literal_eval
df['genres'] = df.genres.map(lambda x: literal_eval(x)[0]['name'])
Result:
Out[294]:
ID genres
1 0 Animation
2 1 Comedy
3 2 Comedy
4 3 Comedy
5 4 Action
I have a dataframe named matchdf. It is a huge one so I'm showing the 1st 3 rows and columns of the dataframe:
print(matchdf.iloc[:3,:3]
Unnamed: 0 athletesInvolved awayScore
0 0 [{'id': '39037', 'name': 'Azhar Ali', 'shortNa... 0
1 1 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
2 2 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
I was working with athletesInvolved column and as you can see it contains a list which is of form:
print(matchdf['athletesInvolved'][0])
[{'id': '39037', 'name': 'Azhar Ali', 'shortName': 'Azhar Ali', 'displayName': 'Azhar Ali'}, {'id': '17134', 'name': 'Tim Murtagh', 'shortName': 'Murtagh', 'displayName': 'Tim Murtagh'}]
However the datatype for this object is str as opposed to a list. How can we convert the above datatype to a list
We can using ast
import ast
df.c=df.c.apply(ast.literal_eval)
I am trying to extract the name from the below dictionary:
df = df[[x.get('Name') for x in df['Contact']]]
Given below is how my Dataframe looks like:
data = [{'emp_id': 101,
'name': {'Name': 'Kevin',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000bt4HEG4'}}},
{'emp_id': 102,
'name': {'Name': 'Scott',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000yr5UTR9'}}}]
df = pd.DataFrame(data)
df
emp_id name
0 101 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102 {'Name': 'Scott', 'attributes': {'type': 'Cont...
I get an error:
AttributeError: 'NoneType' object has no attribute 'get'
If there are no NaNs, use json_normalize.
pd.io.json.json_normalize(df.name.tolist())['Name']
0 Kevin
1 Scott
Name: Name, dtype: object
If there are NaNs, you will need to drop them first. However, it is easy to retain the indices.
df
emp_id name
0 101.0 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102.0 NaN
2 103.0 {'Name': 'Scott', 'attributes': {'type': 'Cont...
idx = df.index[df.name.notna()]
names = pd.io.json.json_normalize(df.name.dropna().tolist())['Name']
names.index = idx
names
0 Kevin
2 Scott
Name: Name, dtype: object
Use apply, and use tolist to make it a list:
print(df['name'].apply(lambda x: x.get('Name')).tolist())
Output:
['Kevin', 'Scott']
If don't need list, want Series, use:
print(df['name'].apply(lambda x: x.get('Name')))
Output:
0 Kevin
1 Scott
Name: name, dtype: object
Update:
print(df['name'].apply(lambda x: x['attributes'].get('Name')).tolist())
Try following line:
names = [name.get('Name') for name in df['name']]