Extract values from column of dictionaries using pandas - python

I am trying to extract the name from the below dictionary:
df = df[[x.get('Name') for x in df['Contact']]]
Given below is how my Dataframe looks like:
data = [{'emp_id': 101,
'name': {'Name': 'Kevin',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000bt4HEG4'}}},
{'emp_id': 102,
'name': {'Name': 'Scott',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000yr5UTR9'}}}]
df = pd.DataFrame(data)
df
emp_id name
0 101 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102 {'Name': 'Scott', 'attributes': {'type': 'Cont...
I get an error:
AttributeError: 'NoneType' object has no attribute 'get'

If there are no NaNs, use json_normalize.
pd.io.json.json_normalize(df.name.tolist())['Name']
0 Kevin
1 Scott
Name: Name, dtype: object
If there are NaNs, you will need to drop them first. However, it is easy to retain the indices.
df
emp_id name
0 101.0 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102.0 NaN
2 103.0 {'Name': 'Scott', 'attributes': {'type': 'Cont...
idx = df.index[df.name.notna()]
names = pd.io.json.json_normalize(df.name.dropna().tolist())['Name']
names.index = idx
names
0 Kevin
2 Scott
Name: Name, dtype: object

Use apply, and use tolist to make it a list:
print(df['name'].apply(lambda x: x.get('Name')).tolist())
Output:
['Kevin', 'Scott']
If don't need list, want Series, use:
print(df['name'].apply(lambda x: x.get('Name')))
Output:
0 Kevin
1 Scott
Name: name, dtype: object
Update:
print(df['name'].apply(lambda x: x['attributes'].get('Name')).tolist())

Try following line:
names = [name.get('Name') for name in df['name']]

Related

Create dataframe columns from JSON within CSV column

Good day all!
I am trying to flatten some nested JSON using json_normalize, but I the output I keep getting is not what I need.
Here's my code so far:
df1 = pd.read_csv('data_file.csv')
groups_dict = df1['groups']
df2 = pd.json_normalize(groups_dict)
The bit where the dictionary gets created seems to be working as seen here:
groups_dict.info()
groups_dict.head()
<class 'pandas.core.series.Series'>
RangeIndex: 19 entries, 0 to 18
Series name: groups
Non-Null Count Dtype
-------------- -----
19 non-null object
dtypes: object(1)
memory usage: 280.0+ bytes
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
But when I try to normalize the dictionary, I get the following output:
df2 = pd.json_normalize(groups_dict)
df2.head()
0
1
2
3
4
I need to have each item from the groups column listed as it's own column to complete my project. Please see example below for sample data file (csv format) and what I am trying to accomplish.
CSV:
campaign_id,name,groups,status,content,duration_type,start_date,end_date,relative_duration,auto_enroll,allow_multiple_enrollments,completion_percentage
201644,Clicker 1 Retraining ,"[{'group_id': 798800, 'name': 'Clickers 1 '}]",Closed,"[{'store_purchase_id': 1076203, 'content_type': 'Store Purchase', 'name': 'Spot the Phish Game: Foundational', 'description': 'Make sure you can spot a phishing attempt by using this condensed Spot the Phish game. With ten...', 'type': 'Game', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-10-02T17:08:16.000Z', 'publisher': 'APP1', 'purchase_date': '2022-04-13T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-19T08:00:00.000Z,,1 weeks,TRUE,FALSE,14
201645,Clicker 2 Retraining ,"[{'group_id': 798803, 'name': 'Clickers 2'}]",In Progress,"[{'store_purchase_id': 1060139, 'content_type': 'Store Purchase', 'name': 'Micro-module – Social Engineering', 'description': 'This five-minute micro-module defines social engineering and describes what criminals are after....', 'type': 'Training Module', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-09-09T16:06:01.000Z', 'publisher': 'APP2', 'purchase_date': '2022-03-21T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-13T08:00:00.000Z,,1 weeks,TRUE,FALSE,0
Before script:
df1['groups'].head()
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
After script:
df2.head()
group_id name
0 798800 Clickers 1
1 798803 Clickers 2
2 848426 Colin Safe Brow...
3 798804 Clickers 3
4 855348 Email Whitelist...
Anyone have pointers on how I should proceed?
Any assistance would be greatly appreaciated. Thanks!
You need to first extract the nested dict from its str representation by using eval or ast.literal_eval from the ast module.
You can then create a separate dataframe from the column you want by doing:
import ast
df1['groups'] = df1['groups'].apply(ast.literal_eval)
However, this returns a list of a single dict in your dataset. To combat this, we'll extract the first element of each row.
df1['groups'] = df1['groups'].apply(lambda l: l[0])
df2 = df1['groups'].apply(pd.Series)
Then you can access individual columns such as group_id and name using:
df2['group_id']
df2['name'] # etc.
group_id
0 798800
1 798803
2 848426
3 798804
4 855348
Similarly for other columns within your nested dict.

How to transpose column values to headers, while getting values from another column?

I am currently struggling with extracting/flatten data from hugely nested dictionary: Flattening a nested dictionary with unique keys for each dictionary? .
I received a somewhat acceptable response, but do now have problems in relation to applying that methodology to another dictionary.
So far I have gotten to a point where I have the following
DataFrame.
First I would concatenate the values of "this_should_be_columns" + '_' + "child_column_name", (not a problem)
What I want is for all the unique values in ("this_should_be_columns"_"child_column_name") to become headers, and the rows should be there corresponding value (column "0").
Any ideas/solutions would be much appreciated!
FYI, my dictionary looks as follows:
{'7454':
{'coach':
{'wyId': 562711, 'shortName': 'Name1', 'firstName': 'N1', 'middleName': '', 'lastName': 'N2',
'birthDate': None,
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7454, 'gender': 'male', 'status': 'active'}},
'7453':
{'coach':
{'wyId': 56245, 'shortName': 'Name2', 'firstName': 'N3', 'middleName': '', 'lastName': 'N4',
'birthDate': 'yyyy-mm-dd',
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7453, 'gender': 'male', 'status': 'active'}}}
The code looks as follows:
df_test = pd.DataFrame(pd.Series(responses).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).reset_index().rename(columns={'level_0': 'teamId', 'level_1': 'type', 'level_2': 'this_should_be_columns', 'level_3': 'child_column_name', 'level_4': 'firstname', 'level_5' :'middleName', 'level_6' : 'ignore'}))
del df_test['firstname']
del df_test['middleName']
del df_test['ignore']
print(df_test)
The problem is that your dictionaries have a different number of levels. 'birthArea' and 'passportArea' contain dictionaries while the other keys simply contain values. You can use pd.json_normalize() to flatten the keys of the innermost dictionary as described in Flatten nested dictionaries, compressing keys.
In [37]: pd.DataFrame(responses).stack().apply(lambda x: pd.json_normalize(x, sep='_').to_dict(orient='records')[0]).apply(pd.Series).stack().reset_index()
Out[37]:
level_0 level_1 level_2 0
0 coach 7454 wyId 562711
1 coach 7454 shortName Name1
2 coach 7454 firstName N1
3 coach 7454 middleName
4 coach 7454 lastName N2
.. ... ... ... ...
28 coach 7453 birthArea_name Denmark
29 coach 7453 passportArea_id 208
30 coach 7453 passportArea_alpha2code DK
31 coach 7453 passportArea_alpha3code DNK
32 coach 7453 passportArea_name Denmark

how do I split a column into seperate columns in a csv file?

So I'm working on a movie genre data set and the dataset has all the genres in a single column but I want to split them.
here's how the data set looks like:
genres
----------------------------------------------
[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}]
[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
So what I want to do is get only the first genre so the new column should look like:
genres
_____________
Animation
Comedy
Comedy
Comedy
Action
I hope this is clear enough to understand my problem.
Use DataFrame.apply.
The first dictionary in the list is selected in each cell. From that dictionary the name field is selected:
df['genres']=df['genres'].apply(lambda x: x[0]['name'])
print(df)
ID genres
0 0 Animation
1 1 Comedy
2 2 Comedy
3 3 Comedy
4 4 Action
or
df['genres']=df['genres'].apply(lambda x: eval(x)[0]['name'])
TRY THIS
def decode_str_dict(x):
try:
out=eval(x)[0]['name']
except Exception:
try:
out=eval(x)['name']
except Exception:
try:
out=eval(x)
except Exception:
out=x
return out
df['genres'].apply(decode_str_dict)
df['genres'] = df['genres'].map(lambda x:[i['name'] for i in x])
df['first_genre'] = df['genres'][0]
df = df[['name','first_genre']]
This works if the values are considered a string.
from ast import literal_eval
df['genres'] = df.genres.map(lambda x: literal_eval(x)[0]['name'])
Result:
Out[294]:
ID genres
1 0 Animation
2 1 Comedy
3 2 Comedy
4 3 Comedy
5 4 Action

Convert dataframe column values into a list

I have a dataframe named matchdf. It is a huge one so I'm showing the 1st 3 rows and columns of the dataframe:
print(matchdf.iloc[:3,:3]
Unnamed: 0 athletesInvolved awayScore
0 0 [{'id': '39037', 'name': 'Azhar Ali', 'shortNa... 0
1 1 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
2 2 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
I was working with athletesInvolved column and as you can see it contains a list which is of form:
print(matchdf['athletesInvolved'][0])
[{'id': '39037', 'name': 'Azhar Ali', 'shortName': 'Azhar Ali', 'displayName': 'Azhar Ali'}, {'id': '17134', 'name': 'Tim Murtagh', 'shortName': 'Murtagh', 'displayName': 'Tim Murtagh'}]
However the datatype for this object is str as opposed to a list. How can we convert the above datatype to a list
We can using ast
import ast
df.c=df.c.apply(ast.literal_eval)

Python: Json unpacking into a dataframe

I'm trying to parse json I've recieved from an api into a pandas DataFrame. That json is ierarchical, in this example I have city code, line name and list of stations for this line. Unfortunately I can't "unpack" it. Would be gratefull for help and explanation.
Json:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская', <------Line name
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино', <------Station 1
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево', <------Station 2
'order': 1},
etc.
I'm trying to recieve evrything from lowest level and the add all higher level information (starting from linename):
c = r.content
j = simplejson.loads(c)
tmp=[]
i=0
data1=pd.DataFrame(tmp)
data2=pd.DataFrame(tmp)
pd.concat
station['name']
for station in j['lines']:
data2 = data2.append(pd.DataFrame(station['stations'], station['name']),ignore_index=True)
data2
Once more - the questions are:
How to make it work?
Is this solution an optimal one, or there are some functions I should know about?
Update:
The Json parses normally:
json_normalize(j)
id lines name
1 [{'hex_color': 'FFCD1C', 'stations': [{'lat': ... Москва
Current DataFrame I can get:
data2 = data2.append(pd.DataFrame(station['stations']),ignore_index=True)
id lat lng name order
0 8.189 55.745113 37.864052 Новокосино 0
1 8.88 55.752237 37.814587 Новогиреево 1
Desired dataframe can be represented as:
id lat lng name order Line_Name Id_Top Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 1 Москва
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 1 Москва
In addition to MaxU's answer, I think you still need the highest level id, this should work:
json_normalize(data, ['lines','stations'], ['id',['lines','name']],record_prefix='station_')
Assuming you have the following dictionary:
In [70]: data
Out[70]:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская',
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино',
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево',
'order': 1}]}]}
Solution: use pandas.io.json.json_normalize:
In [71]: pd.io.json.json_normalize(data['lines'],
['stations'],
['name', 'id'],
meta_prefix='parent_')
Out[71]:
id lat lng name order parent_name parent_id
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 8
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 8
UPDATE: reflects updated question
res = (pd.io.json.json_normalize(data,
['lines', 'stations'],
['id', ['lines', 'name']],
meta_prefix='Line_')
.assign(Name_Top='Москва'))
Result:
In [94]: res
Out[94]:
id lat lng name order Line_id Line_lines.name Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 1 Калининская Москва
1 8.88 55.752237 37.814587 Новогиреево 1 1 Калининская Москва

Categories