I would like to count the number of themes after normalizing a nested column.
Here is a sample of my data:
0 [{'code': '8', 'name': 'Human development'}, {'code': '11', 'name': ''}]
1 [{'code': '1', 'name': 'Economic management'}, {'code': '6', 'name': 'Social protection and risk management'}]
2 [{'code': '5', 'name': 'Trade and integration'}, {'code': '2', 'name': 'Public sector governance'}, {'code': '11', 'name': 'Environment and natural resources management'}, {'code': '6', 'name': 'Social protection and risk management'}]
3 [{'code': '7', 'name': 'Social dev/gender/inclusion'}, {'code': '7', 'name': 'Social dev/gender/inclusion'}]
4 [{'code': '5', 'name': 'Trade and integration'}, {'code': '4', 'name': 'Financial and private sector development'}]
Name: mjtheme_namecode, dtype: object
This is what I have tried:
from pandas.io.json import json_normalize
result = json_normalize(json_file, 'mjtheme_namecode').name.value_counts()
However this returns the error
TypeError: string indices must be integers
I think the issue is the way you read the json file, mjtheme_namecode should be one long list, not a list of lists or something like that. Try putting max_level=0. Other possibility is the problem with the empty field. Try putting in a default value (see: Pandas json_normalize and null values in JSON)
I managed to get the result like this:
from pandas.io.json import json_normalize
mjtheme_namecode =[{'code':'8','name':'Humandevelopment'},{'code':'11','name':''},{'code':'1','name':'Economicmanagement'},{'code':'6','name':'Socialprotectionandriskmanagement'},
{'code':'5','name':'Tradeandintegration'},{'code':'2','name':'Publicsectorgovernance'},{'code':'11','name':'Environmentandnaturalresourcesmanagement'},{'code':'6','name':'Socialprotectionandriskmanagement'},
{'code':'7','name':'Socialdev/gender/inclusion'},{'code':'7','name':'Socialdev/gender/inclusion'},
{'code':'5','name':'Tradeandintegration'},{'code':'4','name':'Financialandprivatesectordevelopment'}]
print(mjtheme_namecode)
result = json_normalize(mjtheme_namecode).name.value_counts()
print(result)
Socialdev/gender/inclusion 2
Socialprotectionandriskmanagement 2
Tradeandintegration 2
Humandevelopment 1
Publicsectorgovernance 1
Environmentandnaturalresourcesmanagement 1
Financialandprivatesectordevelopment 1
Economicmanagement 1
1
Name: name, dtype: int64
Related
Good day all!
I am trying to flatten some nested JSON using json_normalize, but I the output I keep getting is not what I need.
Here's my code so far:
df1 = pd.read_csv('data_file.csv')
groups_dict = df1['groups']
df2 = pd.json_normalize(groups_dict)
The bit where the dictionary gets created seems to be working as seen here:
groups_dict.info()
groups_dict.head()
<class 'pandas.core.series.Series'>
RangeIndex: 19 entries, 0 to 18
Series name: groups
Non-Null Count Dtype
-------------- -----
19 non-null object
dtypes: object(1)
memory usage: 280.0+ bytes
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
But when I try to normalize the dictionary, I get the following output:
df2 = pd.json_normalize(groups_dict)
df2.head()
0
1
2
3
4
I need to have each item from the groups column listed as it's own column to complete my project. Please see example below for sample data file (csv format) and what I am trying to accomplish.
CSV:
campaign_id,name,groups,status,content,duration_type,start_date,end_date,relative_duration,auto_enroll,allow_multiple_enrollments,completion_percentage
201644,Clicker 1 Retraining ,"[{'group_id': 798800, 'name': 'Clickers 1 '}]",Closed,"[{'store_purchase_id': 1076203, 'content_type': 'Store Purchase', 'name': 'Spot the Phish Game: Foundational', 'description': 'Make sure you can spot a phishing attempt by using this condensed Spot the Phish game. With ten...', 'type': 'Game', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-10-02T17:08:16.000Z', 'publisher': 'APP1', 'purchase_date': '2022-04-13T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-19T08:00:00.000Z,,1 weeks,TRUE,FALSE,14
201645,Clicker 2 Retraining ,"[{'group_id': 798803, 'name': 'Clickers 2'}]",In Progress,"[{'store_purchase_id': 1060139, 'content_type': 'Store Purchase', 'name': 'Micro-module – Social Engineering', 'description': 'This five-minute micro-module defines social engineering and describes what criminals are after....', 'type': 'Training Module', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-09-09T16:06:01.000Z', 'publisher': 'APP2', 'purchase_date': '2022-03-21T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-13T08:00:00.000Z,,1 weeks,TRUE,FALSE,0
Before script:
df1['groups'].head()
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
After script:
df2.head()
group_id name
0 798800 Clickers 1
1 798803 Clickers 2
2 848426 Colin Safe Brow...
3 798804 Clickers 3
4 855348 Email Whitelist...
Anyone have pointers on how I should proceed?
Any assistance would be greatly appreaciated. Thanks!
You need to first extract the nested dict from its str representation by using eval or ast.literal_eval from the ast module.
You can then create a separate dataframe from the column you want by doing:
import ast
df1['groups'] = df1['groups'].apply(ast.literal_eval)
However, this returns a list of a single dict in your dataset. To combat this, we'll extract the first element of each row.
df1['groups'] = df1['groups'].apply(lambda l: l[0])
df2 = df1['groups'].apply(pd.Series)
Then you can access individual columns such as group_id and name using:
df2['group_id']
df2['name'] # etc.
group_id
0 798800
1 798803
2 848426
3 798804
4 855348
Similarly for other columns within your nested dict.
I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.
data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }
source_df = pd.DataFrame(data)
source_df looks like below :
As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.
Code to apply for one row:
Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')
Output for above code :
I would like to know how we can achieve desired output for the entire column
Expected output:
EDIT:
#lostCode , below is the input with nan and empty list
You can do:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)
Output:
volume_state volume_id volume_name name id state nodes
0 available 330172 q_-4144d4e WAG 01 105F available 3
1 available 275192 p_3089d821ae WAG 01 105F available 3
2 unavailable 830172 w_-4144d4e FEC 01 382E available 4
3 unavailable 223192 g_3089d821ae FEC 01 382E available 4
4 unavailable 930172 e_-4144d4e ASD 01 303F available 6
5 unavailable 245192 h_3089d821ae ASD 01 303F available 6
concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df
You can use to check type of source_df:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)
Target_df=source_df.apply(json_normalize)
I have a dataframe named matchdf. It is a huge one so I'm showing the 1st 3 rows and columns of the dataframe:
print(matchdf.iloc[:3,:3]
Unnamed: 0 athletesInvolved awayScore
0 0 [{'id': '39037', 'name': 'Azhar Ali', 'shortNa... 0
1 1 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
2 2 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
I was working with athletesInvolved column and as you can see it contains a list which is of form:
print(matchdf['athletesInvolved'][0])
[{'id': '39037', 'name': 'Azhar Ali', 'shortName': 'Azhar Ali', 'displayName': 'Azhar Ali'}, {'id': '17134', 'name': 'Tim Murtagh', 'shortName': 'Murtagh', 'displayName': 'Tim Murtagh'}]
However the datatype for this object is str as opposed to a list. How can we convert the above datatype to a list
We can using ast
import ast
df.c=df.c.apply(ast.literal_eval)
I have below code. It is part of bigger code and i am just providing a snippet to show the problem. When i run below code i get the error AttributeError: 'str' object has no attribute 'values'. df['URL'].values[0] runs fine. I want copy text values from URL field into new field called pdf_text and i want to do this one value at a time. Therefore I am using a function. In my real code, i take values from URL column and open those files and do further processing.
sales = [{'account': 'credit cards', 'Jan': '150 jones', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf'},
{'account': '1', 'Jan': 'Jones', 'Feb': '210', 'URL': ''},
{'account': '1', 'Jan': '50', 'Feb': '90', 'URL': 'ea2017-104.pdf' }]
df = pd.DataFrame(sales)
def pdf2text(url):
url=url.values[0]
return url
#
abc= (df.assign(pdf_text = df['URL'].apply(pdf2text)))
You just want the name of the PDF without the file extension?
>>> import pandas as pd
>>> sales = [{'account': 'credit cards', 'Jan': '150 jones', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf'},
... {'account': '1', 'Jan': 'Jones', 'Feb': '210', 'URL': ''},
... {'account': '1', 'Jan': '50', 'Feb': '90', 'URL': 'ea2017-104.pdf' }]
>>>
>>> df = pd.DataFrame(sales)
>>> df.head()
Feb Jan URL account
0 200 .jones 150 jones ea2018-001.pdf credit cards
1 210 Jones 1
2 90 50 ea2017-104.pdf 1
>>> df['your_column'] = df.URL.map(lambda x: x.split(".")[0])
>>> df.head()
Feb Jan URL account your_column
0 200 .jones 150 jones ea2018-001.pdf credit cards ea2018-001
1 210 Jones 1
2 90 50 ea2017-104.pdf 1 ea2017-104
>>>
It raises ValueError because url is a string (not whole series) and you try to get values attribute from string object.
In your case when you use apply for series, your function pdf2text on each iteration takes pdf file name as the argument.
df['URL'] = df['URL'].apply(pdf2text)
is equivalent to
urls = []
for url in df['URL']:
# `url` equals something like this -> 'ea2018-001.pdf'
urls.append(pdf2text(url))
df['URL'] = pd.Series(urls)
but it's slower and unefficient
I have a pandas data frame.
mac_address no. of co_visit no. of random_visit
0 00:02:1a:11:b0:b9 1 2
1 00:02:71:d6:04:84 1 1
2 00:05:33:34:2f:f2 1 3
3 00:08:22:04:c4:fb 1 4
4 00:08:22:06:7b:41 1 1
5 00:08:22:07:48:15 1 1
6 00:08:22:08:a8:54 1 3
7 00:08:22:0e:0a:fc 1 1
I want to convert it into a dictionary with mac_address as key and 'no. of co_visit' and 'no. of random_visit' as subkeys inside key and value across that column as value inside subkey. So, my output for first 2 row will be like.
00:02:1a:11:b0:b9:{no. of co_visit:1, no. of random_visit: 2}
00:02:71:d6:04:84:{no. of co_visit:1, no. of random_visit: 1}
I am using python2.7. Thank you.
I was able to set mac_address as key but the values were being added as list inside key, not key inside key.
You can use pandas.DataFrame.T and to_dict().
df.set_index('mac_address').T.to_dict()
Output:
{'00:02:1a:11:b0:b9': {'no. of co_visit': '1', 'no. of random_visit': '2'},
'00:02:71:d6:04:84': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:05:33:34:2f:f2': {'no. of co_visit': '1', 'no. of random_visit': '3'},
'00:08:22:04:c4:fb': {'no. of co_visit': '1', 'no. of random_visit': '4'},
'00:08:22:06:7b:41': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:08:22:07:48:15': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:08:22:08:a8:54': {'no. of co_visit': '1', 'no. of random_visit': '3'},
'00:08:22:0e:0a:fc': {'no. of co_visit': '1', 'no. of random_visit': '1'}}