Pandas - Extracting values from a Dataframe column - python

I have a Dataframe in the below format:
cust_id, cust_details
101, [{'self': 'https://website.com/rest/api/2/customFieldOption/1', 'value': 'Type-A', 'id': '1'},
{'self': 'https://website.com/rest/api/2/customFieldOption/2', 'value': 'Type-B', 'id': '2'},
{'self': 'https://website.com/rest/api/2/customFieldOption/3', 'value': 'Type-C', 'id': '3'},
{'self': 'https://website.com/rest/api/2/customFieldOption/4', 'value': 'Type-D', 'id': '4'}]
102, [{'self': 'https://website.com/rest/api/2/customFieldOption/5', 'value': 'Type-X', 'id': '5'},
{'self': 'https://website.com/rest/api/2/customFieldOption/6', 'value': 'Type-Y', 'id': '6'}]
I am trying to extract for every cust_id all cust_detail values
Expected output:
cust_id, new_value
101,Type-A, Type-B, Type-C, Type-D
102,Type-X, Type-Y

Easy answer:
df['new_value'] = df.cust_details.apply(lambda ds: [d['value'] for d in ds])
More complex, potentially better answer:
Rather than storing lists of dictionaries in the first place, I'd recommend making each dictionary a row in the original dataframe.
df = pd.concat([
df['cust_id'],
pd.DataFrame(
df['cust_details'].explode().values.tolist(),
index=df['cust_details'].explode().index
)
], axis=1)
If you need to group values by id, you can do so via standard groupby methods:
df.groupby('cust_id')['value'].apply(list)
This may seem more complex, but depending on your use case might save you effort in the long-run.

Related

pandas split list like object

Hi I have this column of data named labels:
[{'id': 123456,
'name': John,
'age': 22,
'pet': None,
'gender': male,
'result': [{'id': 'vEo0PIYPEE',
'type': 'choices',
'value': {'choices': ['Same Person']},
'to_name': 'image',
'from_name': 'person_evaluation'}]}]
[{'id': 123457,
'name': May,
'age': 21,
'pet': None,
'gender': female,
'result': [{'id': zTHYuKIOQ',
'type': 'choices',
'value': {'choices': ['Different Person']},
'to_name': 'image',
'from_name': 'person_evaluation'}]}]
......
Not sure what type is this, and I would like to break this down, to extract the value [Same Person], the outcome should be something like this:
0 [Same Person]
1 [Different Person]
....
How should I achieve this?
Based on the limited data that you have provided, would this work?
df['labels_new'] = df['labels'].apply(lambda x: x[0].get('result')[0].get('value').get('choices'))
labels labels_new
0 [{'id': 123456, 'name': 'John', 'age': 22, 'pe... [Same Person]
1 [{'id': 123457, 'name': 'May', 'age': 21, 'pet... [Different Person]
You can use the following as well, but I find dict.get() to be more versatile (returning default values for example) and has better exception handling.
df['labels'].apply(lambda x: x[0]['result'][0]['value']['choices'])
You could consider using pd.json_normalize (read more here) but for the current state of your column that you have, its going to be a bit complex to extract the data with that, rather than simply using a lambda function

Converting data frame into a nested dictionary

Below is my subsetted data frame, I am having a hard time trying to convert it into my desired output as I am fairly new to Python. Essentially I want to create a nested dictionary inside a list, with the column names as a value, and then another nested dictionary inside a list. Is this doable?
import pandas as pd
Sector Community Name
0 centre: 10901.0 park: 3238.0
1 northeast: 6958.0 heights: 1955.0
Desired output:
[{'column': 'Sector',
'value': [{'name': 'centre', 'value': 10901.0},
{'name': 'northeast', 'value': 6958.0}]},
{'column': 'Community Name',
'value': [{'name': 'park', 'value': 3238.0},
{'name': 'heights', 'value': 1955.0},
{'name': 'hill', 'value': 1454.0}]}]
From #sushanth's answer, I may add up to this solution. Assume that your dataframe variable is defined as df.
result = []
for header in list(df):
column_values = df[header].to_list()
result.append({
"column" : header,
"value" : [dict(zip(['name', 'value'], str(value).split(":"))) for value in column_values]
})
Using pandas in above case might be a overkill, Here is a solution using python inbuilt functions which you can give a try,
input_ = {"Sector": ["centre: 10901.0", "northeast: 6958.0"],
"Community Name": ["park: 3238.0", "heights: 1955.0"]}
result = []
for k, v in input_.items():
result.append({
"column" : k,
"value" : [dict(zip(['name', 'value'], vv.split(":"))) for vv in v]
})
print(result)
[{'column': 'Sector',
'value': [{'name': 'centre', 'value': ' 10901.0'},
{'name': 'northeast', 'value': ' 6958.0'}]},
{'column': 'Community Name',
'value': [{'name': 'park', 'value': ' 3238.0'},
{'name': 'heights', 'value': ' 1955.0'}]}]

Json_Normalize, targeting nested columns within a specific column?

I'm working with an API trying to currently pull data out of it. The challenge I'm having is that the majority of the columns are straight forward and not nested, with the exception of a CustomFields column which has all the various custom fields used located in a list per record.
Using json_normalize is there a way to target a nested column to flatten it? I'm trying to fetch and use all the data available from the API but one nested column in particular is causing a headache.
The JSON data when retrieved from the API looks like the following. This is just for one customer profile,
[{'EmailAddress': 'an_email#gmail.com', 'Name': 'Al Smith’, 'Date': '2020-05-26 14:58:00', 'State': 'Active', 'CustomFields': [{'Key': '[Location]', 'Value': 'HJGO'}, {'Key': '[location_id]', 'Value': '34566'}, {'Key': '[customer_id]', 'Value': '9051'}, {'Key': '[status]', 'Value': 'Active'}, {'Key': '[last_visit.1]', 'Value': '2020-02-19'}]
Using json_normalize,
payload = json_normalize(payload_json['Results'])
Here are the results when I run the above code,
Ideally, here is what I would like the final result to look like,
I think I just need to work with the record_path and meta parameters but I'm not totally understanding how they work.
Any ideas? Or would using json_normalize not work in this situation?
Try this, You have square brackets in your JSON, that's why you see those [ ] :
d = [{'EmailAddress': 'an_email#gmail.com', 'Name': 'Al Smith', 'Date': '2020-05-26 14:58:00', 'State': 'Active', 'CustomFields': [{'Key': '[Location]', 'Value': 'HJGO'}, {'Key': '[location_id]', 'Value': '34566'}, {'Key': '[customer_id]', 'Value': '9051'}, {'Key': '[status]', 'Value': 'Active'}, {'Key': '[last_visit.1]', 'Value': '2020-02-19'}]}]
df = pd.json_normalize(d, record_path=['CustomFields'], meta=[['EmailAddress'], ['Name'], ['Date'], ['State']])
df = df.pivot_table(columns='Key', values='Value', index=['EmailAddress', 'Name'], aggfunc='sum')
print(df)
Output:
Key [Location] [customer_id] [last_visit.1] [location_id] [status]
EmailAddress Name
an_email#gmail.com Al Smith HJGO 9051 2020-02-19 34566 Active

Append Dates in Chronological Order

This is the JSON:
[{'can_occur_before': False,
'categories': [{'id': 8, 'name': 'Airdrop'}],
'coins': [{'id': 'cashaa', 'name': 'Cashaa', 'symbol': 'CAS'}],
'created_date': '2018-05-26T03:34:05+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'Unsold Token Distribution',
'twitter_account': None,
'vote_count': 125},
{'can_occur_before': False,
'categories': [{'id': 4, 'name': 'Exchange'}],
'coins': [{'id': 'tron', 'name': 'TRON', 'symbol': 'TRX'}],
'created_date': '2018-06-04T03:54:59+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'Indodax Listing',
'twitter_account': '#PutraDwiJuliyan',
'vote_count': 75},
{'can_occur_before': False,
'categories': [{'id': 5, 'name': 'Conference'}],
'coins': [{'id': 'modum', 'name': 'Modum', 'symbol': 'MOD'}],
'created_date': '2018-05-26T03:18:03+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'SAPPHIRE NOW',
'twitter_account': None,
'vote_count': 27},
{'can_occur_before': False,
'categories': [{'id': 4, 'name': 'Exchange'}],
'coins': [{'id': 'apr-coin', 'name': 'APR Coin', 'symbol': 'APR'}],
'created_date': '2018-05-29T17:45:16+01:00',
'date_event': '2018-06-05T00:00:00+01:00',
'title': 'TopBTC Listing',
'twitter_account': '#cryptoalarm',
'vote_count': 23}]
I want to take all the date_events and append them to a list in chronological order. I currently have this code and am not sure how to order them chronologically.
date = []
for i in getevents:
date.append(i['date_event'][:10])
Thanks for any help !
Simple way is to compose a list and then apply sort() method
data = json.load(open('filename.json','r'))
dates = [item['date_event'] for i in data]
dates.sort()
Using your example data with field 'creation_date' ('date_event' values are all the same) we'll get:
['2018-05-26T03:18:03+01:00',
'2018-05-26T03:34:05+01:00',
'2018-05-29T17:45:16+01:00',
'2018-06-04T03:54:59+01:00']
First of all, all the date_event in your array of objects are all the same, so not much sense in sorting them.. Also your approach will not get you far, you need to convert the dates to native date/time objects so that you can sort them through a sorting function.
The easiest way to parse properly formatted Date/Times is to use dateutil.parse.parser, and sorting an existing list is done by list.sort() - I made a quick example on how to use these tools, also i changed the date_event values to showcase it: https://repl.it/repls/BogusSpecificRate
After you have decoded the JSON string (json.loads) and have a Python list to work with, you can proceed with sorting the list:
# Ascending
events.sort(key=lambda e: parser.parse(e['date_event']))
print([":".join([e['title'], e['date_event']]) for e in events])
# Descending
events.sort(key=lambda e: parser.parse(e['date_event']), reverse=True)
print([":".join([e['title'], e['date_event']]) for e in events])

Python - Extract Json file into headers

I have a json file that I imported into pandas. The first column is filled with cells that are in json format. Below is the first cell of 10K cells or so...
df = pd.read_json("test_file.json") # import data
print (df['test_column'].iloc[0]) # print first cell
{'data': [{'time': '2016-03-25', 'id': '54', 'stop': {'length': 38, 'fun_time': False, 'before': '2015-03-24', 'id': '10xd9'}}], 'dataType': 'life', 'weird': '2013-06-15', '_id': 'dirt', '_type': 'what', 'trace': '32', 'timestamp': 1418193255, 'teller': 'jeff', 'work': '1', 'eventCategory': 'so_true', 'eventType': 'complete', 'city': 'CHI', 'type': 'some_type', 'value': '32', 'data': 'river' }}}
The code above is an approximation of the real data in each cell
Is there a quick way to extract all the key values in the json data, append them as a header to new columns in a pandas, and then add the value to the appropriate row?
Thanks
Try
pd.io.json.json_normalize(df.test_column.apply(pd.io.json.loads))

Categories