Extract value from JSON data of a pandas column with condition apply - python

Need to extract value from a json string stored in pandas column and assign it to a column with a conditional apply to rows with null values only.
df = pd.DataFrame({'col1': [06010, np.nan, 06020, np.nan],
'json_col': [{'Id': '060',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06037'}
},
{'Id': '061',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06038'}
},
{'Id': '062',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06039'}
},
{'Id': '063',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06040'}
}],
})
# Check for null condition and extract Code from json string
df['Col1'] = df[df['Col1'].isnull()].apply(lambda x : [x['json_col'][i]['value']['Code'] for i in x])
Expected result:
Col1
06010
06038
06020
06040

To extract field from a dictionary column, you can use .str accessor. For instance, to extract json_col -> value -> code you can use df.json_col.str['value'].str['Code']. And then use fillna to replace nan in col1:
df.col1.fillna(df.json_col.str['value'].str['Code'])
0 06010
1 06038
2 06020
3 06040
Name: col1, dtype: object

Try with this:
>>> df['col1'].fillna(df['json_col'].map(lambda x: x['value']['Code']))
0 06010
1 06038
2 06020
3 06040
Name: col1, dtype: object
>>>

Related

How to group two items with two dates and get durations in pandas?

I usually work with data look like this {'id': '1', 'start_date': '2012-04-8', 'end_date': '2012-08-06'} but now I have something very different. I have items of items where each two-element represents the one item
data = [
{'id': '1', 'field': 'end_tmie', 'value': '2012-08-06'},
{'id': '1', 'field': 'start_date', 'value': '2012-04-8'},
{'id': '2', 'field': 'end_tmie', 'value': '2012-01-06'},
{'id': '2', 'field': 'start_date', 'value': '2012-03-8'},
]
Goal how to get the duration end_time -start_time for each two data points with the same id in pandas
data Goal
df = [
{'id': '1', 'durations': '2012-08-06 - 2012-04-8'},
{'id': '2', 'durations': '2012-01-06 - 2012-03-8'},
]
2 data Goal how to resample data to look like tihs
df = [
{'id':'1', 'start':'2012-04-8', 'end':'2012-08-06'},
{'id':'2', 'start':'2012-03-8', 'end':'2012-01-06'},
]
Create DataFrame constructor first, then DataFrame.pivot with rename columns and for duration convert subtract columns with convert timedetas to days by Series.dt.days:
df = pd.DataFrame(data)
df['value'] = pd.to_datetime(df['value'])
df = df.pivot(index='id',columns='field',values='value').rename(columns={'start_date':'start','end_tmie':'end'})
df['durations'] = df['end'].sub(df['start']).dt.days
Last for exports filter columns with DataFrame.to_dict:
d1 = df['durations'].reset_index().to_dict('records')
print (d1)
[{'id': '1', 'durations': 120}, {'id': '2', 'durations': -62}]
d2 = df[['start','end']].apply(lambda x: x.dt.strftime('%Y-%m-%d')).reset_index().to_dict('records')
print (d2)
[{'id': '1', 'start': '2012-04-08', 'end': '2012-08-06'},
{'id': '2', 'start': '2012-03-08', 'end': '2012-01-06'}]
Assuming that there are no multiple value of start_date and end_tmie for each id, pd.pivot_table() should do the job.
>>> import pandas as pd
>>> data = [
... {'id': '1', 'field': 'end_tmie', 'value': '2012-08-06'},
... {'id': '1', 'field': 'start_date', 'value': '2012-04-8'},
... {'id': '2', 'field': 'end_tmie', 'value': '2012-01-06'},
... {'id': '2', 'field': 'start_date', 'value': '2012-03-8'},
... ]
>>> df = pd.DataFrame(data)
>>> df.pivot_table('value', 'id', 'field', lambda x: x).sort_index(ascending=False, axis=1).assign(duration=lambda x: pd.to_datetime(x['end_tmie']) - pd.to_datetime(x['start_date']))
field start_date end_tmie duration
id
1 2012-04-8 2012-08-06 120 days
2 2012-03-8 2012-01-06 -62 days

Pandas to_dict data structure, using column as dictionary index

This is just a very specific data structure transformation that I'm trying to achieve with pandas, so if you know how to do it, please share :)
Imagine I have a dataframe that looks like this
id
value
date
1
1
2021-04-01
1
5
2021-04-02
1
10
2021-04-03
2
3
2021-04-01
2
4
2021-04-02
2
11
2021-04-03
Now I want to transform this into an object, where the keys are the ids, and the values are arrays of information about that id. So it would look like this...
{
'1': [
{ 'value': 1, 'date': '2021-04-01' },
{ 'value': 5, 'date': '2021-04-02' },
{ 'value': 10, 'date': '2021-04-03' }
],
'2': [
{ 'value': 3, 'date': '2021-04-01' },
{ 'value': 4, 'date': '2021-04-02' },
{ 'value': 11, 'date': '2021-04-03' }
],
}
I imagine I have to use .to_dict() somehow, but I can't quite figure out how to do it?
Thoughts?
Edit: I've already figured out a brute-force way of doing it, I'm looking for something more elegant ;)
You can use groupby() on id and then apply() to_dict() on each group:
df.groupby('id').apply(lambda x: x[['value', 'date']].to_dict(orient='records')).to_dict()
{1: [{'value': 1, 'date': '2021-04-01'}, {'value': 5, 'date': '2021-04-02'}, {'value': 10, 'date': '2021-04-03'}], 2: [{'value': 3, 'date': '2021-04-01'}, {'value': 4, 'date': '2021-04-02'}, {'value': 11, 'date': '2021-04-03'}]}
You can use list comprehension after converting the dataframe to dict object.
But here's a more Pandas-way,if your id column is a real column of the dataframe,
df = df.set_index('id').T.to_dict()
If you meant id as the index of dataframe, just use,
df = df.T.to_dict()

Extract dictionary value from a list with one element in a pandas column

I have a pandas dataframe with a column which is a list containing a single dictionary.
For example:
col1
[{'type': 'yellow', 'id': 2, ...}]
[{'type': 'brown', 'id': 13, ...}]
...
I need to extract the value associated with the 'type' keyword. There are different ways to do it, but since my dataframe is huge (several million rows) I need an efficient way to do this but I am not sure which method is the best.
Let us try this:
data = {
'col': [[{'type': 'yellow', 'id': 2}], [{'type': 'brown', 'id': 13}], np.nan]
}
df = pd.DataFrame(data)
print(df)
col
0 [{'type': 'yellow', 'id': 2}]
1 [{'type': 'brown', 'id': 13}]
2 NaN
Use explode and str accessor:
df['result'] = df.col.explode().str['type']
output:
col result
0 [{'type': 'yellow', 'id': 2}] yellow
1 [{'type': 'brown', 'id': 13}] brown
2 NaN NaN
Accessing any element in most data structures is O(1) operation. I'm sure pandas data frame is no different. The only issue you will face is: looping through the rows. There's probably no way around it.

Using pandas, how can I group/aggregate summing cases where boolean columns are true?

I have a DataFrame constructed from a database query. Each row in the frame has a database id, date, job, an issues boolean, and a fixed boolean. For example:
data = [
{'id': 1, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': False},
{'id': 2, 'date': '2020-02-01', 'job': 'ABC', 'issue': False, 'fixed': False},
{'id': 3, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': True},
{'id': 4, 'date': '2020-02-01', 'job': 'DEF', 'issue': True, 'fixed': True}
]
data_df = pd.DataFrame(data)
I want to do a groupby and agg where I am grouping by job and date, and getting the count of 'issues' and 'fixed' that are True. Something like:
result_data = [
{'date': '2020-02-01', 'job': 'ABC', 'issue': 2, 'fixed': 1},
{'date': '2020-02-01', 'job': 'DEF', 'issue': 1, 'fixed': 1}
]
result_df = pd.DataFrame(result_data)
The code would look something like:
result_df = data_df.groupby(['date', 'job']).agg({'issue': 'sum-true', 'fixed': 'sum-true'})
but I am not sure what 'sum-true' should be. Not, I cant just filter the whole DF by the column being true, and summing, as issue might be True, while fixed is False.
How about this?
>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum()
issue fixed
date job
2020-02-01 ABC 2.0 1.0
DEF 1.0 1.0
Simply summing a boolean vector will return True counts.
And if you want the data in the exact format you specified above, just reset_index:
>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum().reset_index()
date job issue fixed
0 2020-02-01 ABC 2.0 1.0
1 2020-02-01 DEF 1.0 1.0

list of dictionary column in a dataframe

A column in my dataframe is list of dictionaries some thing like this:
How can I filter the rows that have specific value for the id key in tag column? for instance the rows that contains {"id" : 18}
Since your tag column is list-valued, you could use explode if you are on pandas 0.25+:
# toy data
df = pd.DataFrame({'type':['df','fg','ff'],
'tag': [[{"id" : 12} ,{"id" : 13}],
[{"id" : 12}],
[{'id':10}]]
})
# make each row contains exactly one dict: {id: val}
s = df['tag'].explode()
# the indexes of interested rows
idx = s.index[pd.DataFrame(s.to_list())['id'].values==12]
# output
df.loc[idx]
Output:
type tag
0 df [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]
Sample DatFrame
df=pd.DataFrame({'type':['dg','fg','ff'],'tag':[[{"id" : 12} ,{"id" : 13}] ,[{"id" : 12}],[{"id" : 29}]]})
print(df)
type tag
0 dg [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]
2 ff [{'id': 29}]
Then you can use Series.apply to check each cell:
df_filtered=df[df['tag'].apply(lambda x: pd.Series([dict['id'] for dict in x]).eq(12).any())]
print(df_filtered)
type tag
0 dg [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]

Categories