This is just a very specific data structure transformation that I'm trying to achieve with pandas, so if you know how to do it, please share :)
Imagine I have a dataframe that looks like this
id
value
date
1
1
2021-04-01
1
5
2021-04-02
1
10
2021-04-03
2
3
2021-04-01
2
4
2021-04-02
2
11
2021-04-03
Now I want to transform this into an object, where the keys are the ids, and the values are arrays of information about that id. So it would look like this...
{
'1': [
{ 'value': 1, 'date': '2021-04-01' },
{ 'value': 5, 'date': '2021-04-02' },
{ 'value': 10, 'date': '2021-04-03' }
],
'2': [
{ 'value': 3, 'date': '2021-04-01' },
{ 'value': 4, 'date': '2021-04-02' },
{ 'value': 11, 'date': '2021-04-03' }
],
}
I imagine I have to use .to_dict() somehow, but I can't quite figure out how to do it?
Thoughts?
Edit: I've already figured out a brute-force way of doing it, I'm looking for something more elegant ;)
You can use groupby() on id and then apply() to_dict() on each group:
df.groupby('id').apply(lambda x: x[['value', 'date']].to_dict(orient='records')).to_dict()
{1: [{'value': 1, 'date': '2021-04-01'}, {'value': 5, 'date': '2021-04-02'}, {'value': 10, 'date': '2021-04-03'}], 2: [{'value': 3, 'date': '2021-04-01'}, {'value': 4, 'date': '2021-04-02'}, {'value': 11, 'date': '2021-04-03'}]}
You can use list comprehension after converting the dataframe to dict object.
But here's a more Pandas-way,if your id column is a real column of the dataframe,
df = df.set_index('id').T.to_dict()
If you meant id as the index of dataframe, just use,
df = df.T.to_dict()
Related
Need to extract value from a json string stored in pandas column and assign it to a column with a conditional apply to rows with null values only.
df = pd.DataFrame({'col1': [06010, np.nan, 06020, np.nan],
'json_col': [{'Id': '060',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06037'}
},
{'Id': '061',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06038'}
},
{'Id': '062',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06039'}
},
{'Id': '063',
'Date': '20210908',
'value': {'Id': '060',
'Code': '06040'}
}],
})
# Check for null condition and extract Code from json string
df['Col1'] = df[df['Col1'].isnull()].apply(lambda x : [x['json_col'][i]['value']['Code'] for i in x])
Expected result:
Col1
06010
06038
06020
06040
To extract field from a dictionary column, you can use .str accessor. For instance, to extract json_col -> value -> code you can use df.json_col.str['value'].str['Code']. And then use fillna to replace nan in col1:
df.col1.fillna(df.json_col.str['value'].str['Code'])
0 06010
1 06038
2 06020
3 06040
Name: col1, dtype: object
Try with this:
>>> df['col1'].fillna(df['json_col'].map(lambda x: x['value']['Code']))
0 06010
1 06038
2 06020
3 06040
Name: col1, dtype: object
>>>
I have two list of dictionary:
a =[{ 'id': "1", 'date': "2017-01-24" },{ 'id': "2", 'date': "2018-01-24" },{ 'id': "3", 'date': "2019-01-24" }]
b =[{ 'id': "1", 'name': "abc" },{ 'id': "2",'name': "xyz"},{ 'id': "4",'name': "ijk"}]
I want to merge these dictionaries based on id and the result should be:
[{ 'id': "1", 'date': "2017-01-24",'name': "abc" },{ 'id': "2", 'date': "2018-01-24",'name': "xyz" },{ 'id': "3", 'date': "2019-01-24" },{ 'id': "4",'name': "ijk"}]
How can I do this without iterating in python?
since the dicts are stored in list, you'll either have to iterate or use a vectorized approach such as pandas.... for example:
import pandas as pd
a =[{ 'id': "1", 'date': "2017-01-24" },{ 'id': "2", 'date': "2018-01-24" }]
b =[{ 'id': "1", 'name': "abc" },{ 'id': "2",'name': "xyz"}]
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
out = df1.merge(df2, on='id').to_dict('r')
result:
[{'id': '1', 'date': '2017-01-24', 'name': 'abc'}, {'id': '2', 'date': '2018-01-24', 'name': 'xyz'}]
without testing I'm not sure how this compares speed-wise to just simply iterating. It may take long to iterate, but pandas also has to construct the dataframe and convert output to dict so there's a tradeoff
I have a pandas dataframe with a column which is a list containing a single dictionary.
For example:
col1
[{'type': 'yellow', 'id': 2, ...}]
[{'type': 'brown', 'id': 13, ...}]
...
I need to extract the value associated with the 'type' keyword. There are different ways to do it, but since my dataframe is huge (several million rows) I need an efficient way to do this but I am not sure which method is the best.
Let us try this:
data = {
'col': [[{'type': 'yellow', 'id': 2}], [{'type': 'brown', 'id': 13}], np.nan]
}
df = pd.DataFrame(data)
print(df)
col
0 [{'type': 'yellow', 'id': 2}]
1 [{'type': 'brown', 'id': 13}]
2 NaN
Use explode and str accessor:
df['result'] = df.col.explode().str['type']
output:
col result
0 [{'type': 'yellow', 'id': 2}] yellow
1 [{'type': 'brown', 'id': 13}] brown
2 NaN NaN
Accessing any element in most data structures is O(1) operation. I'm sure pandas data frame is no different. The only issue you will face is: looping through the rows. There's probably no way around it.
I have a DataFrame constructed from a database query. Each row in the frame has a database id, date, job, an issues boolean, and a fixed boolean. For example:
data = [
{'id': 1, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': False},
{'id': 2, 'date': '2020-02-01', 'job': 'ABC', 'issue': False, 'fixed': False},
{'id': 3, 'date': '2020-02-01', 'job': 'ABC', 'issue': True, 'fixed': True},
{'id': 4, 'date': '2020-02-01', 'job': 'DEF', 'issue': True, 'fixed': True}
]
data_df = pd.DataFrame(data)
I want to do a groupby and agg where I am grouping by job and date, and getting the count of 'issues' and 'fixed' that are True. Something like:
result_data = [
{'date': '2020-02-01', 'job': 'ABC', 'issue': 2, 'fixed': 1},
{'date': '2020-02-01', 'job': 'DEF', 'issue': 1, 'fixed': 1}
]
result_df = pd.DataFrame(result_data)
The code would look something like:
result_df = data_df.groupby(['date', 'job']).agg({'issue': 'sum-true', 'fixed': 'sum-true'})
but I am not sure what 'sum-true' should be. Not, I cant just filter the whole DF by the column being true, and summing, as issue might be True, while fixed is False.
How about this?
>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum()
issue fixed
date job
2020-02-01 ABC 2.0 1.0
DEF 1.0 1.0
Simply summing a boolean vector will return True counts.
And if you want the data in the exact format you specified above, just reset_index:
>>> df.groupby(['date', 'job'])[['issue', 'fixed']].sum().reset_index()
date job issue fixed
0 2020-02-01 ABC 2.0 1.0
1 2020-02-01 DEF 1.0 1.0
This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 3 years ago.
I've got a pandas dataset with a column that's a comma-separated string, e.g. 1,2,3,10:
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
I'd like to get a count and a mean score for each value in topics. So:
topic_id,count,mean
1,2,5
11,2,8
12,1,6
et cetera. How can I do this?
I've got as far as:
df['topic_ids'] = df.topics.str.split()
But now I guess I want to explode topic_ids out, so there's a column for each unique value in the entire set of values...?
unnest then groupby and agg
df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})
New_df.groupby('topics').score.agg(['count','mean'])
Out[1256]:
count mean
topics
1 2 5.0
11 2 8.0
12 1 6.0
18 2 5.5
22 1 9.0
30 4 6.5
In [111]: def mean1(x): return np.array(x).astype(int).mean()
In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
mean1 len
0 21.000000 3
1 19.666667 3
2 14.333333 3
3 16.333333 3
This is one way. Reindex & stack, then groupby & agg.
import pandas as pd
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
.stack()\
.reset_index(name='topics')\
.drop('level_2', 1)
df2.groupby('topics').score.agg(['count', 'mean']).reset_index()