Keeping additional column when normalizing list of dicts - python

I have a dataframe containing id and list of dicts:
df = pd.DataFrame({
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
})
and I want to normalize it like this:
id a b
0 100 1 2
0 100 3 4
1 200 11 22
1 200 33 44
This gets most of the way:
pd.concat([
pd.DataFrame.from_dict(item)
for item in df.list_of_dicts
])
but is missing the id column.
I'm most interested in readability.

How about something like this:
d = {
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
}
df = pd.DataFrame([pd.Series(x) for ld in d['list_of_dicts'] for x in ld])
id = [[x]*len(l) for l,x in zip(d['list_of_dicts'],d['id'])]
df['id'] = pd.Series([x for l in id for x in l])
EDIT - Here's a simpler version
t = [[('id', i)]+list(l.items()) for i in d['id'] for ll in d['list_of_dicts'] for l in ll]
df = pd.DataFrame([dict(x) for x in t])
And, if you really want the id column first, you can change dict to OrderedDict from the collections module.

This is what I call an incomprehension
pd.DataFrame(
*list(map(list, zip(
*[(d, i) for i, l in zip(df.id, df.list_of_dicts) for d in l]
)))
).rename_axis('id').reset_index()
id a b
0 100 1 2
1 100 11 22
2 200 3 4
3 200 33 44

Related

Pandas: How to group by column values when column values are dicts?

I am doing an exercise in which the current requirement is to "Find the top 10 major project themes (using column 'mjtheme_namecode')".
My first thought was to do group_by, then count and sort the groups.
However, the values in this column are lists of dicts, e.g.
[{'code': '1', 'name': 'Economic management'},
{'code': '6', 'name': 'Social protection and risk management'}]
and I can't (apparently) group these, at least not with group_by. I get an error.
TypeError: unhashable type: 'list'
Is there a trick? I'm guessing something along the lines of this question.
(I can group by another column that has string values and matches 1:1 with this column, but the exercise is specific.)
df.head()
There are two steps to solve your problem:
Using pandas==0.25
Flatten the list of dict
Transform dict in columns:
Step 1
df = df.explode('mjtheme_namecode')
Step 2
df = df.join(pd.DataFrame(df['mjtheme_namecode'].values.tolist())
Added: if the dict has multiple hierarchies, you can try using json_normalize:
from pandas.io.json import json_normalize
df = df.join(json_normalize(df['mjtheme_namecode'].values.tolist())
The only issue here is pd.explode will duplicate all other columns (in case that is an issue).
Using sample data:
x = [
[1,2,[{'a':1, 'b':3},{'a':2, 'b':4}]],
[1,3,[{'a':5, 'b':6},{'a':7, 'b':8}]]
]
df = pd.DataFrame(x, columns=['col1','col2','col3'])
Out[1]:
col1 col2 col3
0 1 2 [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}]
1 1 3 [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
## Step 1
df.explode('col3')
Out[2]:
col1 col2 col3
0 1 2 {'a': 1, 'b': 3}
0 1 2 {'a': 2, 'b': 4}
1 1 3 {'a': 5, 'b': 6}
1 1 3 {'a': 7, 'b': 8}
## Step 2
df = df.join(pd.DataFrame(df['col3'].values.tolist()))
Out[3]:
col1 col2 col3 a b
0 1 2 {'a': 1, 'b': 3} 1 3
0 1 2 {'a': 2, 'b': 4} 1 3
1 1 3 {'a': 5, 'b': 6} 2 4
1 1 3 {'a': 7, 'b': 8} 2 4
## Now you can group with the new variables

Splitting dictionary into existing columns

Suppose I have the dataframe pd.DataFrame({'a':nan, 'b':nan, 'c':{'a':1, 'b':2},{'a':4, 'b':7, 'c':nan}, {'a':nan, 'b':nan, 'c':{'a':6, 'b':7}}). I want to take the values from the keys in the dictionary in column c and parse them into keys a and b.
Expected output is:
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
I know how to do this to create new columns, but that is not the task I need for this, since a and b have relevant information needing updates from c. I have not been able to find anything relevant to this task.
Any suggestions for an efficient method would be most welcome.
** EDIT **
The real problem is that I have the following dataframe, which I reduced to the above (in several, no doubt, extraneous steps):
a b c
0 nan nan [{'a':1, 'b':2}, {'a':6, 'b':7}]
1 4 7 nan
and I need to have output, in as few steps as possible, as per
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
Thanks!
This works:
def func(x):
d = eval(x['c'])
x['a'] = d['a']
x['b'] = d['b']
return x
df = df.apply(lambda x : func(x), axis=1)
How about this:
for t in d['c'].keys():
d[t] = d['c'][t]
Here is an example:
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> d
{'a': '', 'b': '', 'c': {'a': 1, 'b': 2}}
>>> d.keys()
dict_keys(['a', 'b', 'c'])
>>> d['c'].keys()
dict_keys(['a', 'b'])
>>> for t in d['c'].keys():
... d[t] = d['c'][t]
...
>>> d
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>>
We can turn it into a function:
>>> def updateDict(dict, sourceKey):
... for targetKey in dict[sourceKey].keys():
... dict[targetKey] = dict[sourceKey][targetKey]
...
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2, 'z':1000}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2, 'z': 1000}, 'z': 1000}
>>>

python: How to modify dictonaries in a DataFrame?

How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
The dataframe has 'multiple dictionary' in one list.
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4},{'a':5 ,'b':6, 'c':7, 'd':8}]
1 1 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':13 ,'b':14, 'c':15, 'd':16}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':17 ,'b':18, 'c':19, 'd':20}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be
Dataframe df:
id options
0 0 [{'a':1 ,'c':3},{'a':5 ,'c':7}]
1 1 [{'a':9, 'c':11},{'a':13,'c':15}]
2 2 [{'a':9 ,'c':11},{'a':17,c':19}]
I tried filtering but I could not assign the value to the dataframe
for x in totaldf['options']:
for y in x:
y = {a: y[a], 'c': y['c']} ...?
Using nested listed comprehension:
df['options'] = [[{'a': y['a'], 'c': y['b']} for y in x] for x in df['options']]
If you wanted to use a for loop it would be something like:
new_options = []
for x in df['options']:
row = []
for y in x:
row.append({a: y[a], 'c': y['c']})
new_options.append(row)
df['options'] = new_options
# An alternative vectorized solution.
df.options = df.options.apply(lambda x: [{k:v for k,v in e.items() if k in['a','c']} for e in x])
Out[398]:
id options
0 0 [{'a': 1, 'c': 3}, {'a': 5, 'c': 7}]
1 1 [{'a': 9, 'c': 11}, {'a': 13, 'c': 15}]
2 2 [{'a': 9, 'c': 11}, {'a': 17, 'c': 19}]

pandas how to fill NaN/None values based on the other columns?

Given the following, how can I set the NaN/None value of the B row based on the other rows? Should I use apply?
d = [
{'A': 2, 'B': Decimal('628.00'), 'C': 1, 'D': 'blue'},
{'A': 1, 'B': None, 'C': 3, 'D': 'orange'},
{'A': 3, 'B': None, 'C': 1, 'D': 'orange'},
{'A': 2, 'B': Decimal('575.00'), 'C': 2, 'D': 'blue'},
{'A': 4, 'B': None, 'C': 1, 'D': 'blue'},
]
df = pd.DataFrame(d)
# Make sure types are correct
df['B'] = df['B'].astype('float')
df['C'] = df['C'].astype('int')
In : df
Out:
A B C D
0 2 628 1 blue
1 1 NaN 3 orange
2 3 NaN 1 orange
3 2 575 2 blue
4 4 NaN 1 blue
In : df.dtypes
Out:
A int64
B float64
C int64
D object
dtype: object
Here is an example of the "rules" to set B when the value is None:
def make_B(c, d):
"""When B is None, the value of B depends on C and D."""
if d == 'blue':
return Decimal('1400.89') * 1 * c
elif d == 'orange':
return Decimal('2300.57') * 2 * c
raise
Here is the way I solve it:
I define make_B as below:
def make_B(x):
if np.isnan(x['B']):
"""When B is None, the value of B depends on C and D."""
if x['D'] == 'blue':
return Decimal('1400.89') * 1 * x['C']
elif x['D'] == 'orange':
return Decimal('2300.57') * 2 * x['C']
else:
return x['B']
Then I use apply:
df.apply(make_B,axis=1)

Reorganizing the data in a dataframe

I have data in the following format:
data =
[
{'data1': [{'sub_data1': 0}, {'sub_data2': 4}, {'sub_data3': 1}, {'sub_data4': -5}]},
{'data2': [{'sub_data1': 1}, {'sub_data2': 1}, {'sub_data3': 1}, {'sub_data4': 12}]},
{'data3': [{'sub_data1': 3}, {'sub_data2': 0}, {'sub_data3': 1}, {'sub_data4': 7}]},
]
How should I reorganize it so that when save it to hdf by
a = pd.DataFrame(data, columns=map(lambda x: x.name, ['data1', 'data2', 'data3']))
a.to_hdf('my_data.hdf')
I get a dataframe in the following format:
data1 data2 data3
_________________________________________
sub_data1 0 1 1
sub_data2 4 1 0
sub_data3 1 1 1
sub_data4 -5 12 7
update1: after following advice given me below and saving it an hdf file and reading it, I got this which is not what I want:
data1 data2 data3
0 {u'sub_data1': 22} {u'sub_data1': 33} {u'sub_data1': 44}
1 {u'sub_data2': 0} {u'sub_data2': 11} {u'sub_data2': 44}
2 {u'sub_data3': 12} {u'sub_data3': 16} {u'sub_data3': 19}
3 {u'sub_data4': 0} {u'sub_data4': 0} {u'sub_data4': 0}
Well if you convert your data into dictionary of dictionaries, you can then just create DataFrame very easily:
In [25]: data2 = {k: {m: n for i in v for m, n in i.iteritems()} for x in data for k, v in x.iteritems()}
In [26]: data2
Out[26]:
{'data1': {'sub_data1': 0, 'sub_data2': 4, 'sub_data3': 1, 'sub_data4': -5},
'data2': {'sub_data1': 1, 'sub_data2': 1, 'sub_data3': 1, 'sub_data4': 12},
'data3': {'sub_data1': 3, 'sub_data2': 0, 'sub_data3': 1, 'sub_data4': 7}}
In [27]: pd.DataFrame(data2)
Out[27]:
data1 data2 data3
sub_data1 0 1 3
sub_data2 4 1 0
sub_data3 1 1 1
sub_data4 -5 12 7

Categories