python: How to modify dictonaries in a DataFrame? - python

How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
The dataframe has 'multiple dictionary' in one list.
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4},{'a':5 ,'b':6, 'c':7, 'd':8}]
1 1 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':13 ,'b':14, 'c':15, 'd':16}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':17 ,'b':18, 'c':19, 'd':20}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be
Dataframe df:
id options
0 0 [{'a':1 ,'c':3},{'a':5 ,'c':7}]
1 1 [{'a':9, 'c':11},{'a':13,'c':15}]
2 2 [{'a':9 ,'c':11},{'a':17,c':19}]
I tried filtering but I could not assign the value to the dataframe
for x in totaldf['options']:
for y in x:
y = {a: y[a], 'c': y['c']} ...?

Using nested listed comprehension:
df['options'] = [[{'a': y['a'], 'c': y['b']} for y in x] for x in df['options']]
If you wanted to use a for loop it would be something like:
new_options = []
for x in df['options']:
row = []
for y in x:
row.append({a: y[a], 'c': y['c']})
new_options.append(row)
df['options'] = new_options

# An alternative vectorized solution.
df.options = df.options.apply(lambda x: [{k:v for k,v in e.items() if k in['a','c']} for e in x])
Out[398]:
id options
0 0 [{'a': 1, 'c': 3}, {'a': 5, 'c': 7}]
1 1 [{'a': 9, 'c': 11}, {'a': 13, 'c': 15}]
2 2 [{'a': 9, 'c': 11}, {'a': 17, 'c': 19}]

Related

Pandas: How to group by column values when column values are dicts?

I am doing an exercise in which the current requirement is to "Find the top 10 major project themes (using column 'mjtheme_namecode')".
My first thought was to do group_by, then count and sort the groups.
However, the values in this column are lists of dicts, e.g.
[{'code': '1', 'name': 'Economic management'},
{'code': '6', 'name': 'Social protection and risk management'}]
and I can't (apparently) group these, at least not with group_by. I get an error.
TypeError: unhashable type: 'list'
Is there a trick? I'm guessing something along the lines of this question.
(I can group by another column that has string values and matches 1:1 with this column, but the exercise is specific.)
df.head()
There are two steps to solve your problem:
Using pandas==0.25
Flatten the list of dict
Transform dict in columns:
Step 1
df = df.explode('mjtheme_namecode')
Step 2
df = df.join(pd.DataFrame(df['mjtheme_namecode'].values.tolist())
Added: if the dict has multiple hierarchies, you can try using json_normalize:
from pandas.io.json import json_normalize
df = df.join(json_normalize(df['mjtheme_namecode'].values.tolist())
The only issue here is pd.explode will duplicate all other columns (in case that is an issue).
Using sample data:
x = [
[1,2,[{'a':1, 'b':3},{'a':2, 'b':4}]],
[1,3,[{'a':5, 'b':6},{'a':7, 'b':8}]]
]
df = pd.DataFrame(x, columns=['col1','col2','col3'])
Out[1]:
col1 col2 col3
0 1 2 [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}]
1 1 3 [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
## Step 1
df.explode('col3')
Out[2]:
col1 col2 col3
0 1 2 {'a': 1, 'b': 3}
0 1 2 {'a': 2, 'b': 4}
1 1 3 {'a': 5, 'b': 6}
1 1 3 {'a': 7, 'b': 8}
## Step 2
df = df.join(pd.DataFrame(df['col3'].values.tolist()))
Out[3]:
col1 col2 col3 a b
0 1 2 {'a': 1, 'b': 3} 1 3
0 1 2 {'a': 2, 'b': 4} 1 3
1 1 3 {'a': 5, 'b': 6} 2 4
1 1 3 {'a': 7, 'b': 8} 2 4
## Now you can group with the new variables

Splitting dictionary into existing columns

Suppose I have the dataframe pd.DataFrame({'a':nan, 'b':nan, 'c':{'a':1, 'b':2},{'a':4, 'b':7, 'c':nan}, {'a':nan, 'b':nan, 'c':{'a':6, 'b':7}}). I want to take the values from the keys in the dictionary in column c and parse them into keys a and b.
Expected output is:
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
I know how to do this to create new columns, but that is not the task I need for this, since a and b have relevant information needing updates from c. I have not been able to find anything relevant to this task.
Any suggestions for an efficient method would be most welcome.
** EDIT **
The real problem is that I have the following dataframe, which I reduced to the above (in several, no doubt, extraneous steps):
a b c
0 nan nan [{'a':1, 'b':2}, {'a':6, 'b':7}]
1 4 7 nan
and I need to have output, in as few steps as possible, as per
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
Thanks!
This works:
def func(x):
d = eval(x['c'])
x['a'] = d['a']
x['b'] = d['b']
return x
df = df.apply(lambda x : func(x), axis=1)
How about this:
for t in d['c'].keys():
d[t] = d['c'][t]
Here is an example:
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> d
{'a': '', 'b': '', 'c': {'a': 1, 'b': 2}}
>>> d.keys()
dict_keys(['a', 'b', 'c'])
>>> d['c'].keys()
dict_keys(['a', 'b'])
>>> for t in d['c'].keys():
... d[t] = d['c'][t]
...
>>> d
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>>
We can turn it into a function:
>>> def updateDict(dict, sourceKey):
... for targetKey in dict[sourceKey].keys():
... dict[targetKey] = dict[sourceKey][targetKey]
...
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2, 'z':1000}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2, 'z': 1000}, 'z': 1000}
>>>

Flatten lists of dicts on specified keys

Goal: to read data from SQL table where a column contains JSON (arrays), extract certain keys/values from the JSON into new columns to then write to a new table. One of the joys of the original data format is that some data records are JSON arrays and some are not arrays (just JSON). Thus we may start with:
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
for x in testcase:
print(x)
(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}])
(2, {'a': 30, 'b': 40})
(3, {'a': 100, 'b': 200, 'd': 300})
Note the first element of each tuple is the record id. The first record is an array of length two, the second and third records are not arrays. The desired output is (as a DataFrame):
a b data
1 1 2 '{"c": 3}'
1 11 12 '{"c": 13}'
2 30 40 '{}'
3 100 200 '{"d": 300}'
Here you can see I've extracted keys 'a' and 'b' from the dicts into new columns, leaving the remaining keys/values in situ. The empty dict for id=2 is desirable behaviour.
First, I extracted the id and the data into separate lists. I take this opportunity to make the dict into a list of dicts (of length 1) so the types are now consistent:
id = [x[0] for x in testcase]
data_col = [x[1] if type(x[1]) == list else [x[1]] for x in testcase]
for x in data_col:
print(x)
[{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]
[{'a': 30, 'b': 40}]
[{'a': 100, 'b': 200, 'd': 300}]
It feels a bit of a clunky extra step to have to extract id and data_col as separate lists, although at least we have the nice property that we're not copying data:
id[0] is testcase[0][0]
True
data_col[0] is testcase[0][1]
True
And, as I say, I had to deal with the issue that some records contained arrays of dicts and some just dicts, so this makes them all consistent.
The main nitty gritty happens here, where I perform a dict comprehension in a double list comprehension to iterate over each dict:
popped = [(id, {key: element.pop(key, None) for key in ['a', 'b']}) \
for id, row in zip(id, data_col) for element in row]
for x in popped:
print(x)
(1, {'a': 1, 'b': 2})
(1, {'a': 11, 'b': 12})
(2, {'a': 30, 'b': 40})
(3, {'a': 100, 'b': 200})
I need to be able to relate each new row with its original id, and the above achieves that, correctly reproducing the appropriate id value (1, 1, 2, 3). With a bit of housekeeping, I can then get all my target rows lined up:
import pandas as pd
from psycopg2.extras import Json
id2 = [x[0] for x in popped]
cols = [x[1] for x in popped]
data = [Json(item) for sublist in data_col for item in sublist]
popped_df = pd.DataFrame(cols, index=id2)
popped_df['data'] = data
And this gives me the desired DataFrame as shown above. But ... is all my messing about with lists necessary? I couldn't do a simple json_normalize because I don't want to extract all keys and it falls over with the combination of arrays and non-arrays.
It also needs to be as performant as possible as it's going to be processing multi-millions of rows. For this reason, I actually convert the DataFrame to a list using:
list(popped_df.itertuples())
to then pass to psycopg2.extras' execute_values()
so I may yet not bother constructing the DataFrame and just build the output list, but in this post I'm really asking if there's a cleaner, faster way to extract these specific keys from the dicts into new columns and rows, robust to whether the record is an array or not and keeping track of the associated record id.
I shied away from an end-to-end pandas approach, reading the data using pd.read_sql() as I was reading that DataFrame.to_sql() was relatively slow.
You could do something like this:
import pandas as pd
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
def split_dict(d, keys=['a', 'b']):
"""Split the dictionary by keys"""
preserved = {key: value for key, value in d.items() if key in keys}
complement = {key: value for key, value in d.items() if key not in keys}
return preserved, complement
def get_row(val):
preserved, complement = split_dict(val)
preserved['data'] = complement
return preserved
rows = []
index = []
for i, values in testcase:
if isinstance(values, list):
for value in values:
rows.append(get_row(value))
index.append(i)
else:
rows.append(get_row(values))
index.append(i)
df = pd.DataFrame.from_records(rows, index=index)
print(df)
Output
a b data
1 1 2 {'c': 3}
1 11 12 {'c': 13}
2 30 40 {}
3 100 200 {'d': 300}
Your data is messy, since the second element of your testcase can be either a list or a dict. In this case, you can construct a list via a for loop, then feed to the pd.DataFrame constructor:
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
L = []
for idx, data in testcase:
for d in ([data] if isinstance(data, dict) else data):
# string conversion not strictly necessary below
others = str({k: v for k, v in d.items() if k not in ('a', 'b')})
L.append((idx, d['a'], d['b'], others))
df = pd.DataFrame(L, columns=['index', 'a', 'b', 'data']).set_index('index')
print(df)
a b data
index
1 1 2 {'c': 3}
1 11 12 {'c': 13}
2 30 40 {}
3 100 200 {'d': 300}

Keeping additional column when normalizing list of dicts

I have a dataframe containing id and list of dicts:
df = pd.DataFrame({
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
})
and I want to normalize it like this:
id a b
0 100 1 2
0 100 3 4
1 200 11 22
1 200 33 44
This gets most of the way:
pd.concat([
pd.DataFrame.from_dict(item)
for item in df.list_of_dicts
])
but is missing the id column.
I'm most interested in readability.
How about something like this:
d = {
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
}
df = pd.DataFrame([pd.Series(x) for ld in d['list_of_dicts'] for x in ld])
id = [[x]*len(l) for l,x in zip(d['list_of_dicts'],d['id'])]
df['id'] = pd.Series([x for l in id for x in l])
EDIT - Here's a simpler version
t = [[('id', i)]+list(l.items()) for i in d['id'] for ll in d['list_of_dicts'] for l in ll]
df = pd.DataFrame([dict(x) for x in t])
And, if you really want the id column first, you can change dict to OrderedDict from the collections module.
This is what I call an incomprehension
pd.DataFrame(
*list(map(list, zip(
*[(d, i) for i, l in zip(df.id, df.list_of_dicts) for d in l]
)))
).rename_axis('id').reset_index()
id a b
0 100 1 2
1 100 11 22
2 200 3 4
3 200 33 44

Convert a Pandas DataFrame to a dictionary

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

Categories