pandas - create key value pair from grouped by data frame

pandas - create key value pair from grouped by data frame - python

I have a data frame with three columns, I would like to create a dictionary after applying groupby function on first and second column.I can do this by for loops, but is there any pandas way of doing it?
DataFrame:
Col X Col Y Sum
A a 3
A b 2
A c 1
B p 5
B q 6
B r 7
After grouping by on Col X and Col Y : df.groupby(['Col X','Col Y']).sum()
Sum
Col X Col Y
A a 3
b 2
c 1
B p 5
q 6
r 7
Dictionary I want to create
{A:{'a':3,'b':2,'c':1}, B:{'p':5,'q':6,'r':7}}

Use a dictionary comprehension while iterating via a groupby object
{name: dict(zip(g['Col Y'], g['Sum'])) for name, g in df.groupby('Col X')}
{'A': {'a': 3, 'b': 2, 'c': 1}, 'B': {'p': 5, 'q': 6, 'r': 7}}
If you insisted on using to_dict somewhere, you could do something like this:
s = df.set_index(['Col X', 'Col Y']).Sum
{k: s.xs(k).to_dict() for k in s.index.levels[0]}
{'A': {'a': 3, 'b': 2, 'c': 1}, 'B': {'p': 5, 'q': 6, 'r': 7}}
Keep in mind, that the to_dict method is just using some comprehension under the hood. If you have a special use case that requires something more than what the orient options provide for... there is no shame in constructing your own comprehension.

You can iterate over the MultiIndex series:
>>> s = df.set_index(['ColX', 'ColY'])['Sum']
>>> {k: v.reset_index(level=0, drop=True).to_dict() for k, v in s.groupby(level=0)}
{'A': {'a': 3, 'b': 2, 'c': 1}, 'B': {'p': 5, 'q': 6, 'r': 7}}

#A to_dict() solution
d = df.groupby(['Col X','Col Y']).sum().reset_index().pivot(columns='Col X',values='Sum').to_dict()
Out[70]:
{'A': {0: 3.0, 1: 2.0, 2: 1.0, 3: nan, 4: nan, 5: nan},
'B': {0: nan, 1: nan, 2: nan, 3: 5.0, 4: 6.0, 5: 7.0}}
#if you need to get rid of the nans:
{k1:{k2:v2 for k2,v2 in v1.items() if pd.notnull(v2)} for k1,v1 in d.items()}
Out[73]: {'A': {0: 3.0, 1: 2.0, 2: 1.0}, 'B': {3: 5.0, 4: 6.0, 5: 7.0}}

Related

Updating a nested dictionary whose root keys match the index of a certain dataframe with said dataframe’s values

I have a nested dict that is uniform throughout (i.e. each 2nd level dict will have the same keys).
{
'0': {'a': 1, 'b': 2},
'1': {'a': 3, 'b': 4},
'2': {'a': 5, 'b': 6},
}
and the following data frame
c
0 9
1 6
2 4
Is there a way (without for loops) to update/map the dict/key-values such that I get
{
'0': {'a': 1, 'b': 2, 'c': 9},
'1': {'a': 3, 'b': 4, 'c': 6},
'2': {'a': 5, 'b': 6, 'c': 4},
}

Try this
# input
my_dict = {
'0': {'a': 1, 'b': 2},
'1': {'a': 3, 'b': 4},
'2': {'a': 5, 'b': 6},
}
my_df = pd.DataFrame({'c': [9, 6, 4]})
# build df from my_dict
df1 = pd.DataFrame.from_dict(my_dict, orient='index')
# append my_df as a column to df1
df1['c'] = my_df.values
# get dictionary
df1.to_dict('index')
But a simple loop is much more efficient here. I tested on a sample with 1mil entries and the loop is 2x faster.1
for d, c in zip(my_dict.values(), my_df['c']):
d['c'] = c
my_dict
{'0': {'a': 1, 'b': 2, 'c': 9},
'1': {'a': 3, 'b': 4, 'c': 6},
'2': {'a': 5, 'b': 6, 'c': 4}}
1: Constructing a dataframe is expensive, so unless you want a dataframe (and possibly do other computations later), it's not worth it to construct one for a task such as this one.

Python How to add all values in a 2d Dictionary and return a single dictionary with summed values?

Part of the program I am developing has a 2D dict of length n.
Dictionary Example:
test_dict = {
0: {'A': 2, 'B': 1, 'C': 5},
1: {'A': 3, 'B': 1, 'C': 2},
2: {'A': 1, 'B': 1, 'C': 1},
3: {'A': 4, 'B': 2, 'C': 5}
}
All of the dictionaries have the same keys but different values. I need to sum all the values as to equal below.
I have tried to merge the dictionaries using the following:
new_dict = {}
for k, v in test_dict.items():
new_dict.setdefault(k, []).append(v)
I also tried using:
new_dict = {**test_dict[0], **test_dict[1], **test_dict[2], **test_dict[3]}
Unfortuntly I have not had any luck in getting the desired outcome.
Desired Outcome: outcome = {'A': 10, 'B': 5, 'C': 13}
How can I add all the values into a single dictionary?

Solution using pandas
Convert your dict to pandas.DataFrame and then do summation on columns and convert it back to dict.
import pandas as pd
df = pd.DataFrame.from_dict(test_dict, orient='index')
print(df.sum().to_dict())
Output:
{'A': 10, 'B': 5, 'C': 13}
Alternate solution
Use collections.Counter which allows you to add the values of same keys within dict
from collections import Counter
d = Counter()
for _,v in test_dict.items():
d.update(v)
print(d)

subtracting values of a list in a dictionary using a dataframe

I have a dataframe below, with the products purchased by users.
DataSet :
user age maritalstatus product
A Young married 111
B young married 222
C young Single 111
D old single 222
E old married 111
F teen married 222
G teen married 555
H adult single 444
I adult single 333
dictionary:
{A:[111,222], B:[111,222], C:[111], D:[222], G:[222,555], X:[222,444] }
Expected output:
{A:[222], B:[111], C:[], D:[], G:[222], X:[222,444] }
The dictionary should look into the dataframe and remove the products already purchased by the users.

You can use a dict comprehension:
{k:[e for e in v if e not in df.loc[df.user.eq(k), 'product'].tolist()] for k,v in d.items()}
Out[292]: {'A': [222], 'B': [111], 'C': [], 'D': [], 'G': [222], 'X': [222, 444]}
A slightly more verbose solution for easier understanding:
First to build a user-product dict:
user_prod = df.groupby('user')['product'].apply(list).to_dict()
{'A': [111],
'B': [222],
'C': [111],
'D': [222],
'E': [111],
'F': [222],
'G': [555],
'H': [444],
'I': [333]}
Then, use a dict comprehension to remove elements which are in the user_prod dict.
{k:[e for e in v if e not in user_prod.get(k,[])] for k,v in d.items()}
Out[319]: {'A': [222], 'B': [111], 'C': [], 'D': [], 'G': [222], 'X': [222, 444]}
The use of user_prod.get is necessary because the user may not exist and .get will avoid an exception by returning an empty list.

Here is one intuitive way to implement your logic. You can optimize via sets and comprehensions, but for reasonable size datasets the below method should be adequate.
products = df.groupby('user')['product'].apply(list)
d = {'A':[111,222], 'B':[111,222], 'C':[111], 'D':[222], 'G':[222,555], 'X':[222,444] }
for k, v in d.items():
p = products.get(k)
if p:
for i in p:
d[k].remove(i)
# {'A': [222], 'B': [111], 'C': [], 'D': [], 'G': [222], 'X': [222, 444]}

product user
0 1 10
1 2 11
2 1 12
3 1 13
4 2 14
new_purchase = frame.set_index('user')['product'].to_dict()
{10: 1, 11: 2, 12: 1, 13: 1, 14: 2}
{10: [2, 1], 11: [2], 12: [], 13: [22], 14: [1]}
result = {}
for k, v in prev_purchase.items():
result[k] = [item for item in v if item not in [new_purchase[k]]]
{10: [2], 11: [], 12: [], 13: [22], 14: [1]}

Convert redundant array to dict (or JSON)?

Suppose I have an array:
[['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
And I want a dict (or JSON):
{
'a': {
10: {1: 0.1, 2: 0.2},
20: {2: 0.3}
}
'b': {
10: {1: 0.4},
20: {2: 0.5}
}
}
Is there any good way or some library for this task?
In this example the array is just 4-column, but my original array is more complicated (7-column).
Currently I implement this naively:
import pandas as pd
df = pd.DataFrame(array)
grouped1 = df.groupby('column1')
for column1 in grouped1.groups:
group1 = grouped1.get_group(column1)
grouped2 = group1.groupby('column2')
for column2 in grouped2.groups:
group2 = grouped2.get_group(column2)
...
And defaultdict way:
d = defaultdict(lambda x: defaultdict(lambda y: defaultdict ... ))
for row in array:
d[row[0]][row[1]][row[2]... = row[-1]
But I think neither is smart.

I would suggest this rather simple solution:
from functools import reduce
data = [['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
result = dict()
for row in data:
reduce(lambda v, k: v.setdefault(k, {}), row[:-2], result)[row[-2]] = row[-1]
print(result)
{'a': {10: {1: 0.1, 2: 0.2}, 20: {2: 0.3}}, 'b': {10: {1: 0.4}, 20: {2: 0.5}}}
An actual recursive solution would be something like this:
def add_to_group(keys: list, group: dict):
if len(keys) == 2:
group[keys[0]] = keys[1]
else:
add_to_group(keys[1:], group.setdefault(keys[0], dict()))
result = dict()
for row in data:
add_to_group(row, result)
print(result)

Introduction
Here is a recursive solution. The base case is when you have a list of 2-element lists (or tuples), in which case, the dict will do what we want:
>>> dict([(1, 0.1), (2, 0.2)])
{1: 0.1, 2: 0.2}
For other cases, we will remove the first column and recurse down until we get to the base case.
The code:
from itertools import groupby
def rows2dict(rows):
if len(rows[0]) == 2:
# e.g. [(1, 0.1), (2, 0.2)] ==> {1: 0.1, 2: 0.2}
return dict(rows)
else:
dict_object = dict()
for column1, groupped_rows in groupby(rows, lambda x: x[0]):
rows_without_first_column = [x[1:] for x in groupped_rows]
dict_object[column1] = rows2dict(rows_without_first_column)
return dict_object
if __name__ == '__main__':
rows = [['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
dict_object = rows2dict(rows)
print dict_object
Output
{'a': {10: {1: 0.1, 2: 0.2}, 20: {2: 0.3}}, 'b': {10: {1: 0.4}, 20: {2: 0.5}}}
Notes
We use the itertools.groupby generator to simplify grouping of similar rows based on the first column
For each group of rows, we remove the first column and recurse down
This solution assumes that the rows variable has 2 or more columns. The result is unpreditable for rows which has 0 or 1 column.

Convert a Pandas DataFrame to a dictionary

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}

The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}

Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)

Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]

Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}

For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }

Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}

Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}

df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()

DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - create key value pair from grouped by data frame - python

You can iterate over the MultiIndex series: >>> s = df.set_index(['ColX', 'ColY'])['Sum'] >>> {k: v.reset_index(level=0, drop=True).to_dict() for k, v in s.groupby(level=0)} {'A': {'a': 3, 'b': 2, 'c': 1}, 'B': {'p': 5, 'q': 6, 'r': 7}}

Related

Updating a nested dictionary whose root keys match the index of a certain dataframe with said dataframe’s values

Python How to add all values in a 2d Dictionary and return a single dictionary with summed values?

subtracting values of a list in a dictionary using a dataframe

Convert redundant array to dict (or JSON)?

Convert a Pandas DataFrame to a dictionary

Categories

Resources