Related
I would like to parallelize following function, where A and B are just some columns from my input data frame. I would like to provide the output dictionary as an input so it is filled within the function (pass by reference)
outdict = {}
inputdict1 = {'id1': 100, 'id2': 200, 'id3': 0}
inputdict2 = {'id1': ['cat'], 'id2': ['dog', 'rabbit'], 'id3': []}
inputdf = pd.DataFrame({'A': ['cat', 'cat', 'dog', 'dog', 'dog', 'cat', 'rabbit', 'rabbit'],
'B': ['a', 'b', 'b', 'c', 'c', 'd', 'e', 'f']})
def processing(outdict, inputdict1, inputdict2, inputdf):
for key, _ in tqdm(inputdict1.items()):
outdict[key] = inputdf[inputdf.A.isin(inputdict2[key])].B.nunique()
processing(outdict, inputdict1, inputdict2, inputdf)
print(outdict)
{'id1': 3, 'id2': 4, 'id3': 0}
After some research I have tried the following approach
from multiprocessing import Pool
def processing(outdict, inputdict1, inputdict2, inputdf):
for key, value in tqdm(inputdict1.items()):
outdict[key] = inputdf[inputdf.A.isin(inputdict2[key])].B.nunique()
outdict = {}
pool = Pool()
pool.starmap(processing, zip(outdict, inputdict1, inputdict2, inputdf))
print(outdict)
{}
I have a list containing many dictionaries with same keys but different values.
What I would like to do is to group/merge dictionaries based on the values of some of the keys.
It's probably faster to show an example rather than trying to explain:
[{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
So, what I want to achieve is merging the first and third dictionary, since they have the same "zone", "hour" and "weekday", summing the values in C1 and C2:
[{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 10, 'C2': 27},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
Any help here? :) I've been struggling with this for a couple of days, I've got a bad unscalable solution, but I'm sure there is something far more pythonic that I could put in place.
Thanks!
Sort then group by the relevant keys; iterate over the groups and create new dictionaries with summed values.
import operator
import itertools
keys = operator.itemgetter('zone','weekday','hour')
c1_c2 = operator.itemgetter('C1','C2')
# data is your list of dicts
data.sort(key=keys)
grouped = itertools.groupby(data,keys)
new_data = []
for (zone,weekday,hour),g in grouped:
c1,c2 = 0,0
for d in g:
c1 += d['C1']
c2 += d['C2']
new_data.append({'zone':zone,'weekday':weekday,
'hour':hour,'C1':c1,'C2':c2})
That last loop could also be written as:
for (zone,weekday,hour),g in grouped:
cees = map(c1_c2,g)
c1,c2 = map(sum,zip(*cees))
new_data.append({'zone':zone,'weekday':weekday,
'hour':hour,'C1':c1,'C2':c2})
By using a defaultdict you can merge them in linear time.
from collections import defaultdict
res = defaultdict(lambda : defaultdict(int))
for d in dictionaries:
res[(d['zone'],d['weekday'],d['hour'])]['C1']+= d['C1']
res[(d['zone'],d['weekday'],d['hour'])]['C2']+= d['C2']
The drawback is that you need another pass to have the output as you've defined it.
I've gone ahead and written a slightly longer solution, making use of nametuples as keys of the dictionary:
from collections import namedtuple
zones = [{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()
for zone in zones:
zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
if zone_time in results:
results[zone_time]['C1'] += zone['C1']
results[zone_time]['C2'] += zone['C2']
else:
results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
print(results)
This uses a namedtuple of (zone, weekday, hour) as the key to each dictionary. Then it's fairly trivial to either add to it if it already exists within results, or create a new entry in the dictionary.
You can definitely make this shorter and "smarter", but it may become less readable.
Edit: Run Time Comparison
My original answer (see below) was not a good one, but I think I had a useful contribution by doing a little bit of run time analysis on the other answers so I've edited that portion and put it at the top. Here I include the three other solutions, along with the required transformations to produce the desired output. For completeness I also include a version using pandas, which assumes that the user is working with a DataFrame (transforming from list of dicts to data frame and back was not even close to worth it). Comparison times vary a little depending on the random data generated, but these are fairly representative:
>>> run_timer(100)
Times with 100 values
...with defaultdict: 0.1496697600000516
...with namedtuple: 0.14976404899994122
...with groupby: 0.0690777249999428
...with pandas: 3.3165711250001095
>>> run_timer(1000)
Times with 1000 values
...with defaultdict: 1.267153091999944
...with namedtuple: 0.9605341750000207
...with groupby: 0.6634409229998255
...with pandas: 3.5146895360001054
>>> run_timer(10000)
Times with 10000 values
...with defaultdict: 9.194478484000001
...with namedtuple: 9.157486462000179
...with groupby: 5.18553969300001
...with pandas: 4.704001281000046
>>> run_timer(100000)
Times with 100000 values
...with defaultdict: 59.644778522000024
...with namedtuple: 89.26688319799996
...with groupby: 93.3517027989999
...with pandas: 14.495209061999958
Take aways:
working with pandas data frames pays off big time for large datasets
NOTE: I do not include conversion between list of dicts and data frame, which is definitely significant
otherwise the accepted solution (by wwii) wins for small to medium datasets, but for very large ones it may be the slowest
changing the sizes of the groups (e.g., by decreasing the number of zones) has a huge effect which is not examined here
Here is the script I used to generate the above.
import random
import pandas
from timeit import timeit
from functools import partial
from itertools import groupby
from operator import itemgetter
from collections import namedtuple, defaultdict
def with_pandas(df):
return df.groupby(['zone', 'weekday', 'hour']).agg(sum).reset_index()
def with_groupby(data):
keys = itemgetter('zone', 'weekday', 'hour')
# data is your list of dicts
data.sort(key=keys)
grouped = groupby(data, keys)
new_data = []
for (zone, weekday, hour), g in grouped:
c1, c2 = 0, 0
for d in g:
c1 += d['C1']
c2 += d['C2']
new_data.append({'zone': zone, 'weekday': weekday,
'hour': hour, 'C1': c1, 'C2': c2})
return new_data
def with_namedtuple(zones):
ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()
for zone in zones:
zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
if zone_time in results:
results[zone_time]['C1'] += zone['C1']
results[zone_time]['C2'] += zone['C2']
else:
results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
return [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**val
}
for key, val in results.items()
]
def with_defaultdict(dictionaries):
res = defaultdict(lambda: defaultdict(int))
for d in dictionaries:
res[(d['zone'], d['weekday'], d['hour'])]['C1'] += d['C1']
res[(d['zone'], d['weekday'], d['hour'])]['C2'] += d['C2']
return [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**val
}
for key, val in res.items()
]
def gen_random_vals(num):
return [
{
'zone': random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
'weekday': random.randint(1, 7),
'hour': random.randint(0, 23),
'C1': random.randint(1, 50),
'C2': random.randint(1, 50),
}
for idx in range(num)
]
def run_timer(num_vals=1000, timeit_num=1000):
vals = gen_random_vals(num_vals)
df = pandas.DataFrame(vals)
p_fmt = "\t...with %s: %s"
times = {
'defaultdict': timeit(stmt=partial(with_defaultdict, vals), number=timeit_num),
'namedtuple': timeit(stmt=partial(with_namedtuple, vals), number=timeit_num),
'groupby': timeit(stmt=partial(with_groupby, vals), number=timeit_num),
'pandas': timeit(stmt=partial(with_pandas, df), number=timeit_num),
}
print("Times with %d values" % num_vals)
for key, val in times.items():
print(p_fmt % (key, val))
where
with_groupby uses the solution by wwii
with_namedtuple uses the solution by Jose Salvatierra
with_defaultdict uses the solution by abc
with_pandas uses the solution proposed by Alexander Cécile in comments
assumes data is already in a DataFrame and produces a DataFrame as result
Original answer:
Just for fun, here's a completely different approach using groupby. Granted, it's not the prettiest, but it should be fairly quick.
from itertools import groupby
from operator import itemgetter
from pprint import pprint
vals = [
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}
]
ordered = sorted(
[
(
(row['zone'], row['weekday'], row['hour']),
row['C1'], row['C2']
)
for row in vals
]
)
def invert_columns(grp):
return zip(*[g_row[1:] for g_row in grp])
merged = [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**dict(
zip(["C1", "C2"], [sum(col) for col in invert_columns(grp)])
)
}
for key, grp in groupby(ordered, itemgetter(0))
]
pprint(merged)
which yields
[{'C1': 10, 'C2': 27, 'hour': 12, 'weekday': 1, 'zone': 'A'},
{'C1': 5, 'C2': 27, 'hour': 6, 'weekday': 2, 'zone': 'B'},
{'C1': 2, 'C2': 13, 'hour': 8, 'weekday': 5, 'zone': 'C'}]
I have a list of lists of data:
[[1422029700000, 230.84, 230.42, 230.31, 230.32, 378], [1422029800000, 231.84, 231.42, 231.31, 231.32, 379], ...]
and a list of keys:
['a', 'b', 'c', 'd', 'e']
I want to combine them to a dictionary of lists so it looks like:
['a': [1422029700000, 1422029800000], 'b': [230.84, 231.84], ...]
I can do this using loops but I am looking for a pythonic way.
It is quite simple:
In [1]: keys = ['a','b','c']
In [2]: values = [[1,2,3],[4,5,6],[7,8,9]]
In [7]: dict(zip(keys, zip(*values)))
Out[7]: {'a': (1, 4, 7), 'b': (2, 5, 8), 'c': (3, 6, 9)}
If you need lists as values:
In [8]: dict(zip(keys, [list(t) for t in zip(*values)]))
Out[8]: {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}
or:
In [9]: dict(zip(keys, map(list, zip(*values))))
Out[9]: {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}
Use:
{k: [d[i] for d in data] for i, k in enumerate(keys)}
Example:
>>> data=[[1422029700000, 230.84, 230.42, 230.31, 230.32, 378], [1422029800000, 231.84, 231.42, 231.31, 231.32, 379]]
>>> keys = ["a", "b", "c"]
>>> {k: [d[i] for d in data] for i, k in enumerate(keys)}
{'c': [230.42, 231.42], 'a': [1422029700000, 1422029800000], 'b': [230.84, 231.84]}
Your question has everything in a list so if you want a list of dicts:
l1= [[1422029700000, 230.84, 230.42, 230.31, 230.32, 378], [1422029800000, 231.84, 231.42, 231.31, 231.32, 379]]
l2 = ['a', 'b', 'c', 'd', 'e',"f"] # added f to match length of sublists
print([{a:list(b)} for a,b in zip(l2,zip(*l1))])
[{'a': [1422029700000, 1422029800000]}, {'b': [230.84, 231.84]}, {'c': [230.42, 231.42]}, {'d': [230.31, 231.31]}, {'e': [230.32, 231.32]}, {'f': [378, 379]}]
If you actually want a dict use a dict comprehension with zip:
print({a:list(b) for a,b in zip(l2,zip(*l1))})
{'f': [378, 379], 'e': [230.32, 231.32], 'a': [1422029700000, 1422029800000], 'b': [230.84, 231.84], 'c': [230.42, 231.42], 'd': [230.31, 231.31]}
You example also has a list of keys shorter than the length of your sublists so zipping will actually mean you lose values from your sublists so you may want to address that.
If you are using python2 you can use itertools.izip:
from itertools import izip
print({a:list(b) for a,b in izip(l2,zip(*l1))
I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details
My goal is to first select the first 3 items in the dictionary below. I would also like to select items with values greater than 1.
dic=Counter({'school': 4, 'boy': 3, 'old': 3, 'the': 1})
My attempt:
1.>>> {x:x for x in dic if x[1]>1}
{'boy': 'boy', 'the': 'the', 'old': 'old', 'school': 'school'}
2.>>>dic[:3]
TypeError: unhashable type
Desired output: Counter({'school': 4, 'boy': 3, 'old': 3})
Thanks for your suggestions.
For items with count greater than one:
>>> [x for x in dic if dic[x] > 1]
['boy', 'school', 'old']
For the three most common items:
>>> [x for x, freq in dic.most_common(3)]
['school', 'boy', 'old']
To get dictionaries:
>>> {x: freq for x,freq in dic.items() if freq > 1}
{'boy': 3, 'school': 4, 'old': 3}
>>> {x: freq for x,freq in dic.most_common(3)}
{'boy': 3, 'school': 4, 'old': 3}
Note: Those are ordinary dictionaries. Use Counter(result) to turn them back into Counters. Alternatively to the dictionary comprehension you can also use the builtin dict function to turn a list of tuples into a dictionary, and then make a Counter from that.
>>> Counter(dict(dic.most_common(3)))
Counter({'school': 4, 'boy': 3, 'old': 3})