I have a list containing many dictionaries with same keys but different values.
What I would like to do is to group/merge dictionaries based on the values of some of the keys.
It's probably faster to show an example rather than trying to explain:
[{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
So, what I want to achieve is merging the first and third dictionary, since they have the same "zone", "hour" and "weekday", summing the values in C1 and C2:
[{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 10, 'C2': 27},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
Any help here? :) I've been struggling with this for a couple of days, I've got a bad unscalable solution, but I'm sure there is something far more pythonic that I could put in place.
Thanks!
Sort then group by the relevant keys; iterate over the groups and create new dictionaries with summed values.
import operator
import itertools
keys = operator.itemgetter('zone','weekday','hour')
c1_c2 = operator.itemgetter('C1','C2')
# data is your list of dicts
data.sort(key=keys)
grouped = itertools.groupby(data,keys)
new_data = []
for (zone,weekday,hour),g in grouped:
c1,c2 = 0,0
for d in g:
c1 += d['C1']
c2 += d['C2']
new_data.append({'zone':zone,'weekday':weekday,
'hour':hour,'C1':c1,'C2':c2})
That last loop could also be written as:
for (zone,weekday,hour),g in grouped:
cees = map(c1_c2,g)
c1,c2 = map(sum,zip(*cees))
new_data.append({'zone':zone,'weekday':weekday,
'hour':hour,'C1':c1,'C2':c2})
By using a defaultdict you can merge them in linear time.
from collections import defaultdict
res = defaultdict(lambda : defaultdict(int))
for d in dictionaries:
res[(d['zone'],d['weekday'],d['hour'])]['C1']+= d['C1']
res[(d['zone'],d['weekday'],d['hour'])]['C2']+= d['C2']
The drawback is that you need another pass to have the output as you've defined it.
I've gone ahead and written a slightly longer solution, making use of nametuples as keys of the dictionary:
from collections import namedtuple
zones = [{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()
for zone in zones:
zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
if zone_time in results:
results[zone_time]['C1'] += zone['C1']
results[zone_time]['C2'] += zone['C2']
else:
results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
print(results)
This uses a namedtuple of (zone, weekday, hour) as the key to each dictionary. Then it's fairly trivial to either add to it if it already exists within results, or create a new entry in the dictionary.
You can definitely make this shorter and "smarter", but it may become less readable.
Edit: Run Time Comparison
My original answer (see below) was not a good one, but I think I had a useful contribution by doing a little bit of run time analysis on the other answers so I've edited that portion and put it at the top. Here I include the three other solutions, along with the required transformations to produce the desired output. For completeness I also include a version using pandas, which assumes that the user is working with a DataFrame (transforming from list of dicts to data frame and back was not even close to worth it). Comparison times vary a little depending on the random data generated, but these are fairly representative:
>>> run_timer(100)
Times with 100 values
...with defaultdict: 0.1496697600000516
...with namedtuple: 0.14976404899994122
...with groupby: 0.0690777249999428
...with pandas: 3.3165711250001095
>>> run_timer(1000)
Times with 1000 values
...with defaultdict: 1.267153091999944
...with namedtuple: 0.9605341750000207
...with groupby: 0.6634409229998255
...with pandas: 3.5146895360001054
>>> run_timer(10000)
Times with 10000 values
...with defaultdict: 9.194478484000001
...with namedtuple: 9.157486462000179
...with groupby: 5.18553969300001
...with pandas: 4.704001281000046
>>> run_timer(100000)
Times with 100000 values
...with defaultdict: 59.644778522000024
...with namedtuple: 89.26688319799996
...with groupby: 93.3517027989999
...with pandas: 14.495209061999958
Take aways:
working with pandas data frames pays off big time for large datasets
NOTE: I do not include conversion between list of dicts and data frame, which is definitely significant
otherwise the accepted solution (by wwii) wins for small to medium datasets, but for very large ones it may be the slowest
changing the sizes of the groups (e.g., by decreasing the number of zones) has a huge effect which is not examined here
Here is the script I used to generate the above.
import random
import pandas
from timeit import timeit
from functools import partial
from itertools import groupby
from operator import itemgetter
from collections import namedtuple, defaultdict
def with_pandas(df):
return df.groupby(['zone', 'weekday', 'hour']).agg(sum).reset_index()
def with_groupby(data):
keys = itemgetter('zone', 'weekday', 'hour')
# data is your list of dicts
data.sort(key=keys)
grouped = groupby(data, keys)
new_data = []
for (zone, weekday, hour), g in grouped:
c1, c2 = 0, 0
for d in g:
c1 += d['C1']
c2 += d['C2']
new_data.append({'zone': zone, 'weekday': weekday,
'hour': hour, 'C1': c1, 'C2': c2})
return new_data
def with_namedtuple(zones):
ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()
for zone in zones:
zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
if zone_time in results:
results[zone_time]['C1'] += zone['C1']
results[zone_time]['C2'] += zone['C2']
else:
results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
return [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**val
}
for key, val in results.items()
]
def with_defaultdict(dictionaries):
res = defaultdict(lambda: defaultdict(int))
for d in dictionaries:
res[(d['zone'], d['weekday'], d['hour'])]['C1'] += d['C1']
res[(d['zone'], d['weekday'], d['hour'])]['C2'] += d['C2']
return [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**val
}
for key, val in res.items()
]
def gen_random_vals(num):
return [
{
'zone': random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
'weekday': random.randint(1, 7),
'hour': random.randint(0, 23),
'C1': random.randint(1, 50),
'C2': random.randint(1, 50),
}
for idx in range(num)
]
def run_timer(num_vals=1000, timeit_num=1000):
vals = gen_random_vals(num_vals)
df = pandas.DataFrame(vals)
p_fmt = "\t...with %s: %s"
times = {
'defaultdict': timeit(stmt=partial(with_defaultdict, vals), number=timeit_num),
'namedtuple': timeit(stmt=partial(with_namedtuple, vals), number=timeit_num),
'groupby': timeit(stmt=partial(with_groupby, vals), number=timeit_num),
'pandas': timeit(stmt=partial(with_pandas, df), number=timeit_num),
}
print("Times with %d values" % num_vals)
for key, val in times.items():
print(p_fmt % (key, val))
where
with_groupby uses the solution by wwii
with_namedtuple uses the solution by Jose Salvatierra
with_defaultdict uses the solution by abc
with_pandas uses the solution proposed by Alexander Cécile in comments
assumes data is already in a DataFrame and produces a DataFrame as result
Original answer:
Just for fun, here's a completely different approach using groupby. Granted, it's not the prettiest, but it should be fairly quick.
from itertools import groupby
from operator import itemgetter
from pprint import pprint
vals = [
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}
]
ordered = sorted(
[
(
(row['zone'], row['weekday'], row['hour']),
row['C1'], row['C2']
)
for row in vals
]
)
def invert_columns(grp):
return zip(*[g_row[1:] for g_row in grp])
merged = [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**dict(
zip(["C1", "C2"], [sum(col) for col in invert_columns(grp)])
)
}
for key, grp in groupby(ordered, itemgetter(0))
]
pprint(merged)
which yields
[{'C1': 10, 'C2': 27, 'hour': 12, 'weekday': 1, 'zone': 'A'},
{'C1': 5, 'C2': 27, 'hour': 6, 'weekday': 2, 'zone': 'B'},
{'C1': 2, 'C2': 13, 'hour': 8, 'weekday': 5, 'zone': 'C'}]
Related
I am struggling to create a nested dictionary with the following data:
Team, Group, ID, Score, Difficulty
OneTeam, A, 0, 0.25, 4
TwoTeam, A, 1, 1, 10
ThreeTeam, A, 2, 0.64, 5
FourTeam, A, 3, 0.93, 6
FiveTeam, B, 4, 0.5, 7
SixTeam, B, 5, 0.3, 8
SevenTeam, B, 6, 0.23, 9
EightTeam, B, 7, 1.2, 4
Once imported as a Pandas Dataframe, I turn each feature into these lists:
teams, group, id, score, diff.
Using this stack overflow answer Create a complex dictionary using multiple lists I can create the following dictionary:
{'EightTeam': {'diff': 4, 'id': 7, 'score': 1.2},
'FiveTeam': {'diff': 7, 'id': 4, 'score': 0.5},
'FourTeam': {'diff': 6, 'id': 3, 'score': 0.93},
'OneTeam': {'diff': 4, 'id': 0, 'score': 0.25},
'SevenTeam': {'diff': 9, 'id': 6, 'score': 0.23},
'SixTeam': {'diff': 8, 'id': 5, 'score': 0.3},
'ThreeTeam': {'diff': 5, 'id': 2, 'score': 0.64},
'TwoTeam': {'diff': 10, 'id': 1, 'score': 1.0}}
using the code:
{team: {'id': i, 'score': s, 'diff': d} for team, i, s, d in zip(teams, id, score, diff)}
But what I'm after is having 'Group' as the main key, then team, and then id, score and difficulty within the team (as above).
I have tried:
{g: {team: {'id': i, 'score': s, 'diff': d}} for g, team, i, s, d in zip(group, teams, id, score, diff)}
but this doesn't work and results in only one team per group within the dictionary:
{'A': {'FourTeam': {'diff': 6, 'id': 3, 'score': 0.93}},
'B': {'EightTeam': {'diff': 4, 'id': 7, 'score': 1.2}}}
Below is how the dictionary should look, but I'm not sure how to get there - any help would be much appreciated!
{'A:': {'EightTeam': {'diff': 4, 'id': 7, 'score': 1.2},
'FiveTeam': {'diff': 7, 'id': 4, 'score': 0.5},
'FourTeam': {'diff': 6, 'id': 3, 'score': 0.93},
'OneTeam': {'diff': 4, 'id': 0, 'score': 0.25}},
'B': {'SevenTeam': {'diff': 9, 'id': 6, 'score': 0.23},
'SixTeam': {'diff': 8, 'id': 5, 'score': 0.3},
'ThreeTeam': {'diff': 5, 'id': 2, 'score': 0.64},
'TwoTeam': {'diff': 10, 'id': 1, 'score': 1.0}}}
A dict comprehension may not be the best way of solving this if your data is stored in a table like this.
Try something like
from collections import defaultdict
groups = defaultdict(dict)
for g, team, i, s, d in zip(group, teams, id, score, diff):
groups[g][team] = {'id': i, 'score': s, 'diff': d }
By using defaultdict, if groups[g] already exists, the new team is added as a key, if it doesn't, an empty dict is automatically created that the new team is then inserted into.
Edit: you edited your answer to say that your data is in a pandas dataframe. You can definitely skip the steps of turning the columns into list. Instead you could then for example do:
from collections import defaultdict
groups = defaultdict(dict)
for row in df.itertuples():
groups[row.Group][row.Team] = {'id': row.ID, 'score': row.Score, 'diff': row.Difficulty}
If you absolutely want to use comprehension, then this should work:
z = zip(teams, group, id, score, diff)
s = set(group)
d = { #outer dict, one entry for each different group
group: ({ #inner dict, one entry for team, filtered for group
team: {'id': i, 'score': s, 'diff': d}
for team, g, i, s, d in z
if g == group
})
for group in s
}
I added linebreaks for clarity
EDIT:
After the comment, to better clarify my intention and out of curiosity, I run a comparison:
# your code goes here
from collections import defaultdict
import timeit
teams = ['OneTeam', 'TwoTeam', 'ThreeTeam', 'FourTeam', 'FiveTeam', 'SixTeam', 'SevenTeam', 'EightTeam']
group = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
id = [0, 1, 2, 3, 4, 5, 6, 7]
score = [0.25, 1, 0.64, 0.93, 0.5, 0.3, 0.23, 1.2]
diff = [4, 10, 5, 6, 7, 8, 9, 4]
def no_comprehension():
global group, teams, id, score, diff
groups = defaultdict(dict)
for g, team, i, s, d in zip(group, teams, id, score, diff):
groups[g][team] = {'id': i, 'score': s, 'diff': d }
def comprehension():
global group, teams, id, score, diff
z = zip(teams, group, id, score, diff)
s = set(group)
d = {group: ({team: {'id': i, 'score': s, 'diff': d} for team, g, i, s, d in z if g == group}) for group in s}
print("no comprehension:")
print(timeit.timeit(lambda : no_comprehension(), number=10000))
print("comprehension:")
print(timeit.timeit(lambda : comprehension(), number=10000))
executable version
Output:
no comprehension:
0.027287796139717102
comprehension:
0.028979241847991943
They do look the same, in terms of performance. With my sentence above, I was just highlighting this as an alternative solution to the one already posted by #JohnO.
I have input list
inlist = [{"id":123,"hour":5,"groups":"1"},{"id":345,"hour":3,"groups":"1;2"},{"id":65,"hour":-2,"groups":"3"}]
I need to group the dictionaries by 'groups' value. After that I need to add key min and max of hour in new grouped lists. The output should look like this
outlist=[(1, [{"id":123, "hour":5, "min_group_hour":3, "max_group_hour":5}, {"id":345, "hour":3, "min_group_hour":3, "max_group_hour":5}]),
(2, [{"id":345, "hour":3, "min_group_hour":3, "max_group_hour":3}])
(3, [{"id":65, "hour":-2, "min_group_hour":-2, "max_group_hour":-2}])]
So far I managed to group input list
new_list = []
for domain in test:
for group in domain['groups'].split(';'):
d = dict()
d['id'] = domain['id']
d['group'] = group
d['hour'] = domain['hour']
new_list.append(d)
for k,v in itertools.groupby(new_list, key=itemgetter('group')):
print (int(k),max(list(v),key=itemgetter('hour'))
And output is
('1', [{'group': '1', 'id': 123, 'hour': 5}])
('2', [{'group': '2', 'id': 345, 'hour': 3}])
('3', [{'group': '3', 'id': 65, 'hour': -2}])
I don't know how to aggregate values by group? And is there more pythonic way of grouping dictionaries by key value that needs to be splitted?
Start by creating a dict that maps group numbers to dictionaries:
from collections import defaultdict
dicts_by_group = defaultdict(list)
for dic in inlist:
groups = map(int, dic['groups'].split(';'))
for group in groups:
dicts_by_group[group].append(dic)
This gives us a dict that looks like
{1: [{'id': 123, 'hour': 5, 'groups': '1'},
{'id': 345, 'hour': 3, 'groups': '1;2'}],
2: [{'id': 345, 'hour': 3, 'groups': '1;2'}],
3: [{'id': 65, 'hour': -2, 'groups': '3'}]}
Then iterate over the grouped dicts and set the min_group_hour and max_group_hour for each group:
outlist = []
for group in sorted(dicts_by_group.keys()):
dicts = dicts_by_group[group]
min_hour = min(dic['hour'] for dic in dicts)
max_hour = max(dic['hour'] for dic in dicts)
dicts = [{'id': dic['id'], 'hour': dic['hour'], 'min_group_hour': min_hour,
'max_group_hour': max_hour} for dic in dicts]
outlist.append((group, dicts))
Result:
[(1, [{'id': 123, 'hour': 5, 'min_group_hour': 3, 'max_group_hour': 5},
{'id': 345, 'hour': 3, 'min_group_hour': 3, 'max_group_hour': 5}]),
(2, [{'id': 345, 'hour': 3, 'min_group_hour': 3, 'max_group_hour': 3}]),
(3, [{'id': 65, 'hour': -2, 'min_group_hour': -2, 'max_group_hour': -2}])]
IIUC: Here is another way to do it in pandas:
import pandas as pd
input = [{"id":123,"hour":5,"group":"1"},{"id":345,"hour":3,"group":"1;2"},{"id":65,"hour":-2,"group":"3"}]
df = pd.DataFrame(input)
#Get minimum
dfmi = df.groupby('group').apply(min)
#Rename hour column as min_hour
dfmi.rename(columns={'hour':'min_hour'}, inplace=True)
dfmx = df.groupby('group').apply(max)
#Rename hour column as max_hour
dfmx.rename(columns={'hour':'max_hour'}, inplace=True)
#Merge min df with main df
df = df.merge(dfmi, on='group', how='outer')
#Merge max df with main df
df = df.merge(dfmx, on='group', how='outer')
output = list(df.apply(lambda x: x.to_dict(), axis=1))
#Dictionary of dictionaries
dict_out = df.to_dict(orient='index')
I'm pulling data from the database and assuming i have something like this:
Product Name Quantity
a 3
a 5
b 2
c 7
I want to sum the Quantity based on Product name, so this is what i want:
product = {'a':8, 'b':2, 'c':7 }
Here's what I'm trying to do after fetching the data from the database:
for row in result:
product[row['product_name']] += row['quantity']
but this will give me: 'a'=5 only, not 8.
Option 1: pandas
This is one way, assuming you begin with a pandas dataframe df. This solution has O(n log n) complexity.
product = df.groupby('Product Name')['Quantity'].sum().to_dict()
# {'a': 8, 'b': 2, 'c': 7}
The idea is you can perform a groupby operation, which produces a series indexed by "Product Name". Then use the to_dict() method to convert to a dictionary.
Option 2: collections.Counter
If you begin with a list or iterator of results, and wish to use a for loop, you can use collections.Counter for O(n) complexity.
from collections import Counter
result = [['a', 3],
['a', 5],
['b', 2],
['c', 7]]
product = Counter()
for row in result:
product[row[0]] += row[1]
print(product)
# Counter({'a': 8, 'c': 7, 'b': 2})
Option 3: itertools.groupby
You can also use a dictionary comprehension with itertools.groupby. This requires sorting beforehand.
from itertools import groupby
res = {i: sum(list(zip(*j))[1]) for i, j in groupby(sorted(result), key=lambda x: x[0])}
# {'a': 8, 'b': 2, 'c': 7}
If you insist on using loops, you can do this:
# fake data to make the script runnable
result = [
{'product_name': 'a', 'quantity': 3},
{'product_name': 'a', 'quantity': 5},
{'product_name': 'b', 'quantity': 2},
{'product_name': 'c', 'quantity': 7}
]
# solution with defaultdict and loops
from collections import defaultdict
d = defaultdict(int)
for row in result:
d[row['product_name']] += row['quantity']
print(dict(d))
The output:
{'a': 8, 'b': 2, 'c': 7}
Since you mention pandas
df.set_index('ProductName').Quantity.sum(level=0).to_dict()
Out[20]: {'a': 8, 'b': 2, 'c': 7}
Use tuple to store the result.
Edit:
Not clear if the data mentioned is really a dataframe.
If yes then li = [tuple(x) for x in df.to_records(index=False)]
li = [('a', 3), ('a', 5), ('b', 2), ('c', 7)]
d = dict()
for key, val in li:
val_old = 0
if key in d:
val_old = d[key]
d[key] = val + val_old
print(d)
Output
{'a': 8, 'b': 2, 'c': 7}
I am trying to gather summary statistics to generate a boxplot.
I have a dictionary where the keys are variables to be plotted on the y-axis and the values are their count in the data.
d = {16: 5,
21: 9,
44: 2,
2: 1}
I am wondering if there is a way to generate statistics such as median, Q1, Q3, etc. from the counts alone - I don't want to turn it into a list like [16, 16, 16, 16, 16, 21, 21, ...] and calculate from that. This is due to me trying to save a considerable amount of memory and not having to store the individual observations in memory.
EDIT
To be more concrete. Given an input
d = {4: 2, 10: 1, 3: 2, 11: 1, 18: 1, 12: 1, 14: 1, 16: 2, 7: 1}
I would like something that outputs
{'q1': 4, 'q2': 10.5, 'q3', 15, 'max': 18, 'min': 3}
Here is an idea. I have not dealt with all situations (e.g. when median index is not a whole number), but since get_val returns the result of a generator it should be memory-efficient.
from collections import OrderedDict
from itertools import accumulate
d = {16: 5,
21: 9,
44: 4,
2: 2}
d = OrderedDict(sorted(d.items()))
size = sum(d.values())
idx = {'q1': size/4,
'q2': size/2,
'q3': size*3/4}
# {'q1': 5.0, 'q2': 10.0, 'q3': 15.0}
def get_val(d, i):
return next(k for k, x in zip(d, accumulate(d.values())) if i < x)
res = {k: get_val(d, v) for k, v in idx.items()}
# {'q1': 16, 'q2': 21, 'q3': 21}
very new to python and programming in general so bear with me. So the basic function loops over a simple dictionary, checks the values and if one of them is 0 it replaces that value with the mean of the rest of the group. This works!
def replace_zero(group):
for k, v in group.iteritems():
if v == '-':
print 'there was a - value!'
group[k] = 0
new_mean = sum(group.itervalues()) / (len(group.keys())-1)
group[k] = new_mean
return group[k]
elif v == 0:
print 'there was a zero value!'
group[k] = 0
new_mean = sum(group.itervalues()) / (len(group.keys())-1)
group[k] = new_mean
return group[k]
But due to my huge dataset i don't want to call this function 36 times, so i made 12 dictionaries, which contain 3 dictionaries each.
gr_ctr_0 = {'distance': {'A1': sheet['E5'].value, 'A12': sheet['E16'].value,
'B1': sheet['E17'].value, 'B12': sheet['E28'].value,
'C1': sheet['E29'].value, 'C12': sheet['E40'].value,
'D1': sheet['E41'].value, 'D12': sheet['E52'].value},
'speed': {'A1': sheet['F5'].value, 'A12': sheet['F16'].value,
'B1': sheet['F17'].value, 'B12': sheet['F28'].value,
'C1': sheet['F29'].value, 'C12': sheet['F40'].value,
'D1': sheet['F41'].value, 'D12': sheet['F52'].value},
'time': {'A1': sheet['G5'].value, 'A12': sheet['G16'].value,
'B1': sheet['G17'].value, 'B12': sheet['G28'].value,
'C1': sheet['G29'].value, 'C12': sheet['G40'].value,
'D1': sheet['G41'].value, 'D12': sheet['G52'].value}}
I would now like to extend my function so that i pass it one dictionary (gr_ctr_0), and it still replaces any 0 value with the mean of the nested dictionary (e.g inside 'distance')
I've read all the related questions and thought it would a relatively simple line where I just add something along the lines of
def replace_zero(main_dict):
for group in main_dict:
for k, v in group.iteritems():
but it doesn't work, like at all :( Alternatively I read something about the recursive function, however I would have no idea how to implement that! Thank you all in advance!
EDIT !!!
taking both comments into account, I came up with this solution
def replace_zero_stackoverflow(group):
average = float(sum(group.itervalues())) / (len(group)- sum(v==0 for v in group.itervalues()))
for k, v in group.iteritems():
if v == 0:
group[k] = average
return group
res = {name: replace_zero_stackoverflow(group) for name, group in gr_ctr_0.iteritems()}
Your function stop the first time that a match is found because it have a return, so it wouldn't check every sub dictionary if one of them have have a match, likewise it wouldn't check every value and also would fail if multiples "-" are inside or a combination of "-" and 0 (or any other combination of numbers and string for that matter)
For example
gr_ctr_0 = {'distance': { 'A1': 1, 'A12': 1, 'B1': '-', 'B12': 5,
'C1': 5, 'C12': 4, 'D1': 6, 'D12': '-'},
'speed': { 'A1': 0, 'A12': 6, 'B1': 4, 'B12': 4,
'C1': 1, 'C12': 6, 'D1': 6, 'D12': 1},
'time': { 'A1': 5, 'A12': 2, 'B1': 2, 'B12': 4,
'C1': 0, 'C12': 3, 'D1': 2, 'D12': '-'}
}
with you original function it result in
>>> replace_zero(gr_ctr_0["distance"])
there was a - value!
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
replace_zero(gr_ctr_0["distance"])
File "C:\Users\David\Documents\Python Scripts\stackoverflow_test.py", line 53, in replace_zero
new_mean = sum(group.itervalues()) / (len(group.keys())-1)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>>
You first need to replace all the "-" (or all string) with a numbers and then you can replace a zero or all zero according to your desire.
To accomplished this, we can create another function that do the first step like
def clean_data(data, default=0):
for k,v in data.iteritems():
if v=="-": #isinstance(v,str):
data[k] = default
return data
the commented part is for the detection of any string, if necessary use that instead.
Now the replace_zero can call the clean function to ensure success like this
def replace_zero(group):
group = clean_data(group)
for k, v in group.iteritems():
if v == 0:
print 'there was a zero value!'
group[k] = sum(group.itervalues()) / (len(group)-1)
break
return group
this function would only replace the first zero that it found, to replace all of them remove the break, also notice that in that case if there are multiples zeroes each one would have a different values, but if you want to replace all with the same value, you need to calculate that one first like this
def replace_zero(group):
group = clean_data(group)
average = sum(group.itervalues()) / (len(group)-1)
for k, v in group.iteritems():
if v == 0:
print 'there was a zero value!'
group[k] = average
return group
and in this case, if you want to ignore the zeroes for the average, you can change the (len(group)-1) for ( len(group)- sum( v==0 for v in group.itervalues() ) this sum would count how many 0 there are.
And finally, you can do as Mike Müller show you to get the final result or also like this
def replace_zero_in_groups(data):
for k,v in data.iteritems():
data[k] = replace_zero(v)
return data
testing it with the example above (the replace all zeroes with the same average variant)
>>> replace_zero_in_groups(gr_ctr_0)
{'distance': {'A1': 1,
'A12': 1,
'B1': 3.6666666666666665,
'B12': 5,
'C1': 5,
'C12': 4,
'D1': 6,
'D12': 3.6666666666666665},
'speed': {'A1': 4.0,
'A12': 6,
'B1': 4,
'B12': 4,
'C1': 1,
'C12': 6,
'D1': 6,
'D12': 1},
'time': {'A1': 5,
'A12': 2,
'B1': 2,
'B12': 4,
'C1': 3.0,
'C12': 3,
'D1': 2,
'D12': 3.0}}
>>>
(also, to ensure a correct division, add from __future__ import division at the start of the code)
Modify your function:
def replace_zero(group):
for k, v in group.iteritems():
if v == '-' or v == 0:
group[k] = float(sum(group.itervalues())) / (len(group.keys())-1)
return group
and this should work:
res = {name: replace_zero(group) for name, group in gr_ctr_0.items()}
You can apply your function to each group in your dictionary gr_ctr_0and create a result dictionary res.
Test:
gr_ctr_0 = {'distance': {'A1': 4, 'A12':0, 'A13':7},
'speed': {'A1': 0, 'A12': 45, 'A13': 5.7},
'time': {'A1': 3, 'A12': 40, 'A13': 20}}
res = {name: replace_zero(group) for name, group in gr_ctr_0.items()}
print(res)
Output:
{'distance': {'A1': 4, 'A13': 7, 'A12': 5.5},
'speed': {'A1': 25.35, 'A13': 5.7, 'A12': 45},
'time': {'A1': 3, 'A13': 20, 'A12': 40}}