Python summary statistics from counts dictionary - python

I am trying to gather summary statistics to generate a boxplot.
I have a dictionary where the keys are variables to be plotted on the y-axis and the values are their count in the data.
d = {16: 5,
21: 9,
44: 2,
2: 1}
I am wondering if there is a way to generate statistics such as median, Q1, Q3, etc. from the counts alone - I don't want to turn it into a list like [16, 16, 16, 16, 16, 21, 21, ...] and calculate from that. This is due to me trying to save a considerable amount of memory and not having to store the individual observations in memory.
EDIT
To be more concrete. Given an input
d = {4: 2, 10: 1, 3: 2, 11: 1, 18: 1, 12: 1, 14: 1, 16: 2, 7: 1}
I would like something that outputs
{'q1': 4, 'q2': 10.5, 'q3', 15, 'max': 18, 'min': 3}

Here is an idea. I have not dealt with all situations (e.g. when median index is not a whole number), but since get_val returns the result of a generator it should be memory-efficient.
from collections import OrderedDict
from itertools import accumulate
d = {16: 5,
21: 9,
44: 4,
2: 2}
d = OrderedDict(sorted(d.items()))
size = sum(d.values())
idx = {'q1': size/4,
'q2': size/2,
'q3': size*3/4}
# {'q1': 5.0, 'q2': 10.0, 'q3': 15.0}
def get_val(d, i):
return next(k for k, x in zip(d, accumulate(d.values())) if i < x)
res = {k: get_val(d, v) for k, v in idx.items()}
# {'q1': 16, 'q2': 21, 'q3': 21}

Related

nested Dictionary print differently after sorting by inner dictionary values in python

I was trying sorting nested dict by inner dict's value. The sorting went well. but when I check my result, I found out that the original dict was printed when I just use the variable (d2), but it gives me the correct result when I use print(d2)
d2 = {1: {1: 4, 2: 5, 3: 6},
2: {7: 13, 8: 14, 9: 15, 10: 16, 11: 17, 12: 18},
3: {1: 1, 2: 9, 3: 4}}
# sorting by inner dict value
for keys in d2.keys():
sorted_tuples = sorted(d2[keys].items(), key=operator.itemgetter(1), reverse=True)
d2[keys] = {k: v for k, v in sorted_tuples}
print(d2)
d2
{1: {3: 6, 2: 5, 1: 4}, 2: {12: 18, 11: 17, 10: 16, 9: 15, 8: 14, 7: 13}, 3: {2: 9, 3: 4, 1: 1}}
{1: {1: 4, 2: 5, 3: 6},
2: {7: 13, 8: 14, 9: 15, 10: 16, 11: 17, 12: 18},
3: {1: 1, 2: 9, 3: 4}}
why the output is different when I use d2 and print(d2)
friend! Did you use the pretty print module to print the results of d2? I was only able to replicate your behavior using the pretty print module. Pretty print alphabetically sorts a dictionary before printing it, which can be disabled.
I originally (and wrongly) suspected the different output between d2 and print(d2) was a result of dictionaries being unordered collections of data; I suspected dict.__str__ and dict.__repr__ differed just enough. I would recommend you use an OrderedDict over a standard dictionary if you wish to maintain its order--despite Python preserving dictionaries insertion order in Python 3.7.
Below is my code and conclusions.
After initialization, d2 and print(d2) printed the same values:
❯ python
Python 3.7.12 (default, Sep 10 2021, 17:29:55)
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> d2 = {1: {1: 4, 2: 5, 3: 6},
2: {7: 13, 8: 14, 9: 15, 10: 16, 11: 17, 12: 18},
3: {1: 1, 2: 9, 3: 4}}
>>> d2
{1: {1: 4, 2: 5, 3: 6}, 2: {7: 13, 8: 14, 9: 15, 10: 16, 11: 17, 12: 18}, 3: {1: 1, 2: 9, 3: 4}}
>>> print(d2)
{1: {1: 4, 2: 5, 3: 6}, 2: {7: 13, 8: 14, 9: 15, 10: 16, 11: 17, 12: 18}, 3: {1: 1, 2: 9, 3: 4}}
After sorting, d2 and print(d2) printed the same values.
>>> import operator
>>> for keys in d2.keys():
... sorted_tuples = sorted(d2[keys].items(), key=operator.itemgetter(1), reverse=True)
... d2[keys] = {k: v for k, v in sorted_tuples}
...
>>> print(d2)
{1: {3: 6, 2: 5, 1: 4}, 2: {12: 18, 11: 17, 10: 16, 9: 15, 8: 14, 7: 13}, 3: {2: 9, 3: 4, 1: 1}}
>>> d2
{1: {3: 6, 2: 5, 1: 4}, 2: {12: 18, 11: 17, 10: 16, 9: 15, 8: 14, 7: 13}, 3: {2: 9, 3: 4, 1: 1}}
However, while using the pretty print module, I was able to replicate your behavior.
>>> from pprint import pprint as pp
>>> pp(print(d2))
{1: {3: 6, 2: 5, 1: 4}, 2: {12: 18, 11: 17, 10: 16, 9: 15, 8: 14, 7: 13}, 3: {2: 9, 3: 4, 1: 1}}
>>> pp(d2)
{1: {1: 4, 2: 5, 3: 6},
2: {7: 13, 8: 14, 9: 15, 10: 16, 11: 17, 12: 18},
3: {1: 1, 2: 9, 3: 4}}
Once I disabled dictionary sorting in the pretty print module, I was able to obtain your desired output.
>>> pprint.sorted = lambda x, key=None: x
>>> pp(d2)
{1: {3: 6, 2: 5, 1: 4},
2: {12: 18, 11: 17, 10: 16, 9: 15, 8: 14, 7: 13},
3: {2: 9, 3: 4, 1: 1}}
>>> pp(print(d2))
{1: {3: 6, 2: 5, 1: 4}, 2: {12: 18, 11: 17, 10: 16, 9: 15, 8: 14, 7: 13}, 3: {2: 9, 3: 4, 1: 1}}

How to get count of a column with key-pair value in Pandas

I am new to Pandas and working on some exercise.
The question is to find the number of items that have more than 3 types. I am confused how to get the types(keys) from type column.
Besides, is it a proper way for Pandas to store kv pair in a single column? Thanks!
shopid name type
1 item1 {S: 10, M: 10, L: 10}
2 item2 {S: 10, M: 10}
2 item3 {S: 10, M: 10, L: 10, XL: 10}
3 item1 {S: 10, M: 10, L: 10}
3 item2 {S: 10, M: 10}
4 item3 {S: 10, M: 10, L: 10, XL: 10}
4 item1 {S: 10, M: 10, L: 10}
4 item2 {S: 10, M: 10}
4 item4 {S: 10, M: 10, L: 10, XL: 10, XXL: 10}
Expected output
2
where item3 and item4 have more than 3 types
Here's another way, using the str accessor to look at the dictionary in column 'type', then use nunique to count the number of unique names:
df.loc[df['type'].str.len() > 3, 'name'].nunique()
Output:
2
IIUC , considering below as your dataframe , :
d = {'shopid': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4, 6: 4, 7: 4, 8: 4},
'name': {0: 'item1',
1: 'item2',
2: 'item3',
3: 'item1',
4: 'item2',
5: 'item3',
6: 'item1',
7: 'item2',
8: 'item4'},
'type': {0: {'S': 10, 'M': 10, 'L': 10},
1: {'S': 10, 'M': 10},
2: {'S': 10, 'M': 10, 'L': 10, 'XL': 10},
3: {'S': '10', 'M': 10, 'L': 10},
4: {'S': 10, 'M': 10},
5: {'S': 10, 'M': 10, 'L': 10, 'XL': 10},
6: {'S': 10, 'M': 10, 'L': 10},
7: {'S': 10, 'M': 10},
8: {'S': 10, 'M': 10, 'L': 10, 'XL': 10, 'XXL': 10}}}
df = pd.DataFrame(d)
You can convert the dictionary column as a DataFrame and group on shopid and get the first values which ignores NaN , then take sum of notna on axis=1 and compare:
output = (pd.DataFrame(df['type'].tolist()).groupby(df['name']).first()
.notna().sum(1).gt(3).sum())
print(output)
#2
It looks like your column type is not a proper dict, but str. There is though a trick you can try, by observing that number of types is actually equals to number of : you have:
df=df[df['type'].str.count(':')>3]
It should help you to select rows that have more than three types.
Additionally, I agree with the comments about the datatype in a dataframe, having k,v as dict in a column is not a good option at time.
Note: Assuming your data is in a dataframe called df
Firstly for your purpose,
You can remove duplicates in your DF by doing,
df.drop_duplicates()
optionally reset the index using df.drop_duplicates().reset_index()
Then you can add a slack column by doing,
df['type_count'] = df.apply(lambda r: len(r['type']), axis = 1)
Then you can crunch, delete the slack column and find the number of rows like so,
print(df[df['type_count']>3].iloc[:,:-1].shape[0])
df[df['type_count']>3] finds all items > 3 dictionary elements, .iloc[:,:-1] removes the slack column and shape[0] selects the rows only.
Hope this helps.

Python - group/merge dictionaries based on key/values identity

I have a list containing many dictionaries with same keys but different values.
What I would like to do is to group/merge dictionaries based on the values of some of the keys.
It's probably faster to show an example rather than trying to explain:
[{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
So, what I want to achieve is merging the first and third dictionary, since they have the same "zone", "hour" and "weekday", summing the values in C1 and C2:
[{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 10, 'C2': 27},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
Any help here? :) I've been struggling with this for a couple of days, I've got a bad unscalable solution, but I'm sure there is something far more pythonic that I could put in place.
Thanks!
Sort then group by the relevant keys; iterate over the groups and create new dictionaries with summed values.
import operator
import itertools
keys = operator.itemgetter('zone','weekday','hour')
c1_c2 = operator.itemgetter('C1','C2')
# data is your list of dicts
data.sort(key=keys)
grouped = itertools.groupby(data,keys)
new_data = []
for (zone,weekday,hour),g in grouped:
c1,c2 = 0,0
for d in g:
c1 += d['C1']
c2 += d['C2']
new_data.append({'zone':zone,'weekday':weekday,
'hour':hour,'C1':c1,'C2':c2})
That last loop could also be written as:
for (zone,weekday,hour),g in grouped:
cees = map(c1_c2,g)
c1,c2 = map(sum,zip(*cees))
new_data.append({'zone':zone,'weekday':weekday,
'hour':hour,'C1':c1,'C2':c2})
By using a defaultdict you can merge them in linear time.
from collections import defaultdict
res = defaultdict(lambda : defaultdict(int))
for d in dictionaries:
res[(d['zone'],d['weekday'],d['hour'])]['C1']+= d['C1']
res[(d['zone'],d['weekday'],d['hour'])]['C2']+= d['C2']
The drawback is that you need another pass to have the output as you've defined it.
I've gone ahead and written a slightly longer solution, making use of nametuples as keys of the dictionary:
from collections import namedtuple
zones = [{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}]
ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()
for zone in zones:
zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
if zone_time in results:
results[zone_time]['C1'] += zone['C1']
results[zone_time]['C2'] += zone['C2']
else:
results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
print(results)
This uses a namedtuple of (zone, weekday, hour) as the key to each dictionary. Then it's fairly trivial to either add to it if it already exists within results, or create a new entry in the dictionary.
You can definitely make this shorter and "smarter", but it may become less readable.
Edit: Run Time Comparison
My original answer (see below) was not a good one, but I think I had a useful contribution by doing a little bit of run time analysis on the other answers so I've edited that portion and put it at the top. Here I include the three other solutions, along with the required transformations to produce the desired output. For completeness I also include a version using pandas, which assumes that the user is working with a DataFrame (transforming from list of dicts to data frame and back was not even close to worth it). Comparison times vary a little depending on the random data generated, but these are fairly representative:
>>> run_timer(100)
Times with 100 values
...with defaultdict: 0.1496697600000516
...with namedtuple: 0.14976404899994122
...with groupby: 0.0690777249999428
...with pandas: 3.3165711250001095
>>> run_timer(1000)
Times with 1000 values
...with defaultdict: 1.267153091999944
...with namedtuple: 0.9605341750000207
...with groupby: 0.6634409229998255
...with pandas: 3.5146895360001054
>>> run_timer(10000)
Times with 10000 values
...with defaultdict: 9.194478484000001
...with namedtuple: 9.157486462000179
...with groupby: 5.18553969300001
...with pandas: 4.704001281000046
>>> run_timer(100000)
Times with 100000 values
...with defaultdict: 59.644778522000024
...with namedtuple: 89.26688319799996
...with groupby: 93.3517027989999
...with pandas: 14.495209061999958
Take aways:
working with pandas data frames pays off big time for large datasets
NOTE: I do not include conversion between list of dicts and data frame, which is definitely significant
otherwise the accepted solution (by wwii) wins for small to medium datasets, but for very large ones it may be the slowest
changing the sizes of the groups (e.g., by decreasing the number of zones) has a huge effect which is not examined here
Here is the script I used to generate the above.
import random
import pandas
from timeit import timeit
from functools import partial
from itertools import groupby
from operator import itemgetter
from collections import namedtuple, defaultdict
def with_pandas(df):
return df.groupby(['zone', 'weekday', 'hour']).agg(sum).reset_index()
def with_groupby(data):
keys = itemgetter('zone', 'weekday', 'hour')
# data is your list of dicts
data.sort(key=keys)
grouped = groupby(data, keys)
new_data = []
for (zone, weekday, hour), g in grouped:
c1, c2 = 0, 0
for d in g:
c1 += d['C1']
c2 += d['C2']
new_data.append({'zone': zone, 'weekday': weekday,
'hour': hour, 'C1': c1, 'C2': c2})
return new_data
def with_namedtuple(zones):
ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()
for zone in zones:
zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
if zone_time in results:
results[zone_time]['C1'] += zone['C1']
results[zone_time]['C2'] += zone['C2']
else:
results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
return [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**val
}
for key, val in results.items()
]
def with_defaultdict(dictionaries):
res = defaultdict(lambda: defaultdict(int))
for d in dictionaries:
res[(d['zone'], d['weekday'], d['hour'])]['C1'] += d['C1']
res[(d['zone'], d['weekday'], d['hour'])]['C2'] += d['C2']
return [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**val
}
for key, val in res.items()
]
def gen_random_vals(num):
return [
{
'zone': random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
'weekday': random.randint(1, 7),
'hour': random.randint(0, 23),
'C1': random.randint(1, 50),
'C2': random.randint(1, 50),
}
for idx in range(num)
]
def run_timer(num_vals=1000, timeit_num=1000):
vals = gen_random_vals(num_vals)
df = pandas.DataFrame(vals)
p_fmt = "\t...with %s: %s"
times = {
'defaultdict': timeit(stmt=partial(with_defaultdict, vals), number=timeit_num),
'namedtuple': timeit(stmt=partial(with_namedtuple, vals), number=timeit_num),
'groupby': timeit(stmt=partial(with_groupby, vals), number=timeit_num),
'pandas': timeit(stmt=partial(with_pandas, df), number=timeit_num),
}
print("Times with %d values" % num_vals)
for key, val in times.items():
print(p_fmt % (key, val))
where
with_groupby uses the solution by wwii
with_namedtuple uses the solution by Jose Salvatierra
with_defaultdict uses the solution by abc
with_pandas uses the solution proposed by Alexander Cécile in comments
assumes data is already in a DataFrame and produces a DataFrame as result
Original answer:
Just for fun, here's a completely different approach using groupby. Granted, it's not the prettiest, but it should be fairly quick.
from itertools import groupby
from operator import itemgetter
from pprint import pprint
vals = [
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 3, 'C2': 15},
{'zone': 'B', 'weekday': 2, 'hour': 6, 'C1': 5, 'C2': 27},
{'zone': 'A', 'weekday': 1, 'hour': 12, 'C1': 7, 'C2': 12},
{'zone': 'C', 'weekday': 5, 'hour': 8, 'C1': 2, 'C2': 13}
]
ordered = sorted(
[
(
(row['zone'], row['weekday'], row['hour']),
row['C1'], row['C2']
)
for row in vals
]
)
def invert_columns(grp):
return zip(*[g_row[1:] for g_row in grp])
merged = [
{
'zone': key[0],
'weekday': key[1],
'hour': key[2],
**dict(
zip(["C1", "C2"], [sum(col) for col in invert_columns(grp)])
)
}
for key, grp in groupby(ordered, itemgetter(0))
]
pprint(merged)
which yields
[{'C1': 10, 'C2': 27, 'hour': 12, 'weekday': 1, 'zone': 'A'},
{'C1': 5, 'C2': 27, 'hour': 6, 'weekday': 2, 'zone': 'B'},
{'C1': 2, 'C2': 13, 'hour': 8, 'weekday': 5, 'zone': 'C'}]

Creating a complex nested dictionary from multiple lists in Python

I am struggling to create a nested dictionary with the following data:
Team, Group, ID, Score, Difficulty
OneTeam, A, 0, 0.25, 4
TwoTeam, A, 1, 1, 10
ThreeTeam, A, 2, 0.64, 5
FourTeam, A, 3, 0.93, 6
FiveTeam, B, 4, 0.5, 7
SixTeam, B, 5, 0.3, 8
SevenTeam, B, 6, 0.23, 9
EightTeam, B, 7, 1.2, 4
Once imported as a Pandas Dataframe, I turn each feature into these lists:
teams, group, id, score, diff.
Using this stack overflow answer Create a complex dictionary using multiple lists I can create the following dictionary:
{'EightTeam': {'diff': 4, 'id': 7, 'score': 1.2},
'FiveTeam': {'diff': 7, 'id': 4, 'score': 0.5},
'FourTeam': {'diff': 6, 'id': 3, 'score': 0.93},
'OneTeam': {'diff': 4, 'id': 0, 'score': 0.25},
'SevenTeam': {'diff': 9, 'id': 6, 'score': 0.23},
'SixTeam': {'diff': 8, 'id': 5, 'score': 0.3},
'ThreeTeam': {'diff': 5, 'id': 2, 'score': 0.64},
'TwoTeam': {'diff': 10, 'id': 1, 'score': 1.0}}
using the code:
{team: {'id': i, 'score': s, 'diff': d} for team, i, s, d in zip(teams, id, score, diff)}
But what I'm after is having 'Group' as the main key, then team, and then id, score and difficulty within the team (as above).
I have tried:
{g: {team: {'id': i, 'score': s, 'diff': d}} for g, team, i, s, d in zip(group, teams, id, score, diff)}
but this doesn't work and results in only one team per group within the dictionary:
{'A': {'FourTeam': {'diff': 6, 'id': 3, 'score': 0.93}},
'B': {'EightTeam': {'diff': 4, 'id': 7, 'score': 1.2}}}
Below is how the dictionary should look, but I'm not sure how to get there - any help would be much appreciated!
{'A:': {'EightTeam': {'diff': 4, 'id': 7, 'score': 1.2},
'FiveTeam': {'diff': 7, 'id': 4, 'score': 0.5},
'FourTeam': {'diff': 6, 'id': 3, 'score': 0.93},
'OneTeam': {'diff': 4, 'id': 0, 'score': 0.25}},
'B': {'SevenTeam': {'diff': 9, 'id': 6, 'score': 0.23},
'SixTeam': {'diff': 8, 'id': 5, 'score': 0.3},
'ThreeTeam': {'diff': 5, 'id': 2, 'score': 0.64},
'TwoTeam': {'diff': 10, 'id': 1, 'score': 1.0}}}
A dict comprehension may not be the best way of solving this if your data is stored in a table like this.
Try something like
from collections import defaultdict
groups = defaultdict(dict)
for g, team, i, s, d in zip(group, teams, id, score, diff):
groups[g][team] = {'id': i, 'score': s, 'diff': d }
By using defaultdict, if groups[g] already exists, the new team is added as a key, if it doesn't, an empty dict is automatically created that the new team is then inserted into.
Edit: you edited your answer to say that your data is in a pandas dataframe. You can definitely skip the steps of turning the columns into list. Instead you could then for example do:
from collections import defaultdict
groups = defaultdict(dict)
for row in df.itertuples():
groups[row.Group][row.Team] = {'id': row.ID, 'score': row.Score, 'diff': row.Difficulty}
If you absolutely want to use comprehension, then this should work:
z = zip(teams, group, id, score, diff)
s = set(group)
d = { #outer dict, one entry for each different group
group: ({ #inner dict, one entry for team, filtered for group
team: {'id': i, 'score': s, 'diff': d}
for team, g, i, s, d in z
if g == group
})
for group in s
}
I added linebreaks for clarity
EDIT:
After the comment, to better clarify my intention and out of curiosity, I run a comparison:
# your code goes here
from collections import defaultdict
import timeit
teams = ['OneTeam', 'TwoTeam', 'ThreeTeam', 'FourTeam', 'FiveTeam', 'SixTeam', 'SevenTeam', 'EightTeam']
group = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
id = [0, 1, 2, 3, 4, 5, 6, 7]
score = [0.25, 1, 0.64, 0.93, 0.5, 0.3, 0.23, 1.2]
diff = [4, 10, 5, 6, 7, 8, 9, 4]
def no_comprehension():
global group, teams, id, score, diff
groups = defaultdict(dict)
for g, team, i, s, d in zip(group, teams, id, score, diff):
groups[g][team] = {'id': i, 'score': s, 'diff': d }
def comprehension():
global group, teams, id, score, diff
z = zip(teams, group, id, score, diff)
s = set(group)
d = {group: ({team: {'id': i, 'score': s, 'diff': d} for team, g, i, s, d in z if g == group}) for group in s}
print("no comprehension:")
print(timeit.timeit(lambda : no_comprehension(), number=10000))
print("comprehension:")
print(timeit.timeit(lambda : comprehension(), number=10000))
executable version
Output:
no comprehension:
0.027287796139717102
comprehension:
0.028979241847991943
They do look the same, in terms of performance. With my sentence above, I was just highlighting this as an alternative solution to the one already posted by #JohnO.

Convert redundant array to dict (or JSON)?

Suppose I have an array:
[['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
And I want a dict (or JSON):
{
'a': {
10: {1: 0.1, 2: 0.2},
20: {2: 0.3}
}
'b': {
10: {1: 0.4},
20: {2: 0.5}
}
}
Is there any good way or some library for this task?
In this example the array is just 4-column, but my original array is more complicated (7-column).
Currently I implement this naively:
import pandas as pd
df = pd.DataFrame(array)
grouped1 = df.groupby('column1')
for column1 in grouped1.groups:
group1 = grouped1.get_group(column1)
grouped2 = group1.groupby('column2')
for column2 in grouped2.groups:
group2 = grouped2.get_group(column2)
...
And defaultdict way:
d = defaultdict(lambda x: defaultdict(lambda y: defaultdict ... ))
for row in array:
d[row[0]][row[1]][row[2]... = row[-1]
But I think neither is smart.
I would suggest this rather simple solution:
from functools import reduce
data = [['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
result = dict()
for row in data:
reduce(lambda v, k: v.setdefault(k, {}), row[:-2], result)[row[-2]] = row[-1]
print(result)
{'a': {10: {1: 0.1, 2: 0.2}, 20: {2: 0.3}}, 'b': {10: {1: 0.4}, 20: {2: 0.5}}}
An actual recursive solution would be something like this:
def add_to_group(keys: list, group: dict):
if len(keys) == 2:
group[keys[0]] = keys[1]
else:
add_to_group(keys[1:], group.setdefault(keys[0], dict()))
result = dict()
for row in data:
add_to_group(row, result)
print(result)
Introduction
Here is a recursive solution. The base case is when you have a list of 2-element lists (or tuples), in which case, the dict will do what we want:
>>> dict([(1, 0.1), (2, 0.2)])
{1: 0.1, 2: 0.2}
For other cases, we will remove the first column and recurse down until we get to the base case.
The code:
from itertools import groupby
def rows2dict(rows):
if len(rows[0]) == 2:
# e.g. [(1, 0.1), (2, 0.2)] ==> {1: 0.1, 2: 0.2}
return dict(rows)
else:
dict_object = dict()
for column1, groupped_rows in groupby(rows, lambda x: x[0]):
rows_without_first_column = [x[1:] for x in groupped_rows]
dict_object[column1] = rows2dict(rows_without_first_column)
return dict_object
if __name__ == '__main__':
rows = [['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
dict_object = rows2dict(rows)
print dict_object
Output
{'a': {10: {1: 0.1, 2: 0.2}, 20: {2: 0.3}}, 'b': {10: {1: 0.4}, 20: {2: 0.5}}}
Notes
We use the itertools.groupby generator to simplify grouping of similar rows based on the first column
For each group of rows, we remove the first column and recurse down
This solution assumes that the rows variable has 2 or more columns. The result is unpreditable for rows which has 0 or 1 column.

Categories