A recipe to group/aggregate data?

A recipe to group/aggregate data? - python

I have some data stored in a list that I would like to group based on a value.
For example, if my data is
data = [(1, 'a'), (2, 'x'), (1, 'b')]
and I want to group it by the first value in each tuple to get
result = [(1, 'ab'), (2, 'x')]
how would I go about it?
More generally, what's the recommended way to group data in python? Is there a recipe that can help me?

The go-to data structure to use for all kinds of grouping is the dict. The idea is to use something that uniquely identifies a group as the dict's keys, and store all values that belong to the same group under the same key.
As an example, your data could be stored in a dict like this:
{1: ['a', 'b'],
2: ['x']}
The integer that you're using to group the values is used as the dict key, and the values are aggregated in a list.
The reason why we're using a dict is because it can map keys to values in constant O(1) time. This makes the grouping process very efficient and also very easy. The general structure of the code will always be the same for all kinds of grouping tasks: You iterate over your data and gradually fill a dict with grouped values. Using a defaultdict instead of a regular dict makes the whole process even easier, because we don't have to worry about initializing the dict with empty lists.
import collections
groupdict = collections.defaultdict(list)
for value in data:
group = value[0]
value = value[1]
groupdict[group].append(value)
# result:
# {1: ['a', 'b'],
# 2: ['x']}
Once the data is grouped, all that's left is to convert the dict to your desired output format:
result = [(key, ''.join(values)) for key, values in groupdict.items()]
# result: [(1, 'ab'), (2, 'x')]
The Grouping Recipe
The following section will provide recipes for different kinds of inputs and outputs, and show how to group by various things. The basis for everything is the following snippet:
import collections
groupdict = collections.defaultdict(list)
for value in data: # input
group = ??? # group identifier
value = ??? # value to add to the group
groupdict[group].append(value)
result = groupdict # output
Each of the commented lines can/has to be customized depending on your use case.
Input
The format of your input data dictates how you iterate over it.
In this section, we're customizing the for value in data: line of the recipe.
A list of values
More often than not, all the values are stored in a flat list:
data = [value1, value2, value3, ...]
In this case we simply iterate over the list with a for loop:
for value in data:
Multiple lists
If you have multiple lists with each list holding the value of a different attribute like
firstnames = [firstname1, firstname2, ...]
middlenames = [middlename1, middlename2, ...]
lastnames = [lastname1, lastname2, ...]
use the zip function to iterate over all lists simultaneously:
for value in zip(firstnames, middlenames, lastnames):
This will make value a tuple of (firstname, middlename, lastname).
Multiple dicts or a list of dicts
If you want to combine multiple dicts like
dict1 = {'a': 1, 'b': 2}
dict2 = {'b': 5}
First put them all in a list:
dicts = [dict1, dict2]
And then use two nested loops to iterate over all (key, value) pairs:
for dict_ in dicts:
for value in dict_.items():
In this case, the value variable will take the form of a 2-element tuple like ('a', 1) or ('b', 2).
Grouping
Here we'll cover various ways to extract group identifiers from your data.
In this section, we're customizing the group = ??? line of the recipe.
Grouping by a list/tuple/dict element
If your values are lists or tuples like (attr1, attr2, attr3, ...) and you want to group them by the nth element:
group = value[n]
The syntax is the same for dicts, so if you have values like {'firstname': 'foo', 'lastname': 'bar'} and you want to group by the first name:
group = value['firstname']
Grouping by an attribute
If your values are objects like datetime.date(2018, 5, 27) and you want to group them by an attribute, like year:
group = value.year
Grouping by a key function
Sometimes you have a function that returns a value's group when it's called. For example, you could use the len function to group values by their length:
group = len(value)
Grouping by multiple values
If you wish to group your data by more than a single value, you can use a tuple as the group identifier. For example, to group strings by their first letter and their length:
group = (value[0], len(value))
Grouping by something unhashable
Because dict keys must be hashable, you will run into problems if you try to group by something that can't be hashed. In such a case, you have to find a way to convert the unhashable value to a hashable representation.
sets: Convert sets to frozensets, which are hashable:
group = frozenset(group)
dicts: Dicts can be represented as sorted (key, value) tuples:
group = tuple(sorted(group.items()))
Modifying the aggregated values
Sometimes you will want to modify the values you're grouping. For example, if you're grouping tuples like (1, 'a') and (1, 'b') by the first element, you might want to remove the first element from each tuple to get a result like {1: ['a', 'b']} rather than {1: [(1, 'a'), (1, 'b')]}.
In this section, we're customizing the value = ??? line of the recipe.
No change
If you don't want to change the value in any way, simple delete the value = ??? line from your code.
Keeping only a single list/tuple/dict element
If your values are lists like [1, 'a'] and you only want to keep the 'a':
value = value[1]
Or if they're dicts like {'firstname': 'foo', 'lastname': 'bar'} and you only want to keep the first name:
value = value['firstname']
Removing the first list/tuple element
If your values are lists like [1, 'a', 'foo'] and [1, 'b', 'bar'] and you want to discard the first element of each tuple to get a group like [['a', 'foo], ['b', 'bar']], use the slicing syntax:
value = value[1:]
Removing/Keeping arbitrary list/tuple/dict elements
If your values are lists like ['foo', 'bar', 'baz'] or dicts like {'firstname': 'foo', 'middlename': 'bar', 'lastname': 'baz'} and you want delete or keep only some of these elements, start by creating a set of elements you want to keep or delete. For example:
indices_to_keep = {0, 2}
keys_to_delete = {'firstname', 'middlename'}
Then choose the appropriate snippet from this list:
To keep list elements: value = [val for i, val in enumerate(value) if i in indices_to_keep]
To delete list elements: value = [val for i, val in enumerate(value) if i not in indices_to_delete]
To keep dict elements: value = {key: val for key, val in value.items() if key in keys_to_keep]
To delete dict elements: value = {key: val for key, val in value.items() if key not in keys_to_delete]
Output
Once the grouping is complete, we have a defaultdict filled with lists. But the desired result isn't always a (default)dict.
In this section, we're customizing the result = groupdict line of the recipe.
A regular dict
To convert the defaultdict to a regular dict, simply call the dict constructor on it:
result = dict(groupdict)
A list of (group, value) pairs
To get a result like [(group1, value1), (group1, value2), (group2, value3)] from the dict {group1: [value1, value2], group2: [value3]}, use a list comprehension:
result = [(group, value) for group, values in groupdict.items()
for value in values]
A nested list of just values
To get a result like [[value1, value2], [value3]] from the dict {group1: [value1, value2], group2: [value3]}, use dict.values:
result = list(groupdict.values())
A flat list of just values
To get a result like [value1, value2, value3] from the dict {group1: [value1, value2], group2: [value3]}, flatten the dict with a list comprehension:
result = [value for values in groupdict.values() for value in values]
Flattening iterable values
If your values are lists or other iterables like
groupdict = {group1: [[list1_value1, list1_value2], [list2_value1]]}
and you want a flattened result like
result = {group1: [list1_value1, list1_value2, list2_value1]}
you have two options:
Flatten the lists with a dict comprehension:
result = {group: [x for iterable in values for x in iterable]
for group, values in groupdict.items()}
Avoid creating a list of iterables in the first place, by using list.extend instead of list.append. In other words, change
groupdict[group].append(value)
to
groupdict[group].extend(value)
And then just set result = groupdict.
A sorted list
Dicts are unordered data structures. If you iterate over a dict, you never know in which order its elements will be listed. If you don't care about the order, you can use the recipes shown above. But if you do care about the order, you have to sort the output accordingly.
I'll use the following dict to demonstrate how to sort your output in various ways:
groupdict = {'abc': [1], 'xy': [2, 5]}
Keep in mind that this is a bit of a meta-recipe that may need to be combined with other parts of this answer to get exactly the output you want. The general idea is to sort the dictionary keys before using them to extract the values from the dict:
groups = sorted(groupdict.keys())
# groups = ['abc', 'xy']
Keep in mind that sorted accepts a key function in case you want to customize the sort order. For example, if the dict keys are strings and you want to sort them by length:
groups = sorted(groupdict.keys(), key=len)
# groups = ['xy', 'abc']
Once you've sorted the keys, use them to extract the values from the dict in the correct order:
# groups = ['abc', 'xy']
result = [groupdict[group] for group in groups]
# result = [[1], [2, 5]]
Remember that this can be combined with other parts of this answer to get different kinds of output. For example, if you want to keep the group identifiers:
# groups = ['abc', 'xy']
result = [(group, groupdict[group]) for group in groups]
# result = [('abc', [1]), ('xy', [2, 5])]
For your convenience, here are some commonly used sort orders:
Sort by number of values per group:
groups = sorted(groudict.keys(), key=lambda group: len(groupdict[group]))
result = [groupdict[group] for group in groups]
# result = [[2, 5], [1]]
Counting the number of values in each group
To count the number of elements associated with each group, use the len function:
result = {group: len(values) for group, values in groupdict.items()}
If you want to count the number of distinct elements, use set to eliminate duplicates:
result = {group: len(set(values)) for group, values in groupdict.items()}
An example
To demonstrate how to piece together a working solution from this recipe, let's try to turn an input of
data = [["A",0], ["B",1], ["C",0], ["D",2], ["E",2]]
into
result = [["A", "C"], ["B"], ["D", "E"]]
In other words, we're grouping lists by their 2nd element.
The first two lines of the recipe are always the same, so let's start by copying those:
import collections
groupdict = collections.defaultdict(list)
Now we have to find out how to loop over the input. Since our input is a simple list of values, a normal for loop will suffice:
for value in data:
Next we have to extract the group identifier from the value. We're grouping by the 2nd list element, so we use indexing:
group = value[1]
The next step is to transform the value. Since we only want to keep the first element of each list, we once again use list indexing:
value = value[0]
Finally, we have to figure out how to turn the dict we generated into a list. What we want is a list of values, without the groups. We consult the Output section of the recipe to find the appropriate dict flattening snippet:
result = list(groupdict.values())
Et voilà:
data = [["A",0], ["B",1], ["C",0], ["D",2], ["E",2]]
import collections
groupdict = collections.defaultdict(list)
for value in data:
group = value[1]
value = value[0]
groupdict[group].append(value)
result = list(groupdict.values())
# result: [["A", "C"], ["B"], ["D", "E"]]

itertools.groupby
There is a general purpose recipe in itertools and it's groupby().
A schema of this recipe can be given in this form:
[(k, aggregate(g)) for k, g in groupby(sorted(data, key=extractKey), extractKey)]
The two relevant parts to change in the recipe are:
define the grouping key (extractKey): in this case getting the first item of the tuple:
lambda x: x[0]
aggregate grouped results (if needed) (aggregate): g contains all the matching tuples for each key k (e.g. (1, 'a'), (1, 'b') for key 1, and (2, 'x') for key 2), we want to take only the second item of the tuple and concatenate all of those in one string:
''.join(x[1] for x in g)
Example:
from itertools import groupby
extractKey = lambda x: x[0]
aggregate = lambda g: ''.join(x[1] for x in g)
[(k, aggregate(g)) for k, g in groupby(sorted(data, key=extractKey), extractKey)]
# [(1, 'ab'), (2, 'x')]
Sometimes, extractKey, aggregate, or both can be inlined into a one-liner (we omit sort key too, as that's redundant for this example):
[(k, ''.join(x[1] for x in g)) for k, g in groupby(sorted(data), lambda x: x[0])]
# [(1, 'ab'), (2, 'x')]
Pros and cons
Comparing this recipe with the recipe using defaultdict there are pros and cons in both cases.
groupby() tends to be slower (about twice as slower in my tests) than the defaultdict recipe.
On the other hand, groupby() has advantages in the memory constrained case where the values are being produced on the fly; you can process the groups in a streaming fashion, without storing them; defaultdict will require the memory to store all of them.

Pandas groupby
This isn't a recipe as such, but an intuitive and flexible way to group data using a function. In this case, the function is str.join.
import pandas as pd
data = [(1, 'a'), (2, 'x'), (1, 'b')]
# create dataframe from list of tuples
df = pd.DataFrame(data)
# group by first item and apply str.join
grp = df.groupby(0)[1].apply(''.join)
# create list of tuples from index and value
res = list(zip(grp.index, grp))
print(res)
[(1, 'ab'), (2, 'x')]
Advantages
Fits nicely into workflows that only require list output at the end of a sequence of vectorisable steps.
Easily adaptable by changing ''.join to list or other reducing function.
Disadvantages
Overkill for an isolated task: requires list -> pd.DataFrame -> list conversion.
Introduces dependency on a 3rd party library.

Multiple-parse list comprehension
This is inefficient compared to the dict and groupby solutions.
However, for small lists where performance is not a concern, you can perform a list comprehension which parses the list for each unique identifier.
res = [(i, ''.join([j[1] for j in data if j[0] == i]))
for i in set(list(zip(*data))[0])]
[(1, 'ab'), (2, 'x')]
The solution can be split into 2 parts:
set(list(zip(*data))[0]) extracts the unique set of identifiers which we iterate via a for loop within the list comprehension.
(i, ''.join([j[1] for j in data if j[0] == i])) applies the logic we require for the desired output.

Related

How to get those tuples from a list of tuples which have the same second value?

I have a list of tuples which look like below
entities = [('tmp', 'warm'), ('loc', 'blr'), ('cap', 'blr'), ('aps', 'yes'), ('date', '12-10-2018')]
I want to store those tuples which have the same second values. As you can see, the tuples ('loc', 'blr') and ('cap', 'blr') have the same second value.
I want these two tuples to be stored in a list for me to refer.
This is what I tried but it doesn't work as expected
duplicate = []
for i in range(len(entities)):
for j in range(1, len(entities)):
if entities[i][1] == entities[j][1]:
duplicate.append([entities[i][1], entities[j][1]])
break
But I get all the tuples as if all tuples have same second value. How can I accomplish this?
Desired output
('loc', 'blr'), ('cap', 'blr')

You could group together lists with common second elements in the following way:
s = sorted(entities, key = lambda x: x[1])
[list(v) for k,v in groupby(s, key=lambda x: x[1])]
[[('date', '12-10-2018')],
[('loc', 'blr'), ('cap', 'blr')],
[('tmp', 'warm')],
[('aps', 'yes')]]
If performance is an issue consider using operator.itemgetter:
from operator import itemgetter
s = sorted(entities, key = itemgetter(1))
[list(v) for k,v in groupby(s, key = itemgetter(1))]
[[('date', '12-10-2018')],
[('loc', 'blr'), ('cap', 'blr')],
[('tmp', 'warm')],
[('aps', 'yes')]]
Now, if you only want to keet cases where two tuples had a common second element, you can do:
[i for i in l if len(i)>1]
[[('loc', 'blr'), ('cap', 'blr')]]
I proposed this answer so that this way you can extend this to n common elements in the second place, as you may have more than 2.

You can use O(n log n) itertools.groupby (requires pre-sorting your input list), but O(n) collections.Counter is sufficient:
from collections import Counter
from operator import itemgetter
# construct dictionary mapping second value to count
c = Counter(map(itemgetter(1), entities))
# filter for second values with count > 1
dups = {value for value, count in c.items() if count > 1}
# filter entities with second value in dups
res = [entity for entity in entities if entity[1] in dups]
print(res)
# [('loc', 'blr'), ('cap', 'blr')]

You can group these tuples using dict for a more common case
grouped = {}
for k, v in entities:
grouped[v].setdefault(k, []).append((k, v))
for _, tuples in grouped.items():
if len(tuples) > 2:
print(tuples)
all pairs with same second value will be grouped under different keys

Sort list of dictionaries based on keys

I want to sort a list of dictionaries based on the presence of keys. Let's say I have a list of keys [key2,key3,key1], I need to order the list in such a way the dictionary with key2 should come first, key3 should come second and with key1 last.
I saw this answer (Sort python list of dictionaries by key if key exists) but it refers to only one key
The sorting is not based on value of the 'key'. It depends on the presence of the key and that too with a predefined list of keys.

Just use sorted using a list like [key1 in dict, key2 in dict, ...] as the key to sort by. Remember to reverse the result, since True (i.e. key is in dict) is sorted after False.
>>> dicts = [{1:2, 3:4}, {3:4}, {5:6, 7:8}]
>>> keys = [5, 3, 1]
>>> sorted(dicts, key=lambda d: [k in d for k in keys], reverse=True)
[{5: 6, 7: 8}, {1: 2, 3: 4}, {3: 4}]
This is using all the keys to break ties, i.e. in above example, there are two dicts that have the key 3, but one also has the key 1, so this one is sorted second.

I'd do this with:
sorted_list = sorted(dict_list, key = lambda d: next((i for (i, k) in enumerate(key_list) if k in d), len(key_list) + 1))
That uses a generator expression to find the index in the key list of the first key that's in each dictionary, then use that value as the sort key, with dicts that contain none of the keys getting len(key_list) + 1 as their sort key so they get sorted to the end.

How about something like this
def sort_key(dict_item, sort_list):
key_idx = [sort_list.index(key) for key in dict_item.iterkeys() if key in sort_list]
if not key_idx:
return len(sort_list)
return min(key_idx)
dict_list.sort(key=lambda x: sort_key(x, sort_list))
If the a given dictionary in the list contains more than one of the keys in the sorting list, it will use the one with the lowest index. If none of the keys are present in the sorting list, the dictionary is sent to the end of the list.
Dictionaries that contain the same "best" key (i.e. lowest index) are considered equal in terms of order. If this is a problem, it wouldn't be too hard to have the sort_key function consider all the keys rather than just the best.
To do that, simply return the whole key_idx instead of min(key_idx) and instead of len(sort_list) return [len(sort_list)]

Create OrderedDict from dict with values of list type (in the order of list's values)

It is a bit hard for me to explain it in words, so I'll show an example:
What I have (data is a dict instance):
data = {'a':[4,5,3], 'b':[1,0,2], 'c':[6,7,8]}
What I need (ordered_data is an OrderedDict instance):
ordered_data = {'b':[0,1,2], 'a':[3,4,5], 'b':[6,7,8]}
The order of keys should be changed with respect to order of items in nested lists

tmp = {k:sorted(v) for k,v in data.items()}
ordered_data = OrderedDict((k,v) for k,v in sorted(tmp.items(), key=lambda i: i[1]))
First sort the values. If you don't need the original data, it's OK to do this in place, but I made a temporary variable.
key is a function that returns a key to be sorted on. In this case, the key is the second element of the item tuple (the list), and since lists are comparable, that's good enough.

You can use OrderedDict by sorting your items and the values :
>>> from operator import itemgetter
>>> from collections import OrderedDict
>>> d = OrderedDict(sorted([(k, sorted(v)) for k, v in data.items()], key=itemgetter(1)))
>>> d
OrderedDict([('b', [0, 1, 2]), ('a', [3, 4, 5]), ('c', [6, 7, 8])])

Usually, you should not worry about the data order in the dictionary itself, and instead, jsut order it when you retrieve the dictionary's contents (i.e.: iterate over it):
data = {'a':[4,5,3], 'b':[1,0,2], 'c':[6,7,8]}
for datum in sorted(data.items(), key=lambda item: item[1]):
...

Python: Comparison between list items that isn't redundant

Say I have a list nested within a key of a dict. So something like this:
d = {'people':['John', 'Carry', 'Joe', 'Greg', 'Carl', 'Gene']}
And I want to compare the people in the list with each other so that I can make a graph connecting names that start with the same first letter.
I came up with a nested for loop to try and solve this:
for subject in d.keys():
for word1 in d[people]:
for word2 in d[people]:
if word1[0] == word2[0]:
g.connectThem(word1,word2)
But the nested for loop can get redundant since it will make the same comparisons twice over. Is there any way to make it so there will be no redundancy in regards to comparison?

You can iterate through pairs using itertools.combinations
for pair in itertools.combinations(d['people'], 2):
first, second = pair
if first[0] == second[0]:
g.connectThem(first, second)
These are the pairs that are produced from combinations
[('John', 'Carry'), ('John', 'Joe'), ('John', 'Greg'), ('John', 'Carl'), ('John', 'Gene'),
('Carry', 'Joe'), ('Carry', 'Greg'), ('Carry', 'Carl'), ('Carry', 'Gene'),
('Joe', 'Greg'), ('Joe', 'Carl'), ('Joe', 'Gene'),
('Greg', 'Carl'), ('Greg', 'Gene'),
('Carl', 'Gene')]
Notice you do not have the issue of repeats (by reversing the order of the pair).
Assuming your connectThem function works, this should produce your desired behavior.

If you want to to compare the people in the list with each other so that I can make a graph connecting names that start with the same first letter. then use a dict and a single pass over d["people"] where you use the first letters of names as the keys so the solution is 0(n) and markedly more efficient than the quadratic getting all combinations which creates mostly unnecessary pairings:
d = {"people":['John', 'Carry', 'Joe', 'Greg', 'Carl', 'Gene']}
from collections import defaultdict
my_d = defaultdict(list)
for v in d["people"]:
my_d[v[0]].append(v)
print(my_d)
defaultdict(<type 'list'>, {'C': ['Carry', 'Carl'], 'J': ['John', 'Joe'], 'G': ['Greg', 'Gene']})
You can now pass complete lists of names with common first names to a method to add to the graph by just iterating over the values of my_d.
If you want to create combinations of people in lists lengths that are > 2 then you can it will save making multiple unnecessary combinations using the original list. It will only combine the actual names you want.
So to handle duplicate names, only make combinations from names with a common first letter and only consider groups that have links i.e names with non unique first letters:
from collections import defaultdict
# store all names in groups, grouped by common first letter in names
my_d = defaultdict(set)
for v in d["people"]:
# 0(1) set lookup avoids adding names twice
if v not in my_d[v[0]]:
my_d[v[0]].add(v)
from itertools import combinations
for group in my_d.itervalues():
# two elements are a combination
if len(group) == 2:
g.connectThem(group[0],group[1])
# ignore uniques names ?
elif len(group) > 2:
for n1,n2 in combinations(group,2):
g.connectThem(n1 ,n2)
Without using itertools at all we can see that because our linear pass creates a dict of groupings we can simply loop over each value list in out dict and create uniques pairings:
for group in my_d.itervalues():
for ind, n1 in enumerate(group):
for n2 in group[ind+1:]:
print(n1,n2)
('Carry', 'Carl')
('John', 'Joe')
('Greg', 'Gene')

extracting dictionary pairs based on value

I want to copy pairs from this dictionary based on their values so they can be assigned to new variables. From my research it seems easy to do this based on keys, but in my case the values are what I'm tracking.
things = ({'alpha': 1, 'beta': 2, 'cheese': 3, 'delta': 4})
And in made-up language I can assign variables like so -
smaller_things = all values =3 in things

You can use .items() to traverse through the pairs and make changes like this:
smaller_things = {}
for k, v in things.items():
if v == 3:
smaller_things[k] = v
If you want a one liner and only need the keys back, list comprehension will do it:
smaller_things = [k for k, v in things.items() if v == 3]
>>> things = { 'a': 3, 'b': 2, 'c': 3 }
>>> [k for k, v in things.items() if v == 3]
['a', 'c']

you can just reverse the dictionary and pull from that:
keys_values = { 1:"a", 2:"b"}
values_keys = dict(zip(keys_values.values(), keys_values.keys()))
print values_keys
>>> {"a":1, "b":2}
That way you can do whatever you need to with standard dictionary syntax.
The potential drawback is if you have non-unique values in the original dictionary; items in the original with the same value will have the same key in the reversed dictionary, so you can't guarantee which of the original keys would be the new value. And potentially some values are unhashable (such as lists).
Unless you have a compulsive need to be clever, iterating over items is easier:
for key, val in my_dict.items():
if matches_condition(val):
do_something(key)

kindly this answer is as per my understanding of your question .
The dictionary is a kind of hash table , the main intension of dictionary is providing the non integer indexing to the values . The keys in dictionary are just like indexes .
for suppose consider the "array" , the elements in array are addressed by the index , and we have index for the elements not the elements for index . Just like that we have keys(non integer indexes) for values in dictionary .
And there is one implication the values in dictionary are non hashable I mean the values in dictionary are mutable and keys in dictionary are immutable ,simply values could be changed any time .
simply it is not good approach to address any thing by using values in dictionary

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

A recipe to group/aggregate data? - python

Related

How to get those tuples from a list of tuples which have the same second value?

Sort list of dictionaries based on keys

Create OrderedDict from dict with values of list type (in the order of list's values)

Python: Comparison between list items that isn't redundant

extracting dictionary pairs based on value

Categories

Resources