Generator to merge sorted dictionary-like iterables

Generator to merge sorted dictionary-like iterables - python

This is a variation on Generator to yield gap tuples from zipped iterables .
I wish to design a generator function that:
Accepts an arbitrary number of iterables
Each input iterable yields zero or more (k, v), k not necessarily unique
Input keys are assumed to be sorted in ascending order
Output should yield (k, (v1, v2, ...))
Output keys are unique, and appear in the same order as the input
The number of output tuples is equal to the number of unique keys in the input
The output values correspond to all input tuples matching the output key
Since the inputs and outputs are potentially large, they should be treated as iterables and not loaded as an in-memory dict or list.
As an example,
i1 = ((2, 'a'), (3, 'b'), (5, 'c'))
i2 = ((1, 'd'), (2, 'e'), (3, 'f'))
i3 = ((1, 'g'), (3, 'h'), (5, 'i'), (5, 'j'))
result = sorted_merge(i1, i2, i3)
print [result]
This would output:
[(1, ('d', 'g')), (2, ('a', 'e')), (3, ('b', 'f', 'h')), (5, ('c', 'i', 'j'))]
If I'm not mistaken, there's nothing built into the Python standard library to do this out of the box.

While there isn't a single standard library function to do what you want, there are enough building blocks to get you most of the way:
from heapq import merge
from itertools import groupby
from operator import itemgetter
def sorted_merge(*iterables):
for key, group in groupby(merge(*iterables), itemgetter(0)):
yield key, [pair[1] for pair in group]
Example:
>>> i1 = ((2, 'a'), (3, 'b'), (5, 'c'))
>>> i2 = ((1, 'd'), (2, 'e'), (3, 'f'))
>>> i3 = ((1, 'g'), (3, 'h'), (5, 'i'), (5, 'j'))
>>> result = sorted_merge(i1, i2, i3)
>>> list(result)
[(1, ['d', 'g']), (2, ['a', 'e']), (3, ['b', 'f', 'h']), (5, ['c', 'i', 'j'])]
Note that in the version of sorted_merge above, we're yielding int, list pairs for the sake of producing readable output. There's nothing to stop you changing the relevant line to
yield key, (pair[1] for pair in group)
if you want to yield int, <generator> pairs instead.

Something a little different:
from collections import defaultdict
def sorted_merged(*items):
result = defaultdict(list)
for t in items:
for k, v in t:
result[k].append(v)
return sorted(list(result.items()))
i1 = ((2, 'a'), (3, 'b'), (5, 'c'))
i2 = ((1, 'd'), (2, 'e'), (3, 'f'))
i3 = ((1, 'g'), (3, 'h'), (5, 'i'), (5, 'j'))
result = sorted_merged(i1, i2, i3)

Related

Convert a list of tuples to a dictionary, based on tuple values

I have a little problem, maybe dumb, but it seems I can't solve it.
I have a list of objects that have members, but let's say my list is this:
l = [(1, 'a'), (2, 'a'), (1, 'b'), (1, 'c'), (3, 'a')]
I want to "gather" all elements based on the value I choose, and to put them into a dictionary based on that value/key (can be both the first or the second value of the tuple).
For example, if I want to gather the values based on the first element, I want something like that:
{1: [(1, 'a'), (1, 'b'), (1, 'c')], 2: [(2, 'a')], 3: [(3, 'a')]}
However, what I achieved until now is this:
>>> {k:v for k,v in zip([e[0] for e in l], l)}
{1: (1, 'c'), 2: (2, 'a'), 3: (3, 'a')}
Can somebody please help me out?

My first thought would be using defaultdict(list) (efficient, in linear time), which does exactly what you were trying to do:
from collections import defaultdict
dic = defaultdict(list)
l = [(1, 'a'), (2, 'a'), (1, 'b'), (1, 'c'), (3, 'a')]
for item in l:
dic[item[0]].append(item)
output
defaultdict(list,{1: [(1, 'a'), (1, 'b'), (1, 'c')], 2: [(2, 'a')], 3: [(3, 'a')]})

Here you go:
l = [(1, 'a'), (2, 'a'), (1, 'b'), (1, 'c'), (3, 'a')]
output_dict = dict()
for item in l:
if item[0] in output_dict:
output_dict[item[0]].append(item)
continue
output_dict[item[0]] = [item]
print(output_dict)

Here with list comprehension, oneliner:
l = [(1, 'a'), (2, 'a'), (1, 'b'), (1, 'c'), (3, 'a')]
print (dict([(x[0], [y for y in l if y[0] == x[0]]) for x in l]))
Output:
{1: [(1, 'a'), (1, 'b'), (1, 'c')], 2: [(2, 'a')], 3: [(3, 'a')]}

First create a dico with list inside :
l = [(1, 'a'), (2, 'a'), (1, 'b'), (1, 'c'), (3, 'a')]
dico={}
for i in range(4):
dico[i]=[]
Then fill this dico
for i in l:
dico[i[0]].append(i)

Join strings from consecutive equal tuples

Given a list which looks like this:
[(1, 'a'), (1, 'b'), (2, 'c'), (1, 'd')]
I want to join the consecutive tuples inside the list if they have same first value, so the result looks like following:
[(1, 'ab'), (2, 'c'), (1, 'd')]
Should only join if both are next to each other.
If key is None like below it should be merged to previous item.
[(1, 'a'), (1, 'b'), (None, 'e'), (2, 'c'), (1, 'd')]
result should be
[(1, 'abe'), (2, 'c'), (1, 'd')]

You can use itertools.groupby to group consecutive sublists with the same first value, and join the strings from the corresponding gruoped tuples:
from itertools import groupby
l = [(1, 'a'), (1, 'b'), (2, 'c'), (1, 'd')]
[(k,''.join([i for _,i in v])) for k,v in groupby(l, key=lambda x:x[0])]
# [(1, 'ab'), (2, 'c'), (1, 'd')]

Change `itertools.product` iteration order

Given some iterables, itertools.product iterates from back to front, trying all choices of the last iterable before advancing the second-to-last iterable, and trying all choices of the last two iterables before advancing the third-to-last iterable, etc. For instance,
>>> list(itertools.product([2,1,0],['b','c','a']))
[(2, 'b'), (2, 'c'), (2, 'a'), (1, 'b'), (1, 'c'), (1, 'a'), (0, 'b'), (0, 'c'), (0, 'a')]
I would like to iterate over the product in a different manner: the order the tuples should be produced is by the sum of the indices of the elements they contain, i.e., before producing a tuple whose elements' indices in their respective iterables sum to k, produce all tuples whose elements' indices in their respective iterables sum to k-1. For example, after producing the tuple containing the first element (index 0) of every iterable, the next tuples produced should each contain the second element from a single iterable and the first from the rest; after that, the tuples produced should contain the third element from one tuple or the second element from two tuples, etc. Using the above example,
>>> my_product([2,1,0],['b','c','a'])
[(2, 'b'), # element 0 from both iterables
(2, 'c'), (1, 'b'), # elements 0,1 and 1,0 (sums to 1)
(2, 'a'), (1, 'c'), (0, 'b'), # elements 0,2 and 1,1 and 2,0 (sums to 2)
(1, 'a'), (0, 'c'), # elements 1,2 and 2,1 (sums to 3)
(0, 'a')] # elements 2,2 (sums to 4)

Solved this with sorting:
def my_product(*args):
return [tuple(i[1] for i in p) for p in
sorted(itertools.product(*map(enumerate, args)),
key=lambda x: (sum(y[0] for y in x), x))]
Test:
>>> my_product([0,1,2],[3,4,5])
[(0, 3),
(0, 4), (1, 3),
(0, 5), (1, 4), (2, 3),
(1, 5), (2, 4),
(2, 5)]
works also with non-sorted, non-numeric items:
>>> my_product(['s0','b1','k2'],['z3','a4','c5'])
[('s0', 'z3'),
('s0', 'a4'), ('b1', 'z3'),
('s0', 'c5'), ('b1', 'a4'), ('k2', 'z3'),
('b1', 'c5'), ('k2', 'a4'),
('k2', 'c5')]
>>> my_product([2,1,0],['b','c','a'])
[(2, 'b'),
(2, 'c'), (1, 'b'),
(2, 'a'), (1, 'c'), (0, 'b'),
(1, 'a'), (0, 'c'),
(0, 'a')]
and with multiple args:
>>> my_product([2,1,0],['b','c','a'],['x','y','z'])
[(2, 'b', 'x'),
(2, 'b', 'y'), (2, 'c', 'x'), (1, 'b', 'x'),
(2, 'b', 'z'), (2, 'c', 'y'), (2, 'a', 'x'), (1, 'b', 'y'), (1, 'c', 'x'), (0, 'b', 'x'),
(2, 'c', 'z'), (2, 'a', 'y'), (1, 'b', 'z'), (1, 'c', 'y'), (1, 'a', 'x'), (0, 'b', 'y'), (0, 'c', 'x'),
(2, 'a', 'z'), (1, 'c', 'z'), (1, 'a', 'y'), (0, 'b', 'z'), (0, 'c', 'y'), (0, 'a', 'x'),
(1, 'a', 'z'), (0, 'c', 'z'), (0, 'a', 'y'),
(0, 'a', 'z')]

Given that you'll need to check that sum why don' you just use sort:
def my_product(*args):
return list(sorted(itertools.product(*args), key=lambda x: (sum(x), x)))

As an alternative to sorting, this solution makes multiple passes over the result of itertools.product().
Note that it uses the same decorate-manipulate-undecorate pattern that other answers use.
import itertools
# TESTED on Python3
def my_product(*args):
args = [list(enumerate(arg)) for arg in args]
for sum_indexes in range(sum(len(item) for item in args)):
for partial in itertools.product(*args):
indexes, values = zip(*partial)
if sum(indexes) == sum_indexes:
yield values
assert list(my_product([2,1,0],['b','c','a'])) == [
(2, 'b'), # element 0 from both iterables
(2, 'c'), (1, 'b'), # elements 0,1 and 1,0 (sums to 1)
(2, 'a'), (1, 'c'), (0, 'b'), # elements 0,2 and 1,1 and 2,0 (sums to 2)
(1, 'a'), (0, 'c'), # elements 1,2 and 2,1 (sums to 3)
(0, 'a')]

How to do a full outer join / merge of iterators by key?

I have multiple sorted iterators that yield keyed data, representable by lists:
a = iter([(1, 'a'), (2, 't'), (4, 'c')])
b = iter([(1, 'a'), (3, 'g'), (4, 'g')])
I want to merge them, using the key and keeping track of which iterator had a value for a key. This should be equivalent to a full outer join in SQL:
>>> list(full_outer_join(a, b, key=lambda x: x[0]))
[(1, 'a', 'a'), (2, 't', None), (3, None, 'g'), (4, 'c', 'g')]
I tried using heapq.merge and itertools.groupby, but with merge I already lose information about the iterators:
>>> list(heapq.merge(a, b, key=lambda x: x[0]))
[(1, 'a'), (1, 'a'), (2, 't'), (3, 'g'), (4, 'c'), (4, 'g')]
So what I could use is a tag generator
def tagged(it, tag):
for item in it:
yield (tag, *x)
and merge the tagged iterators, group by the key and create a dict using the tag:
merged = merge(tagged(a, 'a'), tagged(b, 'b'), key=lambda x: x[1])
grouped = groupby(merged, key=lambda x: x[1])
[(key, {g[0]: g[2] for g in group}) for key, group in grouped]
Which gives me this usable output:
[(1, {'a': 'a', 'b': 'a'}),
(2, {'a': 't'}),
(3, {'b': 'g'}),
(4, {'a': 'c', 'b': 'g'})]
However, I think creating dicts for every group is quite costly performance wise, so maybe there is a more elegant way?
Edit:
To clarify, the dataset is too big to fit into memory, so I definitely need to use generators/iterators.
Edit 2:
To further clarify, a and b should only be iterated over once, because they represent huge files that are slow to read.

You can alter your groupby solution by using reduce and a generator in a function:
from itertools import groupby
from functools import reduce
def group_data(a, b):
sorted_data = sorted(a+b, key=lambda x:x[0])
data = [reduce(lambda x, y:(*x, y[-1]), list(b)) for _, b in groupby(sorted_data, key=lambda x:x[0])]
current = iter(range(len(list(filter(lambda x:len(x) == 2, data)))))
yield from [i if len(i) == 3 else (*i, None) if next(current)%2 == 0 else (i[0], None, i[-1]) for i in data]
print(list(group_data([(1, 'a'), (2, 't'), (4, 'c')], [(1, 'a'), (3, 'g'), (4, 'g')])))
Output:
[(1, 'a', 'a'), (2, 't', None), (3, None, 'g'), (4, 'c', 'g')]

Here is one solution via dictionaries. I provide it here as it's not clear to me that dictionaries are inefficient in this case.
I believe dict_of_lists can be replaced by an iterator, but I use it in the below solution for demonstration purposes.
a = [(1, 'a'), (2, 't'), (4, 'c')]
b = [(1, 'a'), (3, 'g'), (4, 'g')]
dict_of_lists = {'a': a, 'b': b}
def gen_results(dict_of_lists):
keys = {num for k, v in dict_of_lists.items() \
for num, val in v}
for key in keys:
d = {k: val for k, v in dict_of_lists.items() \
for num, val in v if num == key}
yield (key, d)
Result
list(gen_results(dict_of_lists))
[(1, {'a': 'a', 'b': 'a'}),
(2, {'a': 't'}),
(3, {'b': 'g'}),
(4, {'a': 'c', 'b': 'g'})]

Python tuple list

Can You help me to convert Python list:
[(1, 'a'), (2, 'b'), (2, 'c'), (3, 'd'), (3, 'e')]
so that:
(1, 'a') is index 0
(2, 'b'), (2, 'c') are both index 1
(3, 'd'), (3, 'e') are both index 2
Simply, all tuples which element[0] is equal, have same index.
Thank You,

itertools.groupby to the rescue!:
lst = [(1, 'a'), (2, 'b'), (2, 'c'), (3, 'd'), (3, 'e')]
lst.sort(key=lambda x: x[0]) #Only necessary if your list isn't sorted already.
new_lst = [list(v) for k,v in itertools.groupby(lst,key=lambda x:x[0])]
You could use operator.itemgetter(0) instead of the lambda if you wanted...
demo:
>>> import itertools
>>> lst = [(1, 'a'), (2, 'b'), (2, 'c'), (3, 'd'), (3, 'e')]
>>> lst.sort(key=lambda x: x[0])
>>> new_lst = [list(v) for k,v in itertools.groupby(lst,key=lambda x:x[0])]
>>> new_lst
[[(1, 'a')], [(2, 'b'), (2, 'c')], [(3, 'd'), (3, 'e')]]

It's not clear what you want, but this will group the items into lists according ot their first element.
groups = collections.defaultdict(list)
for x,y in items:
groups[x].append(y)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generator to merge sorted dictionary-like iterables - python

Related

Convert a list of tuples to a dictionary, based on tuple values

Join strings from consecutive equal tuples

Change `itertools.product` iteration order

How to do a full outer join / merge of iterators by key?

Python tuple list

Categories

Resources