Merge list together based on common values - python

I have a list
a = [(1,2),(1,3),(4,5),(6,7),(8,7)]
I want to merge the values in the lists in groups so I can get:
b = [(1,2,3),(4,5),(6,7,8)]
The order doesn't matter, but the group based on connectivity matters. Haven't been able to figure out a way to do it, any help is appreciated!

You can use set intersection to test if there's any value in common between two sets, and you can use set union to merge the two sets:
b = []
for p in map(set, a):
for i, s in enumerate(b):
if s & p:
b[i] |= p
break
else:
b.append(p)
b becomes:
[{1, 2, 3}, {4, 5}, {8, 6, 7}]
You can then convert it to your desired list of sorted tuples if you want:
b = [tuple(sorted(s)) for s in b]
b becomes:
[(1, 2, 3), (4, 5), (6, 7, 8)]

some for loops will do the job:
a = [(1,2),(1,3),(4,5),(6,7),(8,7)]
unions = [[i1,i2] for i1,x in enumerate(a) for i2,y in enumerate(a) for z in x if z in y and i2!=i1]
for c in unions:
if c[::-1] in unions: unions.remove(c[::-1])
b = [e for i,e in enumerate(a) if i not in [y for x in unions for y in x]]
for c in unions:b.append(tuple(set(a[c[0]]+a[c[1]])))
print sorted(b)

Related

How does enumerate(zip(*k_fold(dataset, folds))) work?

If we have:
a = ['a', 'aa', 'aaa']
b = ['b', 'bb', 'bbb']
for i, (x, y) in enumerate(zip(a, b)):
print (i, x, y)
then the code prints:
0 a b
1 aa bb
2 aaa bbb
To iterate over all elements of the two lists, they must have the same size.
Now, if we have the following snippet:
for fold, (train_idx, test_idx, val_idx) in enumerate(zip(*k_fold(dataset, folds))):
pass
where len(dataset) = 1000 and folds = 3, then how does the code works in terms of *k_fold(dataset, folds)?
EDIT:
I add the reference of the snippet about which my question is, it is line 31 of this code.
Python's enumerate function
Enumeration is used to iterate through an iterable whilst keeping an integer count of the number of iterations, so:
>>> for number, value in enumerate(["a", "b", "c"]):
... print(number, value)
1 a
2 b
3 c
Python's zip function
The built-in function zip is used to combine two iterables like so:
>>> a = [1, 2]
>>> b = [3, 4]
>>> list(zip(a, b))
[(1, 3), (2, 4)]
When zip is provided with iterables of different length, then it returns a zip object with the length of the shortest iterable. So:
>>> a = [1, 2, 5, 6]
>>> b = [3, 4]
>>> list(zip(a, b))
[(1, 3), (2, 4)]
Python's unpacking operator
Python uses the * to unpack iterables. Looking through the GitHub repository, it seems that k_fold returns a tuple with 3 elements. This is so that they can pass the values that the k_fold function returns into the iterable.
bonus example:
a = [1, 2, 5, 6, 8, 9, 10 , 11]
b = [3, 4, 12, 13 ]
c = [ 14, 15 ]
for i in enumerate(zip(a, b, c)):
print(i)
output:
(0, (1, 3, 14))
(1, (2, 4, 15)) -----> like they are fold, (train_idx, test_idx, val_idx)
not sure about what train_idx, test_idx, val_idx are in the code on github:
train_idx, test_idx val_idx are lists don't know with what they are filled though !

find common elements in two lists in linear time complexity

I have two unsorted lists of integers without duplicates both of them contain the same elements but not in the same order and I want to find the indices of the common elements between the two lists in lowest time complexity. For example
a = [1, 8, 5, 3, 4]
b = [5, 4, 1, 3, 8]
the output should be :
list1[0] With list2[2]
list1[1] With list2[4]
list1[2] With list2[0]
and so on
I have thought of using set. intersection and then find the index using the 'index' function but I didn't know how to print the output in a right way
this is what I've tried
b = set(list1).intersection(list2)
ina = [list1.index(x) for x in b]
inb = [list2.index(x) for x in b]
print (ina , inb )
To find them in linear time you should use some kind of hashing. The easiest way in Python is to use a dict:
list1 = [1, 8, 5, 3, 4]
list2 = [5, 4, 1, 3, 8]
common = set(list1).intersection(list2)
dict2 = {e: i for i, e in enumerate(list2) if e in common}
result = [(i, dict2[e]) for i, e in enumerate(list1) if e in common]
The result will be
[(0, 2), (1, 4), (2, 0), (3, 3), (4, 1)]
You can use something like this to format and print it:
for i1, i2 in result:
print(f"list1[{i1}] with list2[{i2}]")
you get:
list1[0] with list2[2]
list1[1] with list2[4]
list1[2] with list2[0]
list1[3] with list2[3]
list1[4] with list2[1]
Create a dictionary that maps elements of one list to their indexes. Then update it to have the indexes of the corresponding elements of the other list. Then any element that has two indices is in the intersection.
intersect = {x: [i] for i, x in enumerate(list1)}
for i, x in enumerate(list2):
if x in intersect:
intersect[x].append(i)
for l in intersect.values():
if len(l) == 2:
print(f'list1[{l[0]}] with list2[{l[1]}]')
a = [1, 8, 5, 3, 4]
b = [5, 4, 1, 3, 8]
e2i = {e : i for (i, e) in enumerate(b)}
for i, e in enumerate(a):
if e in e2i:
print('list1[%d] with list2[%d]' % (i, e2i[e]))
Building on the excellent answers here, you can squeeze a little more juice out of the lemon by not bothering to record the indices of a. (Those indices are just 0 through len(a) - 1 anyway and you can add them back later if needed.)
e2i = {e : i for (i, e) in enumerate(b)}
output = [e2i.get(e) for e in enumerate(a)]
output
# [2, 4, 0, 3, 1]
With len(a) == len(b) == 5000 on my machine this code runs a little better than twice as fast as Björn Lindqvist's code (after I modified his code to store the output rather than print it).

Python: vertical binning of two lists

I have two lists of the same size:
A = [1, 1, 2, 2, 3, 3, 4, 5]
B = [a, b, c, d, e, f, g, h] # numeric values
How do I do a vertical binning?
Output desired:
C = [ 1, 2, 3, 4, 5] # len = 5
D = [a + b, c + d, e + f, g, h] # len = 5
i.e. a mapping of A list to its cumulative sum (vertical binning?) where it occurs in list B.
I assume a, b, ... are numeric variables:
bins = dict()
for b, x in zip(A,B):
bins[b] = bins.setdefault(b, 0) + x
C = [key for key in bins]
D = [bins[key] for key in bins]
If a, b, ... are of another type, you would have to adjust the default value in bins.setdefault(b, ...).
This is a perfect case for the use of itertools.groupby:
from itertools import groupby
from operator import itemgetter
fst = itemgetter(0)
A = [1,1,2,2,3,3,4,5]
B = [1,3,4,6,7,7,8,8]
C = []
D = []
for k, v in groupby(zip(A, B), key=fst):
C.append(k)
D.append(sum(item[-1] for item in v))
C
>>[1, 2, 3, 4, 5]
D
>>[4, 10, 14, 8, 8]
If B is a list of strings then your summation operation becomes:
D.append(''.join(item[-1] for item in v))
You can use a dictionary and since Python 3.6 the order is preserved, therefore you get your C as the keys and D as values:
A = [1,1,2,2,3,3,4,5]
B = ["a","b","c","d","e","f","g","h"]
from random import randint
rename_to_B_for_numeric = [randint(0, 255) for _ in A]
result = {}
for idx, item in enumerate(A):
if item not in result:
# not sure about the type, so...
result[item] = "" if isinstance(B[idx], str) else 0
result[item] += B[idx]
print(result)
# {1: 'ab', 2: 'cd', 3: 'ef', 4: 'g', 5: 'h'}
print(list(result.keys()))
# [1, 2, 3, 4, 5]
print(list(result.values()))
# ['ab', 'cd', 'ef', 'g', 'h']
obviously if the type of item in B is not a string nor a number (int in this case) you'll need to modify the code a little bit to get some default type. Or just use else:
if item not in result:
result[item] = B[idx]
else:
result[item] += B[idx]
Here, C is the unique values of A:
C = sorted(set(A))
gives:
[1, 2, 3, 4, 5]
Now, D is the vertical binning of B w.r.t A (if B's elements are alpha):
D = [''.join(B[i] for i in range(len(B)) if A[i] == j) for j in C]
if B's elements are num:
D = [sum(B[i] for i in range(len(B)) if A[i] == j) for j in C]
gives:
['ab', 'cd', 'ef', 'g', 'h']
Note:
A = [1,1,2,2,3,3,4,5]
B = ['a','b','c','d','e','f','g','h']
Here a,b,c,... if numeric, go for the second eqn :)

Python equivalent of R "split"-function

In R, you could split a vector according to the factors of another vector:
> a <- 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> b <- rep(1:2,5)
[1] 1 2 1 2 1 2 1 2 1 2
> split(a,b)
$`1`
[1] 1 3 5 7 9
$`2`
[1] 2 4 6 8 10
Thus, grouping a list (in terms of python) according to the values of another list (according to the order of the factors).
Is there anything handy in python like that, except from the itertools.groupby approach?
From your example, it looks like each element in b contains the 1-indexed list in which the node will be stored. Python lacks the automatic numeric variables that R seems to have, so we'll return a tuple of lists. If you can do zero-indexed lists, and you only need two lists (i.e., for your R use case, 1 and 2 are the only values, in python they'll be 0 and 1)
>>> a = range(1, 11)
>>> b = [0,1] * 5
>>> split(a, b)
([1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
Then you can use itertools.compress:
def split(x, f):
return list(itertools.compress(x, f)), list(itertools.compress(x, (not i for i in f)))
If you need more general input (multiple numbers), something like the following will return an n-tuple:
def split(x, f):
count = max(f) + 1
return tuple( list(itertools.compress(x, (el == i for el in f))) for i in xrange(count) )
>>> split([1,2,3,4,5,6,7,8,9,10], [0,1,1,0,2,3,4,0,1,2])
([1, 4, 8], [2, 3, 9], [5, 10], [6], [7])
Edit: warning, this a groupby solution, which is not what OP asked for, but it may be of use to someone looking for a less specific way to split the R way in Python.
Here's one way with itertools.
import itertools
# make your sample data
a = range(1,11)
b = zip(*zip(range(len(a)), itertools.cycle((1,2))))[1]
{k: zip(*g)[1] for k, g in itertools.groupby(sorted(zip(b,a)), lambda x: x[0])}
# {1: (1, 3, 5, 7, 9), 2: (2, 4, 6, 8, 10)}
This gives you a dictionary, which is analogous to the named list that you get from R's split.
As a long time R user I was wondering how to do the same thing. It's a very handy function for tabulating vectors. This is what I came up with:
a = [1,2,3,4,5,6,7,8,9,10]
b = [1,2,1,2,1,2,1,2,1,2]
from collections import defaultdict
def split(x, f):
res = defaultdict(list)
for v, k in zip(x, f):
res[k].append(v)
return res
>>> split(a, b)
defaultdict(list, {1: [1, 3, 5, 7, 9], 2: [2, 4, 6, 8, 10]})
You could try:
a = [1,2,3,4,5,6,7,8,9,10]
b = [1,2,1,2,1,2,1,2,1,2]
split_1 = [a[k] for k in (i for i,j in enumerate(b) if j == 1)]
split_2 = [a[k] for k in (i for i,j in enumerate(b) if j == 2)]
results in:
In [22]: split_1
Out[22]: [1, 3, 5, 7, 9]
In [24]: split_2
Out[24]: [2, 4, 6, 8, 10]
To make this generalise you can simply iterate over the unique elements in b:
splits = {}
for index in set(b):
splits[index] = [a[k] for k in (i for i,j in enumerate(b) if j == index)]

Numpy - group data into sum values

Say I have an array of values:
a = np.array([1,5,4,2,4,3,1,2,4])
and three 'sum' values:
b = 10
c = 9
d = 7
Is there a way to group the values in a into groups of sets where the values combine to equal b,c and d? For example:
b: [5,2,3]
c: [4,4,1]
d: [4,2,1]
b: [5,4,1]
c: [2,4,3]
d: [4,2,1]
b: [4,2,4]
c: [5,4]
d: [1,1,2,3]
Note the sum of b,c and d should remain the same (==26). Perhaps this operation already has a name?
Here's a naive implementation using itertools
from itertools import chain, combinations
def group(n, iterable):
s = list(iterable)
return [c for c in chain.from_iterable(combinations(s, r)
for r in range(len(s)+1))
if sum(c) == n]
group(5, range(5))
yields
[(1, 4), (2, 3), (0, 1, 4), (0, 2, 3)]
Note, this probably will be very slow for large lists because we're essentially constructing and filtering through the power set of that list.
You could use this for
sum_vals = [10, 9, 7]
a = [1, 5, 4, 2, 4, 3, 1, 2, 4]
map(lambda x: group(x, a), sum_vals)
and then zip them together.

Categories