Concatenate lists and remove overlaps

Concatenate lists and remove overlaps - python

For two lists I want
A = [ 1,2,3,4,5]
B = [4,5,6,7]
result
C = [1,2,3,4,5,6,7]
if I specify an overlap of 2.
Code so far:
concat_list = []
word_overlap = 2
for lst in [lst1, lst2, lst3]:
if (len(concat_list) != 0):
if (concat_list[-word_overlap:] != lst[:word_overlap]):
concat_list += lst
elif ([concat_list[-word_overlap:]] == lst[:word_overlap]):
raise SystemExit
else:
concat_list += lst
doing it for lists of strings, but should be the same thing.
EDIT:
What I want my code to do is, first, check if there is any overlap (of 1, of 2, etc), then concatenate lists, eliminating the overlap (so I don't get double elements).
[1,2,3,4,5] + [4,5,6,7] = [1,2,3,4,5,6,7]
but
[1,2,3] + [4,5,6] = [1,2,3,4,5,6]
I want it to also check for any overlap smaller than my set word_overlap.

Here's a naïve variant:
def concat_nooverlap(a,b):
maxoverlap=min(len(a),len(b))
for overlap in range(maxoverlap,-1,-1):
# Check for longest possible overlap first
if a[-overlap:]==b[:overlap]:
break # Found an overlap, don't check any shorter
return a+b[overlap:]
It would be more efficient with types that support slicing by reference, such as buffers or numpy arrays.
One quite odd thing this does is, upon reaching overlap=0, it compares the entirety of a (sliced, which is a copy for a list) with an empty slice of b. That comparison will fail unless they were empty, but it still leaves overlap=0, so the return value is correct. We can handle this case specifically with a slight rewrite:
def concat_nooverlap(a,b):
maxoverlap=min(len(a),len(b))
for overlap in range(maxoverlap,0,-1):
# Check for longest possible overlap first
if a[-overlap:]==b[:overlap]:
return a+b[overlap:]
else:
return a+b

You can use set and union
s.union(t): new set with elements from both s and t
>> list(set(A) | set(B))
[1, 2, 3, 4, 5, 6, 7]
But you can't have the exact number you need to overlap this way.
To answer you question, you will have to ruse and use a combination of sets:
get a new list with elements from both A and B
get new list with elements common to A and B
get only the number of elements you need in this list using slicing
get new list with elements in either A or B but not both
OVERLAP = 1
A = [1, 2, 3, 4, 5]
B = [4, 5, 6, 7]
C = list(set(A) | set(B)) # [1, 2, 3, 4, 5, 6, 7]
D = list(set(A) & set(B)) # [4, 5]
D = D[OVERLAP:] # [5]
print list(set(C) ^ set(D)) # [1, 2, 3, 4, 6, 7]
just for fun, a one-liner could give this:
list((set(A) | set(B)) ^ set(list(set(A) & set(B))[OVERLAP:])) # [1, 2, 3, 4, 6, 7]
Where OVERLAP is the constant where you need you reunion.

Not sure if I correctly interpreted your question, but you could do it like this:
A = [ 1,2,3,4,5]
B = [4,5,6,7]
overlap = 2
print A[0:-overlap] + B
If you want to make sure they have the same value, your check could be along the lines of:
if(A[-overlap:] == B[:overlap]):
print A[0:-overlap] + B
else:
print "error"

assuming that both lists will be consecutive, and list a will always have smaller values than list b. I come up with this solution.
This will also help you detect overlap.
def concatenate_list(a,b):
max_a = a[len(a)-1]
min_b = b[0]
if max_a >= min_b:
print 'overlap exists'
b = b[(max_a - min_b) + 1:]
else:
print 'no overlap'
return a + b
For strings you can do this also
def concatenate_list_strings(a,b):
count = 0
for i in xrange(min(len(a),len(b))):
max_a = a[len(a) - 1 - count:]
min_b = b[0:count+1]
if max_a == min_b:
b = b[count +1:]
return 'overlap count ' + str(count), a+b
count += 1
return a + b

Related

Filter Function in Python - Difference between solutions

The goal is to identify the mutual elements between both lists below:
a = [1,2,3,5,7,9]
b = [2,3,5,6,7,8]
I tried the below:
a = [1,2,3,5,7,9]
b = [2,3,5,6,7,8]
for i in a:
filt1=list(filter(lambda x:x==i,b))
print(filt1)
The expected result is:
[2, 3, 5, 7]
The following code works:
a = [1,2,3,5,7,9]
b = [2,3,5,6,7,8]
filt2=(list(filter(lambda x: x in a,b)))
print(filt2)
Doesn't it perform like the one I've been trying? What's the difference between them?

filt1 is being overridden on every iteration of your for loop.
you end up with [] because the last number (9) is not in b.
define it beforehand, and add the filter lists to it.
try this:
a = [1, 2, 3, 5, 7, 9]
b = [2, 3, 5, 6, 7, 8]
filt1 = []
for i in a:
filt1 += list(filter(lambda x: x == i, b))
print(filt1)
Output:
[2, 3, 5, 7]

Using filter won't be an efficient solution in such case, sets intersection is a reasonable way:
a = [1,2,3,5,7,9]
b = [2,3,5,6,7,8]
common_nums = list(set(a) & set(b))
print(common_nums) # [2, 3, 5, 7]
As for issues:
in your 1st approach you are reassigning filt1 on each iteration so you won't get an expected result (a common items won't be accumulated).
for i in a:
filt1=list(filter(lambda x:x==i,b)) # <---
But I wouldn't advice to fix it but eliminate it in favour of set approach.

No, your first solution is not the same as the second one.
You are iterating through a list, and for each elemnt, you are creating a new list.
And the result is just list for the last element.
You can fix it by creating additional list:
temp = []
for i in a:
filt1=list(filter(lambda x:x==i,b))
temp.extend(filt1)
print(temp)

Compare two lists to return a list with all elements as 0 except the ones which matched while keeping index?

I am a bit stuck on this:
a = [1,2,3,2,4,5]
b = [2,5]
I want to compare the two lists and generate a list with the same items as a, but with any items that don't occur in b set to 0. Valid outputs would be these:
c = [0,2,0,0,0,5]
# or
c = [0,0,0,2,0,5]
I would not know the number elements in either list beforehand.
I tried for loops but
['0' for x in a if x not in b]
It removes all instances of 2. Which I only want to remove once(it occurs once in b for the moment). I need to add a condition in the above loop to keep elements which match.

The following would work:
a = [1,2,3,2,4,5]
b = [2, 5]
output = []
for x in a:
if x in b:
b.remove(x)
output.append(x)
else:
output.append(0)
or for a one-liner, using the fact that b.remove(x) returns None:
a = [1,2,3,2,4,5]
b = {2, 5}
output = [(b.remove(x) or x) if x in b else 0 for x in a]

If the elements in b are unique, this is best done with a set, because sets allow very efficient membership testing:
a = [1,2,3,2,4,5]
b = {2, 5} # make this a set
result = []
for num in a:
# If this number occurs in b, remove it from b.
# Otherwise, append a 0.
if num in b:
b.remove(num)
result.append(num)
else:
result.append(0)
# result: [0, 2, 0, 0, 0, 5]
If b can contain duplicates, you can replace the set with a Counter, which represents a multiset:
import collections
a = [1,2,3,2,4,5]
b = collections.Counter([2, 2, 5])
result = []
for num in a:
if b[num] > 0:
b[num] -= 1
result.append(num)
else:
result.append(0)
# result: [0, 2, 0, 2, 0, 5]

Here's one way using set. Downside is the list copy operation and initial set conversion. Upside is O(1) removal and lookup operations.
a = [1,2,3,2,4,5]
b = [2,5]
b_set = set(b)
c = a.copy()
for i in range(len(c)):
if c[i] in b_set:
b_set.remove(c[i])
else:
c[i] = 0
print(c)
[0, 2, 0, 0, 0, 5]

Remove sublist from list

I want to do the following in Python:
A = [1, 2, 3, 4, 5, 6, 7, 7, 7]
C = A - [3, 4] # Should be [1, 2, 5, 6, 7, 7, 7]
C = A - [4, 3] # Should not be removing anything, because sequence 4, 3 is not found
So, I simply want to remove the first appearance of a sublist (as a sequence) from another list. How can I do that?
Edit: I am talking about lists, not sets. Which implies that ordering (sequence) of items matter (both in A and B), as well as duplicates.

Use sets:
C = list(set(A) - set(B))
In case you want to mantain duplicates and/or oder:
filter_set = set(B)
C = [x for x in A if x not in filter_set]

If you want to remove exact sequences, here is one way:
Find the bad indices by checking to see if the sublist matches the desired sequence:
bad_ind = [range(i,i+len(B)) for i,x in enumerate(A) if A[i:i+len(B)] == B]
print(bad_ind)
#[[2, 3]]
Since this returns a list of lists, flatten it and turn it into a set:
bad_ind_set = set([item for sublist in bad_ind for item in sublist])
print(bad_ind_set)
#set([2, 3])
Now use this set to filter your original list, by index:
C = [x for i,x in enumerate(A) if i not in bad_ind_set]
print(C)
#[1, 2, 5, 6, 7, 7, 7]
The above bad_ind_set will remove all matches of the sequence. If you only want to remove the first match, it's even simpler. You just need the first element of bad_ind (no need to flatten the list):
bad_ind_set = set(bad_ind[0])
Update: Here is a way to find and remove the first matching sub-sequence using a short circuiting for loop. This will be faster because it will break out once the first match is found.
start_ind = None
for i in range(len(A)):
if A[i:i+len(B)] == B:
start_ind = i
break
C = [x for i, x in enumerate(A)
if start_ind is None or not(start_ind <= i < (start_ind + len(B)))]
print(C)
#[1, 2, 5, 6, 7, 7, 7]

I considered this question was like one substring search, so KMP, BM etc sub-string search algorithm could be applied at here. Even you'd like support multiple patterns, there are some multiple pattern algorithms like Aho-Corasick, Wu-Manber etc.
Below is KMP algorithm implemented by Python which is from GitHub Gist.
PS: the author is not me. I just want to share my idea.
class KMP:
def partial(self, pattern):
""" Calculate partial match table: String -> [Int]"""
ret = [0]
for i in range(1, len(pattern)):
j = ret[i - 1]
while j > 0 and pattern[j] != pattern[i]:
j = ret[j - 1]
ret.append(j + 1 if pattern[j] == pattern[i] else j)
return ret
def search(self, T, P):
"""
KMP search main algorithm: String -> String -> [Int]
Return all the matching position of pattern string P in S
"""
partial, ret, j = self.partial(P), [], 0
for i in range(len(T)):
while j > 0 and T[i] != P[j]:
j = partial[j - 1]
if T[i] == P[j]: j += 1
if j == len(P):
ret.append(i - (j - 1))
j = 0
return ret
Then use it to calcuate out the matched position, finally remove the match:
A = [1, 2, 3, 4, 5, 6, 7, 7, 7, 3, 4]
B = [3, 4]
result = KMP().search(A, B)
print(result)
#assuming at least one match is found
print(A[:result[0]:] + A[result[0]+len(B):])
Output:
[2, 9]
[1, 2, 5, 6, 7, 7, 7, 3, 4]
[Finished in 0.201s]
PS: You can try other algorithms also. And #Pault 's answers is good enough unless you care about the performance a lot.

Here is another approach:
# Returns that starting and ending point (index) of the sublist, if it exists, otherwise 'None'.
def findSublist(subList, inList):
subListLength = len(subList)
for i in range(len(inList)-subListLength):
if subList == inList[i:i+subListLength]:
return (i, i+subListLength)
return None
# Removes the sublist, if it exists and returns a new list, otherwise returns the old list.
def removeSublistFromList(subList, inList):
indices = findSublist(subList, inList)
if not indices is None:
return inList[0:indices[0]] + inList[indices[1]:]
else:
return inList
A = [1, 2, 3, 4, 5, 6, 7, 7, 7]
s1 = [3,4]
B = removeSublistFromList(s1, A)
print(B)
s2 = [4,3]
C = removeSublistFromList(s2, A)
print(C)

Remove intersection from two lists in python

Given two lists, what is the best way to remove the intersection of the two? For example, given:
a = [2,2,2,3]
b = [2,2,5]
I want to return:
a = [2,3]
b = [5]

Let's assume you wish to handle the general case (same elements appear more than once in each list), the so called multiset.
You can use collections.Counter:
from collections import Counter
intersection = Counter(a) & Counter(b)
multiset_a_without_common = Counter(a) - intersection
multiset_b_without_common = Counter(b) - intersection
new_a = list(multiset_a_without_common.elements())
new_b = list(multiset_b_without_common.elements())
For your values of a, b, you'll get:
a = [2,2,2,3]
b = [2,2,5]
new_a = [2, 3]
new_b = [5]
Note that for a special case of each element appearing exactly once, you can use the standard set, as the other answers are suggesting.

You can loop through the two lists and remove elements as you find an intersect point as the following:
a = [2, 2, 2, 3]
b = [2, 2, 5]
delete = []
for c in a:
for n in b:
if n == c:
delete.append(c)
delete.append(n)
break
a.remove(delete[0])
b.remove(delete[1])
delete = []
print a
print b
output:
[2, 3]
[5]

a = [2,2,2,3]
b = [2,2,5]
for i in list(b): #I call list() on b because otherwise I can't remove from it during the for loop.
if i in a:
a.remove(i)
b.remove(i)
Output:
a = [2, 3]
b = [5]

Pythonic way to merge two overlapping lists, preserving order

Alright, so I have two lists, as such:
They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5], [3.5, 4, 5, 6, 7]
The lists are not necessarily ordered nor unique. [9, 1, 1, 8, 7], [8, 6, 7].
I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. Additionally, the first list might be huge. My current working code is as such:
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
n = 1
while n < len(master):
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1
return master + addition
What I would like to know is - is there a more efficient way of doing this? It works, but I'm slightly leery of this, because it can run into large runtimes in my application - I'm merging large lists of strings.
EDIT: I'd expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8,3,4,5,7,8]. For clarity, I've highlighted the overlapping portion.
[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7]

You can try the following:
>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]
>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]
The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.

There are a couple of easy optimizations that are possible.
You don't need to start at master[1], since the longest overlap starts at master[-len(addition)]
If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index:
This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy):
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
first = addition[0]
n = max(len(master) - len(addition), 1) # (1)
while 1:
try:
n = master.index(first, n) # (2)
except ValueError:
return master + addition
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1

This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.
def merge(a, b):
max_offset = len(b) # can't overlap with greater size than len(b)
for i in reversed(range(max_offset+1)):
# checks for equivalence of decreasing sized slices
if a[-i:] == b[:i]:
break
return a + b[i:]
We can test with your test data by doing:
test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
{'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]
all(merge(test['a'], test['b']) == test['result'] for test in test_data)
This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)
Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case
def merge(a, b):
if a[-1] not in b:
return a + b
...
In fact you could take that solution one step further and probably make your algorithm much faster
def merge(a, b):
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
return a + b
if a[-idx:] == b[:idx]:
return a + b[:idx]
However this might not find the longest overlap in cases like:
a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.
def merge(a, b):
results = []
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
results.append(a + b)
break
if a[-idx:] == b[:idx]:
results.append(a + b[:idx])
return min(results, key=len)
Which should work since merging the longest overlap should produce the shortest result in all cases.

One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.
Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:
def merge(master, addition):
overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
for overlap_len in overlap_lens:
for i in range(overlap_len):
if master[-overlap_len + i] != addition[i]:
break
else:
return master + addition[overlap_len:]
return master + addition
The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.
The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.
It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.

I don't offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return x_longest - longest, x_longest
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
else:
print master + addition
[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]

First of all and for clarity, you can replace your while loop with a for loop:
def merge(master, addition):
for n in xrange(1, len(master)):
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
Then, you don't have to compare all possible slices, but only those for which master's slice starts with the first element of addition:
def merge(master, addition):
indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
for n in indices:
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
So instead of comparing slices like this:
1234123141234
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
you are only doing these comparisons:
1234123141234
| | |
| | 3579
| 3579
3579
How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.
You could also generate a list of indices for addition so its own slices always end with master's last element, further restricting the number of comparisons.

Based on https://stackoverflow.com/a/30056066/541208:
def join_two_lists(a, b):
index = 0
for i in xrange(len(b), 0, -1):
#if everything from start to ith of b is the
#same from the end of a at ith append the result
if b[:i] == a[-i:]:
index = i
break
return a + b[index:]

All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by #JuniorCompressor and #TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).
I found using pandas to concatenate lists with large size is much more time-efficient:
import pandas as pd, numpy as np
trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )
testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )
# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)
mergedCompIds = np.array(compoundIdMaps["compoundId"])

What you need is a sequence alignment algorithm like Needleman-Wunsch.
Needleman-Wunsch is a global sequence alignment algorithm based on dynamic programming:
I found this nice implementation to merge arbitrary object sequences in python:
https://github.com/ajnisbet/paired
import paired
seq_1 = 'The quick brown fox jumped over the lazy dog'.split(' ')
seq_2 = 'The brown fox leaped over the lazy dog'.split(' ')
alignment = paired.align(seq_1, seq_2)
print(alignment)
# [(0, 0), (1, None), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7)]
for i_1, i_2 in alignment:
print((seq_1[i_1] if i_1 is not None else '').ljust(15), end='')
print(seq_2[i_2] if i_2 is not None else '')
# The The
# quick
# brown brown
# fox fox
# jumped leaped
# over over
# the the
# lazy lazy
# dog dog

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenate lists and remove overlaps - python

Related

Filter Function in Python - Difference between solutions

Compare two lists to return a list with all elements as 0 except the ones which matched while keeping index?

Remove sublist from list

Remove intersection from two lists in python

Pythonic way to merge two overlapping lists, preserving order

Categories

Resources