Longest common substring between two long lists - python

I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.
A simplified example is
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In
I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..
I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.
Current code (copied off the link and someone else from stackoverflow):
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
for string in L1:
print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]
edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..

Use the difflib module:
>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
s = SequenceMatcher(None, x, y)
return s.find_longest_match(0, len(x), 0, len(y)).size
...
for item in L1:
f = partial(func, item)
print max(L2, key=f)
...
xy_b_c
z_d_e_y

You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max as shown by Ashwini Chaudhary.

Related

determine if list is periodic python

I am curious to find out a function to check if a given list is periodic or not and return the periodic elements. lists are not loaded rather their elements are generated and added on the fly, if this note will make the algorithm easier anyhow.
For example, if the input to the function is [1,2,1,2,1,2,1,2], the output shall be (1,2).
I am looking for some tips and hints on the easier methods to achieve this.
Thanks in advance,
This problem can be solved with the Knuth-Morris-Pratt algorithm for string matching. Please get familiar with the way the fail-links are calculated before you proceed.
Lets consider the list as something like a sequence of values (like a String). Let the size of the list/sequence is n.
Then, you can:
Find the length of the longest proper prefix of your list which is also a suffix. Let the length of the longest proper prefix suffix be len.
If n is divisible by n - len, then the list is periodic and the period is of size len. In this case you can print the first len values.
More info:
GeeksForGeeks article.
Knuth-Morris-Pratt algorithm
NOTE: the original question had python and python-3.x tags, they were edited not by OP, that's why my answer is in python.
I use itertools.cycle and zip to determine if the list is k-periodic for a given k, then just iterate all possible k values (up to half the length of the list).
try this:
from itertools import cycle
def is_k_periodic(lst, k):
if len(lst) < k // 2: # we want the returned part to repaet at least twice... otherwise every list is periodic (1 period of its full self)
return False
return all(x == y for x, y in zip(lst, cycle(lst[:k])))
def is_periodic(lst):
for k in range(1, (len(lst) // 2) + 1):
if is_k_periodic(lst, k):
return tuple(lst[:k])
return None
print(is_periodic([1, 2, 1, 2, 1, 2, 1, 2]))
Output:
(1, 2)
Thank you all for answering my question. Neverthelss, I came up with an implementation that suits my needs.
I will share it here with you looking forward your inputs to optimize it for better performance.
The algorithm is:
assume the input list is periodic.
initialize a pattern list.
go over the list up to its half, for each element i in this first half:
add the element to the pattern list.
check if the pattern is matched throughout the list.
if it matches, declare success and return the pattern list.
else break and start the loop again adding the next element to the pattern list.
If a pattern list is found, check the last k elements of the list where k is len(list) - len(list) modulo the length of the pattern list, if so, return the pattern list, else declare failure.
The code in python:
def check_pattern(nums):
p = []
i = 0
pattern = True
while i < len(nums)//2:
p.append(nums[i])
for j in range(0, len(nums)-(len(nums) % len(p)), len(p)):
if nums[j:j+len(p)] != p:
pattern = False
break
else:
pattern = True
# print(nums[-(len(nums) % len(p)):], p[:(len(nums) % len(p))])
if pattern and nums[-(len(nums) % len(p)) if (len(nums) % len(p)) > 0 else -len(p):] ==\
p[:(len(nums) % len(p)) if (len(nums) % len(p)) > 0 else len(p)]:
return p
i += 1
return 0
This algorithm might be inefficient in terms of performance but it checks the list even if the last elements did not form a complete period.
Any hints or suggestions are highly appreciated.
Thanks in advance,,,
Let L the list. Classic method is: use your favorite algorithm to search the second occurence of the sublist L in the list L+L. If the list is present at index k, then the period is L[:k]:
L L
1 2 1 2 1 2 1 2 | 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2
(This is conceptually identical to #KonstantinYovkov's answer). In Python: example with strings (because Python has no builtin sublist search method):
>>> L = "12121212"
>>> k = (L+L).find(L, 1) # skip the first occurrence
>>> L[:k]
'12'
But:
>>> L = "12121"
>>> k = (L+L).find(L, 1)
>>> L[:k] # k is None => return the whole list
'12121'

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Finding Palidrome from a permutation in Python

I have a string, I need to find out palindromic sub-string of length 4( all 4 indexes sub-strings), in which the indexes should be in ascending order (index1<index2<index3<index4).
My code is working fine for small string like mystr. But when it comes to large string it takes long time.
from itertools import permutations
#Mystr
mystr = "kkkkkkz" #"ghhggh"
#Another Mystr
#mystr = "kkkkkkzsdfsfdkjdbdsjfjsadyusagdsadnkasdmkofhduyhfbdhfnsklfsjdhbshjvncjkmkslfhisduhfsdkadkaopiuqegyegrebkjenlendelufhdysgfdjlkajuadgfyadbldjudigducbdj"
l = len(mystr)
mylist = permutations(range(l), 4)
cnt = 0
for i in filter(lambda i: i[0] < i[1] < i[2] < i[3] and (mystr[i[0]] + mystr[i[1]] + mystr[i[2]] + mystr[i[3]] == mystr[i[3]] + mystr[i[2]] + mystr[i[1]] + mystr[i[0]]), mylist):
#print(i)
cnt += 1
print(cnt) # Number of palindromes found
If you want to stick with the basic structure of your current algorithm, a few ways to speed it up would be to use combinations instead of the permutations, which will return an iterable in sorted order. This means you don't need to check that the indexes are in ascending order. Secondly you can speed up the bit that checks for a palindrome by simply checking to see if the first two characters are identical to the last two characters reversed (instead of comparing the whole thing against its reversed self).
from itertools import combinations
mystr = "kkkkkkzsdfsfdkjdbdsjfjsadyusagdsadnkasdmkofhduyhfbdhfnsklfsjdhbshjvncjkmkslfhisduhfsdkadkaopiuqegyegrebkjenlendelufhdysgfdjlkajuadgfyadbldjudigducbdj"
cnt = 0
for m in combinations(mystr, 4):
if m[:2] == m[:1:-1]: cnt += 1
print cnt
Or if you want to simplify that last bit to a one-liner:
print len([m for m in combinations(mystr, 4) if m[:2] == m[:1:-1]])
I didn't do a real time test on this but on my system this method takes about 6.3 seconds to run (with your really long string) which is significantly faster than your method.

difflib returns different ratio depending on order of sequences

Does anyone know why these two return different ratios.
>>> import difflib
>>> difflib.SequenceMatcher(None, '10101789', '11426089').ratio()
0.5
>>> difflib.SequenceMatcher(None, '11426089', '10101789').ratio()
0.625
This gives some ideas of how matching works.
>>> import difflib
>>>
>>> def print_matches(a, b):
... s = difflib.SequenceMatcher(None, a, b)
... for block in s.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
... print s.ratio()
...
>>> print_matches('01017', '14260')
a[0] and b[4] match for 1 elements
a[5] and b[5] match for 0 elements
0.2
>>> print_matches('14260', '01017')
a[0] and b[1] match for 1 elements
a[4] and b[2] match for 1 elements
a[5] and b[5] match for 0 elements
0.4
It looks as if it matches as much as it can on the first sequence against the second and continues from the matches. In this case ('01017', '14260'), the righthand match is on the 0, the last character, so no further matches on the right are possible. In this case ('14260', '01017'), the 1s match and the 0 still is available to match on the right, so two matches are found.
I think the matching algorithm is commutative against sorted sequences.
I was working with difflib lately, and though this answer is late, I thought it might add a little spice to the answer provided by hughdbrown as it shows what's happening visually.
Before I go to the code snippet, let me quote the documentation
The idea is to find the longest contiguous matching subsequence that
contains no "junk" elements; these "junk" elements are ones that are
uninteresting in some sense, such as blank lines or whitespace.
(Handling junk is an extension to the Ratcliff and Obershelp
algorithm.) The same idea is then applied recursively to the pieces of
the sequences to the left and to the right of the matching
subsequence. This does not yield minimal edit sequences, but does tend
to yield matches that “look right” to people.
I think comparing the first string against the second one and then finding matches looks right enough to people. This is explained nicely in the answer by hughdbrown.
Now try and run this code snippet:
def show_matching_blocks(a, b):
s = SequenceMatcher(None, a, b)
m = s.get_matching_blocks()
seqs = [a, b]
new_seqs = []
for select, seq in enumerate(seqs):
i, n = 0, 0
new_seq = ''
while i < len(seq):
if i == m[n][select]:
new_seq += '{' + seq[m[n][select]:m[n][select] + m[n].size] + '}'
i += m[n].size
n += 1
elif i < m[n][select]:
new_seq += seq[i:m[n][select]]
i = m[n][select]
new_seqs.append(new_seq)
for seq, n in zip(seqs, new_seqs):
print('{} --> {}'.format(seq, n))
print('')
a, b = '10101789', '11426089'
show_matching_blocks(a, b)
show_matching_blocks(b, a)
Output:
10101789 --> {1}{0}1017{89}
11426089 --> {1}1426{0}{89}
11426089 --> {1}{1}426{0}{89}
10101789 --> {1}0{1}{0}17{89}
The parts inside braces ({}) are the matching parts. I just used SequenceMatcher.get_matching_blocks() to put the matching blocks within braces for better visibility. You can clearly see the difference when the order is reversed. With the first order, there are 4 matches, so the ratio is 2*4/16=0.5. But when the order is reversed, there are now 5 matches, so the ratio becomes 2*5/16=0.625. The ratio is calculated as given here in the documentation

Removing a character in a string one at a time

Basically I want to remove a character in a string one at a time if it occurs multiple times .
For eg :- if I have a word abaccea and character 'a' then the output of the function should be baccea , abacce , abccea.
I read that I can make maketrans for a and empty string but it replaces every a in the string.
Is there an efficient way to do this besides noting all the positions in a list and then replacing and generating the words ??
Here is a quick way of doing it:
In [6]: s = "abaccea"
In [9]: [s[:key] + s[key+1:] for key,val in enumerate(s) if val == "a"]
Out[10]: ['baccea', 'abccea', 'abacce']
There is the benefit of being able to turn this into a generator by simpling replacing square brackets with round ones.
You could try the following script. It provides a simple function to do what you ask. The use of list comprehensions [x for x in y if something(x)] is well worth learning.
#!/usr/bin/python
word = "abaccea"
letter = "a"
def single_remove(word, letter):
"""Remove character c from text t one at a time
"""
indexes = [c for c in xrange(len(word)) if word[c] == letter]
return [word[:i] + word[i + 1:] for i in indexes]
print single_remove(word, letter)
returns ['baccea', 'abccea', 'abacce']
Cheers
I'd say that your approach sounds good - it is a reasonably efficient way to do it and it will be clear to the reader what you are doing.
However a slightly less elegant but possibly faster alternative is to use the start parameter of the find function.
i = 0
while True:
j = word.find('a', i)
if j == -1:
break
print word[:j] + word[j+1:]
i = j + 1
The find function is likely to be highly optimized in C, so this may give you a performance improvement compared to iterating over the characters in the string yourself in Python. Whether you want to do this though depends on whether you are looking for efficiency or elegance. I'd recommend going for the simple and clear approach first, and only optimizing it if performance profiling shows that efficiency is an important issue.
Here are some performance measurements showing that the code using find can run faster:
>>> method1='[s[:key] + s[key+1:] for key,val in enumerate(s) if val == "a"]'
>>> method2='''
result=[]
i = 0
while True:
j = s.find('a', i)
if j == -1:
break
result.append(s[:j] + s[j+1:])
i = j + 1
'''
>>> timeit.timeit(method1, init, number=100000)
2.5391986271997666
>>> timeit.timeit(method2, init, number=100000)
1.1471052885212885
how about this ?
>>> def replace_a(word):
... word = word[1:8]
... return word
...
>>> replace_a("abaccea")
'baccea'
>>>

Categories