Matching Strings in Python?

Matching Strings in Python? - python

Using Python, how can I check whether 3 consecutive chars within a string (A) are also contained in another string (B)? Is there any built-in function in Python?
EXAMPLE:
A = FatRadio
B = fradio
Assuming that I have defined a threshold of 3, the python script should return true as there are three consecutive characters in B which are also included in A (note that this is the case for 4 and 5 consecutive characters as well).

How about this?
char_count = 3 # Or whatever you want
if len(A) >= char_count and len(B) >= char_count :
for i in range(0, len(A) - char_count + 1):
some_chars = A[i:i+char_count]
if some_chars in B:
# Huray!

You can use the difflib module:
import difflib
def have_common_triplet(a, b):
matcher = difflib.SequenceMatcher(None, a, b)
return max(size for _,_,size in matcher.get_matching_blocks()) >= 3
Result:
>>> have_common_triplet("FatRadio", "fradio")
True
Note however that SequenceMatcher does much more than finding the first common triplet, hence it could take significant more time than a naive approach. A simpler solution could be:
def have_common_group(a, b, size=3):
first_indeces = range(len(a) - len(a) % size)
second_indeces = range(len(b) - len(b) % size)
seqs = {b[i:i+size] for i in second_indeces}
return any(a[i:i+size] in seqs for i in first_indeces)
Which should perform better, especially when the match is at the beginning of the string.

I don't know about any built-in function for this, so I guess the most simple implementation would be something like this:
a = 'abcdefgh'
b = 'foofoofooabcfoo'
for i in range(0,len(a)-3):
if a[i:i+3] in b:
print 'then true!'
Which could be shorten to:
search_results = [i for in range(0,len(a)-3) if a[i:i+3] in b]

Related

Count max substring of the same character

i want to write a function in which it receives a string (s) and a single letter (s). the function needs to return the length of the longest substring of this letter. i dont know why the function i wrote doesn't work
for exmaple: print(count_longest_repetition('eabbaaaacccaaddd', 'a') supposed to return '4'
def count_longest_repetition(s, c):
n= len(s)
lst=[]
length_charachter=0
for i in range(n-1):
if s[i]==c and s[i+1]==c:
if s[i] in lst:
lst.append(s[i])
length_charachter= len(lst)
return length_charachter

Due to the condition if s[i] in lst, nothing will be appended to 'lst' as originally 'lst' is empty and the if condition will never be satisfied. Also, to traverse through the entire string you need to use range(n) as it generates numbers from 0 to n-1. This should work -
def count_longest_repetition(s, c):
n= len(s)
length_charachter=0
max_length = 0
for i in range(n):
if s[i] == c:
length_charachter += 1
else:
length_charachter = 0
max_length = max(max_length, length_charachter)
return max_length

I might suggest using a regex approach here with re.findall:
def count_longest_repetition(s, c):
matches = re.findall(r'' + c + '+', s)
matches = sorted(matches, key=len, reverse=True)
return len(matches[0])
cnt = count_longest_repetition('eabbaaaacccaaddd', 'a')
print(cnt)
This prints: 4
To better explain the above, given the inputs shown, the regex used is a+, that is, find groups of one or more a characters. The sorted list result from the call to re.findall is:
['aaaa', 'aa', 'a']
By sorting descending by string length, we push the longest match to the front of the list. Then, we return this length from the function.

Your function doesn't work because if s[i] in lst: will initially return false and never gets to add anything to to the lst list (so it will remain false throughout the loop).
You should look into regular expressions for this kind of string processing/search:
import re
def count_longest_repetition(s, c):
return max((0,*map(len,re.findall(f"{re.escape(c)}+",s))))
If you're not allowed to use libraries, you could compute repetitions without using a list by adding matches to a counter that you reset on every mismatch:
def count_longest_repetition(s, c):
maxCount = count = 0
for b in s:
count = (count+1)*(b==c)
maxCount = max(count,maxCount)
return maxCount

This can also be done by groupby
from itertools import groupby
def count_longest_repetition(text,let):
return max([len(list(group)) for key, group in groupby(list(text)) if key==let])
count_longest_repetition("eabbaaaacccaaddd",'a')
#returns 4

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?

You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)

Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Is there a simpler way to do string_match from CodingBat in Python?

This is the problem. http://codingbat.com/prob/p182414
To summarize, given two strings (a and b) return how many times a substring of 2, from string a is in string b. For example, string_match('xxcaazz', 'xxbaaz') → 3.
def string_match(a, b):
amount = 0
for i in range(len(a)):
if (len(a[i:i+2]) == 2) and a[i:i+2] == b[i:i+2]:
amount += 1
return amount

Not much simpler, but if you limit the range of i to len(a)-1, you don't need to check if it defines a long enough substring of a.

My approach is something like this:
def string_match(a, b):
count = 0
for i in range(len(a)-1):
if a[i:i+2]==b[i:i+2]:
count += 1
return count

How to find number or 'xx' pairs in a barcode recursively (python)

I am using python.
Im having trouble with this recursion problem, I am trying to find how many pairs of characters are the same in a string. For example, 'xx' would return 1 and 'xxx' would also return one because the pairs are not allowed to overlap. 'aabbb' would return 2.
I am completely stuck. I thought of breaking the word up into length 2 strings and recursing through the string like that, but then cases like 'aaa' would result in incorrect output.
Thanks.

Not sure why you want to do this recursively. If you wish to avoid regex, you can still just scan the string from left to right. For example, using itertools.groupby
>>> from itertools import groupby
>>> s = 'aabbb'
>>> sum(sum(1 for i in g)//2 for k,g in groupby(s))
2
>>> s = 'yyourr ssstringg'
>>> sum(sum(1 for i in g)//2 for k,g in groupby(s))
4
sum(1 for i in g) is used to find the length of the group. If the groups are not very long you can use len(list(g)) instead

You can use regex for that:
import re
s = 'yyourr ssstringg'
print len(re.findall(r'(\w)\1', s))
[OUTPUT]
4
This also takes care of your "overlaps-not-allowed" problem as you can see in the above example it prints 4 and not 5.
For a recursion approach, you can do it as:
st = 'yyourr ssstringg'
def get_double(s):
if len(s) < 2:
return 0
else:
for i,k in enumerate(s):
if k==s[i+1]:
return 1 + get_double(s[i+2:])
>>> print get_double(st)
4
And without a for loop:
st = 'yyourr sstringg'
def get_double(s):
if len(s) < 2:
return 0
elif s[0]==s[1]:
return 1 + get_double(s[2:])
else:
return 0 + get_double(s[1:])
>>> print get_double(st)
4

I would evaluate it by 2's.
for example "sskkkj" would be looked at as two sets of two char strings:
"ss", "kk", "kj" # from 0 index
"sk", "kk" # offset by 1
look at the two sets at the same time and add only one to the count if either has a pair.

Longest common substring between two long lists

I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.
A simplified example is
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In
I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..
I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.
Current code (copied off the link and someone else from stackoverflow):
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
for string in L1:
print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]
edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..

Use the difflib module:
>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
s = SequenceMatcher(None, x, y)
return s.find_longest_match(0, len(x), 0, len(y)).size
...
for item in L1:
f = partial(func, item)
print max(L2, key=f)
...
xy_b_c
z_d_e_y

You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max as shown by Ashwini Chaudhary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching Strings in Python? - python

How about this? char_count = 3 # Or whatever you want if len(A) >= char_count and len(B) >= char_count : for i in range(0, len(A) - char_count + 1): some_chars = A[i:i+char_count] if some_chars in B: # Huray!

Related

Count max substring of the same character

Finding regular expression with at least one repetition of each letter

Is there a simpler way to do string_match from CodingBat in Python?

How to find number or 'xx' pairs in a barcode recursively (python)

Longest common substring between two long lists

Categories

Resources