find pattern in a string without using regex - python

I'm trying to find a pattern in a string. Example:
trail = 'AABACCCACCACCACCACCACC" one can note the "ACC" repetition after a prefix of AAB; so the result should be AAB(ACC)
Without using regex 'import re' how can I do this. What I did so far:
def get_pattern(trail):
for j in range(0,len(trail)):
k = j+1
while k<len(trail) and trail[j]!=trail[k]:
k+=1
if k==len(trail)-1:
continue
window = ''
stop = trail[j]
m = j
while m<len(trail) and k<len(trail) and trail[m]==trail[k]:
window+=trail[m]
m+=1
k+=1
if trail[m]==stop and len(window)>1:
break
if len(window)>1:
prefix=''
if j>0:
prefix = trail[0:j]
return prefix+'('+window+')'
return False
This will do (almost) the trick because in a use case like this:
"AAAAAAAAAAAAAAAAAABDBDBDBDBDBDBDBDBDBDBDBDBDBDBDBD"
the result is AA but it should be: AAAAAAAAAAAAAAAAAA(BD)

The issue with your code is that once you find a repetition that is of length 2 or greater, you don't check forward to make sure it's maintained. In your second example, this causes it to grab onto the 'AA' without seeing the 'BD's that follow.
Since we know we're dealing with cases of prefix + window, it makes sense to instead look from the end rather than the beginning.
def get_pattern(string):
str_len = len(string)
splits = [[string[i-rep_length: i] for i in range(str_len, 0, -rep_length)] for rep_length in range(1, str_len//2)]
reps = [[window == split[0] for window in split].index(False) for split in splits]
prefix_lengths = [str_len - (i+1)*rep for i,rep in enumerate(reps)]
shortest_prefix_length = min(prefix_lengths)
indices = [i for i, pre_len in enumerate(prefix_lengths) if pre_len == shortest_prefix_length]
reps = list(map(reps.__getitem__, indices))
splits = list(map(splits.__getitem__, indices))
max_reps = max(reps)
window = splits[reps.index(max_reps)][0]
prefix = string[0:shortest_prefix_length]
return f'{prefix}({window})' if max_reps > 1 else None
splits uses list comprehension to create a list of lists where each sublist splits the string into rep_length sized pieces starting from the end.
For each sublist split, the first split[0] is our proposed pattern and we see how many times that it's repeated. This is easily done by finding the first instance of False when checking window == split[0] using the list.index() function. We also want to calculate the size of the prefix. We want the shortest prefix with the largest number of reps. This is because of nasty edge cases like jeifjeiAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBB where the window has B that repeats more than the window itself. Additionally, anything that repeats 4 times can also be seen as a double-sized window repeated twice.
If you want to deal with an additional suffix, we can do a hacky solution by just trimming from the end until get_pattern() returns a pattern and then just append what was trimmed:
def get_pattern_w_suffix(string):
for i in range(len(string), 0, -1):
pattern = get_pattern(string[0:i])
suffix = string[i:]
if pattern is not None:
return pattern + suffix
return None
However, this assumes that the suffix doesn't have a pattern itself.

Related

The longest prefix that is also suffix of two lists

So I have two lists:
def function(w,w2): # => this is how I want to define my function (no more inputs than this 2 lists)
I want to know the biggest prefix of w which is also suffix of w2.
How can I do this only with logic (without importing anything)
I can try and help get you started on this problem, but it sort of sounds like a homework question so I won't give you a complete answer (per these guidelines).
If I were you I'd start with a small case and build up from there. Lets start with:
w = "ab"
w2 = "ba"
The function for this might look like:
def function(w,w2):
prefix = ""
# Does the first letter of w equal the last letter of w2?
if w[0] == w2[-1]:
prefix += w[0]
# What about the second letter?
if w[1] == w2[-2]:
prefix += w[1]
return prefix
Then when you run print(function(w,w2)) you get ab.
This code should work for 2 letter words, but what if the words are longer? This is when we would introduce a loop.
def function(w,w2):
prefix = ""
for i in range(0, len(w)):
if w[i] == w2[(i+1)*-1]:
prefix+= w[i]
else:
return prefix
return prefix
Hopefully this code will offer a good starting place for you! One issue with what I have written is what if w2 is shorter than w. Then you will get an index error! There are a few ways to solve this, but one way is to make sure that w is always the shorter word. Best of luck, and feel free to DM me if you have other questions.
A simple iterative approach could be:
Start from the longest possible prefix (i.e. all of w), and test it against a w2 suffix of the same length.
If they match, you can return it immediately, since it must be the longest possible match.
If they don't match, shorten it by one, and repeat.
If you never find a match, the answer is an empty string.
In code, this looks like:
>>> def function(w, w2):
... for i in range(len(w), 0, -1):
... if w[:i] == w2[-i:]:
... return w[:i]
... return ''
...
>>> function("asdfasdf", "qwertyasdf")
'asdf'
The slice operator (w[:i] for a prefix of length i, w2[-i:] for a suffix of length i) gracefully handles mismatched lengths by just giving you a shorter string if i is out of the range of the given string (which means they won't match, so the iteration is forced to continue until the lengths do match).
>>> function("aaaaaba", "ba")
'a'
>>> function("a", "abbbaababaa")
'a'

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Python script to make every combination of a string with placed characters

I'm looking for help in creating a script to add periods to a string in every place but first and last, using as many periods as needed to create as many combinations as possible:
The output for the string 1234 would be:
["1234", "1.234", "12.34", "123.4", "1.2.34", "1.23.4" etc. ]
And obviously this needs to work for all lengths of string.
You should solve this type of problems yourself, these are simple algorithms to manipulate data that you should know how to come up with.
However, here is the solution (long version for more clarity):
my_str = "1234" # original string
# recursive function for constructing dots
def construct_dot(s, t):
# s - the string to put dots
# t - number of dots to put
# zero dots will return the original string in a list (stop criteria)
if t==0: return [s]
# allocation for results list
new_list = []
# iterate the next dot location, considering the remaining dots.
for p in range(1,len(s) - t + 1):
new_str = str(s[:p]) + '.' # put the dot in the location
res_str = str(s[p:]) # crop the string frot the dot to the end
sub_list = construct_dot(res_str, t-1) # make a list with t-1 dots (recursive)
# append concatenated strings
for sl in sub_list:
new_list.append(new_str + sl)
# we result with a list of the string with the dots.
return new_list
# now we will iterate the number of the dots that we want to put in the string.
# 0 dots will return the original string, and we can put maximum of len(string) -1 dots.
all_list = []
for n_dots in range(len(my_str)):
all_list.extend(construct_dot(my_str,n_dots))
# and see the results
print(all_list)
Output is:
['1234', '1.234', '12.34', '123.4', '1.2.34', '1.23.4', '12.3.4', '1.2.3.4']
A concise solution without recursion: using binary combinations (think of 0, 1, 10, 11, etc) to determine where to insert the dots.
Between each letter, put a dot when there's a 1 at this index and an empty string when there's a 0.
your_string = "1234"
def dot_combinations(string):
i = 0
combinations = []
# Iter while the binary representation length is smaller than the string size
while i.bit_length() < len(string):
current_word = []
for index, letter in enumerate(string):
current_word.append(letter)
# Append a dot if there's a 1 in this position
if (1 << index) & i:
current_word.append(".")
i+=1
combinations.append("".join(current_word))
return combinations
print dot_combinations(your_string)
Output:
['1234', '1.234', '12.34', '1.2.34', '123.4', '1.23.4', '12.3.4', '1.2.3.4']

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Categories