How to determine the minimum period of a periodic series - python

I am doing a text mining and trying to clean bullet screen (弹幕) data.(Bullet screen is a kind of comment in video websites) There are repetitions of expressions in my data. ("LOL LOL LOL", "LMAOLMAOLMAOLMAO") And I want to get "LOL", "LMAO".
In most cases, I want to find the minimum period of a sequence.
CORNER CASE: The tail of the input sequence can be seen as a part of the periodic subsequence.
"eat an apple eat an apple eat an" # input
"eat an apple" # output
There are some other test cases:
cases = [
"abcd", #4 abcd
"ababab", #2 ab
"ababcababc", #5 ababc
"abcdabcdabc", #4 abcd
]
NOTE: As for the last case "abcdabcdabc", "abcd" is better than "abcdabcdabc" because the last three character "abc" is part of "abcd".
def solve(x):
n = len(x)
d = dict()
T = 0
k = 0
while k < n:
w = x[k]
if w not in d:
d[w] = T
T += 1
else:
while k < n and d.get(x[k], None) == k%T:
k += 1
if k < n:
T = k+1
k += 1
return T, x[:T]
it can output correct answers for first two cases but fails to handle all of them.

There is effective Z-algorithm
Given a string S of length n, the Z Algorithm produces an array Z
where Z[i] is the length of the longest substring starting from S[i]
which is also a prefix of S, i.e. the maximum k such that
S[j] = S[i + j] for all 0 ≤ j < k. Note that Z[i] = 0 means that
S[0] ≠ S[i]. For easier terminology, we will refer to substrings which
are also a prefix as prefix-substrings.
Calculate Z-array for your string and find such position i with property i + Z[i] == len and len % i == 0 (len is string length). Now i is period length

I'm not fluent in Python, but can easily describe the algorithm you need:
found <- false
length <- inputString.length
size = 1
output <- inputString
while (not found) and (size <= length / 2) do
if (length % size = 0) then
chunk <- inputString.substring(0, size)
found <- true
for (j <- 1,length/size) do
if (not inputString.substring(j * size, size).equals(chunk)) then
found <- false
if end
for end
if found then
output <- chunk
if end
if end
size <- size + 1
while end
The idea is to increasingly take substrings starting from the start of the string, the starting length of the substrings being 1 and while you do not find a repetitive cycle, you increase the length (until it is evidently no longer feasible, that is, half of the length of the input has been reached). In each iteration you compare the length of the substring with the length of the input string and if the length of the input string is not divisible with the current substring, then the current substring will not be repetitive for the input string (an optimization would be to find out what numbers is your input string's length divisible with and check only for that lengths in your substrings, but I avoided such optimizations for the sake of understandability). If the size of your string is divisible with the current size, then you take the substring from the start of your input string up until the current size and check whether it is repeated. The first time you find such a pattern you can stop with your loop, because you have found the solution. If no such solution is found, then the input string is the smallest repetitive substring and it is repeated 0 times, as it is found in your string only once.
EDIT
If you want to tolerate the last occurrence being only a part of the pattern, limited by the inputString, then the algorithm can be changed like this:
found <- false
length <- inputString.length
size = 1
output <- inputString
while (not found) and (size <= length / 2) do
chunk <- inputString.substring(0, size)
found <- true
for (j <- 1,length/size) do
if (not inputString.substring(j * size, size).equals(chunk)) then
found <- (chunk.indexOf(inputString.substring(j).length) = 0)
if end
for end
if found then
output <- chunk
if end
size <- size + 1
while end
In this case, we see the line of
found <- (chunk.indexOf(inputString.substring(j).length) = 0)
so, in the case of a mismatch, we check whether our chunk starts with the remaining part of the string. If so, then we are at the end of the input string and the pattern is partially matched up until the end of the string, so found will be true. If not, then found will be false.

You could do it this way :
def solve(string):
foundPeriods = {}
for x in range(len(string)):
#Tested substring
substring = string[0:len(string)-x]
#Frequency count
occurence_count = string.count(substring)
#Make a comparaison to original string
if substring * occurence_count in string:
foundPeriods[occurence_count] = substring
return foundPeriods[max(foundPeriods.keys())]
for x in cases:
print(x ,'===> ' , solve(x), "#" , len(solve(x)))
print()
Output
abcd ===> a # 1
ababab ===> ab # 2
ababcababc ===> ababc # 5
abcdabcdabc ===> abcd # 4
EDIT :
Answer edited to consider the following in the question
"abcdabcdabc", "abcd" is better than "abcdabcdabc" because it comes more naturally

Related

Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?

I am wondering if there is a fast Python way to check if a subsequence exists in a string, while allowing for 1-3 mismatches.
For example the string: "ATGCTGCTGA"
The subsequence "ATGCC" would be acceptable and return true. Note that there is 1 mismatch in bold.
I have tried to use the pairwise2 functions from the Bio package but it is very slow. I have 1,000 strings, and for each string I want to test 10,000 subsequences.
Computation speed would be prioritized here.
** Note I don't mean gaps, but one nucleotide (the letter A, T, C, G) being substituted for another one.
One can zip the strings and compare tuples left and right side, and count false.
Here false is just 1. Should not be slow...
st = "ATGCTGCTGA"
s = "ATGCC"
[ x==y for (x,y) in zip(st,s)].count(False)
1
Try:
ALLOWED_MISMATCHES = 3
s = "ATGCTGCTGA"
subsequence = "ATGCC"
for i in range(len(s) - len(subsequence) + 1):
if sum(a != b for a, b in zip(s[i:], subsequence)) <= ALLOWED_MISMATCHES:
print("Match")
break
else:
print("No Match")
Prints:
Match
This is possible with regex if you use the PyPi's regex module instead with fuzzy matching:
ATGCC{i<=3:[ATGC]}
ATGCC - Look for exactly 'ATGCC'
{i<=3:[ATCG]} - Allow for up to three insertions that are within the character class of nucleotide character [ATGC].
For example:
import regex as re
s = 'ATGCTGCTGA'
print(bool(re.search(r'ATGCC{i<=3:[ATGC]}', s)))
Prints:
True
Especically, if your subsequences have all the same length, you could try k-mer matching with Biotite, a package I am developer of.
To allow mismatches, you can generate similar subsequences in the matching process using a SimilarityRule. In this example the rule and substitution matrix is set, so that all subsequences with up to MAX_MISMATCH mismatches are enumerated. Since the k-mer matching is implemented in C it should run quite fast.
import numpy as np
import biotite.sequence as seq
import biotite.sequence.align as align
K = 5
MAX_MISMATCH = 1
database_sequences = [
seq.NucleotideSequence(seq_str) for seq_str in [
"ATGCTGCTGA",
# ...
]
]
query_sequences = [
seq.NucleotideSequence(seq_str) for seq_str in [
"ATGCC",
# ...
]
]
kmer_table = align.KmerTable.from_sequences(K, database_sequences)
# The alphabet for the substitution matrix
# should not contain unambiguous symbols, such as 'N'
alphabet = seq.NucleotideSequence.alphabet_unamb
matrix_entries = np.full((len(alphabet), len(alphabet)), -1)
np.fill_diagonal(matrix_entries, 0)
matrix = align.SubstitutionMatrix(alphabet, alphabet, matrix_entries)
print(matrix)
print()
# The similarity rule will allow up to MAX_MISMATCH mismatches
similarity_rule = align.ScoreThresholdRule(matrix, -MAX_MISMATCH)
for i, sequence in enumerate(query_sequences):
matches = kmer_table.match(sequence, similarity_rule)
# Colums:
# 0. Sequence position of match in query (first position of k-mer)
# 1. Index of DB sequence in list
# 2. Sequence position of match in DB sequence (first position of k-mer)
print(matches)
print()
index_of_matches = np.unique(matches[:, 1])
for j in index_of_matches:
print(f"Match of subsequence {i} to sequence {j}")
print()
Output:
A C G T
A 0 -1 -1 -1
C -1 0 -1 -1
G -1 -1 0 -1
T -1 -1 -1 0
[[0 0 0]]
Match of subsequence 0 to sequence 0
If your subsequences have different lengths, but are all quite short (~ up length 10) you could create a KmerTable for each subsequence length (K = length) and match each table to the subsequences with the respective length. If your subsequences are much larger than that, this approach probably will not work due to the memory requirements of a KmerTable.

How to encode (replace) parts of word from end to beginning for some N value (like abcabcab to cbacbaba for n=3)?

I would like to create a program for encoding and decoding words.
Specifically, the program should take part of the word (count characters depending on the value of n) and turns them backwards.
This cycle will be running until it encodes the whole word.
At first I created the number of groups of parts of the word which is the number of elements n + some possible remainder
*(For example for Language with n = 3 has 3 parts - two parts of 3 chars and one remainder with 2 chars).This unit is called a general.
Then, depending on the general, I do a cycle that n * takes the given character and always adds it to the group (group has n chars).
At the end of the group cycle, I add (in reverse order) to new_word and reset the group value.
The goal should be to example decode word Language with (n value = 2) to aLgnaueg.
Or Language with (n value = 3) to naL aug eg and so on.
Next example is word abcabcab (n=3) to cba cba ba ?
Output of my code don´t do it right. Output for n=3 is "naLaugeg"
Could I ask how to improve it? Is there some more simple python function how to rewrite it?
My code is there:
n = 3
word = "Language"
new_word = ""
group = ""
divisions = (len(word)//n)
residue = (len(word)%n)
general = divisions + residue
for i in range(general):
j=2
for l in range(n):
group += word[i+j]
print(word[i+j], l)
j=j-1
for j in range((len(group)-1),-1,-1):
new_word += group[j]
print(word[j])
group = ""
print(group)
print(new_word)
import textwrap
n = 3
word = "Language"
chunks = textwrap.wrap(word, n)
reversed_chunks = [chunk[::-1] for chunk in chunks]
>>> print(' '.join(reversed_chunks))
naL aug eg

How to find the most amount of shared characters in two strings? (Python)

yamxxopd
yndfyamxx
Output: 5
I am not quite sure how to find the number of the most amount of shared characters between two strings. For example (the strings above) the most amount of characters shared together is "yamxx" which is 5 characters long.
xx would not be a solution because that is not the most amount of shared characters. In this case the most is yamxx which is 5 characters long so the output would be 5.
I am quite new to python and stack overflow so any help would be much appreciated!
Note: They should be the same order in both strings
Here is simple, efficient solution using dynamic programming.
def longest_subtring(X, Y):
m,n = len(X), len(Y)
LCSuff = [[0 for k in range(n+1)] for l in range(m+1)]
result = 0
for i in range(m + 1):
for j in range(n + 1):
if (i == 0 or j == 0):
LCSuff[i][j] = 0
elif (X[i-1] == Y[j-1]):
LCSuff[i][j] = LCSuff[i-1][j-1] + 1
result = max(result, LCSuff[i][j])
else:
LCSuff[i][j] = 0
print (result )
longest_subtring("abcd", "arcd") # prints 2
longest_subtring("yammxdj", "nhjdyammx") # prints 5
This solution starts with sub-strings of longest possible lengths. If, for a certain length, there are no matching sub-strings of that length, it moves on to the next lower length. This way, it can stop at the first successful match.
s_1 = "yamxxopd"
s_2 = "yndfyamxx"
l_1, l_2 = len(s_1), len(s_2)
found = False
sub_length = l_1 # Let's start with the longest possible sub-string
while (not found) and sub_length: # Loop, over decreasing lengths of sub-string
for start in range(l_1 - sub_length + 1): # Loop, over all start-positions of sub-string
sub_str = s_1[start:(start+sub_length)] # Get the sub-string at that start-position
if sub_str in s_2: # If found a match for the sub-string, in s_2
found = True # Stop trying with smaller lengths of sub-string
break # Stop trying with this length of sub-string
else: # If no matches found for this length of sub-string
sub_length -= 1 # Let's try a smaller length for the sub-strings
print (f"Answer is {sub_length}" if found else "No common sub-string")
Output:
Answer is 5
s1 = "yamxxopd"
s2 = "yndfyamxx"
# initializing counter
counter = 0
# creating and initializing a string without repetition
s = ""
for x in s1:
if x not in s:
s = s + x
for x in s:
if x in s2:
counter = counter + 1
# display the number of the most amount of shared characters in two strings s1 and s2
print(counter) # display 5

Infinite string

We are given N words, each of length at max 50.All words consist of small case alphabets and digits and then we concatenate all the N words to form a bigger string A.An infinite string S is built by performing infinite steps on A recursively: In ith step, A is concatenated with ′$′ i times followed by reverse of A. Eg: let N be 3 and each word be '1','2' and '3' after concatenating we get A= 123 reverse of a is 321 and on first recursion it will be
A=123$321 on second recursion it will be A=123$321$$123$321 And so on… The infinite string thus obtained is S.Now after ith recursion we have to find the character at index say k.Now recursion can be large as pow(10,4) and N which can be large as (pow(10,4)) and length of each word at max is 50 so in worst case scenario our starting string can have a length of 5*(10**5) which is huge so recursion and adding the string won't work.
What I came up with is that the string would be a palindrome after 1 st recursion so if I can calculate the pos of '$'*I I can calculate any index since the string before and after it is a palindrome.I came up with a pattern that
looks like this:
string='123'
k=len(string)
recursion=100
lis=[]
for i in range(1,recursion+1):
x=(2**(i-1))
y=x*(k+1)+(x-i)
lis.append(y)
print(lis[:10])
Output:
[4, 8, 17, 36, 75, 154, 313, 632, 1271, 2550]
Now I have two problems with it first I also want to add position of adjacent '$' in the list because at position 8 which is the after 2nd recursion there will be more (recursion-1)=1 more '$' at position 9 and likewise for position 17 which is 3rd recursion there will be (3-1) two more '$' in position 18 and 19 and this would continue until ith recursion and for that I would have to insert while loop and that would make my algorithm to give TLE
string='123'
k=len(string)
recursion=100
lis=[]
for i in range(1,recursion+1):
x=(2**(i-1))
y=x*(k+1)+(x-i)
lis.append(y)
count=1
while(count<i):
y=y+1
lis.append(y)
count+=1
print(lis[:10])
Output: [4, 8, 9, 17, 18, 19, 36, 37, 38, 39]
The idea behind finding the position of $ is that the string before and after it is a palindrome and if the index of $ is odd the element before and after it would be the last element of the string and it is even the element before and after it would be the first element of the string.
The number of dollar signs that S will have in each group of them follows the following sequence:
1 2 1 3 1 2 1 4 1 2 1 ...
This corresponds to the number of trailing zeroes that i has in its binary representation, plus one:
bin(i) | dollar signs
--------+-------------
00001 | 1
00010 | 2
00011 | 1
00100 | 3
00101 | 1
00110 | 2
... ...
With that information you can use a loop that subtracts from k the size of the original words and then subtracts the number of dollars according to the above observation. This way you can detect whether k points at a dollar or within a word.
Once k has been "normalised" to an index within the limits of the original total words length, there only remains a check to see whether the characters are in their normal order or reversed. This depends on the number of iterations done in the above loop, and corresponds to i, i.e. whether it is odd or even.
This leads to this code:
def getCharAt(words, k):
size = sum([len(word) for word in words]) # sum up the word sizes
i = 0
while k >= size:
i += 1
# Determine number of dollars: corresponds to one more than the
# number of trailing zeroes in the binary representation of i
b = bin(i)
dollars = len(b) - b.rindex("1")
k -= size + dollars
if k < 0:
return '$'
if i%2: # if i is odd, then look in reversed order
k = size - 1 - k
# Get the character at the k-th index
for word in words:
if k < len(word):
return word[k]
k -= len(word)
You would call it like so:
print (getCharAt(['1','2','3'], 13)) # outputs 3
Generator Version
When you need to request multiple characters like that, it might be more interesting to create a generator, which just keeps producing the next character as long as you keep iterating:
def getCharacters(words):
i = 0
while True:
i += 1
if i%2:
for word in words:
yield from word
else:
for word in reversed(words):
yield from reversed(word)
b = bin(i)
dollars = len(b) - b.rindex("1")
yield from "$" * dollars
If for instance you want the first 80 characters from the infinite string that would be built from "a", "b" and "cd", then call it like this:
import itertools
print ("".join(itertools.islice(getCharacters(['a', 'b', 'cd']), 80)))
Output:
abcd$dcba$$abcd$dcba$$$abcd$dcba$$abcd$dcba$$$$abcd$dcba$$abcd$dcba$$$abcd$dcba$
Here is my solution to the problem (index starts at 1 for findIndex) I am basically counting recursively to find the value of the findIndex element.
def findInd(k,n,findIndex,orientation):
temp = k # no. of characters covered.
tempRec = n # no. of dollars to be added
bool = True # keeps track of if dollar or reverse of string is to be added.
while temp < findIndex:
if bool:
temp += tempRec
tempRec += 1
bool = not bool
else:
temp += temp - (tempRec - 1)
bool = not bool
# print(temp,findIndex)
if bool:
if findIndex <= k:
if orientation: # checks if string must be reversed.
return A[findIndex - 1]
else:
return A[::-1][findIndex - 1] # the string reverses when there is a single dollar so this is necessary
else:
if tempRec-1 == 1:
return findInd(k,1,findIndex - (temp+tempRec-1)/2,False) # we send a false for orientation as we want a reverse in case we encounter a single dollar sign.
else:
return findInd(k,1,findIndex - (temp+tempRec-1)/2,True)
else:
return "$"
A = "123" # change to suit your need
findIndex = 24 # the index to be found # change to suit your need
k = len(A) # length of the string.
print(findInd(k,1,findIndex,True))
I think this will satisfy your time constraint also as I do not go through each element.

extract substring pattern

I have long file like 1200 sequences
>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL
I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx
the output should be like this:
QDIQLCGMGIL
ILPEHCIIDIT
TISDNCVVIFS
FSKTSCSYCTM
this is the pogram only give position of C . it is not work like what I want
pos=[]
def find(ch,string1):
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos
z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')
print z
You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:
def find(ch,string1):
pos = []
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos # outside
You can also use enumerate with a list comp in place of your range logic:
def indexes(ch, s1):
return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]
Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.
If you want the five chars that are both sides:
In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"
In [25]: inds = indexes("C",s)
In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']
I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.
You can do it all in a single function, yielding a slice when you find a match:
def find(ch, s):
ln = len(s)
for i, char in enumerate(s):
if ch == char and 5 <= i <= ln - 6:
yield s[i- 5:i + 6]
Where presuming the data in your question is actually two lines from yoru file like:
s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""
Running:
for line in s.splitlines():
print(list(find("C" ,line)))
would output:
['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']
Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.
You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match
def find(ch, s):
ln, i = len(s) - 6, s.find(ch)
while 5 <= i <= ln:
yield s[i - 5:i + 6]
i = s.find(ch, i + 1)
Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.
My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to #Smac89 for improving it by transforming it into a generator:
import re
string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""
# Generator
def find_cysteine2(string):
# Create a loop that will utilize regex multiple times
# in order to capture matches within groups
while True:
# Find a match
data = re.search(r'(\w{5}C\w{5})',string)
# If match exists, let's collect the data
if data:
# Collect the string
yield data.group(1)
# Shrink the string to not include
# the previous result
location = data.start() + 1
string = string[location:]
# If there are no matches, stop the loop
else:
break
print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Categories