Finding the largest repeating substring

Finding the largest repeating substring - python

Here is a function I wrote that will take a very long text file. Such as a text file containing an entire textbook. It will find any repeating substrings and output the largest string. Right now it doesn't work however, it just outputs the string I put in
For example, if there was a typo in which an entire sentence was repeated. It would output that sentence; given it is the largest in the whole file. If there was a typo where an entire paragraph was typed twice, it would output the paragraph.
This algorithm takes the first character, finds any matches, and if it does and if the length is the largest, store the substring. Then it takes the first 2 characters and repeats. Then the first 3 characters. etc.. Then it will start over except starting at the 2nd character instead of the 1st. Then all the way through and back, starting at the 3rd character.
def largest_substring(string):
length = 0
x,y=0,0
for y in range(len(string)): #start at string[0, ]
for x in range(len(string)): #start at string[ ,0]
substring = string[y:x] #substring is [0,0] first, then [0,1], then [0.2]... then [1,1] then [1,2] then [1,3]... then [2,2] then [2,3]... etc.
if substring in string: #if substring found and length is longest so far, save the substring and proceed.
if len(substring) > length:
match = substring
length = len(substring)

I think your logic is flawed here because it will always return the entire string as it checks whether a substring is in whole string which is always true so the statement if substring in string will be always true. Instead you need to find if the substring occurs more than once in the entire string and then update the count.
Here is example of brute force algorithm that solves it :-
import re
def largest_substring(string):
length = 0
x=0
y=0
for y in range(len(string)):
for x in range(len(string)):
substring = string[y:x]
if len(list(re.finditer(substring,string))) > 1 and len(substring) > length:
match = substring
length = len(substring)
return match
print largest_substring("this is repeated is repeated is repeated")

Related

length of the longest substring of given string so that rearrangement of its characters form PALINDROME

Only lower case string as input.
Only words as input
Invalid if characters like "#","#"... are present
Find the length of the longest substring of given string so that the characters in it can be rearranged to form a palindrome.
Output the length
I am unable to put it in terms of programming in python.
please help.
My line of thinking was to keep an initial counter as 1(as even a word with completely different letters will have 1 by default)
then add 2 for every 1 match in letters
Sample input: "letter"
Sample output: 5 #(1(by default + 2(for 2"t"s) + 2(for 2"e"s))

You can use this code to figure out the length of the longest substring which can be rearranged to form a palindrome:
def longestSubstring(s: str):
# To keep track of the last
# index of each xor
n = len(s)
index = dict()
# Initialize answer with 0
answer = 0
mask = 0
index[mask] = -1
# Now iterate through each character
# of the string
for i in range(n):
# Convert the character from
# [a, z] to [0, 25]
temp = ord(s[i]) - 97
# Turn the temp-th bit on if
# character occurs odd number
# of times and turn off the temp-th
# bit off if the character occurs
# even number of times
mask ^= (1 << temp)
# If a mask is present in the index
# Therefore a palindrome is
# found from index[mask] to i
if mask in index.keys():
answer = max(answer,
i - index[mask])
# If x is not found then add its
# position in the index dict.
else:
index[mask] = i
# Check for the palindrome of
# odd length
for j in range(26):
# We cancel the occurrence
# of a character if it occurs
# odd number times
mask2 = mask ^ (1 << j)
if mask2 in index.keys():
answer = max(answer,
i - index[mask2])
return answer
This algorithm is basically O(N*26). XOR basically checks if the amount of a certain character is even or odd. For every character in your string there is a certain XOR sequence of every character up to that point which tells you which characters have appeared an odd number of time and which characters have appeared an even number of times. If the same sequence has already been encountered in the past then you know that you have found a palindrome, because you have become back to where you started in the XOR sequence aka there are an even number of every character which appears between this point and the start point. If there is an even number of every character which appear between two points in the string, then you can form a palindrome out of them. The odd length check is just a special case to check for palindromes which are of odd length. It works by just pretending sequentially if a character appears an odd number of times, that it just assumes it to occur an even number of times to handle the special case of the character in the middle of an odd length palindrome.
Edit: here is the link to the original code and explanation.

Explanation about split in python

I have this task.
st = 'print only the words that sstart with an s in the sstatement'
and the solution would be
for word in st.split():
if word[0] == 's':
print word
why won't it work with
for word in st.split():
if word[1] == 's':
print word
I kind of understand what that zero stands for, but how can I print the words with the second letter being 's'.

One of the problems is that it is not guaranteed that the length of the string is sufficient. For instance the empty string ('') or a string with one character ('s') might end up in the word list as well.
A quick fix is to use a length check:
for word in st.split():
if len(word) > 1 and word[1] == 's':
print word
Or you can - like #idjaw says - use slicing, and then we will obtain an empty string if out of range:
for word in st.split():
if word[1:2] == 's':
print word
If you have a string, you can obtain a substring with st[i:j] with st the string, i the first index (inclusive) and j the last index (exclusive). If however the indices are out of range, that is not a problem: then you will obtain the empty string. So we simply construct a slice that starts at 1 and ends at 1 (both inclusive here). If no such indices exist, we obtain the empty string (and this is not equal to 's'), otherwise we obtain a string with exactly one character: the one at index 1.
In the case however you will check against more complicated patterns, you can use a regex:
import re
rgx = re.compile(r'\b\ws\w*\b')
rgx.findall('print only the words that sstart with an s in the sstatement')
Here we specified to match anything between word boundaries \b that is a sequence of \ws with the second character an s:
>>> rgx.findall('print only the words that sstart with an s in the sstatement')
['sstart', 'sstatement']

How to get left and right most indexes from matching word inside of string

With:
phrase = "this is string example....wow!!!"
word = "example"
I want to know what the left-most and right-most indexes are for a matching word inside of the phrase. How to do it with a minimum coding?
The output should be two integer values. First is the ordered number of the first letter of word: "e". The second integer value is the ordered number of the last letter of word: same character "e" (in "example" the first and the last letters are the same). We need to find where in the phase "this is string example....wow!!!" the word "example" is.

This is a way to do it without using the re package. It will also return the beginning/end indices of all occurrences of word:
phrase = "this is string example....wow!!!"
word = "example"
word_len = len(word) # word length is 7
phrase_len = len(phrase) # phrase length is 32
#We loop through the phrase using a "window" size equal to the word length
#If we find a match, we return the first and last index of the "current" window
for i in range(phrase_len - word_len+1):
current = phrase[i:i+word_len]
if current == word:
print i,i+word_len-1
#prints 15, 21

I assume you only want to find the first occurrence of word in phrase. If that's the case, just use str.index to get the position of the first character. Then, add len(word) - 1 to it to get the position of the last character.
start = phrase.index(word) # 15
end = start + len(word) - 1 # 21
If you need to find indexes of all occurrences of word, it's much easier to use the re module:
import re
for m in re.finditer(word, "example example"):
print(m.start(), m.end() - 1)
Prints
0 6
8 14

How about this:
import re
phrase = "this example is example string example....wow example!!!"
word = "example"
start, end = min([(m.start(0), m.end(0)) for m in re.finditer(word, phrase)])
print start, end - 1 # 5 11
print phrase[start:end] # example

How to traverse a list of words and search each word and count the occurrences of a given substring within the word?

I need to define a function called freq_count(substr,list). This function accepts a str and a list of words as arguments. It traverses the list of words and searches each word and counts the occurrences of the substring substr within the word. Print each word along with the number of substring occurrences found.
Here is my code:
def freq_count(substr,list):
start_po = 0
count = 0
for word in list:
if word.find(str(substr)) != -1:
start_po = word.find(str,start_po)
count = count + 1
return(str(word) + str(count))

If I understand correctly you have to iterate over the list of words and count the number of occurrences for each word in the list, and print them.
Now printing and returning are two different tasks. The above statement however already defines a skeleton for the function you aim to implement:
def freq_count(substr,list):
for word in list:
count = 0
while ...: #we find an occurence
count += 1
print(word+str(count))
The only thing we still have to implement is an occurrence check. You already partly solved this by using the find method. The only problem is that we need to iterate until we find no more occurrences. We thus need a variable that keeps track of the position we currently have in the word, a variable we will call pos. pos is added to the structure as follows:
def freq_count(substr,list):
for word in list:
count = 0
pos = word.find(substr)
while pos >= 0:#we find an occurence
count += 1
pos += len(substr)
pos = word.find(substr,pos)
print(word+str(count))
How does this work? The first time we call word.find(substr) it will search for the first occurence of substr in word. If no such substring exists, pos will be equal to -1. In that case the while loop fails immediately, and thus no counting is done, the result is zero in that case.
In the case we found an occurrence, pos will be equal to the index where the substring start. We increment the count because we found one occurrence, and we update pos: first we add the length len(substr) of the substring substr to prevent we find any more occurrence that partly overlap the previous occurrence. Next we call find again, but now we feed it a begin position pos. We thus only look for occurrence that start later or at pos. The while loop will repeat looping as long as there are more occurrences to be found.
Demo (using python3):
>>> freq_count('aba',['','a','aba','ababa','abaaba','abababa','abacaba'])
0
a0
aba1
ababa1
abaaba2
abababa2
abacaba2

Python word counter

I'm taking a Python 2.7 course at school, and they told us to create the following program:
Assume s is a string of lower case characters.
Write a program that prints the longest substring of s in which the letters occur in alphabetical order.
For example, if s = azcbobobegghakl , then your program should print
Longest substring in alphabetical order is: beggh
In the case of ties, print the first substring.
For example, if s = 'abcbcd', then your program should print
Longest substring in alphabetical order is: abc
I wrote the following code:
s = 'czqriqfsqteavw'
string = ''
tempIndex = 0
prev = ''
curr = ''
index = 0
while index < len(s):
curr = s[index]
if index != 0:
if curr < prev:
if len(s[tempIndex:index]) > len(string):
string = s[tempIndex:index]
tempIndex=index
elif index == len(s)-1:
if len(s[tempIndex:index]) > len(string):
string = s[tempIndex:index+1]
prev = curr
index += 1
print 'Longest substring in alphabetical order is: ' + string
The teacher also gave us a series of test strings to try:
onyixlsttpmylw
pdxukpsimdj
yamcrzwwgquqqrpdxmgltap
dkaimdoviquyazmojtex
abcdefghijklmnopqrstuvwxyz
evyeorezmslyn
msbprjtwwnb
laymsbkrprvyuaieitpwpurp
munifxzwieqbhaymkeol
lzasroxnpjqhmpr
evjeewybqpc
vzpdfwbbwxpxsdpfak
zyxwvutsrqponmlkjihgfedcba
vzpdfwbbwxpxsdpfak
jlgpiprth
czqriqfsqteavw
All of them work fine, except the last one, which produces the following answer:
Longest substring in alphabetical order is: cz
But it should say:
Longest substring in alphabetical order is: avw
I've checked the code a thousand times, and found no mistake. Could you please help me?

These lines:
if len(s[tempIndex:index]) > len(string):
string = s[tempIndex:index+1]
don't agree with each other. If the new best string is s[tempIndex:index+1] then that's the string you should be comparing the length of in the if condition. Changing them to agree with each other fixes the problem:
if len(s[tempIndex:index+1]) > len(string):
string = s[tempIndex:index+1]

Indices are your friend.
below is a simple code for the problem.
longword = ''
for x in range(len(s)-1):
for y in range(len(s)+1):
word = s[x:y]
if word == ''.join(sorted(word)):
if len(word) > len(longword):
longword = word
print ('Longest substring in alphabetical order is: '+ longword)

I see that user5402 has nicely answered your question, but this particular problem intrigued me, so I decided to re-write your code. :) The program below uses essentially the same logic as your code with a couple of minor changes.
It is considered more Pythonic to avoid using indices when practical, and to iterate directly over the contents of strings (or other container objects). This generally makes the code easier to read since we don't have to keep track of both the indices and the contents.
In order to get access to both the current & previous character in the string we zip together two copies of the input string, with one of the copies offset by inserting a space character at the start. We also append a space character to the end of the other copy so that we don't have to do special handling when the longest ordered sub-sequence occurs at the end of the input string.
#! /usr/bin/env python
''' Find longest ordered substring of a given string
From http://stackoverflow.com/q/27937076/4014959
Written by PM 2Ring 2015.01.14
'''
data = [
"azcbobobegghakl",
"abcbcd",
"onyixlsttpmylw",
"pdxukpsimdj",
"yamcrzwwgquqqrpdxmgltap",
"dkaimdoviquyazmojtex",
"abcdefghijklmnopqrstuvwxyz",
"evyeorezmslyn",
"msbprjtwwnb",
"laymsbkrprvyuaieitpwpurp",
"munifxzwieqbhaymkeol",
"lzasroxnpjqhmpr",
"evjeewybqpc",
"vzpdfwbbwxpxsdpfak",
"zyxwvutsrqponmlkjihgfedcba",
"vzpdfwbbwxpxsdpfak",
"jlgpiprth",
"czqriqfsqteavw",
]
def longest(s):
''' Return longest ordered substring of s
s consists of lower case letters only.
'''
found, temp = [], []
for prev, curr in zip(' ' + s, s + ' '):
if curr < prev:
if len(temp) > len(found):
found = temp[:]
temp = []
temp += [curr]
return ''.join(found)
def main():
msg = 'Longest substring in alphabetical order is:'
for s in data:
print s
print msg, longest(s)
print
if __name__ == '__main__':
main()
output
azcbobobegghakl
Longest substring in alphabetical order is: beggh
abcbcd
Longest substring in alphabetical order is: abc
onyixlsttpmylw
Longest substring in alphabetical order is: lstt
pdxukpsimdj
Longest substring in alphabetical order is: kps
yamcrzwwgquqqrpdxmgltap
Longest substring in alphabetical order is: crz
dkaimdoviquyazmojtex
Longest substring in alphabetical order is: iquy
abcdefghijklmnopqrstuvwxyz
Longest substring in alphabetical order is: abcdefghijklmnopqrstuvwxyz
evyeorezmslyn
Longest substring in alphabetical order is: evy
msbprjtwwnb
Longest substring in alphabetical order is: jtww
laymsbkrprvyuaieitpwpurp
Longest substring in alphabetical order is: prvy
munifxzwieqbhaymkeol
Longest substring in alphabetical order is: fxz
lzasroxnpjqhmpr
Longest substring in alphabetical order is: hmpr
evjeewybqpc
Longest substring in alphabetical order is: eewy
vzpdfwbbwxpxsdpfak
Longest substring in alphabetical order is: bbwx
zyxwvutsrqponmlkjihgfedcba
Longest substring in alphabetical order is: z
vzpdfwbbwxpxsdpfak
Longest substring in alphabetical order is: bbwx
jlgpiprth
Longest substring in alphabetical order is: iprt
czqriqfsqteavw
Longest substring in alphabetical order is: avw

I have come across this question myself and thought I would share my answer.
My solution works 100% of the time.
The question is to help new Python coders understand loops without having to dig deep into other complex solutions. This bit of code is flatter and uses variable names to make easy reading for new coders.
I added comments to explain the code steps. Without the comments it is very clean and readable.
s = 'czqriqfsqteavw'
test_char = s[0]
temp_str = str('')
longest_str = str('')
for character in s:
if temp_str == "": # if empty = we are working with a new string
temp_str += character # assign first char to temp_str
longest_str = test_char # it will be the longest_str for now
elif character >= test_char[-1]: # compare each char to the previously stored test_char
temp_str += character # add char to temp_str
test_char = character # change the test_char to the 'for' looping char
if len(temp_str) > len(longest_str): # test if temp_char stores the longest found string
longest_str = temp_str # if yes, assign to longest_str
else:
test_char = character # DONT SWAP THESE TWO LINES.
temp_str = test_char # OR IT WILL NOT WORK.
print("Longest substring in alphabetical order is: {}".format(longest_str))

My solution is similar to Nim J's, but it performs less iterations.
res = ""
for n in range(len(s)):
for i in range(1, len(s)-n+1):
if list(s[n:n+i]) == sorted(s[n:n+i]):
if len(list(s[n:n+i])) > len(res):
res = s[n:n+i]
print("Longest substring in alphabetical order is:", res)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the largest repeating substring - python

Related

length of the longest substring of given string so that rearrangement of its characters form PALINDROME

Explanation about split in python

How to get left and right most indexes from matching word inside of string

How to traverse a list of words and search each word and count the occurrences of a given substring within the word?

Python word counter

Categories

Resources