Finding the longest substring of repeating characters in a string

Finding the longest substring of repeating characters in a string - python

(this is the basis for this codeforces problem)
I try not to get help with codeforces problems unless i'm really, really, stuck, which happens to be now.
Your first mission is to find the password of the Martian database. To achieve this, your best secret agents have already discovered the following facts:
The password is a substring of a given string composed of a sequence of non-decreasing digits
The password is as long as possible
The password is always a palindrome
A palindrome is a string that reads the same backwards. racecar, bob, and noon are famous examples.
Given those facts, can you find all possible passwords of the database?
Input
The first line contains n, the length of the input string (1 ≤ n ≤ 105).
The next line contains a string of length n. Every character of this string is a digit.
The digits in the string are in non-decreasing order.
Output
On the first line, print the number of possible passwords, k.
On the next k lines, print the possible passwords in alphabetical order.
My observations are:
A palindrome in a non-decreasing string is simply a string of repeating characters (eg. "4444" or "11" )
on character i, the last instance of i - the first instance of i +1 = length of the repeating character
Keeping track of the max password length and then filtering out every item that is shorter than the max password length guarantees that the passwords outputted are of max length
my solution based on these observations is:
n,s = [input() for i in range(2)]#input
maxlength = 0
results = []
for i in s:
length = (s.rfind(i)-s.find(i))+1
if int(i*(length)) not in results and length>=maxlength:
results.append(int(i*(length)))
maxlength = length
#filer everything lower than the max password length out
results = [i for i in results if len(str(i))>=maxlength]
#output
print(len(results))
for y in results:
print(y)
unfortunately, this solution is wrong, in fact and fails on the 4th test case. I do not understand what is wrong with the code, and so i cannot fix it. Can someone help with this?
Thanks for reading!

Your program will fail on:
4
0011
It will return just 11.
The problem is that the length of str(int('00')) is equal to 1.
You could fix it by removing the int and str calls from your program (i.e. saving the answers as strings instead of ints).

Peter de Rivaz seems to have identified the problem with your code, however, if you are interested in a different way to solve this problem consider using a regular expression.
import sys
import re
next(sys.stdin) # length not needed in Python
s = next(sys.stdin)
repeats = r'(.)\1+'
for match in re.finditer(repeats, s):
print(match.group())
The pattern (.)\1+ will find all substrings of repeated digits. Output for input
10
3445556788
would be:
44
555
88
If re.finditer() finds that there are no repeating digits then either the string is empty, or it consists of a sequence of increasing non-repeating digits. The first case is excluded since n must be greater than 0. For the second case the input is already sorted alphabetically, so just output the length and each digit.
Putting it together gives this code:
import sys
import re
next(sys.stdin) # length not needed in Python
s = next(sys.stdin).strip()
repeats = r'(.)\1+'
passwords = sorted((m.group() for m in re.finditer(repeats, s)),
key=len, reverse=True)
passwords = [s for s in passwords if len(s) == len(passwords[0])]
if len(passwords) == 0:
passwords = list(s)
print(len(passwords))
print(*passwords, sep='\n')
Note that the matching substrings are extracted from the match object and then sorted by length descending. The code relies on the fact that digits in the input must not decrease so a second alphabetic sort of the candidate passwords is not required.

Related

Given some string and index, find longest repeated string

I apologize if this question has been answered elsewhere on this site, but I have searched for a while and have not found a similar question. For some slight context, I am working with RNA sequences.
Without diving into the Bio aspect, my question boils down to this:
Given a string and an index/position, I want to find the largest matching substring based on that position.
For example:
Input
string = "fsalstackoverflowwqiovmnrflofmnastackovsnv"
position = 13 # the f in the substring 'stackoverflow'
Desired Output
rflo
So basically, despite 'stackov' being the longest repeated substring within the string, I only want the largest repeated substring based on the index given.
Any help is appreciated. Thanks!
Edit
I appreciate the answers provided thus far. However, I intentionally made position equal to 13 in order to show that I want to search and expand on either side of the starting position, not just to the right.

We iteratively check longer and longer substrings starting at position position simply checking if they occur in the remaining string using the in keyword. j is the length of the substring that we currently test, which is string[index:index+j] and longest keeps track of the longest substring seen so far. We can break as soon as the sequence starting at position does not occur anymore with the current length j
string = "fsalstackoverflowwqiovmnrflofmnastackovsnv"
position = 13
index=position-1
longest=0
for j in range(1, (len(string)-index)//2):
if string[index:index+j] in string[index+j:]:
longest=j
else:
break
print(longest)
print(string[index:index+longest])
Output:
4
rflo

Use the in keyword to check for presence in the remainder of the string, like this:
string = "fsalstackoverflowwqiovmnrflofmnastackovsnv"
# Python string indices start at 0
position = 12
for sub_len in range(1, len(string) - position):
# Simply check if the string is present in the remainder of the string
sub_string = string[position:position + sub_len]
if sub_string in string[position + sub_len:] or sub_string in string[0:position]:
continue
break
# The last iteration of the loop did not return any occurrences, print the longest match
print(string[position:position + sub_len - 1])
Output:
rflo
If you set position = 32, this returns stackov, showing how it searches from the beginning as well.

Building a RegEx, 12 letters without order, fixed number of individual letters

I have a complex case where I can't get any further. The goal is to check a string via RegEx for the following conditions:
Exactly 12 letters
The letters W,S,I,O,B,A,R and H may only appear exactly once in the string
The letters T and E may only occur exactly 2 times in the string.
Important! The order must not matter
Example matches:
WSITTOBAEERH
HREEABOTTISW
WSITOTBAEREH
My first attempt:
results = re.match(r"^W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}$", word)
The problem with this first attempt is that it only matches if the order of the letters in the RegEx has been followed. That violates condition 4
My second attempt:
results = re.match(r"^[W{1}S{1}I{1}T{2}O{1}B{1}A{1}E{2}R{1}H{1}]{12}$", word)
The problem with trial two: Now the order no longer matters, but the exact number of individual letters is ignored.
I can only do the basics of RegEx so far and can't get any further here. If anyone has an idea what a regular expression looks like that fits the four rules mentioned above, I would be very grateful.

One possibility, although I still think regex is inappropriate for this. Checks that all letters appear the desired amount and that it's 12 letters total (so there's no room left for any more/other letters):
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch('(?=.*W)(?=.*S)(?=.*I)(?=.*O)'
'(?=.*B)(?=.*A)(?=.*R)(?=.*H)'
'(?=.*T.*T)(?=.*E.*E).{12}', s))
Another, checking that none other than T and E appear twice, that none appear thrice, and that we have only the desired letters, 12 total:
import re
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(re.fullmatch(r'(?!.*([^TE]).*\1)'
r'(?!.*(.).*\1.*\1)'
r'[WSIOBARHTE]{12}', s))
A simpler way:
for s in 'WSITTOBAEERH', 'HREEABOTTISW', 'WSITOTBAEREH':
print(sorted(s) == sorted('WSIOBARHTTEE'))

Match Regex permuations without repeating but with a twist

It seems that I can't find a solution for this perhaps an easy problem: I want to be able to match with a simple regex all possible permutations of 5 specified digits, without repeating, where all digits must be used. So, for this sequence:
12345
the valid permutation is:
54321
but
55555
is not valid.
However, if the provided digits have the same number once or more, only in that case the accepted permutations will have those repeated digits, but each digit must be used only once. For example, if the provided number is:
55432
we see that 5 is provided 2 times, so it must be also present two times in each permutation, and some of the accepted answers would be:
32545
45523
but this is wrong:
55523
(not all original digits are used and 5 is repeated more than twice)
I came very close to solve this using:
(?:([43210])(?!.*\1)){5}
but unfortunately it doesn't work when there are multiple same digits provided(like 43211).

One way to solve this is to make a character class out of the search digits and build a regex to search for as many digits in that class as are in the search string. Then you can filter the regex results based on the sorted match string being the same as the sorted search string. For example:
import re
def find_perms(search, text):
search = sorted(search)
regex = re.compile(rf'\b[{"".join(search)}]{{{len(search)}}}\b')
matches = [m for m in regex.findall(text) if sorted(m) == search]
return matches
print(find_perms('54321', '12345 54321 55432'))
print(find_perms('23455', '12345 54321 55432'))
print(find_perms('24455', '12345 54321 55432'))
Output:
['12345', '54321']
['55432']
[]
Note I've included word boundaries (\b) in the regex so that (for example) 12345 won't match 654321. If you want to match substrings as well, just remove the word boundaries from the regex.

The mathematical term for this is a mutliset. In Python, this is handled by the Counter data type. For example,
from collections import Counter
target = '55432'
candidate = '32545'
Counter(candidate) == Counter(target)
If you want to generate all of the multisets, here's one question dealing with that: How to generate all the permutations of a multiset?

How do I get a program to print the number of words in a sentence and each word in order

I need to print how many characters there are in a sentence the user specifies, print how many words there are in a sentence the user specifies and print each word, the number of letters in the word, and the first and last letter in the word. Can this be done?

I want you to take your time and understand what is going on in the code below and I suggest you to read these resources.
http://docs.python.org/3/library/re.html
http://docs.python.org/3/library/functions.html#len
http://docs.python.org/3/library/functions.html
http://docs.python.org/3/library/stdtypes.html#str.split
import re
def count_letter(word):
"""(str) -> int
Return the number of letters in a word.
>>> count_letter('cat')
3
>>> count_letter('cat1')
3
"""
return len(re.findall('[a-zA-Z]', word))
if __name__ == '__main__':
sentence = input('Please enter your sentence: ')
words = re.sub("[^\w]", " ", sentence).split()
# The number of characters in the sentence.
print(len(sentence))
# The number of words in the sentence.
print(len(words))
# Print all the words in the sentence, the number of letters, the first
# and last letter.
for i in words:
print(i, count_letter(i), i[0], i[-1])
Please enter your sentence: hello user
10
2
hello 5 h o
user 4 u r

Please read Python's string documentation, it is self explanatory. Here is a short explanation of the different parts with some comments.
We know that a sentence is composed of words, each of which is composed of letters. What we have to do first is to split the sentence into words. Each entry in this list is a word, and each word is stored in a form of a succession of characters and we can get each of them.
sentence = "This is my sentence"
# split the sentence
words = sentence.split()
# use len() to obtain the number of elements (words) in the list words
print('There are {} words in the given sentence'.format(len(words)))
# go through each word
for word in words:
# len() counts the number of elements again,
# but this time it's the chars in the string
print('There are {} characters in the word "{}"'.format(len(word), word))
# python is a 0-based language, in the sense that the first element is indexed at 0
# you can go backward in an array too using negative indices.
#
# However, notice that the last element is at -1 and second to last is -2,
# it can be a little bit confusing at the beginning when we know that the second
# element from the start is indexed at 1 and not 2.
print('The first being "{}" and the last "{}"'.format(word[0], word[-1]))

We don't do your homework for you on stack overflow... but I will get you started.
The most important method you will need is one of these two (depending on the version of python):
Python3.X - input([prompt]),.. If the prompt argument is present, it is written
to standard output without a trailing newline. The function then
reads a line from input, converts it to a string (stripping a
trailing newline), and returns that. When EOF is read, EOFError is
raised. http://docs.python.org/3/library/functions.html#input
Python2.X raw_input([prompt]),... If the prompt argument is
present, it is written to standard output without a trailing newline.
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that. When EOF is read,
EOFError is raised. http://docs.python.org/2.7/library/functions.html#raw_input
You can use them like
>>> my_sentance = raw_input("Do you want us to do your homework?\n")
Do you want us to do your homework?
yes
>>> my_sentance
'yes'
as you can see, the text wrote was stroed in the my_sentance variable
To get the amount of characters in a string, you need to understand that a string is really just a list! So if you want to know the amount of characters you can use:
len(s),... Return the length (the number of items) of an object.
The argument may be a sequence (string, tuple or list) or a mapping
(dictionary). http://docs.python.org/3/library/functions.html#len
I'll let you figure out how to use it.
Finally you're going to need to use a built in function for a string:
str.split([sep[, maxsplit]]),...Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done (thus, the list will have at most
maxsplit+1 elements). If maxsplit is not specified or -1, then there
is no limit on the number of splits (all possible splits are made).
http://docs.python.org/2/library/stdtypes.html#str.split

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).

There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.

I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)

To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))

I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])

import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2

While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the longest substring of repeating characters in a string - python

Your program will fail on: 4 0011 It will return just 11. The problem is that the length of str(int('00')) is equal to 1. You could fix it by removing the int and str calls from your program (i.e. saving the answers as strings instead of ints).

Related

Given some string and index, find longest repeated string

Building a RegEx, 12 letters without order, fixed number of individual letters

Match Regex permuations without repeating but with a twist

How do I get a program to print the number of words in a sentence and each word in order

Can't convert 'list'object to str implicitly Python

Categories

Resources