Compare adjacent characters in string for differing case - python

I am working through a coding challenge in python, the rules is to take a string and any two adjacent letters of the same character but differing case should be deleted. The process repeated until there are no matching letters of differing case side by side. Finally the length of the string should be printed. I have made a solution below that iterates left to right. Although I have been told there are better more efficient ways.
list_of_elves=list(line)
n2=len(list_of_elves)
i=0
while i < len(list_of_elves):
if list_of_elves[i-1].lower()==list_of_elves[i].lower() and list_of_elves[i-1] != list_of_elves[i]:
del list_of_elves[i]
del list_of_elves[i-1]
if i<2:
i-=1
else:
i-=2
if len(list_of_elves)<2:
break
else:
i+=1
if len(list_of_elves)<2:
break
print(len(list_of_elves))
I have made some pseudo code as well
PROBLEM STATEMENT
Take a given string of alpabetical characters
Build a process to count the initial string length and store to variable
Build a process to iterate through the list and identify the following rule:
Two adjacent matching letters && Of differing case
Delete the pair
Repeat process
Count final length of string
For example, if we had a string with 'aAa' then 'aA' would be deleted, leaving 'a' behind.

In Python, if you want to do it with a regex, use
re.sub(r"([a-zA-Z])(?=(?!\1)(?i:\1))", "", s) # For ASCII only letters
re.sub(r"([^\W\d_])(?=(?!\1)(?i:\1))", "", s) # For any Unicode letters
See the Python demo
Details
([^\W\d_]) - Capturing group 1: any Unicode letter (or any ASCII letter if ([^\W\d_]) is used)
(?=(?!\1)(?i:\1)) - a positive lookahead that requires the same char as matched in the first capturing group (case insensitive) (see (?i:\1)) that is not the same char as matched in Group 1 (see (?!\1))

This is a very similar problem to matching parenthesis, but instead of a match being opposite pairs, the match is upper/lower case. You can use a similar technique of maintaining a stack. Then iterate through and compare the current letter with the top of the stack. If they match pop the element off the stack; if they don't append the letter to the stack. In the end, the length of the stack will be your answer:
line = "cABbaC"
stack = []
match = lambda m, n: m != n and m.upper() == n.upper()
for c in line:
if len(stack) == 0 or not match(c, stack[-1]):
stack.append(c)
else:
stack.pop()
stack
# stack is empty because `Bb` `Aa` and `Cc` get deleted.
Similarly line = "cGBbgaCF" would result in a stack of ['c', 'a', 'C', 'F'] because Bb, then Gg are deleted.

A method that should be very fast:
result = 1
pairs = zip(string, string[1:])
for a, b in pairs:
if a.lower() == b.lower() and a != b:
next(pairs)
else:
result += 1
print(result)
First we create a zip of the input with the input sliced by 1 position, this gives us an iterable that returns all the pairs in the string in order
Then for every pair that doesn't match we increment the result, for every pair that does match we just advance the iterator by one so that we skip the matching pair.
Result is then the length of what would be the result, we don't actually need to store the result as we can just calculate it as we go along since it's the only thing that needs to be returned

Really only need a single assertion in the regex to match the pair and
delete it.
re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
Code sample :
>>> import re
>>> strs = ["aAa","aaa","aAaAA"]
>>> for target in strs:
... modtarg = re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
... print( target, "\t--> (", len(modtarg), ") ", modtarg )
...
aAa --> ( 1 ) a
aaa --> ( 3 ) aaa
aAaAA --> ( 1 ) A
Info :
(?-i: # Disable Case insensitive if on
( [a-zA-Z] ) # (1), upper or lower case
(?! \1 ) # Not the same cased letter
(?i: \1 ) # Enable Case insensitive, must be the opposite cased letter
)

Related

Find multiple longest common leading substrings with length >= 4

In Python I am trying to extract all the longest common leading substrings that contain at least 4 characters from a list. For example, in the list called "data" below, the 2 longest common substrings that fit my criteria are "johnjack" and "detc". I knew how to find the single longest common substring with the codes below, which returned nothing (as expected) because there is no common substring. But I am struggling with building a script that could detect multiple common substrings within a list, where each of the common substring must have length of 4 or above.
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh']
def ls(data):
if len(data)==0:
prefix = ''
else:
prefix = data[0]
for i in data:
while not i.startswith(prefix) and len(prefix) > 0:
prefix = prefix[:-1]
print(prefix)
ls(data)
Here's one, but I think it's probably not the fastest or most efficient. Let's start with just the data and a container for our answer:
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh', 'chunganh']
substrings = []
Note I added a dupe for chunganh -- that's a common edge case we should be handling.
See How do I find the duplicates in a list and create another list with them?
So to capture the duplicates in the data
seen = {}
dupes = []
for x in data:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
for dupe in dupes:
substrings.append(dupe)
Now let's record the unique values in the data as-is
# Capture the unique values in the data
last = set(data)
From here, we can loop through our set, popping characters off the end of each unique value. If the length of our set changes, we've found a unique substring.
# Handle strings up to 10000 characters long
for k in [0-b for b in range(1, 10000)]:
# Use negative indexing to start from the longest
last, middle = set([i[:k] for i in data]), last
# Unique substring found
if len(last) != len(middle):
for k in last:
count = 0
for word in middle:
if k in word:
count += 1
if count > 1:
substrings.append(k)
# Early stopping
if len(last) == 1:
break
Finally, you mentioned needing only substrings of length 4.
list(filter(lambda x: len(x) >= 4, substrings))

Consecutive values in strings, getting indices

The following is a python string of length of approximately +1000.
string1 = "XXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
len(string1) ## 1311
I would like to know the index of where the consecutive X's end and the non-X characters begin. Reading this string from left to right, the first non-X character is at index location 22, and the first non-X character from the right is at index location 1306.
How does one find these indices?
My guess would be:
for x in string1:
if x != "X":
print(string.index(x))
The problem with this is it outputs all indices that are not X. It does not give me the index where the consecutive X's end.
Even more confusing for me is how to "check" for consecutive X's. Let's say I have this string:
string2 = "XXXXAAXAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
Here, the consecutive X's end at index 4, not index 7. How could I check several characters ahead whether this is really no longer consecutive?
using regex, split the first & last group of Xs, get their lengths to construct the indices.
import re
mystr = 'XXXXAAXAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX'
xs = re.split('[A-W|Y-Z]+', mystr)
indices = (len(xs[0]), len(mystr) - len(xs[-1]) - 1)
# (4, 47)
I simply need the outputs for the indices. I'm then going to put them in randint(first_index, second_index)
Its possible to pass the indices to the function like this
randint(*indices)
However, I suspect that you want to use the output of randint(first_index, last_index) to select a random character from the middle, this would be a shorter alternative.
from random import choice
randchar = choice(mystr.strip('X'))
If I understood well your question, you just do:
def getIndexs(string):
lst =[]
flag = False
for i, char in enumerate(string):
if char == "x":
flag = True
if ((char != "x") and flag):
lst.append(i-1)
flag = False
return lst
print(getIndexs("xxxxbbbxxxxaaaxxxbb"))
[3, 10, 16]
If the sequences are, as you say, only in the beginning and at the end of your string, a simple loop / reversed loop would suffice:
string1 = "XXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB........AAAAXXXXX"
left_index = 0
for char in string1:
left_index += 1
if char != "X":
break
right_index = len(string1)
for char in reversed(string1):
if char != "X":
break
right_index -= 1
print(left_index) # 22
print(right_index) # 65
Regex can lookahead and identify characters that don't match the pattern:
>>>[match.span() for match in re.finditer(r'X{2,}((?=[^X])|$)', string2)]
[(0, 4), (48, 53)]
Breaking this down:
X - the character we're matching
{2,} - need to see at least two in a row to consider a match
((?=[^X])|$) - two conditions will satisfy the match
(?=[^X]) - lookahead for anything but an X
$ - the end of the string
As a result, finditer returns each instance where there are multiple X's, followed by a non-X or an end of line. match.span() extracts the position information from each match from the string.
This will give you the first index and last index (of non-'X' character).
s = 'XXABCDXXXEFGHXXXXX'
first_index = len(s) - len(s.lstrip('X'))
last_index = len(s.rstrip('X')) - len(s) - 1
print first_index, last_index
2 -6
How it works:
For first_index:
We strip all the 'X' characters at the beginning of our string. Finding the difference in length between the original and shortened string gives us the index of the first non-'X' character.
For last_index:
Similarly, we strip the 'X' characters at the end of our string. We also subtract 1 from the difference, since reverse indexing in Python starts from -1.
Note:
If you just want to randomly select one of the characters between first_index and last_index, you can do:
import random
shortened_s = s.strip('X')
random.choice(shortened_s)

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Find symmetric words in a text [duplicate]

This question already has answers here:
how to find words that made up of letter exactly facing each other? (python) [closed]
(4 answers)
Closed 9 years ago.
I have to write a function which takes one arguments text containing a block of text in the form of a str, and returns a sorted list of “symmetric” words. A symmetric word is defined as a word where for all values i, the letter i positions from the start of the word and the letter i positions from the end of the word are equi-distant from the respective ends of the alphabet. For example, bevy is a symmetric word as: b (1 position from the start of the word) is the second letter of the alphabet and y (1 position from the end of the word) is the second-last letter of the alphabet; and e (2 positions from the start of the word) is the fifth letter of the alphabet and v (2 positions from the end of the word) is the fifth-last letter of the alphabet.
For example:
>>> symmetrics("boy bread aloz bray")
['aloz','boy']
>>> symmetrics("There is a car and a book;")
['a']
All I can think about the solution is this but I can't run it since it's wrong:
def symmetrics(text):
func_char= ",.?!:'\/"
for letter in text:
if letter in func_char:
text = text.replace(letter, ' ')
alpha1 = 'abcdefghijklmnopqrstuvwxyz'
alpha2 = 'zyxwvutsrqponmlkjihgfedcba'
sym = []
for word in text.lower().split():
n = range(0,len(word))
if word[n] == word[len(word)-1-n]:
sym.append(word)
return sym
The code above doesn't take into account the position of alpha1 and alpha2 as I don't know how to put it. Is there anyone can help me?
Here is a hint:
In [16]: alpha1.index('b')
Out[16]: 1
In [17]: alpha2.index('y')
Out[17]: 1
An alternative way to approach the problem is by using the str.translate() method:
import string
def is_sym(word):
alpha1 = 'abcdefghijklmnopqrstuvwxyz'
alpha2 = 'zyxwvutsrqponmlkjihgfedcba'
tr = string.maketrans(alpha1, alpha2)
n = len(word) // 2
return word[:n] == word[::-1][:n].translate(tr)
print(is_sym('aloz'))
print(is_sym('boy'))
print(is_sym('bread'))
(The building of the translation table can be easily factored out.)
The for loop could be modified as:
for word in text.lower().split():
for n in range(0,len(word)//2):
if alpha1.index(word[n]) != alpha2.index(word[len(word)-1-n]):
break
else:
sym.append(word)
return sym
According to your symmetric rule, we may verify a symmetric word with the following is_symmetric_word function:
def is_symmetric_word(word):
alpha1 = 'abcdefghijklmnopqrstuvwxyz'
alpha2 = 'zyxwvutsrqponmlkjihgfedcba'
length = len(word)
for i in range(length / 2):
if alpha1.index(word[i]) != alpha2.index(word[length - 1 - i]):
return False
return True
And then the whole function to get all unique symmetric words out of a text can be defined as:
def is_symmetrics(text):
func_char= ",.?!:'\/;"
for letter in text:
if letter in func_char:
text = text.replace(letter, ' ')
sym = []
for word in text.lower().split():
if is_symmetric_word(word) and not (word in sym):
sym.append(word)
return sym
The following are two test cases from you:
is_symmetrics("boy bread aloz bray") #['boy', 'aloz']
is_symmetrics("There is a car and a book;") #['a']
Code first. Discussion below the code.
import string
# get alphabet and reversed alphabet
try:
# Python 2.x
alpha1 = string.lowercase
except AttributeError:
# Python 3.x and newer
alpha1 = string.ascii_lowercase
alpha2 = alpha1[::-1] # use slicing to reverse alpha1
# make a dictionary where the key, value pairs are symmetric
# for example symd['a'] == 'z', symd['b'] == 'y', and so on
_symd = dict(zip(alpha1, alpha2))
def is_symmetric_word(word):
if not word:
return False # zero-length word is not symmetric
i1 = 0
i2 = len(word) - 1
while True:
if i1 >= i2:
return True # we have checked the whole string
# get a pair of chars
c1 = word[i1]
c2 = word[i2]
if _symd[c1] != c2:
return False # the pair wasn't symmetric
i1 += 1
i2 -= 1
# note, added a space to list of chars to filter to a space
_filter_to_space = ",.?!:'\/ "
def _filter_ch(ch):
if ch in _filter_to_space:
return ' ' # return a space
elif ch in alpha1:
return ch # it's an alphabet letter so return it
else:
# It's something we don't want. Return empty string.
return ''
def clean(text):
return ''.join(_filter_ch(ch) for ch in text.lower())
def symmetrics(text):
# filter text: keep only chars in the alphabet or spaces
for word in clean(text).split():
if is_symmetric_word(word):
# use of yield makes this a generator.
yield word
lst = list(symmetrics("The boy...is a yob."))
print(lst) # prints: ['boy', 'a', 'yob']
No need to type the alphabet twice; we can reverse the first one.
We can make a dictionary that pairs each letter with its symmetric letter. This will make it very easy to test whether any given pair of letters is a symmetric pair. The function zip() makes pairs from two sequences; they need to be the same length, but since we are using a string and a reversed copy of the string, they will be the same length.
It's best to write a simple function that does one thing, so we write a function that does nothing but check if a string is symmetric. If you give it a zero-length string it returns False, otherwise it sets i1 to the first character in the string and i2 to the last. It compares characters as long as they continue to be symmetric, and increments i1 while decrementing i2. If the two meet or pass each other, we know we have seen the whole string and it must be symmetric, in which case we return True; if it ever finds any pair of characters that are not symmetric, it returns False. We have to do the check for whether i1 and i2 have met or passed at the top of the loop, so it won't try to check if a character is its own symmetric character. (A character can't be both 'a' and 'z' at the same time, so a character is never its own symmetric character!)
Now we write a wrapper that filters out the junk, splits the string into words, and tests each word. Not only does it convert the chosen punctuation characters to spaces, but it also strips out any unexpected characters (anything not an approved punctuation char, a space, or a letter). That way we know nothing unexpected will get through to the inner function. The wrapper is "lazy"... it is a generator that yields up one word at a time, instead of building the whole list and returning that. It's easy to use list() to force the generator's results into a list. If you want, you can easily modify this function to just build a list and return it.
If you have any questions about this, just ask.
EDIT: The original version of the code didn't do the right thing with the punctuation characters; this version does. Also, as #heltonbiker suggested, why type the alphabet when Python has a copy of it you can use? So I made that change too.
EDIT: #heltonbiker's change introduced a dependency on Python version! I left it in with a suitable try:/except block to handle the problem. It appears that Python 3.x has improved the name of the lowercase ASCII alphabet to string.ascii_lowercase instead of plain string.lowercase.

python regex letter must be followed by another letter

A string consists of letters and numbers but if it contains a 'c' the following letter after the 'c' must be either 'h' or 'k', does anyone know how to write such a regex for Python?
I would suggest the following:
^(?!.*c(?![hk]))[^\W_]+$
Explanation:
^ # Start of string
(?! # Assert that it's not possible to match...
.* # Any string, followed by
c # the letter c
(?! # unless that is followed by
[hk] # h or k
) # (End of inner negative lookahead)
) # (End of outer negative lookahead).
[^\W_]+ # Match one or more letters or digits.
$ # End of string
[^\W_] means "Match any character that's matched by \w, excluding the _".
>>> import re
>>> strings = ["test", "check", "tick", "pic", "cow"]
>>> for item in strings:
... print("{0} is {1}".format(item,
... "valid" if re.match(r"^(?!.*c(?![hk]))[^\W_]+$", item)
... else "invalid"))
...
test is valid
check is valid
tick is valid
pic is invalid
cow is invalid
The expression ^([^\Wc]*(c[hk])*)*$ also works. It says the whole string (from ^ to $) must consist of repetitions of blocks where each block has any number of non-c characters, [^\Wc]*, and any number of ch or ck pairs, (c[hk])* .
For example:
re.search(r'^([^\Wc]*(c[hk])*)*$', 'checkchek').group()
gives
'checkchek'
If you don't want to match the empty string, replace the last * with a +. Ordinarily, to avoid errors like mentioned in a comment when the input string doesn't match, assign the search result to a variable and test for not none:
In [88]: y = re.search(r'^([^\Wc]*(c[hk])*)*$', 'ca')
In [89]: if y:
....: print y.group()
....: else:
....: print 'No match'
....:
No match
The following code detects the presence of "c not followed by h or k" in myinputstring, and if so it prints "problem":
import re
if ((re.findall(r'c(?!(h|k))', myinputstring).length)>0):
print "problem"

Categories