python regex letter must be followed by another letter - python

A string consists of letters and numbers but if it contains a 'c' the following letter after the 'c' must be either 'h' or 'k', does anyone know how to write such a regex for Python?

I would suggest the following:
^(?!.*c(?![hk]))[^\W_]+$
Explanation:
^ # Start of string
(?! # Assert that it's not possible to match...
.* # Any string, followed by
c # the letter c
(?! # unless that is followed by
[hk] # h or k
) # (End of inner negative lookahead)
) # (End of outer negative lookahead).
[^\W_]+ # Match one or more letters or digits.
$ # End of string
[^\W_] means "Match any character that's matched by \w, excluding the _".
>>> import re
>>> strings = ["test", "check", "tick", "pic", "cow"]
>>> for item in strings:
... print("{0} is {1}".format(item,
... "valid" if re.match(r"^(?!.*c(?![hk]))[^\W_]+$", item)
... else "invalid"))
...
test is valid
check is valid
tick is valid
pic is invalid
cow is invalid

The expression ^([^\Wc]*(c[hk])*)*$ also works. It says the whole string (from ^ to $) must consist of repetitions of blocks where each block has any number of non-c characters, [^\Wc]*, and any number of ch or ck pairs, (c[hk])* .
For example:
re.search(r'^([^\Wc]*(c[hk])*)*$', 'checkchek').group()
gives
'checkchek'
If you don't want to match the empty string, replace the last * with a +. Ordinarily, to avoid errors like mentioned in a comment when the input string doesn't match, assign the search result to a variable and test for not none:
In [88]: y = re.search(r'^([^\Wc]*(c[hk])*)*$', 'ca')
In [89]: if y:
....: print y.group()
....: else:
....: print 'No match'
....:
No match

The following code detects the presence of "c not followed by h or k" in myinputstring, and if so it prints "problem":
import re
if ((re.findall(r'c(?!(h|k))', myinputstring).length)>0):
print "problem"

Related

Remove text between () and [] based on condition in Python?

I'm trying to remove the characters between the parentheses and brackets based on the length of characters inside the parentheses and brackets.
Using this:
def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0 # keep it
else: # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)
I'm able to remove the characters between the parentheses and brackets, but I cannot figure out how to remove the characters based on the length of characters inside.
I wanted to remove characters between the parentheses and brackets if the length <=4 with parentheses and brackets if they are >4 remove only parentheses and brackets.
Sample Text:
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
Output:
print(remove_text_inside_brackets(text))
This is a sentence.
Desired Output:
This is a sentence. Once a day twice a day
You can use a simple regex with re.sub and a function as replacement to check the length of the match:
import re
out = re.sub('\(.*?\)|\[.*?\]',
lambda m: '' if len(m.group())<=(4+2) else m.group()[1:-1],
text)
Output:
'This is a sentence. Once a day twice a day '
This give you the logic for more complex checks, in which case you might want to define a named function rather than a lambda
How about splitting on [ and look for ] and measure length (since each split with ] will be necessarily longer than normal split, 4 becomes 5):
def remove_text_inside_brackets(string):
my_str = string.replace('(','[').replace(')',']')
out = []
for s in my_str.split('['):
if ']' in s and len(s) > 5:
s1 = s.rstrip().rstrip(']') + ' '
elif ']' in s and len(s) <= 5:
s1 = ['']
else:
s1 = s
out.extend(s1)
return ''.join(out).strip()
remove_text_inside_brackets(text)
Output:
'This is a sentence. RMVE Once a day twice a day'
Someone will hopefully improve on this, but as an alternative, this nested regular expression can work:
re.sub(r'\[([^)]{5,})\]', '\g<1>',
re.sub(r'\(([^)]{5,})\)', '\g<1>',
re.sub(r'\[[^\]]{,4}\]', '',
re.sub(r'\([^)]{,4}\)', '', text))))
Note that extra spaces, after the period and at the end of the line.
The output of this is slightly different than your given expected output:
'This is a sentence. Once a day twice a day '
It completely removes text and its surrounding brackets when the length is 4 or shorter, while it replaces the match with just the inner text where the length if 5 or longer.
Note that nested brackets, e.g., ((some text) more text) or [(four)] may fail.
I would just use string.find, rather than go character by character. Too much state to track. Note that this will explode if there is an unmatched open paren or open bracket. That's not hard to catch.
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
def remove_text_inside_brackets(text):
i = 0
while i >= 0:
# Try for parens.
i = text.find('(')
j = text.find(')')
if i < 0:
# No parens, try for brackets.
i = text.find('[')
j = text.find(']')
if i >= 0:
if j-i > 5:
text = text[:i] + text[i+1:j] + text[j+1:]
else:
text = text[:i] + text[j+1:]
return text
print(remove_text_inside_brackets(text))
We can take help from regular expressions to solve this
import re
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
text = re.sub('(\(|\[)[a-zA-Z]{1,4}(\)|\])', '', text)
print(re.sub('\[|\]|\(|\)', '', text))
output: "This is a sentence. Once a day twice a day"
here in the regular expression i tried to match the pattern for 1 to 4 length of letter inside braces, along with braces, you can also match numbers and other special characters too.

Compare adjacent characters in string for differing case

I am working through a coding challenge in python, the rules is to take a string and any two adjacent letters of the same character but differing case should be deleted. The process repeated until there are no matching letters of differing case side by side. Finally the length of the string should be printed. I have made a solution below that iterates left to right. Although I have been told there are better more efficient ways.
list_of_elves=list(line)
n2=len(list_of_elves)
i=0
while i < len(list_of_elves):
if list_of_elves[i-1].lower()==list_of_elves[i].lower() and list_of_elves[i-1] != list_of_elves[i]:
del list_of_elves[i]
del list_of_elves[i-1]
if i<2:
i-=1
else:
i-=2
if len(list_of_elves)<2:
break
else:
i+=1
if len(list_of_elves)<2:
break
print(len(list_of_elves))
I have made some pseudo code as well
PROBLEM STATEMENT
Take a given string of alpabetical characters
Build a process to count the initial string length and store to variable
Build a process to iterate through the list and identify the following rule:
Two adjacent matching letters && Of differing case
Delete the pair
Repeat process
Count final length of string
For example, if we had a string with 'aAa' then 'aA' would be deleted, leaving 'a' behind.
In Python, if you want to do it with a regex, use
re.sub(r"([a-zA-Z])(?=(?!\1)(?i:\1))", "", s) # For ASCII only letters
re.sub(r"([^\W\d_])(?=(?!\1)(?i:\1))", "", s) # For any Unicode letters
See the Python demo
Details
([^\W\d_]) - Capturing group 1: any Unicode letter (or any ASCII letter if ([^\W\d_]) is used)
(?=(?!\1)(?i:\1)) - a positive lookahead that requires the same char as matched in the first capturing group (case insensitive) (see (?i:\1)) that is not the same char as matched in Group 1 (see (?!\1))
This is a very similar problem to matching parenthesis, but instead of a match being opposite pairs, the match is upper/lower case. You can use a similar technique of maintaining a stack. Then iterate through and compare the current letter with the top of the stack. If they match pop the element off the stack; if they don't append the letter to the stack. In the end, the length of the stack will be your answer:
line = "cABbaC"
stack = []
match = lambda m, n: m != n and m.upper() == n.upper()
for c in line:
if len(stack) == 0 or not match(c, stack[-1]):
stack.append(c)
else:
stack.pop()
stack
# stack is empty because `Bb` `Aa` and `Cc` get deleted.
Similarly line = "cGBbgaCF" would result in a stack of ['c', 'a', 'C', 'F'] because Bb, then Gg are deleted.
A method that should be very fast:
result = 1
pairs = zip(string, string[1:])
for a, b in pairs:
if a.lower() == b.lower() and a != b:
next(pairs)
else:
result += 1
print(result)
First we create a zip of the input with the input sliced by 1 position, this gives us an iterable that returns all the pairs in the string in order
Then for every pair that doesn't match we increment the result, for every pair that does match we just advance the iterator by one so that we skip the matching pair.
Result is then the length of what would be the result, we don't actually need to store the result as we can just calculate it as we go along since it's the only thing that needs to be returned
Really only need a single assertion in the regex to match the pair and
delete it.
re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
Code sample :
>>> import re
>>> strs = ["aAa","aaa","aAaAA"]
>>> for target in strs:
... modtarg = re.sub(r"(?-i:([a-zA-Z])(?!\1)(?i:\1))", "", target)
... print( target, "\t--> (", len(modtarg), ") ", modtarg )
...
aAa --> ( 1 ) a
aaa --> ( 3 ) aaa
aAaAA --> ( 1 ) A
Info :
(?-i: # Disable Case insensitive if on
( [a-zA-Z] ) # (1), upper or lower case
(?! \1 ) # Not the same cased letter
(?i: \1 ) # Enable Case insensitive, must be the opposite cased letter
)

Find all floats or ints in a given string

Given a string, "Hello4.2this.is random 24 text42", I want to return all ints or floats, [4.2, 24, 42]. All the other questions have solutions that return just 24. I want to return a float even if non-digit characters are next to the number. Since I am new to Python, I am trying to avoid regex or other complicated imports. I have no idea how to start. Please help. Here are some research attempts: Python: Extract numbers from a string, this didn't work since it doesn't recognize 4.2 and 42. There are other questions like the one mentioned, none of which sadly recognize 4.2 and 42.
A regex from perldoc perlretut:
import re
re_float = re.compile("""(?x)
^
[+-]?\ * # first, match an optional sign *and space*
( # then match integers or f.p. mantissas:
\d+ # start out with a ...
(
\.\d* # mantissa of the form a.b or a.
)? # ? takes care of integers of the form a
|\.\d+ # mantissa of the form .b
)
([eE][+-]?\d+)? # finally, optionally match an exponent
$""")
m = re_float.match("4.5")
print m.group(0)
# -> 4.5
To get all numbers from a string:
str = "4.5 foo 123 abc .123"
print re.findall(r"[+-]? *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?", str)
# -> ['4.5', ' 123', ' .123']
Using regular expressions is likely to give you the most concise code for this problem. It is hard to beat the conciseness of
re.findall(r"[+-]? *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?", str)
from pythad's answer.
However, you say "I am trying to avoid regex", so here's a solution that does not use regular expressions. It is obviously a bit longer than a solution using a regular expression (and probably much slower), but it is not complicated.
The code loops through the input character by character.
As it pulls each character from the string, it appends it to current (a string that holds the number currently being parsed) if appending it still maintains a valid number. When it encounters a character that cannot be appended to current, current is saved to a list of numbers, but only if current itself isn't one of '', '.', '-' or '-.'; these are strings that could potentially begin a number but are not themselves valid numbers.
When current is saved, a trailing 'e', 'e-' or 'e+' is removed. That will happen with a string such as '1.23eA'. While parsing that string, current will eventually become '1.23e', but then 'A' is encountered, which means the string does not contain a valid exponential part, so the 'e' is discarded.
After saving current, it is reset. Usually current is reset to '', but when the character that triggered current to be saved was '.' or '-', current is set to that character, because those characters could be the beginning of a new number.
Here's the function extract_numbers(s). The line before return numbers converts the list of strings to a list of integers and floating point values. If you want just the strings, remove that line.
def extract_numbers(s):
"""
Extract numbers from a string.
Examples
--------
>>> extract_numbers("Hello4.2this.is random 24 text42")
[4.2, 24, 42]
>>> extract_numbers("2.3+45-99")
[2.3, 45, -99]
>>> extract_numbers("Avogadro's number, 6.022e23, is greater than 1 million.")
[6.022e+23, 1]
"""
numbers = []
current = ''
for c in s.lower() + '!':
if (c.isdigit() or
(c == 'e' and ('e' not in current) and (current not in ['', '.', '-', '-.'])) or
(c == '.' and ('e' not in current) and ('.' not in current)) or
(c == '+' and current.endswith('e')) or
(c == '-' and ((current == '') or current.endswith('e')))):
current += c
else:
if current not in ['', '.', '-', '-.']:
if current.endswith('e'):
current = current[:-1]
elif current.endswith('e-') or current.endswith('e+'):
current = current[:-2]
numbers.append(current)
if c == '.' or c == '-':
current = c
else:
current = ''
# Convert from strings to actual python numbers.
numbers = [float(t) if ('.' in t or 'e' in t) else int(t) for t in numbers]
return numbers
If you want to get integers or floats from a string, follow the pythad's
ways...
If you want to get both integers and floats from a single string, do this:
string = "These are floats: 10.5, 2.8, 0.5; and these are integers: 2, 1000, 1975, 308 !! :D"
for line in string:
for actualValue in line.split():
value = []
if "." in actualValue:
value = re.findall('\d+\.\d+', actualValue)
else:
value = re.findall('\d+', actualValue)
numbers += value

Replace values in iterable with values of another iterable to the same value

This is a convoluted example, but it shows what I'm attempting to do. Say I have a string:
from string import ascii_uppercase, ascii_lowercase, digits
s = "Testing123"
I would like to replace all values in s that appear in ascii_uppercase with "L" for capital letter, all values that appear in ascii_lowercase with "l" for lowercase letter, and those in digits with "n" for a number.
I'm currently doing:
def getpattern(data):
pattern = ""
for c in data:
if c in ascii_uppercase: pattern += "L"; continue
if c in ascii_lowercase: pattern += "l"; continue
if c in digits: pattern += "n"; continue
pattern += "?"
However, this is tedious with several more lists to replace. I'm usually better at finding map-type algorithms for things like this, but I'm stumped. I can't have it replace anything that was already replaced. For example, if I run the digits one and replace it with "n", the next iteration might replace that with "l" because "n" is a lowercase letter.
getpattern("Testing123") == "Lllllllnnn"
You can create a translation table that maps all upper case letters to 'L', all lower case letters to 'l' and all digits to 'n'. Once you have such a map, you can pass it to str.translate().
from string import ascii_uppercase, ascii_lowercase, digits, maketrans
s = "Testing123"
intab = ascii_uppercase + ascii_lowercase + digits
outtab = ('L' * 26) + ('l' * 26) + ('n' * 10)
trantab = maketrans(intab, outtab)
print s.translate(trantab)
Note that in Python 3 there is no string.maketrans function. Instead, you get the method from the str object str.maketrans(). Read more about this here and the documentation here
I'm not exactly certain of the internals of str.translate(), but my educated guess is the mapping creates a length 256 string for each string character. So as it passes over your string, it'll translate \x00 to \x00, \x01 to \x01, etc, but A to L. That way you don't have to check whether each character is in your translation dictionary. I presume blindly translating all characters with no branches would result to better performance. Print ''.join(chr(i) for i in range(256)) in comparison to see this.
They're in different 32-blocks of ASCII, so you can do this:
>>> ''.join(' nLl'[ord(c) // 32] for c in s)
'Lllllllnnn'
Your example suggests that you don't have other characters, but if you do, this should work:
>>> s = "Testing123 and .?#!-+ äöüß"
>>> ''.join(' nLl'[ord(c) // 32] if c <= 'z' and c.isalnum() else '?' for c in s)
'Lllllllnnn?lll????????????'
Just in case you need to process unicode data:
import unicodedata
cat = {'Lu':'L', 'Ll':'l', 'Nd':'n'}
def getpattern(data):
return ''.join(cat.get(unicodedata.category(c),c) for c in data)

Python: Removing whitespace from multiple lines of a string

So I need the output of my program to look like:
ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end
The largest run of consecutive whitespace characters was 47.
But what I am getting is:
ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end
The longest run of consecutive whitespace characters was 47.
When looking further into the code I wrote, I found with the print(c) statement that this happens:
['ababa', '', 'ab ba ', '', ' xxxxxxxxxxxxxxxxxxx', 'that is it followed by a lot of spaces .', ' no dot at the end']
Between some of the lines, theres the , '',, which is probably the cause of why my print statement wont work.
How would I remove them? I've tried using different list functions but I keep getting syntax errors.
This is the code I made:
a = '''ababa
ab ba
xxxxxxxxxxxxxxxxxxx
that is it followed by a lot of spaces .
no dot at the end'''
c = a.splitlines()
print(c)
#d = c.remove(" ") #this part doesnt work
#print(d)
for row in c:
print(' '.join(row.split()))
last_char = ""
current_seq_len = 0
max_seq_len = 0
for d in a:
if d == last_char:
current_seq_len += 1
if current_seq_len > max_seq_len:
max_seq_len = current_seq_len
else:
current_seq_len = 1
last_char = d
#this part just needs to count the whitespace
print("The longest run of consecutive whitespace characters was",str(max_seq_len)+".")
Regex time:
import re
print(re.sub(r"([\n ])\1*", r"\1", a))
#>>> ababa
#>>> ab ba
#>>> xxxxxxxxxxxxxxxxxxx
#>>> that is it followed by a lot of spaces .
#>>> no dot at the end
re.sub(matcher, replacement, target_string)
Matcher is r"([\n ])\1* which means:
([\n ]) → match either "\n" or " " and put it in a group (#1)
\1* → match whatever group #1 matched, 0 or more times
And the replacement is just
\1 → group #1
You can get the longest whitespace sequence with
max(len(match.group()) for match in re.finditer(r"([\n ])\1*", a))
Which uses the same matcher but instead just gets their lengths, and then maxs it.
From what I can tell, your easiest solution would be using list comprehension:
c= [item for item in a.splitlines() if item != '']
If you wish to make it slightly more robust by also removing strings that only contain whitespace such as ' ', then you can alter it as follows:
c= [item for item in a.splitlines() if item.strip() != '']
You can then also join it the list back together as follows:
output = '\n'.join(c)
This can be easily solved with the built-in filter function:
c = filter(None, a.splitlines())
# or, more explicit
c = filter(lambda x: x != "", a.splitlines())
The first variant will create a list with all elements from the list returned by a.splitlines() that do not evaluate to False, like the empty string.
The second variant creates a small anonymous function (using lambda) that checks if a given element is the empty string and returns False if that is the case. This is more explicit than the first variant.
Another option would be to use a list comprehension that achieves the same thing:
c = [string for string in a.splitlines if string]
# or, more explicit
c = [string for string in a.splitlines if string != ""]

Categories