RegEx pattern not behaving as wanted - python

I am using regex pattern
[^A-Za-z](email,|help|BGN|won't|go|corner|issues|disconected|We|group|No|send|Bv|connecting|has|Pittsburgh,|Many|(Akustica,|Toluca|cannot|Restarting|they|not|PI2|one|condition|entire|LAN|experincing|bar|Exchange,|server|Are|PA)|OutLook|right|says|Rose|Montalvo|back|computer|are|Jane|thier|Disconnected|Nrd|and/or|network|for|Appears|e-mail|unable|Connected|then|Broadview,|issue|email|shows|available|be|we|exchange|error|address|based|My|Microsoft|received|working|created|receive|impacted|WIFI|through|connection|including|or|IL|outlook|via|facility|Everyone's|servers|Also|message|"The|your|Status|doesn't|service|SI-MBX82.de.bosch.com,|next|appears|"disconnected"|Encryption|eMail/file|today|"Waiting|"send/receive"|but|it|trying|SAP|disconnected|e-mails|this|getting|can|of|connect|Incorrect|manually|is|site|an|folder"|cant|Other|have|in|Receiving|if|Plant|no|SI-MBX80.de.bosch.com|that|when|online|persists."|Customer|administrator|users|update|applications|"Disconnected"|SI-MBX81.de.bosch.com|The|on|lower|Some|It|contact|In|the|having)[^A-Za-z]
And applying but it is not able to find "Jane" in the sentence
"Issue with eMail/file Encryption Incorrect email address created for Jane Rose Montalvo."
While Jane is present in the above pattern that I am using.
What could be the reason?

The problem is your regex captures \s before and after the word and it is also the matching criteria.
Hello Jane
So from this once Hello is captured Jane is left and it cannot be matched as it has no space before it.You should make it an assert rather than matching one.
Use (?<=[^a-zA-Z]) instead of simple [^a-zA-Z].See demo.
http://regex101.com/r/lU7jH1/9

Because of overlapping of characters. Just use a capturing group inside lookahead inorder to capture the overlapping characters,
(?=[^A-Za-z](email,|help|BGN|won't|go|corner|issues|disconected|We|group|No|send|Bv|connecting|has|Pittsburgh,|Many|(Akustica,|Toluca|cannot|Restarting|they|not|PI2|one|condition|entire|LAN|experincing|bar|Exchange,|server|Are|PA)|OutLook|right|says|Rose|Montalvo|back|computer|are|Jane|thier|Disconnected|Nrd|and/or|network|for|Appears|e-mail|unable|Connected|then|Broadview,|issue|email|shows|available|be|we|exchange|error|address|based|My|Microsoft|received|working|created|receive|impacted|WIFI|through|connection|including|or|IL|outlook|via|facility|Everyone's|servers|Also|message|"The|your|Status|doesn't|service|SI-MBX82\.de\.bosch\.com,|next|appears|"disconnected"|Encryption|eMail/file|today|"Waiting|"send/receive"|but|it|trying|SAP|disconnected|e-mails|this|getting|can|of|connect|Incorrect|manually|is|site|an|folder"|cant|Other|have|in|Receiving|if|Plant|no|SI-MBX80\.de\.bosch\.com|that|when|online|persists\."|Customer|administrator|users|update|applications|"Disconnected"|SI-MBX81\.de\.bosch.com|The|on|lower|Some|It|contact|In|the|having)[^A-Za-z])
DEMO

If for some reason you cannot or do not want to modify your pattern and you have overlapping matches that you want to capture, you can use re.search in a loop - moving the starting point for the search to the character just after the beginning of the previous match.
#recursive
def foo(s, p, start = 0):
m = p.search(s, start)
if not m:
return ''
return m.group() + foo(s, p, m.start() + 1)
#iterative
def foo1(s, p):
result = ''
m = p.search(s, 0)
while m:
result += m.group()
m = p.search(s, m.start() + 1)
return result
print foo(s, re.compile(p))
print foo1(s, re.compile(p))
>>>
eMail/file Encryption Incorrect email address created for Jane Rose Montalvo.
eMail/file Encryption Incorrect email address created for Jane Rose Montalvo.
>>>

Related

Regular expression to replace second occurrence of dot

String is Hello.world.hello. I wanted to replace the second occurrence of the dot with '_'.
str = "Hello. world. Hello!"
x = re.sub(r'^((.){1}).', r'\1_', str)
#x = str.find(str.find('.')
print(x)
The output I am getting is 'H_llo. world. Hello!'. What should be the correct solution
You can use
import re
text = "Hello. world. Hello!"
print( re.sub(r'^([^.]*\.[^.]*)\.', r'\1_', text) )
# => Hello. world_ Hello!
See the Python demo and the regex demo.
Details:
^ - start of string
([^.]*\.[^.]*) - Group 1: any zero or more chars other than a ., a dot and again any 0+ non-dots
\. - a dot.
The replacement is Group 1 value + _.
It is also possible to do without a regex:
text = "Hello. world. Hello!"
chunks = text.split('.', 2) # split the text twice
if len(chunks) > 2: # if there are more than 2 items
print( fr'{".".join(chunks[0:2])}_{chunks[2]}' )
else:
print(text) # Replace the second dot or print the original
# => Hello. world_ Hello!
See the Python demo.
With your shown samples, could you please try following. Written and tested in Python3.8
import re
str1 = "Hello. world. Hello!"
re.sub(r'^(.*?\.)([^.]*)\.(.*)$', r'\1\2_\3', str1)
'Hello. world_ Hello!'
Explanation: Simply importing re function of Python3.8 then creating str1 variable with value. Then using re.sub function to replace 2nd dot with _ as per requirement. In re.sub function on first argument giving regex to match everything apart from 2nd dot(in 3 capturing groups) and replacing them as per need with respective capturing groups placing _ on place of 2nd dot.
Explanation of regex:
^(.*?\.) ##Creating 1st capturing group, where Matching till 1st dot from staring of value.
([^.]*) ##Creating 2nd capturing group, matching just before dot(2nd dot) here.
\. ##Matching exact literal dot here.
(.*)$ ##Matching/keeping everything else till last of value in 3rd capturing group.

Cant understand why index for joining re.findall starts at 1 instead of 0

Code from the book "Automate the boring stuff with python"
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # username
# # # symbol
[a-zA-Z0-9.-]+ # domain name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)
# Find matches in clipboard text.
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
phoneNum = '-'.join([groups[1], groups[3], groups[5]])
if groups[8] != '':
phoneNum += ' x' + groups[8]
matches.append(phoneNum)
for groups in emailRegex.findall(text):
matches.append(groups[0])
# Copy results to the clipboard.
if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('No phone numbers or email addresses found.')
It takes out emails and phone numbers in your clipboard. My problem is with the lines
for groups in phoneRegex.findall(text):
phoneNum = '-'.join([groups[1], groups[3], groups[5]])
If i have understood it correctly the findall method returns a list of tuples where each tuple has each group of the regex something like
[(area code, separator, first 3 digits, separator, last 4 digits, extension), (area code, separator, first 3 digits, separator, last 4 digits, extension)]
But since lists and tuples start with indexes 0 and i want to join the first, third and fifth of each tuples items, why isnt that line
for groups in phoneRegex.findall(text):
phoneNum = '-'.join([groups[0], groups[2], groups[4]])
In python (and most other regex engines to be honest), The re match objects always have at least one group after a successful match. The first of them (0th index) is always the full match.
To illustrate what I mean by "full match". Here's this simple regex- r'hello\s+world'. This will match strings like hello world and hello world and even foo hello world bar. Check out the demo
Now there will be 3 matches on that demo, and in all of them, you'll notice on the right it says "full match" and lists out the match, which for the first string is hello world, for the second is hello world and for the third is hello world
That is the full match. It's just the full regex match without any capturing whatsoever.
Now here's another regex, that matches and captures- r'hello\s+(world)'. Check the demo for this one.
Now notice, each match has 2 fields, one is the full match, which is the same as last time and the other is Group 1. Which is our captured group - world.
In conclusion, after a successful match, the full match is always at 0th index, followed by the captured groups.
Read the docs for more info.

Python add space

We have the repetitive words like Mr and Mrs in a text. We would like to add a space before and after the keywords Mr and Mrs. But, the word Mr is getting repetitive in Mrs. Please assist in solving the query:
Input:
Hi This is Mr.Sam. Hello, this is MrsPamela.Mr.Sam, what is your call about? Mrs.Pamela, I have a question for you.
import re
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ("Mr", "Mrs")
def add_spaces(string, words):
for word in words:
# pattern to match any non-space char before the word
patt1 = re.compile('\S{}'.format(word))
matches = re.findall(patt1, string)
for match in matches:
non_space_char = match[0]
string = string.replace(match, '{} {}'.format(non_space_char, word))
# pattern to match any non-space char after the word
patt2 = re.compile('{}\S'.format(word))
matches = re.findall(patt2, string)
for match in matches:
non_space_char = match[-1]
string = string.replace(match, '{} {}'.format(word, non_space_char))
return string
print(add_spaces(s, words))
Present Output:
Hi This is Mr .Sam. Hello, this is Mr sPamela. Mr .Sam, what is your call about? Mr s.Pamela, I have a question for you.
Expected Output:
Hi This is Mr .Sam. Hello, this is Mrs Pamela. Mr .Sam, what is your call about? Mrs .Pamela, I have a question for you.
You didn't specify anything after the letter 'r' so your pattern will match any starting with a space character followed by 'M' and 'r', so this will capture any ' Mr' even if it's followed by a 's' such as Mrs, that's why your your first pattern adds a space in the middle of Mrs.
A better pattern would be r'\bMr\b'
'\b' captures word boundaries, see the doc for further explanations: https://docs.python.org/3/library/re.html
I do not have a very extense knowledge of re module, but I came up with a solution which is extendable to any number of words and string and that perfectly works (tested in python3), although it is probably a very extense one and you may find something more optimized and much more concise.
On the other hand, it is not very difficult to understand the procedure:
To begin with, the program orders the words list from descending
length.
Then, it finds the matches of the longer words first and takes note
of the sections where the matches were already done in order not to
change them again. (Note that this introduces a limitation, but it
is necessary, due to the program cannot know if you want to allow
that a word in the variable word can be contained in other, anyway
it does not affect you case)
When it has taken note of all matches (in a non-blocked part of the
string) of a word, it adds the corresponding spaces and corrects the
blocked indexes (they have moved due to the insertion of the spaces)
Finally, it does a trim to eliminate multiple spaces
Note: I used a list for the variable words instead of a tuple
import re
def add_spaces(string, words):
# Get the lenght of the longest word
max_lenght = 0
for word in words:
if len(word)>max_lenght:
max_lenght = len(word)
print("max_lenght = ", max_lenght)
# Order words in descending lenght
ordered_words = []
i = max_lenght
while i>0:
for word in words:
if len(word)==i:
ordered_words.append(word)
i -= 1
print("ordered_words = ", ordered_words)
# Iterate over words adding spaces with each match and "blocking" the match section so not to modify it again
blocked_sections=[]
for word in ordered_words:
matches = [match.start() for match in re.finditer(word, string)]
print("matches of ", word, " are: ", matches)
spaces_position_to_add = []
for match in matches:
blocked = False
for blocked_section in blocked_sections:
if match>=blocked_section[0] and match<=blocked_section[1]:
blocked = True
if not blocked:
# Block section and store position to modify after
blocked_sections.append([match,match+len(word)])
spaces_position_to_add.append([match,match+len(word)+1])
# Add the spaces and update the existing blocked_sections
spaces_added = 0
for new_space in spaces_position_to_add:
# Add space before and after the word
string = string[:new_space[0]+spaces_added]+" "+string[new_space[0]+spaces_added:]
spaces_added += 1
string = string[:new_space[1]+spaces_added]+" "+string[new_space[1]+spaces_added:]
spaces_added += 1
# Update existing blocked_sections
for blocked_section in blocked_sections:
if new_space[0]<blocked_section[0]:
blocked_section[0] += 2
blocked_section[1] += 2
# Trim extra spaces
string = re.sub(' +', ' ', string)
return string
### MAIN ###
if __name__ == '__main__':
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ["Mr", "Mrs"]
print(s)
print(add_spaces(s,words))

Python Regex - get words around match

I want to get the words before and after my match. I could use string.split(' ') - but as I already use regex, isn't there a much better way using only regex?
Using a match object, I can get the exact location. However, this location is character indexed.
import re
myString = "this. is 12my90\nExample string"
pattern = re.compile(r"(\b12(\w+)90\b)",re.IGNORECASE | re.UNICODE)
m = pattern.search(myString)
print("Hit: "+m.group())
print("Indix range: "+str(m.span()))
print("Words around match: "+myString[m.start()-1:m.end()+1]) # should be +/-1 in _words_, not characters
Output:
Hit: 12my90 Indix
range: (9, 15)
Words around match: 12my90
For getting the matching word and the word before, I tried:
pattern = re.compile(r"(\b(w+)\b)\s(\b12(\w+)90\b)",re.IGNORECASE |
re.UNICODE)
Which yields no matches.
In the second pattern you have to escape the w+ like \w+.
Apart from that, there is a newline in your example which you can match using another following \s
Your pattern with 3 capturing groups might look like
(\b\w+\b)\s(\b12\w+90\b)\s(\b\w+\b)
Regex demo
You could use the capturing groups to get the values
print("Words around match: " + m.group(1) + " " + m.group(3))
new line character is missing
regx = r"(\w+)\s12(\w+)90\n(\w+)"

regEx: To match two groups of chars

I want a regEx to match some text that contains both alpha and numeric chars. But I do NOT want it to match only alpha or numbers.
E.g. in python:
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
import re
rr = re.compile('([0-9a-z]{8})')
print 'sub=', rr.sub('########', s)
print 'findall=', rr.findall(s)
generates following output:
sub= [########: ########]: STARTED at ########ng job number ########
findall= ['mytaskid', '3fee46d2', 'processi', '10022001']
I want it to be:
sub= [mytaskid: ########]: STARTED at processing job number 10022001
findall= ['3fee46d2']
Any ideas... ??
In this case it's exactly 8 chars always, it would be even more wonderful to have a regEx that doesn't have {8} in it, i.e. it can match even if there are more or less than 8 chars.
-- edit --
Question is more to understand if there is a way to write a regEx such that I can combine 2 patterns (in this case [0-9] and [a-z]) and ensure the matched string matches both patterns, but number of chars matched from each set is variable. E.g. s could also be
s = 'mytaskid 3fee46d2 STARTED processing job number 10022001'
-- answer --
Thanks to all for the answers, all them give me what I want, so everyone gets a +1 and the first one to answer gets the accepted answer. Although jerry explains it the best. :)
If anyone is a stickler for performance, there is nothing to choose from, they're all the same.
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
def testIt(regEx):
from timeit import timeit
s = '[mytaskid: 3333fe46d2]: STARTED at processing job number 10022001'
assert (re.sub('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b', '########', s) ==
'[mytaskid: ########]: STARTED at processing job number 10022001'), '"%s" does not work.' % regEx
print 'sub() with \'', regEx, '\': ', timeit('rr.sub(\'########\', s)', number=500000, setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
print 'findall() with \'', regEx, '\': ', timeit('rr.findall(s)', setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
testIt('\\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\\b')
testIt('\\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\\b')
testIt('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b')
testIt('\\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\\b')
produced:
sub() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.328042736387
findall() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.350668751542
sub() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.314759661193
findall() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.35618526928
sub() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.322802906619
findall() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.35330467656
sub() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.320779061371
findall() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.347522144274
Try following regex:
\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b
This will match a word containing a digit followed an alphabet or vice versa.
Hence it will cover a complete set of those words which contain at-least one digit and one alphabet.
Note: Although it is not the case with python, I have observed that not all varieties of tools support lookahead and lookbehind. So I prefer to avoid them if possible.
You need to use the look ahead (?=...).
This one matches all words with at least one out of [123] and [abc].
>>> re.findall('\\b(?=[abc321]*[321])[abc321]*[abc][abc321]*\\b', ' 123abc 123 abc')
['123abc']
This way you can do AND for constraints to the same string.
>>> help(re)
(?=...) Matches if ... matches next, but doesn't consume the string.
An other way is to ground it and to say: with one of [abc] and one of [123] means there is at least a [123][abc] or a [abc][123] in the string resulting in
>>> re.findall('\\b[abc321]*(?:[abc][123]|[123][abc])[abc321]*\\b', ' 123abc 123 abc')
['123abc']
Not the most beautiful regular expression, but it works:
\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b
If the format is the same each time, that is:
[########: ########]: STARTED at ########ng job number ########
You can use:
([^\]\s]+)\]
With re.findall, or re.search and getting .group(1) if you use re.search.
[^\]\s]+ is a negated class and will match any character except space (and family) or closing square bracket.
The regex basically looks for characters (except ] or spaces) up until a closing square bracket.
If you want to match any string containing both alpha and numeric characters, you will need a lookahead:
\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b
Used like so:
result = re.search(r'\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b', text, re.I)
re.I is for ignorecase.
\b is a word boundary and will match only between a 'word' character and a 'non-word' character (or start/end of string).
(?=[0-9]*[a-z]) is a positive lookahead and makes sure there's at least 1 alpha in the part to be matched.
(?=[a-z]*[0-9]) is a similar lookahead but checks for digits.
You can use more specific regular expression and skip the findall.
import re
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
mo = re.search(':\s+(\w+)', s)
print mo.group(1)

Categories