What is the best method of processing optional group in Python Regex? - python

I'm trying to write a function that enforces capitalization on certain words, and adds "'s" to certain words if they are followed by " s". For example, it should take grace s and transform that to Grace's.
r"(\b)(grace)( (s|S))?\b": posessive_name,
{...}
def possessive_name(match: Match) -> str:
result = match.group(2).title()
result = result.replace(" ", "'")
return result # type: ignore
I'm correctly "titlizing" it but can't figure out how to reference the optional ( (s|S)) group so that the ( 's) can be added if it's needed, and I'd like to avoid adding an additional regex... Is this possible?
*edited names for clarity

Yes, like this.
import re
test_str = "This is grace s apple."
def fix_names(match):
name, s = match.groups()
name = name.title()
if s:
name = f"{name}'s"
return name
p = re.compile(r"\b(grace)(\s[sS])?\b")
print(p.sub(fix_names, test_str))

lines = (
'a grace s apple',
'the apple is grace s',
'take alice s and steve s',
)
for line in lines:
result = re.sub(r'(\w+)\s+s($|\s)', lambda m: m.group(1).title()+"'s"+m.group(2), line, flags=re.I|re.S)
print(result)
you'll get:
a Grace's apple
the apple is Grace's
take Alice's and Steve's

You could capture 1+ word characters in group 1 followed by matching a space and either s or S using a character class.
In the replacement use the .title() on group 1 and add 's
(?<!\S)(\w+) [sS](?!\S)
Explanation
(?<!\S) Left whitespace boundary
(\w+) Capture group 1, match 1+ word chars
[sS] Match a space and either s or S
(?!\S)
Regex demo | Python demo
Code example
import re
test_str = "grace s"
regex = r"(?<!\S)(\w+) [sS](?!\S)"
result = re.sub(regex, lambda match: match.group(1).title()+"'s", test_str)
print(result)
Output
Grace's
If you want to match grace specifically, you could use use an optional group. If you want match more words, you could use an alternation (?:grace|anotherword)
(?<!\S)(grace)(?: ([sS]))?\b
Regex demo
Example code
import re
test_str = "Her name is grace."
strings = [
"grace s",
"Her name is grace."
]
pattern = r"(?<!\S)(grace)(?: ([sS]))?\b"
regex = re.compile(pattern)
for s in strings:
print(
regex.sub(
lambda m: "{}{}".format(m.group(1).title(), "'s" if m.group(2) else '')
, s)
)
Output
Grace's
Her name is Grace.

Related

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

So I have this text:
"#Natalija What a wonderful day, isn't it #Kristina123 ?"
I tried to remove these two substrings that start with the character '#' by using re.sub function but it didn't work.
How do I remove the susbstring that starts with this character?
Try this regex :
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
t = re.sub('#.*? ', '', text)
print(t)
OUTPUT :
What a wonderful day, isn't it ?
This should work.
# matches the character #
\w+ matches any word character as many times as possible, so it stops at blank character
Code:
import re
regex = r"#\w+"
subst = "XXX"
test_str = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print (result)
output:
XXX What a wonderful day, isn't it XXX ?
It's possible to do it with re.sub(), it would be something like this:
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
output = re.sub('#[a-zA-Z0-9]+\s','',text)
print(output) # Output: What a wonderful day, isn't it ?
# matches the # character
[a-zA-Z0-9] matches alphanumerical (uppercase and lowercase)
"+" means "one or more" (otherwise it would match only one of those characters)
\s matches whitespaces
Alternatively, this can also be done without using the module re. You can first split the sentence into words. Then remove the words containing the # character and finally join the words into a new sentence.
if __name__ == '__main__':
original_text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
individual_words = original_text.split(' ')
words_without_tags = [word for word in individual_words if '#' not in word]
new_sentence = ' '.join(words_without_tags)
print(new_sentence)
I think this would be work for you. The pattern #\w+?\s will determine expressions which start with # continued by one or more alphanumeric characters then finish with an optional white space.
import re
string = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
pattern = '#\w+?\s'
replaced = re.sub(pattern, '', string)
print(replaced)

Replace symbol before match using regex in Python

I have strings such as:
text1 = ('SOME STRING,99,1234 FIRST STREET,9998887777,ABC')
text2 = ('SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF')
text3 = ('ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI')
Desired output:
SOME STRING 99,1234 FIRST STREET,9998887777,ABC
SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF
ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI
My idea: Use regex to find occurrences of 1-5 digits, possibly preceded by a symbol, that are between two commas and not followed by a space and letters, then replace by this match without the preceding comma.
Something like:
text.replace(r'(,\d{0,5},)','.........')
If you would use regex module instead of re then possibly:
import regex
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(regex.sub(r'(?<!^.*,.*),(?=#?\d+,\d+)', ' ', str))
You might be able to use re if you sure there are no other substring following the pattern in the lookahead.
import re
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(re.sub(r',(?=#?\d+,\d+)', ' ', str))
Easier to read alternative if SOME STRING, SOME OTHER STRING, and ANOTHER STRING never contain commas:
text1.replace(",", " ", 1)
which just replaces the first comma with a space
Simple, yet effective:
my_pattern = r"(,)(\W?\d{0,5},)"
p = re.compile(my_pattern)
p.sub(r" \2", text1) # 'SOME STRING 99,1234 FIRST STREET,9998887777,ABC'
p.sub(r" \2", text2) # 'SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF'
p.sub(r" \2", text3) # 'ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI'
Secondary pattern with non-capturing group and verbose compilation:
my_pattern = r"""
(?:,) # Non-capturing group for single comma.
(\W?\d{0,5},) # Capture zero or one non-ascii characters, zero to five numbers, and a comma
"""
# re.X compiles multiline regex patterns
p = re.compile(my_pattern, flags = re.X)
# This time we will use \1 to implement the first captured group
p.sub(r" \1", text1)
p.sub(r" \1", text2)
p.sub(r" \1", text3)

Split with regex but with first character of delimiter

I have a regex like this: "[a-z|A-Z|0-9]: " that will match one alphanumeric character, colon, and space. I wonder how to split the string but keeping the alphanumeric character in the first result of splitting. I cannot change the regex because there are some cases that the string will have special character before colon and space.
Example:
line = re.split("[a-z|A-Z|0-9]: ", "A: ") # Result: ['A', '']
line = re.split("[a-z|A-Z|0-9]: ", ":: )5: ") # Result: [':: )5', '']
line = re.split("[a-z|A-Z|0-9]: ", "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
Update:
Actually, my problem is splitting from a review file. Suppose I have a file that every line has this pattern: [title]: [review]. I want to get the title and review, but some of the titles have a special character before a colon and space, and I don't want to match them. However, it seems that the character before a colon and space that I want to match apparently is an alphanumeric one.
You could split using a negative lookbehind with a single colon or use a character class [:)] where you can specify which characters should not occur directly to the left.
(?<!:):[ ]
In parts
(?<!:) Negative lookbehind, assert what is on the left is not a colon
:[ ] Match a colon followed by a space (Added square brackets only for clarity)
Regex demo | Python demo
For example
import re
pattern = r"(?<!:): "
line = re.split(pattern, "A: ") # Result: ['A', '']
print(line)
line = re.split(pattern, ":: )5: ") # Result: [':: )5', '']
print(line)
line = re.split(pattern, "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
print(line)
Output
['A', '']
[':: )5', '']
['Delicious :)', 'I want to eat this again']
Solution
First of all, as you show in your examples, you need to match characters other than a-zA-Z0-9, so we should just use the . matcher, it will match every character.
So I think the expression you're looking for might be this one:
(.*?):(?!.*:) (.*)
You can use it like so:
import re
pattern = r"(.*?):(?!.*:) (.*)"
matcher = re.compile(pattern)
txt1 = "A: "
txt2 = ":: )5: "
txt3 = "Delicious :): I want to eat this again"
result1 = matcher.search(txt1).groups() # ('A', '')
result2 = matcher.search(txt2).groups() # (':: )5', '')
result3 = matcher.search(txt3).groups() # ('Delicious :)', 'I want to eat this again')
Explanation
We use capture groups (the parentheses) to get the different parts in the string into different groups, search then finds these groups and outputs them in the tuple.
The (?!.*:) part is called "Negative Lookahead", and we use it to make sure we start capturing from the last : we find.
Edit
BTW, if, as you mentioned, you have many lines each containing a review, you can use this snippet to get all of the reviews separated by title and body at once:
import re
pattern = r"(.*?):(?!.*:) (.*)\n?"
matcher = re.compile(pattern)
reviews = """
A:
:: )5:
Delicious :): I want to eat this again
"""
parsed_reviews = matcher.findall(reviews) # [('A', ''), (':: )5', ''), ('Delicious :)', 'I want to eat this again')]

How can I remove a specific character from multi line string using regex in python

I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)

How to python regex match the following?

1<assume tab here>Algebra I<assume tab here>START
1.1 What are the Basic Numbers? 1-1
For each of the two lines above, how do I regex match only the number up to and including the "?". In essence, I want the following groups:
["1", "Algebra I"]
["1.1", "What are the Basic Numbers?"]
Matching everything up to and including a question mark, or up to a "tab character".
How can I do this with a single regex?
Here's an easy regex:
^([\d.]+)\s*([^\t?]+\??)
Group 1 is the numbers, Group 2 contains the text.
To retrieve one single match:
match = re.search(r"^([\d.]+)\s*([^\t?]+\??)", s)
if match:
mynumbers = match.group(1)
myline = match.group(2)
To iterate over the matches, get groups 1 and 2 from:
reobj = re.compile(r"^([\d.]+)\s*([^\t?]+\??)", re.MULTILINE)
for match in reobj.finditer(s):
# matched text: match.group()
Here you go:
(\d(?:\.\d)*)\s+(?:(.*?\?|.*?)\t)
For explanation: (\d(?:\.\d)*) matches a number followed by zero or more .\d's. this is followed by one or more whitespace characters followed by anything (that is lazy and not greedy) with (.*?) which is followed by either ? or \t in a non-capturing group.
Output:
string1 = "1.1 What are the Basic Numbers? 1-1"
string2 = '1\tAlgebra I\tSTART'
m = re.match(pattern, string2)
m.group(1)
#'1'
m.group(2)
#'Algebra I'
m = re.match(pattern, string1)
m.group(1)
#'1.1'
m.group(2)
#'What are the Basic Numbers?'
EDIT: added non-capturing groups.
EDIT#2: fixed it to include question mark
EDIT#3 fixed no of groups.

Categories