regex dealing with brackets - python

I have multiple strings like
string1 = """[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''"""
string2 = """[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]"""
string3 = """[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]"""
strings = [string1, string2, string3]
Every string does contain one or more "[br]"s.
Each string may or may not include annotations.
Every annotation starts with "[*" and ends with "]". It may include double brackets("[[" and "]]"), but never single ones("[" and "]"), so there won't be any confusion (e.g. [* some annotation with [[brackets]]]).
The words I want to replace are the words between the first "[br]" and the annotation(if any exists, otherwise, the end of the string), which are
word1 = """팔짱낄 공''':'''"""
word2 = """낟알 과'''-'''"""
word3 = """둘레 곽[br]클 확"""
So I tried
for string in strings:
print(re.sub(r"\[br\](.)+?(\[\*)+", "AAAA", string))
expecting something like
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
The logic for the regex was
\[br\] : the first "[br]"
(.)+? : one or more characters that I want to replace, lazy
(\[\*)+ : one or more "[*"s
But the result was
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[顆|{{{#!html}}}]]AAAA some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]]AAAA another annotation.][* another annotation.]
instead. I also tried r"\[br\](.)+?(\[\*)*" but still not working. How can I fix this?

You could use
^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)
The pattern matches
^ Start of string
(.*?\[br]) Capture group 1, match as least as possible chars until the first occurrence of [br]
.+? Match any char 1+ times
(?= Positive lookahead, assert at the right
\[\*.*?](?<!].)(?!]) Match [* till ] not surrounded by ]
| Or
$ Assert end of string
) Close lookahead
Replace with capture group 1 and AAAA like \1AAAA
Regex demo | Python demo
Example code
import re
pattern = r"^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)"
s = ("[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''\n"
"[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', \") and brackets(\"(\", \")\", \"[[\", \"]]\").]\n"
"[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]")
subst = "$1AAAA"
result = re.sub(pattern, r"\1AAAA", s, 0, re.MULTILINE)
print(result)
Output
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]

The best I could come up with is first checking if there are any annotations:
import re
r = re.compile(r'''
(\[br])
(.*?)
(\[\*.*\]$)
''', re.VERBOSE)
annotation = re.compile(r'''
(\[\*.*]$)
''', re.VERBOSE)
def replace(m):
return m.group(1) + "AAAA" + m.group(3)
for s in string1, string2, string3:
print()
print(s)
if annotation.search(s):
print(r.sub(replace, s))
else:
print(re.sub(r'\[br](.*)', '[br]AAAA', s))
which gives the expected output:
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
I suppose you could move the if into the replace function, but I'm not sure if that would be much of an improvement. It would look something like:
import re
r = re.compile(r'''
^(?P<prefix>.*)
(?P<br>\[br].*?)
(?P<annotation>\[\*.*\])?
(?P<rest>[^\[]*)$
''', re.VERBOSE)
def replace(m):
g = m.groupdict()
if g['annotation'] is None:
return g['prefix'] + "[br]AAAA" + g['rest']
# the prefix will contain all but the last [br], thus the split...
return g['prefix'].split('[br]')[0] + "[br]AAAA" + g['annotation'] + g['rest']
for s in string1, string2, string3:
print()
print(s)
print(r.sub(replace, s))

Related

What is the best method of processing optional group in Python Regex?

I'm trying to write a function that enforces capitalization on certain words, and adds "'s" to certain words if they are followed by " s". For example, it should take grace s and transform that to Grace's.
r"(\b)(grace)( (s|S))?\b": posessive_name,
{...}
def possessive_name(match: Match) -> str:
result = match.group(2).title()
result = result.replace(" ", "'")
return result # type: ignore
I'm correctly "titlizing" it but can't figure out how to reference the optional ( (s|S)) group so that the ( 's) can be added if it's needed, and I'd like to avoid adding an additional regex... Is this possible?
*edited names for clarity
Yes, like this.
import re
test_str = "This is grace s apple."
def fix_names(match):
name, s = match.groups()
name = name.title()
if s:
name = f"{name}'s"
return name
p = re.compile(r"\b(grace)(\s[sS])?\b")
print(p.sub(fix_names, test_str))
lines = (
'a grace s apple',
'the apple is grace s',
'take alice s and steve s',
)
for line in lines:
result = re.sub(r'(\w+)\s+s($|\s)', lambda m: m.group(1).title()+"'s"+m.group(2), line, flags=re.I|re.S)
print(result)
you'll get:
a Grace's apple
the apple is Grace's
take Alice's and Steve's
You could capture 1+ word characters in group 1 followed by matching a space and either s or S using a character class.
In the replacement use the .title() on group 1 and add 's
(?<!\S)(\w+) [sS](?!\S)
Explanation
(?<!\S) Left whitespace boundary
(\w+) Capture group 1, match 1+ word chars
[sS] Match a space and either s or S
(?!\S)
Regex demo | Python demo
Code example
import re
test_str = "grace s"
regex = r"(?<!\S)(\w+) [sS](?!\S)"
result = re.sub(regex, lambda match: match.group(1).title()+"'s", test_str)
print(result)
Output
Grace's
If you want to match grace specifically, you could use use an optional group. If you want match more words, you could use an alternation (?:grace|anotherword)
(?<!\S)(grace)(?: ([sS]))?\b
Regex demo
Example code
import re
test_str = "Her name is grace."
strings = [
"grace s",
"Her name is grace."
]
pattern = r"(?<!\S)(grace)(?: ([sS]))?\b"
regex = re.compile(pattern)
for s in strings:
print(
regex.sub(
lambda m: "{}{}".format(m.group(1).title(), "'s" if m.group(2) else '')
, s)
)
Output
Grace's
Her name is Grace.

Getting pattern matched from match object

I'm working with Python regex and I'm trying to get pattern matched from a match object not text matched itself.
I have some patterns to replace and I'm doing this:
import re
patterns = {
r'^[ |\n]+': '',
r'[ |\n]+$': '',
r'[ |\n]+': ' '
}
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join(patterns.keys()),
lambda match: patterns[ match.group(0) ],
text)
But this is a wrong solution because match.group(0) returns text matched so none of them will be equals to any key of patterns dict.
I tried match.pattern but get an exception and tried match.re but this gives all re.compile object and its pattern for this problem is '^[ |\n]+|[ |\n]+$|[ |\n]+'.
EDIT: based on Barmar solution I got this:
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i, group in enumerate(match.groups()):
if group:
return patterns[ i ][ 1 ]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[ 0 ] + ')' for p in patterns), getreplacement, text)
print(text)
But still not a way to always get pattern from a match group.
I don't think there's a way to find out directly which alternative matched.
Use a list instead of a dictionary, and put each pattern in a capture group. Then you can see which capture group matched, and use that as the index to get the corresponding replacement.
Note that this won't work if there are any capture groups in the patterns. If groups are needed, make sure they're non-capturing.
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i in range(1, match.groups):
if match.group(i):
return patterns[i-1][1]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[0] + ')' for p in patterns), getreplacement, text)
If I got it right, you want to strip leading and trailing spaces and reduce the ones in the middle to just one.
First, you code likely has a bug: [ |\n] will match a space ( ), a pipe (|), or a new line. You probably don't want to match a pipe, but you might want to match all whitespace characters, like tabs (\t), for example.
Second, styling: keep your lines under 80 chars and no spaces around indices in brackets.
Third, removing the leading and trailing spaces is simply done with str.strip. The only thing remaining to replace now is sequences of two or more whitespaces, which is easily matched with \s{2,} (\s = "whitespace", {2,} = "two or more").
Here is a modification of your code:
import re
patterns = [
(r"^[ |\n]+", ""),
(r"[ |\n]+$", ""),
(r"[ |\n]+", " "),
]
def get_replacement(m: re.Match) -> str:
return next(
patterns[i][1]
for i, group in enumerate(m.groups())
if group is not None
)
text = (
"\n"
" \t Hello there, I\n need your help here plase :) \t \n"
" \t Hello there, I\n need your help here plase :) \t "
"\n"
)
result1 = re.sub(
"|".join(f"({p})" for p, _ in patterns),
get_replacement,
text,
)
result2 = re.sub(r"[ \n]{2,}", " ", text.strip())
result3 = re.sub(r"\s{2,}", " ", text.strip())
print(repr(result1))
print(repr(result2))
print(repr(result3))

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Python Regex: Symbol + in every letter in the same word

I am using Python.
I want to make a regex that allos the following examples:
Day
Dday
Daay
Dayy
Ddaay
Ddayy
...
So, each letter of a word, one or more times.
How can I write it easily? Exist an expression that make it easy?
I have a lot of words.
Thanks
We can try using the following regex pattern:
^([A-Za-z])\1*([A-Za-z])\2*([A-Za-z])\3*$
This matches and captures a single letter, followed by any number of occurrences of this letter. The \1 you see in the above pattern is a backreference which represents the previous matched letter (and so on for \2 and \3).
Code:
word = "DdddddAaaaYyyyy"
matchObj = re.match( r'^([A-Za-z])\1*([A-Za-z])\2*([A-Za-z])\3*$', word, re.M|re.I)
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
print "matchObj.group(3) : ", matchObj.group(3)
else:
print "No match!!"
Demo
To match a character one or more times you can use the + quantifier. To build the full pattern dynamically you would need to split the word to characters and add a + after each of them:
pattern = "".join(char + "+" for char in word)
Then just match the pattern case insensitively.
Demo:
>>> import re
>>> word = "Day"
>>> pattern = "".join(char + "+" for char in word)
>>> pattern
'D+a+y+'
>>> words = ["Dday", "Daay", "Dayy", "Ddaay", "Ddayy"]
>>> all(re.match(pattern, word, re.I) for word in words)
True
Try /d+a+y+/gi:
d+ Matches d one or more times.
a+ Matches a one or more times.
y+ Matches y one or more times.
As per my original comment, the below does exactly what I explain.
Since you want to be able to use this on many words, I think this is what you're looking for.
import re
word = "day"
regex = r"^"+("+".join(list(word)))+"+$"
test_str = ("Day\n"
"Dday\n"
"Daay\n"
"Dayy\n"
"Ddaay\n"
"Ddayy")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
This works by converting the string into a list, then converting it back to string, joining it on +, and appending the same. The resulting regex will be ^d+a+y+$. Since the input you presented is separated by newline characters, I've added re.MULTILINE.

replacing using regex python

I have a sentence like this
s = " zero/NN divided/VBD by/IN anything/NN is zero/NN"
I need to replace all the words with tags to just tags . Output should be
s = "NN VBD IN NN is NN"
I tried using regex replace like this
tup = re.sub( r"\s*/$" , "", s)
but this is not giving me the correct output . Please help
This gives the output you want:
tup = re.sub( r"\b\w+/" , "", s)
\b is matching a word boundary, followed by \w+ at least one word character (a-zA-Z0-9_) and at least the slash.
try:
tup = re.sub( r"[a-z]*/" , "", s)
In [1]: s = " zero/NN divided/VBD by/IN anything/NN is zero/NN"
In [2]: tup = re.sub( r"[a-z]*/" , "", s)
In [3]: print tup
NN VBD IN NN is NN
The \s character group matches all whitespace characters, which doesn't seem what you want. I think you want the other case, all non-whitespace characters. You can also be more specific on what is a tag, for example:
tup = re.sub( r"\S+/([A-Z]+)" , r"\1", s)
This replaces all non-whitespace characters, followed by a slash and then a sequence of uppercase letters with just the uppercase letters.
tup = re.sub( r"\b\w+/(\w+)\b", r"\1", s)
on either side of my regex is \b meaning "word boundary", then on either side of "/" i have \w+ meaning "word characters". On the right we group them by putting them into parentheses.
The second expression r"\1" means. "the first group" which gets the stuff in parentheses.

Categories