Unicode Substitutions using Regex , Python

Unicode Substitutions using Regex , Python - python

I have a string as follows:
str1 = "heylisten\uff08there is something\uff09to say \uffa9"
I need to replace the unicode values detected by my regex expression with spaces on either sides.
Desired output string:
out = "heylisten \uff08 there is something \uff09 to say \uffa9 "
I have used an re.findall to get all the matches and then replace them. It looks like:
p1 = re.findall(r'\uff[0-9a-e][0-9]', str1, flags = re.U)
out = str1
for item in p1:
print item
print out
out= re.sub(item, r" " + item + r" ", out)
And this outputs:
'heylisten\\ uff08 there is something\\ uff09 to say \\ uffa9 '
What is wrong with the above that it prints an extra "\" and also separates it from uff? I even tried with re.search but it seems to only separate \uff08. Is there a better way?

I have a string as follows:
str1 = "heylisten\uff08there is something\uff09to say \uffa9"
I need to replace the unicode values ...
You don't have any unicode values. You have a bytestring.
str1 = u"heylisten\uff08there is something\uff09to say \uffa9"
...
p1 = re.sub(ur'([\uff00-\uffe9])', r' \1 ', str1)

print re.sub(r"(\\uff[0-9a-e][0-9])", r" \1 ", x)
You can directly use this re.sub. See demo.
http://regex101.com/r/sU3fA2/67
import re
p = re.compile(ur'(\\uff[0-9a-e][0-9])', re.UNICODE)
test_str = u"heylisten\uff08there is something\uff09to say \uffa9"
subst = u" \1 "
result = re.sub(p, subst, test_str)
Output:
heylisten \uff08 there is something \uff09 to say \uffa9

Related

regex dealing with brackets

I have multiple strings like
string1 = """[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''"""
string2 = """[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]"""
string3 = """[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]"""
strings = [string1, string2, string3]
Every string does contain one or more "[br]"s.
Each string may or may not include annotations.
Every annotation starts with "[*" and ends with "]". It may include double brackets("[[" and "]]"), but never single ones("[" and "]"), so there won't be any confusion (e.g. [* some annotation with [[brackets]]]).
The words I want to replace are the words between the first "[br]" and the annotation(if any exists, otherwise, the end of the string), which are
word1 = """팔짱낄 공''':'''"""
word2 = """낟알 과'''-'''"""
word3 = """둘레 곽[br]클 확"""
So I tried
for string in strings:
print(re.sub(r"\[br\](.)+?(\[\*)+", "AAAA", string))
expecting something like
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
The logic for the regex was
\[br\] : the first "[br]"
(.)+? : one or more characters that I want to replace, lazy
(\[\*)+ : one or more "[*"s
But the result was
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[顆|{{{#!html}}}]]AAAA some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]]AAAA another annotation.][* another annotation.]
instead. I also tried r"\[br\](.)+?(\[\*)*" but still not working. How can I fix this?

You could use
^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)
The pattern matches
^ Start of string
(.*?\[br]) Capture group 1, match as least as possible chars until the first occurrence of [br]
.+? Match any char 1+ times
(?= Positive lookahead, assert at the right
\[\*.*?](?<!].)(?!]) Match [* till ] not surrounded by ]
| Or
$ Assert end of string
) Close lookahead
Replace with capture group 1 and AAAA like \1AAAA
Regex demo | Python demo
Example code
import re
pattern = r"^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)"
s = ("[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''\n"
"[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', \") and brackets(\"(\", \")\", \"[[\", \"]]\").]\n"
"[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]")
subst = "$1AAAA"
result = re.sub(pattern, r"\1AAAA", s, 0, re.MULTILINE)
print(result)
Output
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]

The best I could come up with is first checking if there are any annotations:
import re
r = re.compile(r'''
(\[br])
(.*?)
(\[\*.*\]$)
''', re.VERBOSE)
annotation = re.compile(r'''
(\[\*.*]$)
''', re.VERBOSE)
def replace(m):
return m.group(1) + "AAAA" + m.group(3)
for s in string1, string2, string3:
print()
print(s)
if annotation.search(s):
print(r.sub(replace, s))
else:
print(re.sub(r'\[br](.*)', '[br]AAAA', s))
which gives the expected output:
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
I suppose you could move the if into the replace function, but I'm not sure if that would be much of an improvement. It would look something like:
import re
r = re.compile(r'''
^(?P<prefix>.*)
(?P<br>\[br].*?)
(?P<annotation>\[\*.*\])?
(?P<rest>[^\[]*)$
''', re.VERBOSE)
def replace(m):
g = m.groupdict()
if g['annotation'] is None:
return g['prefix'] + "[br]AAAA" + g['rest']
# the prefix will contain all but the last [br], thus the split...
return g['prefix'].split('[br]')[0] + "[br]AAAA" + g['annotation'] + g['rest']
for s in string1, string2, string3:
print()
print(s)
print(r.sub(replace, s))

regex and python

I have a string:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
I want to extract the "T" character from the Timestamp, ie change it to:
"123ABC,'2009-12-23 23:45:58.544-04:00'"
I am trying this:
newString = re.sub('(?:\-\d{2})T(?:\d{2}\:)', ' ', myString)
BUT, the returned string is:
"123ABC,'2009-12 45:58.544-04:00'"
The "non capturing groups" don't appear to be "non capturing", and it's removing everything. What am I doing wrong?

You can use lookarounds (positive lookbehind and -ahead):
(?<=\d)T(?=\d)
See a demo on regex101.com.
In Python this would be:
import re
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
rx = r'(?<=\d)T(?=\d)'
# match a T surrounded by digits
new_string = re.sub(rx, ' ', myString)
print new_string
# 123ABC,'2009-12-23 23:45:58.544-04:00'
See a demo on ideone.com.

regex seems a bit of an overkill:
mystring.replace("T"," ")

I'd use capturing groups, unanchored lookbehinds are costly in terms of regex performance:
(\d)T(\d)
And replace with r'\1 \2' replacement pattern containing backreferences to the digit before and after T. See the regex demo
Python demo:
import re
s = "123ABC,'2009-12-23T23:45:58.544-04:00'"
reg = re.compile(r'(\d)T(\d)')
s = reg.sub(r'\1 \2', s)
print(s)

That T is trapped in between numbers and will always be alone on the right. You could use a rsplit and join:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ABC,'2009-12-23 23:45:58.544-04:00'"
Trying this on a leading T somewhere in the string:
myString = "123ATC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ATC,'2009-12-23 23:45:58.544-04:00'"

Finding items in quotes, but not escaped quotes, in python using re

Suppose there is a series of strings. Important items are enclosed in quotes, but other items are enclosed in escaped quotes. How can you return only the important items?
Example where both are returned:
import re
testString = 'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = = '"([^\\\"]*)"'
print re.findall( pattern, testString)
Result prints
['one', 'two']
How can I get python's re to only print
['one']

You can use negative lookbehinds to ensure there's no backslash before the quote:
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = r'(?<!\\)"([^"]*)(?<!\\)"'
# ^^^^^^^ ^^^^^^^
print re.findall(pattern, testString)
regex101 demo
ideone demo

Here even though you are using \" to mark other items but in python it is interpreted as "two" only.You can use python raw strings where \" will be treated as \"
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = '"(\w*)"'
print re.findall( pattern, testString)

Find all strings that are in between two sub strings

I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.

Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)

Change a text between two strings in Python with Regex

I found several similar questions, but I cannot fit my problem to any of them. I try to find and replace a string between two other strings in a text.
reg = "%s(.*?)%s" % (str1,str2)
r = re.compile(reg,re.DOTALL)
result = r.sub(newstring, originaltext)
The problem is that the code above replace also str1 and str2, whereas I want to replace only the text between them. Something obviously that I miss?
Update:
I simplified example:
text = 'abcdefghijklmnopqrstuvwxyz'
str1 = 'gh'
str2 = 'op'
newstring = 'stackexchange'
reg = "%s(.*?)%s" % (str1,str2)
r = re.compile(reg,re.DOTALL)
result = r.sub(newstring, text)
print result
The result is abcdefstackexchangeqrstuvwxyz whereas I need abcdefghstackexchangeopqrstuvwxyz

Use a combination of lookarounds in your regular expression.
reg = "(?<=%s).*?(?=%s)" % (str1,str2)
Explanation:
Lookarounds are zero-width assertions. They don't consume any characters on the string.
(?<= # look behind to see if there is:
gh # 'gh'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
op # 'op'
) # end of look-ahead
Working Demo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode Substitutions using Regex , Python - python

Related

regex dealing with brackets

regex and python

Finding items in quotes, but not escaped quotes, in python using re

Find all strings that are in between two sub strings

Change a text between two strings in Python with Regex

Categories

Resources