Raw string('\r') in regular expression in python doesn't works? - python

Below is my raw string ('\r') test in python.
import re
a = re.compile('\d')
b = re.compile('\\d')
c = re.compile(r'\d')
d = re.compile(r'\\d')
print a.search("1") # (O)
print a.search("\d")
print a.search("\1")
print b.search("1") # (O)
print b.search("\d")
print b.search("\1")
print c.search("1") # (O)
print c.search("\d")
print c.search("\1")
print d.search("1")
print d.search("\d") # (O)
print d.search("\1")
But it seems like raw string doesn't work.
For example, regular expression 'b' should catch the expression which is composed of "backslash + alphabet d", but it catches just number '1'....
And according to meaning of 'r', regular expression 'c' also should catch the string which is composed of 'backslash + alphabet d', but it didn't.
Could anyone explain this?
Thanks

Your first three strings are exactly the same.
>>> '\d' == '\\d' == r'\d'
True
Thus, when run through the regex engine, they all match only a single digit. This is true because '\d' has no interesting behavior in the way that '\n' does, so parsing the backslash as literal is the only reasonable way for the Python interpreter to respond (barring a parse error -- which I'd argue might have been a better idea, but couldn't be implemented now without breaking compatibility).
By contrast, the same is not true of \n:
>>> '\n' == '\\n'
False
>>> '\\n' == r'\n'
True
Your fourth string, r'\\d', is the same as '\\\\d'; thus, that it matches only the literal string \d should be no surprise.

Related

How to properly unescape select sequences in python

I'm escaping certain characters in strings (e.g., \n, \\) with double backslashes, like this: text.replace("\\", "\\\\").replace("\n", "\\n")
Naïvely, I tried to unescape using: text.replace("\\n", "\n").replace("\\\\", "\\")
However, this fails on strings like:
>>> text = "\\\n\\n"
>>> print(text)
\
\n
>>> etext = text.replace("\\", "\\\\").replace("\n", "\\n")
>>> print(etext)
\\\n\\n
>>> ftext = etext.replace("\\n", "\n").replace("\\\\", "\\")
>>> print(ftext)
\
\
>>>
As you can see the original string doesn't survive the round trip.
Even changing the order of replaces around would not solve the issue.
The only way to correctly unescape is to do the replacements in one go.
Python's str has maketrans and translate to achieve a similar effect
but they only work on single characters as keys.
re.sub also does not work since the substitution would need to distinguish the case somehow. (\1 does not work since if the second character is n we want the newline character as output instead of n)
A correct (but slow) solution would be:
def unescape(text: str) -> str:
res: list[str] = []
in_escape = False
for c in text:
if in_escape:
in_escape = False
if c == "\\":
res.append("\\")
continue
if c == "n":
res.append("\n")
continue
if c == "\\":
in_escape = True
continue
res.append(c)
return "".join(res)
>>> text = "\\\n\\n"
>>> print(text)
\
\n
>>> etext = text.replace("\\", "\\\\").replace("\n", "\\n")
>>> print(etext)
\\\n\\n
>>> print(unescape(etext))
\
\n
>>>
Is there a proper/canonical/fast way of escaping (only certain sequences in) strings?
(EDIT: to answer why a subset of escapes is preferred. in my case other escapes are not needed and it's easy to permanently corrupt your data by escaping things that don't need to. for example, from the top of my head I can think of three different escape functions just in python alone that all escape completely different subsets of characters. even the str.escape function changes what it escapes between python versions. now most of the time unescape can handle a wider set of escape sequences than its corresponding escape function but this is not always the case. this all doesn't even take into account trying to load the escaped data in a different language)

How come these strings are not equal?

I have been trying out (for my own personal use) some peoples' solutions to timed keyboard inputs and the only one that has worked was one by Alex Martelli/martineau here. I used their second block of code (starting with import msvcrt) and it worked great for pretty much everything but comparisons. I replaced the return of None with an empty string if no input is entered in time and I used some test lines as shown below:
import msvcrt
import time
def raw_input_with_timeout(prompt, timeout):
print prompt,
finishat = time.time() + timeout
result = []
while True:
if msvcrt.kbhit():
result.append(msvcrt.getche())
if result[-1] == '\r': # or \n, whatever Win returns;-)
return ''.join(result)
time.sleep(0.1) # just to yield to other processes/threads
else:
if time.time() > finishat:
return ""
textVar = raw_input_with_timeout("Enter here: \n", 5)
print str(textVar) # to make sure the string is being stored
print type(str(textVar)) # to make sure it is of type string and can be compared
print str(str(textVar) == "test")
time.sleep(10) # so I can see the output
After I compile that with pyinstaller, run it, and type test into the window, I get this output:
Enter here:
test
test
<type 'str'>
False
I originally thought the comparison was returning False because the function appends characters to an array and that may have had something to do with it not doing a proper comparison with a string, but after looking further into the way Python works (namely, SilentGhost's response here), I really have no idea why the comparison will not return True. Any response is appreciated. Thank you!
You won't be able to see why the strings are different just by printing. String values can contain bytes that are not (easily) visible on a console when printed.
Use the repr() function to produce a debugging-friendly representation instead. This representation will format the string as a Python string literal, using only printable ASCII characters and escape sequences:
>>> foo = 'test\t\n'
>>> print foo
test
>>> foo == 'test'
False
>>> print repr(foo)
'test\t\n'
In your case, you are including the \r carriage return character in your return value:
if result[-1] == '\r':
return ''.join(result)
That last \r is still there, so you get, at the very least, the value 'test\r', but \r won't show up when printing:
>>> print 'test\r'
test
>>> print repr('test\r')
'test\r'
You could just exclude that last character when joining, by slicing the string:
return ''.join(result[:-1])
or you could use str.strip() to remove all whitespace characters from both the start and end of the string (including that \r character):
return ''.join(result).strip()
Note that there is no point in using str() calls here. You return a str object, so str(textVar) is redundant. Moreover, print will call str() on anything not a string object yet.
If you consider this fragment of code:
result = []
while True:
if msvcrt.kbhit():
result.append(msvcrt.getche())
if result[-1] == '\r': # or \n, whatever Win returns;-)
return ''.join(result)
You can see that when building the input string, the final character that the user enters must be \r, which is an unprintable character corresponding to the carriage return. Therefore, the returned input string looks like:
test\r
I think you need to rework to the code to discard the final unprintable character from the input.
You probably have some unseen bytes after the string. Try to print([c for c in textVar]) and if it shows characters lie '\r' and \n try str(textVar).strip() == "test" or remove those chars manually.

How to check if a given character is considered as 'special' by the Python regex engine?

Is there an easy way to verify that the given character has a special regex function?
Of course I can collect regex characters in a list like ['.', "[", "]", etc.] to check that, but I guess there is a more elegant way.
You could use re.escape. For example:
>>> re.escape("a") == "a"
True
>>> re.escape("[") == "["
False
The idea is that if a character is a special one, then re.escape returns the character with a backslash in front of it. Otherwise, it returns the character itself.
You can use re.escape within all function as following :
>>> def checker(st):
... return all(re.escape(i)==i for i in st)
...
>>> checker('aab]')
False
>>> checker('aab')
True
>>> checker('aa.b3')
False
Per the documentation, re.escape will (emphasis mine):
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
So it tells you whether a character could be a meaningful one, not whether it is. For example:
>>> re.escape('&') == '&'
False
This is useful for processing arbitrary strings, as it ensures that all control characters are escaped, but not for telling you which actually needed to be. The simplest approach, in my view, is the one dismissed in the question:
char in set(r'.^$*+?{}[]\| ')
Elegance lies in the eyes of the beholder, however (IMHO) this (below) is the most generic/"timeproof" way of checking if a character is considered to be special by the Python Regex engine -
def isFalsePositive(char):
m = re.match(char, 'a')
if m is not None and m.end() == 1:
return True
else:
return False
def isSpecial(char):
try:
m = re.match(char, char)
except:
return True
if m is not None and m.end() == 1:
if isFalsePositive(char):
return True
else:
return False
else:
return True
P.S. -
isFalsePositive() may be overkill to check the special case of '.' (dot). :-)

python regular expression : How can I filter only special characters?

I want to check either given words contain special character or not.
so below is my python code
The literal 'a#bcd' has '#', so it will be matchd and it's ok.
but 'a1bcd' has no special character. but it was filtered too!!
import re
regexp = re.compile('[~`!##$%^&*()-_=+\[\]{}\\|;:\'\",.<>/?]+')
if regexp.search('a#bcd') :
print 'matched!! nich catch!!'
if regexp.search('a1bcd') :
print 'something is wrong here!!!'
result :
python ../special_char.py
matched!! nich catch!!
something is wrong here!!!
I have no idea why it works like above..someone help me..T_T;;;
thanks~
Move the dash in you regular expression to the start of the [] group, like this:
regexp = re.compile('[-~`!##$%^&*()_=+\[\]{}\\|;:\'\",.<>/?]+')
Where you had the dash, it was read with the surrounding characters as )-_ and since it is inside [] it is interpreted as asking to match a range from ) to _. If you move the dash to just after the [ it has no special meaning and instead matches itself.
Here's an interactive session showing the specific problem there was in your regular expression:
>>> import re
>>> print re.search('[)-_]', 'abcd')
None
>>> print re.search('[)-_]', 'a1b')
<_sre.SRE_Match object at 0x7f71082247e8>
>>> print re.search('[)-_]', 'a1b').group(0)
1
After fixing it:
>>> print re.search('[-)_]', 'a1b')
None
Unless there's some reason not visible in your question, I'd also say that the final + is not needed.
re will be relatively slow for this
I'd suggest trying
specialchars = '''-~`!##$%^&*()_=+[]{}\\|;:'",.<>/?'''
len(word) != len(word.translate(None, specialchars))
or
set(word) & set(specialchars)

How do I get the raw representation of a string in Python?

I am making a class that relies heavily on regular expressions.
Let's say my class looks like this:
class Example:
def __init__(self, regex):
self.regex = regex
def __repr__(self):
return 'Example({})'.format(repr(self.regex.pattern))
And let's say I use it like this:
import re
example = Example(re.compile(r'\d+'))
If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?
EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.
The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.
That said, the actual way to get rawstring representation is what you already gave:
"r'{}'".format(regex.pattern)
The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").
If you really want to do this, you should use a wrapper like:
def rawstr(s):
"""
Return the raw string representation (using r'') literals of the string
*s* if it is available. If any invalid characters are encountered (or a
string which cannot be represented as a rawstr), the default repr() result
is returned.
"""
if any(0 <= ord(ch) < 32 for ch in s):
return repr(s)
if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
return repr(s)
pattern = "r'{0}'"
if '"' in s:
if "'" in s:
return repr(s)
elif "'" in s:
pattern = 'r"{0}"'
return pattern.format(s)
Tests:
>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'

Categories