I am making a class that relies heavily on regular expressions.
Let's say my class looks like this:
class Example:
def __init__(self, regex):
self.regex = regex
def __repr__(self):
return 'Example({})'.format(repr(self.regex.pattern))
And let's say I use it like this:
import re
example = Example(re.compile(r'\d+'))
If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?
EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.
The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.
That said, the actual way to get rawstring representation is what you already gave:
"r'{}'".format(regex.pattern)
The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").
If you really want to do this, you should use a wrapper like:
def rawstr(s):
"""
Return the raw string representation (using r'') literals of the string
*s* if it is available. If any invalid characters are encountered (or a
string which cannot be represented as a rawstr), the default repr() result
is returned.
"""
if any(0 <= ord(ch) < 32 for ch in s):
return repr(s)
if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
return repr(s)
pattern = "r'{0}'"
if '"' in s:
if "'" in s:
return repr(s)
elif "'" in s:
pattern = 'r"{0}"'
return pattern.format(s)
Tests:
>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'
Related
I'm escaping certain characters in strings (e.g., \n, \\) with double backslashes, like this: text.replace("\\", "\\\\").replace("\n", "\\n")
Naïvely, I tried to unescape using: text.replace("\\n", "\n").replace("\\\\", "\\")
However, this fails on strings like:
>>> text = "\\\n\\n"
>>> print(text)
\
\n
>>> etext = text.replace("\\", "\\\\").replace("\n", "\\n")
>>> print(etext)
\\\n\\n
>>> ftext = etext.replace("\\n", "\n").replace("\\\\", "\\")
>>> print(ftext)
\
\
>>>
As you can see the original string doesn't survive the round trip.
Even changing the order of replaces around would not solve the issue.
The only way to correctly unescape is to do the replacements in one go.
Python's str has maketrans and translate to achieve a similar effect
but they only work on single characters as keys.
re.sub also does not work since the substitution would need to distinguish the case somehow. (\1 does not work since if the second character is n we want the newline character as output instead of n)
A correct (but slow) solution would be:
def unescape(text: str) -> str:
res: list[str] = []
in_escape = False
for c in text:
if in_escape:
in_escape = False
if c == "\\":
res.append("\\")
continue
if c == "n":
res.append("\n")
continue
if c == "\\":
in_escape = True
continue
res.append(c)
return "".join(res)
>>> text = "\\\n\\n"
>>> print(text)
\
\n
>>> etext = text.replace("\\", "\\\\").replace("\n", "\\n")
>>> print(etext)
\\\n\\n
>>> print(unescape(etext))
\
\n
>>>
Is there a proper/canonical/fast way of escaping (only certain sequences in) strings?
(EDIT: to answer why a subset of escapes is preferred. in my case other escapes are not needed and it's easy to permanently corrupt your data by escaping things that don't need to. for example, from the top of my head I can think of three different escape functions just in python alone that all escape completely different subsets of characters. even the str.escape function changes what it escapes between python versions. now most of the time unescape can handle a wider set of escape sequences than its corresponding escape function but this is not always the case. this all doesn't even take into account trying to load the escaped data in a different language)
Below is my raw string ('\r') test in python.
import re
a = re.compile('\d')
b = re.compile('\\d')
c = re.compile(r'\d')
d = re.compile(r'\\d')
print a.search("1") # (O)
print a.search("\d")
print a.search("\1")
print b.search("1") # (O)
print b.search("\d")
print b.search("\1")
print c.search("1") # (O)
print c.search("\d")
print c.search("\1")
print d.search("1")
print d.search("\d") # (O)
print d.search("\1")
But it seems like raw string doesn't work.
For example, regular expression 'b' should catch the expression which is composed of "backslash + alphabet d", but it catches just number '1'....
And according to meaning of 'r', regular expression 'c' also should catch the string which is composed of 'backslash + alphabet d', but it didn't.
Could anyone explain this?
Thanks
Your first three strings are exactly the same.
>>> '\d' == '\\d' == r'\d'
True
Thus, when run through the regex engine, they all match only a single digit. This is true because '\d' has no interesting behavior in the way that '\n' does, so parsing the backslash as literal is the only reasonable way for the Python interpreter to respond (barring a parse error -- which I'd argue might have been a better idea, but couldn't be implemented now without breaking compatibility).
By contrast, the same is not true of \n:
>>> '\n' == '\\n'
False
>>> '\\n' == r'\n'
True
Your fourth string, r'\\d', is the same as '\\\\d'; thus, that it matches only the literal string \d should be no surprise.
I have a file of strings one per line in which non-ascii characters have been escaped with decimal code points. One example line is:
mj\\195\\164ger
(The double backslashes are in the file exactly as printed)
I would like to process this string to produce
mjäger
. Conventionally, python uses hexadecimal escapes rather than decimal escapes (e.g., the above string would be written as mj\xc3\xa4ger, which python can decode:
>>> by=b'mj\xc3\xa4ger'
>>> by.decode('utf-8')
'mjäger'
Python, however, doesn't recognize the decimal escape right away.
I have written a method that correctly manipulates the strings to produce hexadecimal escapes, but these escapes are themselves escaped. How can I get python to process these hexadecimal escapes to create the final string?
import re
hexconst=["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f"]
escapes=re.compile(r"\\[0-9]{3}")
def dec2hex(matchobj):
dec=matchobj.group(0)
dec=int(dec[1:])
digit1=dec//16 #integer division
digit2=dec%16
hex="\\x" + hexconst[digit1] + hexconst[digit2]
return hex
line=r'mj\195\164ger'
print(escapes.sub(dec2hex,line)) #Outputs mj\xc3\xa4ger
What is the final step I'm missing to convert the output of the above from mj\xc3\xa4ger to mjäger? Thanks!
It's much easier. re.sub() can take a callback function instead of a replacement string as an argument:
>>> import re
>>> line=r'mj\195\164ger'
>>> def replace(match):
... return chr(int(match.group(1)))
>>> regex = re.compile(r"\\(\d{1,3})")
>>> new = regex.sub(replace, line)
>>> new
'mj\xc3\xa4ger'
>>> print new
mjäger
In Python 3, strings are Unicode strings, so if you're working with encoded input (like UTF-8 encoded content), then you need to use the proper type which is bytes:
>>> line = rb'mj\195\164ger'
>>> regex = re.compile(rb"\\(\d{1,3})")
>>> def replace(match):
... return int(match.group(1)).to_bytes(1, byteorder="big")
>>> new = regex.sub(replace, line)
>>> new
b'mj\xc3\xa4ger'
>>> print(new.decode("utf-8"))
mjäger
(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.
Is it possible to visualize non-printable characters in a python string with its hex values?
e.g. If I have a string with a newline inside I would like to replace it with \x0a.
I know there is repr() which will give me ...\n, but I'm looking for the hex version.
I don't know of any built-in method, but it's fairly easy to do using a comprehension:
import string
printable = string.ascii_letters + string.digits + string.punctuation + ' '
def hex_escape(s):
return ''.join(c if c in printable else r'\x{0:02x}'.format(ord(c)) for c in s)
I'm kind of late to the party, but if you need it for simple debugging, I found that this works:
string = "\n\t\nHELLO\n\t\n\a\17"
procd = [c for c in string]
print(procd)
# Prints ['\n,', '\t,', '\n,', 'H,', 'E,', 'L,', 'L,', 'O,', '\n,', '\t,', '\n,', '\x07,', '\x0f,']
While just list is simpler, a comprehension makes it easier to add in filtering/mapping if necessary.
You'll have to make the translation manually; go through the string with a regular expression for example, and replace each occurrence with the hex equivalent.
import re
replchars = re.compile(r'[\n\r]')
def replchars_to_hex(match):
return r'\x{0:02x}'.format(ord(match.group()))
replchars.sub(replchars_to_hex, inputtext)
The above example only matches newlines and carriage returns, but you can expand what characters are matched, including using \x escape codes and ranges.
>>> inputtext = 'Some example containing a newline.\nRight there.\n'
>>> replchars.sub(replchars_to_hex, inputtext)
'Some example containing a newline.\\x0aRight there.\\x0a'
>>> print(replchars.sub(replchars_to_hex, inputtext))
Some example containing a newline.\x0aRight there.\x0a
Modifying ecatmur's solution to handle non-printable non-ASCII characters makes it less trivial and more obnoxious:
def escape(c):
if c.printable():
return c
c = ord(c)
if c <= 0xff:
return r'\x{0:02x}'.format(c)
elif c <= '\uffff':
return r'\u{0:04x}'.format(c)
else:
return r'\U{0:08x}'.format(c)
def hex_escape(s):
return ''.join(escape(c) for c in s)
Of course if str.isprintable isn't exactly the definition you want, you can write a different function. (Note that it's a very different set from what's in string.printable—besides handling non-ASCII printable and non-printable characters, it also considers \n, \r, \t, \x0b, and \x0c as non-printable.
You can make this more compact; this is explicit just to show all the steps involved in handling Unicode strings. For example:
def escape(c):
if c.printable():
return c
elif c <= '\xff':
return r'\x{0:02x}'.format(ord(c))
else:
return c.encode('unicode_escape').decode('ascii')
Really, no matter what you do, you're going to have to handle \r, \n, and \t explicitly, because all of the built-in and stdlib functions I know of will escape them via those special sequences instead of their hex versions.
I did something similar once by deriving a str subclass with a custom __repr__() method which did what I wanted. It's not exactly what you're looking for, but may give you some ideas.
# -*- coding: iso-8859-1 -*-
# special string subclass to override the default
# representation method. main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):
def __repr__(self):
# use double quotes unless there are more of them within the string than
# single quotes
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"
rep = [quotechar]
for ch in self:
# control char?
if ord(ch) < ord(' '):
# remove the single quotes around the escaped representation
rep += repr(str(ch)).strip("'")
# embedded quote matching quotechar being used?
elif ch == quotechar:
rep += "\\"
rep += ch
# else just use others as they are
else:
rep += ch
rep += quotechar
return "".join(rep)
if __name__ == "__main__":
s1 = '\tWürttemberg'
s2 = MsgStr(s1)
print "str s1:", s1
print "MsgStr s2:", s2
print "--only the next two should differ--"
print "repr(s1):", repr(s1), "# uses built-in string 'repr'"
print "repr(s2):", repr(s2), "# uses custom MsgStr 'repr'"
print "str(s1):", str(s1)
print "str(s2):", str(s2)
print "repr(str(s1)):", repr(str(s1))
print "repr(str(s2)):", repr(str(s2))
print "MsgStr(repr(MsgStr('\tWürttemberg'))):", MsgStr(repr(MsgStr('\tWürttemberg')))
There is also a way to print non-printable characters in the sense of them executing as commands within the string even if not visible (transparent) in the string, and their presence can be observed by measuring the length of the string using len as well as by simply putting the mouse cursor at the start of the string and seeing/counting how many times you have to tap the arrow key to get from start to finish, as oddly some single characters can have a length of 3 for example, which seems perplexing. (Not sure if this was already demonstrated in prior answers)
In this example screenshot below, I pasted a 135-bit string that has a certain structure and format (which I had to manually create beforehand for certain bit positions and its overall length) so that it is interpreted as ascii by the particular program I'm running, and within the resulting printed string are non-printable characters such as the 'line break` which literally causes a line break (correction: form feed, new page I meant, not line break) in the printed output there is an extra entire blank line in between the printed result (see below):
Example of printing non-printable characters that appear in printed string
Input a string:100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000
HPQGg]+\,vE!:#
>>> len('HPQGg]+\,vE!:#')
17
>>>
In the above code excerpt, try to copy-paste the string HPQGg]+\,vE!:# straight from this site and see what happens when you paste it into the Python IDLE.
Hint: You have to tap the arrow/cursor three times to get across the two letters from P to Q even though they appear next to each other, as there is actually a File Separator ascii command in between them.
However, even though we get the same starting value when decoding it as a byte array to hex, if we convert that hex back to bytes they look different (perhaps lack of encoding, not sure), but either way the above output of the program prints non-printable characters (I came across this by chance while trying to develop a compression method/experiment).
>>> bytes(b'HPQGg]+\,vE!:#').hex()
'48501c514767110c5d2b5c2c7645213a40'
>>> bytes.fromhex('48501c514767110c5d2b5c2c7645213a40')
b'HP\x1cQGg\x11\x0c]+\\,vE!:#'
>>> (0x48501c514767110c5d2b5c2c7645213a40 == 0b100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000)
True
>>>
In the above 135 bit string, the first 16 groups of 8 bits from the big-endian side encode each character (including non-printable), whereas the last group of 7 bits results in the # symbol, as seen below:
Technical breakdown of the format of the above 135-bit string
And here as text is the breakdown of the 135-bit string:
10010000 = H (72)
10100000 = P (80)
00111000 = x1c (28 for File Separator) *
10100010 = Q (81)
10001110 = G(71)
11001110 = g (103)
00100010 = x11 (17 for Device Control 1) *
00011000 = x0c (12 for NP form feed, new page) *
10111010 = ] (93 for right bracket ‘]’
01010110 = + (43 for + sign)
10111000 = \ (92 for backslash)
01011000 = , (44 for comma, ‘,’)
11101100 = v (118)
10001010 = E (69)
01000010 = ! (33 for exclamation)
01110100 = : (58 for colon ‘:’)
1000000 = # (64 for ‘#’ sign)
So in closing, the answer to the sub-question about showing the non-printable as hex, in byte array further above appears the letters x1c which denote the file separator command which was also noted in the hint. The byte array could be considered a string if excluding the prefix b on the left side, and again this value shows in the print string albeit it is invisible (although its presence can be observed as demonstrated above with the hint and len command).