Why is \x00 not converted to \0 by repr - python

Here is an interesting oddity about Python's repr:
The tab character \x09 is represented as \t. However this convention does not apply for the null terminator.
Why is \x00 represented as \x00, rather than \0?
Sample code:
# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True
>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0

The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\, \n, \r, \t, (plus \' when both " and ' characters are present) because there are explicit tests for those.
The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \xhh, \uhhhh and \Uhhhhhhhh, always using the shortest of the 3 options that'll fit the value).
Moreover, when generating the repr() output, for a string consisting of a null byte followed by a digit from '1' through to '7' (so bytes([0x00, 0x49]), or bytes([0x00, 0x4A]), etc), you can't just use \0 in the output without then also having to escape the following digit. '\01' is a single octal escape sequence, and not the same value as '\x001', which is two bytes. While forcing the output to always use three octal digits (e.g. '\0001') could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\x001' and '\0Ol'?)
The output is always consistent. Apart from the single quote (which can appear either with ' or \', depending on the presence of " characters), Python will always use same escape sequence style for a given codepoint.
If you want to study the code that produces the output, you can find the Python 3 str.__repr__ implementation in the Objects/unicodeobject.c unicode_repr() function, which uses
/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, ch);
continue;
}
/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'r');
}
for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr() function does much the same thing.

If it tried to use \0, then it would have to special-case when numbers immediately followed it, to prevent them from being interpreted as an octal literal. Always using \x00 is simpler and always correct.

Related

Why does an escape character turn into a string like \x0 in python?

I have a simple question here:
mylist=[1,2,3]
mylist.insert(0, '\0')
print(mylist)
gives us:
['\x00', 1, 2, 3]
in other words, why and how is the escape character \ turning into \x0?
Is there some purpose here? Is this some encoding thing?
Why does python return one representation over another?
So \<various single characters> have special meanings, and \xYY is a character which is hex value YY. The following (and others) are equivalent:
>>> assert '\0' == '\x00' # null
>>> assert '\t' == '\x09' # tab
>>> assert '\r' == '\x0d' # carriage return
>>> assert '\n' == '\x0a' # line feed
This is because a tab is encoded as 9, etc. And in general, Python will show you the most general representation, which is the hexadecimal value for a character, if it can not show you the character itself.
(String Literal Docs)

Creating \x Single Char Hex Values in Python

How do you dynamically create single char hex values?
For instance, I tried
a = "ff"
"\x{0}".format(a)
and
a = "ff"
"\x" + a
I ultimately was looking for something like
\xff
However, neither of the combinations above appear to work.
Additionally, I was originally using chr to obtain single char hex representations of integers but I noticed that chr(63) would return ? (as that is its ascii representation).
Is there another function aside from chr that will return chr(63) as \x_ _ where _ _ is its single char hex representation? In other words, a function that only produces single char hex representations.
When you say \x{0}, Python escapes x and thinks that the next two characters will be hexa-decimal characters, but they are actually not. Refer the table here.
\xhh Character with hex value hh (4,5)
4 . Unlike in Standard C, exactly two hex digits are required.
5 . In a string literal, hexadecimal and octal escapes denote the byte with the given value; it is not necessary that the byte encodes a character in the source character set. In a Unicode literal, these escapes denote a Unicode character with the given value.
So, you have to escape \ in \x, like this
print "\\x{0}".format(a)
# \xff
Try str.decode with 'hex' encoding:
In [204]: a.decode('hex')
Out[204]: '\xff'
Besides, chr returns a single-char string, you don't need to worry about the output of this string:
In [219]: c = chr(31)
In [220]: c
Out[220]: '\x1f'
In [221]: print c #invisible printout
In [222]:

Python unicode.splitlines() triggers at non-EOL character

Triyng to make this in Python 2.7:
>>> s = u"some\u2028text"
>>> s
u'some\u2028text'
>>> l = s.splitlines(True)
>>> l
[u'some\u2028', u'text']
\u2028 is Left-To-Right Embedding character, not \r or \n, so that line should not be splitted. Is there a bug or just my misunderstanding?
\u2028 is LINE SEPARATOR, left-to-right embedding is \u202A:
>>> import unicodedata
>>> unicodedata.name(u'\u2028')
'LINE SEPARATOR'
>>> unicodedata.name(u'\u202A')
'LEFT-TO-RIGHT EMBEDDING'
The list of codepoints considered linebreaks is easy (not that easy to find though) to see in python source (python 2.7, comments by me):
/* Returns 1 for Unicode characters having the line break
* property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
* type 'B', 0 otherwise.
*/
int _PyUnicode_IsLinebreak(register const Py_UNICODE ch)
{
switch (ch) {
// Basic Latin
case 0x000A: // LINE FEED
case 0x000B: // VERTICAL TABULATION
case 0x000C: // FORM FEED
case 0x000D: // CARRIAGE RETURN
case 0x001C: // FILE SEPARATOR
case 0x001D: // GROUP SEPARATOR
case 0x001E: // RECORD SEPARATOR
// Latin-1 Supplement
case 0x0085: // NEXT LINE
// General punctuation
case 0x2028: // LINE SEPARATOR
case 0x2029: // PARAGRAPH SEPARATOR
return 1;
}
return 0;
}
U+2028 is LINE SEPARATOR. Both U+2028 and U+2029 (PARAGRAPH SEPARATOR) should be treated as newlines, so Python is doing the right thing.
Of course it is sometimes perfectly reasonable to want to split on a non-standard list of newline characters. But you can't do that with splitlines. You will have to use split—and, if you need the additional features of splitlines, you'll have to implement them yourself. For example:
return [line.rstrip(sep) for line in s.split(sep)]

Show non printable characters in a string

Is it possible to visualize non-printable characters in a python string with its hex values?
e.g. If I have a string with a newline inside I would like to replace it with \x0a.
I know there is repr() which will give me ...\n, but I'm looking for the hex version.
I don't know of any built-in method, but it's fairly easy to do using a comprehension:
import string
printable = string.ascii_letters + string.digits + string.punctuation + ' '
def hex_escape(s):
return ''.join(c if c in printable else r'\x{0:02x}'.format(ord(c)) for c in s)
I'm kind of late to the party, but if you need it for simple debugging, I found that this works:
string = "\n\t\nHELLO\n\t\n\a\17"
procd = [c for c in string]
print(procd)
# Prints ['\n,', '\t,', '\n,', 'H,', 'E,', 'L,', 'L,', 'O,', '\n,', '\t,', '\n,', '\x07,', '\x0f,']
While just list is simpler, a comprehension makes it easier to add in filtering/mapping if necessary.
You'll have to make the translation manually; go through the string with a regular expression for example, and replace each occurrence with the hex equivalent.
import re
replchars = re.compile(r'[\n\r]')
def replchars_to_hex(match):
return r'\x{0:02x}'.format(ord(match.group()))
replchars.sub(replchars_to_hex, inputtext)
The above example only matches newlines and carriage returns, but you can expand what characters are matched, including using \x escape codes and ranges.
>>> inputtext = 'Some example containing a newline.\nRight there.\n'
>>> replchars.sub(replchars_to_hex, inputtext)
'Some example containing a newline.\\x0aRight there.\\x0a'
>>> print(replchars.sub(replchars_to_hex, inputtext))
Some example containing a newline.\x0aRight there.\x0a
Modifying ecatmur's solution to handle non-printable non-ASCII characters makes it less trivial and more obnoxious:
def escape(c):
if c.printable():
return c
c = ord(c)
if c <= 0xff:
return r'\x{0:02x}'.format(c)
elif c <= '\uffff':
return r'\u{0:04x}'.format(c)
else:
return r'\U{0:08x}'.format(c)
def hex_escape(s):
return ''.join(escape(c) for c in s)
Of course if str.isprintable isn't exactly the definition you want, you can write a different function. (Note that it's a very different set from what's in string.printable—besides handling non-ASCII printable and non-printable characters, it also considers \n, \r, \t, \x0b, and \x0c as non-printable.
You can make this more compact; this is explicit just to show all the steps involved in handling Unicode strings. For example:
def escape(c):
if c.printable():
return c
elif c <= '\xff':
return r'\x{0:02x}'.format(ord(c))
else:
return c.encode('unicode_escape').decode('ascii')
Really, no matter what you do, you're going to have to handle \r, \n, and \t explicitly, because all of the built-in and stdlib functions I know of will escape them via those special sequences instead of their hex versions.
I did something similar once by deriving a str subclass with a custom __repr__() method which did what I wanted. It's not exactly what you're looking for, but may give you some ideas.
# -*- coding: iso-8859-1 -*-
# special string subclass to override the default
# representation method. main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):
def __repr__(self):
# use double quotes unless there are more of them within the string than
# single quotes
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"
rep = [quotechar]
for ch in self:
# control char?
if ord(ch) < ord(' '):
# remove the single quotes around the escaped representation
rep += repr(str(ch)).strip("'")
# embedded quote matching quotechar being used?
elif ch == quotechar:
rep += "\\"
rep += ch
# else just use others as they are
else:
rep += ch
rep += quotechar
return "".join(rep)
if __name__ == "__main__":
s1 = '\tWürttemberg'
s2 = MsgStr(s1)
print "str s1:", s1
print "MsgStr s2:", s2
print "--only the next two should differ--"
print "repr(s1):", repr(s1), "# uses built-in string 'repr'"
print "repr(s2):", repr(s2), "# uses custom MsgStr 'repr'"
print "str(s1):", str(s1)
print "str(s2):", str(s2)
print "repr(str(s1)):", repr(str(s1))
print "repr(str(s2)):", repr(str(s2))
print "MsgStr(repr(MsgStr('\tWürttemberg'))):", MsgStr(repr(MsgStr('\tWürttemberg')))
There is also a way to print non-printable characters in the sense of them executing as commands within the string even if not visible (transparent) in the string, and their presence can be observed by measuring the length of the string using len as well as by simply putting the mouse cursor at the start of the string and seeing/counting how many times you have to tap the arrow key to get from start to finish, as oddly some single characters can have a length of 3 for example, which seems perplexing. (Not sure if this was already demonstrated in prior answers)
In this example screenshot below, I pasted a 135-bit string that has a certain structure and format (which I had to manually create beforehand for certain bit positions and its overall length) so that it is interpreted as ascii by the particular program I'm running, and within the resulting printed string are non-printable characters such as the 'line break` which literally causes a line break (correction: form feed, new page I meant, not line break) in the printed output there is an extra entire blank line in between the printed result (see below):
Example of printing non-printable characters that appear in printed string
Input a string:100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000
HPQGg]+\,vE!:#
>>> len('HPQGg]+\,vE!:#')
17
>>>
In the above code excerpt, try to copy-paste the string HPQGg]+\,vE!:# straight from this site and see what happens when you paste it into the Python IDLE.
Hint: You have to tap the arrow/cursor three times to get across the two letters from P to Q even though they appear next to each other, as there is actually a File Separator ascii command in between them.
However, even though we get the same starting value when decoding it as a byte array to hex, if we convert that hex back to bytes they look different (perhaps lack of encoding, not sure), but either way the above output of the program prints non-printable characters (I came across this by chance while trying to develop a compression method/experiment).
>>> bytes(b'HPQGg]+\,vE!:#').hex()
'48501c514767110c5d2b5c2c7645213a40'
>>> bytes.fromhex('48501c514767110c5d2b5c2c7645213a40')
b'HP\x1cQGg\x11\x0c]+\\,vE!:#'
>>> (0x48501c514767110c5d2b5c2c7645213a40 == 0b100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000)
True
>>>
In the above 135 bit string, the first 16 groups of 8 bits from the big-endian side encode each character (including non-printable), whereas the last group of 7 bits results in the # symbol, as seen below:
Technical breakdown of the format of the above 135-bit string
And here as text is the breakdown of the 135-bit string:
10010000 = H (72)
10100000 = P (80)
00111000 = x1c (28 for File Separator) *
10100010 = Q (81)
10001110 = G(71)
11001110 = g (103)
00100010 = x11 (17 for Device Control 1) *
00011000 = x0c (12 for NP form feed, new page) *
10111010 = ] (93 for right bracket ‘]’
01010110 = + (43 for + sign)
10111000 = \ (92 for backslash)
01011000 = , (44 for comma, ‘,’)
11101100 = v (118)
10001010 = E (69)
01000010 = ! (33 for exclamation)
01110100 = : (58 for colon ‘:’)
1000000 = # (64 for ‘#’ sign)
So in closing, the answer to the sub-question about showing the non-printable as hex, in byte array further above appears the letters x1c which denote the file separator command which was also noted in the hint. The byte array could be considered a string if excluding the prefix b on the left side, and again this value shows in the print string albeit it is invisible (although its presence can be observed as demonstrated above with the hint and len command).

Python issue with incorrectly formated strings that contains \x

At some point our python script receives string like that:
In [1]: ab = 'asd\xeffe\ctive'
In [2]: print ab
asd�fe\ctve \ \\ \\\k\\\
Data is damaged we need escape \x to be properly interpreted as \x but \c has not special meaning in string thus must be intact.
So far the closest solution I found is do something like:
In [1]: ab = 'asd\xeffe\ctve \\ \\\\ \\\\\\k\\\\\\'
In [2]: print ab.encode('string-escape').replace('\\\\', '\\').replace("\\'", "'")
asd\xeffe\ctve \ \\ \\\k\\\
Output taken from IPython, I assumed that ab is a string not unicode string (in the later case we would have to do something like that:
def escape_string(s):
if isinstance(s, str):
s = s.encode('string-escape').replace('\\\\', '\\').replace("\\'", "'")
elif isinstance(s, unicode):
s = s.encode('unicode-escape').replace('\\\\', '\\').replace("\\'", "'")
return s
\xhh is an escape character and \x is seen as the start of this escape.
'\\' is the same as '\x5c'. It is just two different ways to write the backslash character as a Python string literal.
These literal strings: r'\c', '\\c', '\x5cc', '\x5c\x63' are identical str objects in memory.
'\xef' is a single byte (239 as an integer), but r'\xef' (same as '\\xef') is a 4-byte string: '\x5c\x78\x65\x66'.
If s[0] returns '\xef' then it is what s object actually contains. If it is wrong then fix the source of the data.
Note: string-escape also escapes \n and the like:
>>> print u'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.encode('unicode-escape')
\xef\\c\\\u2603"'\u2603\u2603"'\n\xa0
>>> print b'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.encode('string-escape')
\xef\\c\\\\N{SNOWMAN}"\'\xe2\x98\x83\\u2603"\'\n\xa0
backslashreplace is used only on characters that cause UnicodeEncodeError:
>>> print u'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''
ï\c\☃"'☃☃"'
>>> print b'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''
�\c\\N{SNOWMAN}"'☃\u2603"'
�
>>> print u'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.encode('ascii', 'backslashreplace')
\xef\c\\u2603"'\u2603\u2603"'
\xa0
>>> print b'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.decode('latin1').encode('ascii', 'backslashreplace')
\xef\c\\N{SNOWMAN}"'\xe2\x98\x83\u2603"'
\xa0
Backslashes introduce "escape sequences". \x specifically allows you to specify a byte, which is given as two hexadecimal digits after the x. ef are two hexadecimal digits, hence you get no error. Double the backslash to escape it, or use a raw string r"\xeffective".
Edit: While the Python console may show you '\\', this is precisely what you expect. You just say you expect something else because you confuse the string and its representation. It's a string containing a single backslash. If you were to output it with print, you'd see a single backslash.
But the string literal '\' is ill-formed (not closed because \' is an apostrophe, not a backslash and end-of-string-literal), so repr, which formats the results at the interactive shell, does not produce it. Instead it produces a string literal which you could paste into Python source code and get the same string object. For example, len('\\') == 1.
The \x escape sequence signifies a Unicode character in the string, and ef is being interpreted as the hex code. You can sanitize the string by adding an additional \, or else make it a raw string (r'\xeffective').
>>> r'\xeffective'[0]
'\\'
EDIT: You could convert an existing string using the following hack:
>>> a = '\xeffective'
>>> b = repr(a).strip("'")
>>> b
'\\xeffective'

Categories