I'm having some difficulty in understanding how python converts between int and byte data types and specifically why it isn't consistent with representing it as hexadecimal numbers.
Consider the following where I convert the number 13 into a 2 byte representation:
>>> (13).to_bytes(2, byteorder='big')
b'\x00\r'
Why does it use the character r in the second byte location?
In this case I would have expected it to output:
b'\x00\xD'
Doing the reverse in both cases outputs the correct answer.
>>> int.from_bytes(b'\x00\x0D', byteorder='big')
13
>>> int.from_bytes(b'\x00\r', byteorder='big')
13
And both have the correct number of bytes
>>> len(b'\x00\x0D')
2
>>> len(b'\x00\r')
2
There is a difference between bytes and a hexadecimal representation. bytes is a datatype; hexadecimal is a way of representing bit patterns on the screen.
A bytes is an immutable sequence of 8-bit values. The interpreter displays it, where possible, as characters or as string escape sequences, and where not possible, in hexadecimal. The corresponding literal is called a bytestring. In other words, hexadecimal is a sort of last resort. You can construct the bytes b'ABC' using hexadecimal notation: b'\x41\x42\x43' but the interpreter will still report it as b'ABC'. That is no different from the way quotes are handled:
>>> a = "ABC"
>>> a
'ABC'
>>> a = 'AB\'C'
>>> a
"AB'C"
The interpreter has a standard way of displaying its data and takes no account of the way you entered that data in the first place. This isn't a roundtrip failure, because no information is lost. You are seeing an equivalent representation rather than the same representation, is all.
If you want to see a hexadecimal representation then you should ask for it explicitly, instead of relying on the interpreter's default way of displaying a particular datatype.
>>> fmt = '\\x{0:02X}\\x{1:02X}'
>>> print(fmt.format(*((13).to_bytes(2, byteorder='big'))))
\x00\0D
There are some special escape sequences which are so common that they are not explicitly represented with their hex values:
\a <-> \x07 alert
\b <-> \x08 backspace
\t <-> \x09 tab (horizontal)
\n <-> \x0A new line
\v <-> \x0B vertical tab
\f <-> \x0C formfeed
\r <-> \x0D carriage return
\" <-> \x22 "
\' <-> \x27 '
\\ <-> \x5C \
you find a full list of escape sequences and how they work here.
Note that there is also \0 <-> \x00 in the C language.
Related
In reading this tutorial I came across the following difference between __unicode__ and __str__ method:
Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.
How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:
Ω (omega symbol)
03 A9 or u'\u03a9'
In python, would this be considered one character (Ω) and two bytes, or two characters(03 A9) and two bytes? Or maybe I am confusing the difference between char and character ?
In Python, u'\u03a9' is a string consisting of the single Unicode character Ω (U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.
One source of ambiguity is a string like 'é', which could either be the single character U+00E9 or the two-character string U+0065 U+0301.
>>> len(u'\u00e9'); print(u'\u00e9')
1
é
>>> len(u'\u0065\u0301'); print(u'\u0065\u0301')
2
é
The two-byte sequence '\xce\xa9', however, can be interpret as the UTF-8 encoding of U+03A9.
>>> u'\u03a9'.encode('utf-8')
'\xce\xa9'
>>> '\xce\xa9'.decode('utf-8')
u'\u03a9'
In Python 3, that would be (with UTF-8 being the default encoding scheme)
>>> '\u03a9'.encode()
b'\xce\xa9'
>>> b'\xce\xa9'.decode()
'Ω'
Other byte sequences can be decoded to U+03A9 as well:
>>> b'\xff\xfe\xa9\x03'.decode('utf16')
'Ω'
>>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32')
'Ω'
In Python, if I type
euro = u'\u20AC'
euroUTF8 = euro.encode('utf-8')
print(euroUTF8, type(euroUTF8), len(euroUTF8))
the output is
('\xe2\x82\xac', <type 'str'>, 3)
I have two questions:
1. it looks like euroUTF8 is encoded over 3 bytes, but how do I get its binary representation to see how many bits it contain?
2. what does 'x' in '\xe2\x82\xac' mean? I don't think 'x' is a hex number. And why there are three '\'?
In Python 2, print is a statement, not a function. You are printing a tuple here. Print the individual elements by removing the (..):
>>> euro = u'\u20AC'
>>> euroUTF8 = euro.encode('utf-8')
>>> print euroUTF8, type(euroUTF8), len(euroUTF8)
€ <type 'str'> 3
Now you get the 3 individual objects written as strings to stdout; my terminal just happens to be configured to interpret anything written to it as UTF-8, so the bytes correctly result in the € Euro symbol being displayed.
The \x<hh> sequences are Python string literal escape sequences (see the reference documentation); they are the default output for the repr() applied to a string with non-ASCII, non-printable bytes in them. You'll see the same thing when echoing the value in an interactive interpreter:
>>> euroUTF8
'\xe2\x82\xac'
>>> euroUTF8[0]
'\xe2'
>>> euroUTF8[1]
'\x82'
>>> euroUTF8[2]
'\xac'
They provide you with ASCII-safe debugging output. The contents of all Python standard library containers use this format; including lists, tuples and dictionaries.
If you want to format to see the bits that make up these values, convert each byte to an integer by using the ord() function, then format the integer as binary:
>>> ' '.join([format(ord(b), '08b') for b in euroUTF8])
'11100010 10000010 10101100'
Each letter in each encoding are represented using different number of bits. UTF-8 is a 8 bit encoding, so there is no need to get a binary representation to know each bit count of each character. (If you still want to present bits, refer to Martijn's answer.)
\x means that the following value is a byte. So x is not something like a hex number that you should convert or read. It identifies the following value, which is you are interested in. \'s are used to escape that x's because they are not a part of the value.
I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?
The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]
Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.
u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])
If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё
I'm trying to make byte frame which I will send via UDP. I have class Frame which has attributes sync, frameSize, data, checksum etc. I'm using hex strings for value representation. Like this:
testFrame = Frame("AA01","0034","44853600","D43F")
Now, I need to concatenate this hex values together and convert them to byte array like this?!
def convertToBits(self):
stringMessage = self.sync + self.frameSize + self.data + self.chk
return b16decode(self.stringMessage)
But when I print converted value I don't get the same values or I don't know to read python notation correctly:
This is sync: AA01
This is frame size: 0034
This is data:44853600
This is checksum: D43F
b'\xaa\x01\x004D\x856\x00\xd4?'
So, first word is converted ok (AA01 -> \xaa\x01) but (0034 -> \x004D) it's not the same. I tried to use bytearray.fromhex because I can use spaces between bytes but I got same result. Can you help me to send same hex words via UDP?
Python displays any byte that can represent a printable ASCII character as that character. 4 is the same as \x34, but as it opted to print the ASCII character in the representation.
So \x004 is really the same as \x00\x34, D\x856\x00 is the same as \x44\x85\x36\x00, and \xd4? is the same as \xd4\x3f, because:
>>> b'\x34'
'4'
>>> b'\x44'
'D'
>>> b'\x36'
'6'
>>> b'\x3f'
'?'
This is just the representation of the bytes value; the value is entirely correct and you don't need to do anything else.
If it helps, you can visualise the bytes values as hex again using binascii.hexlify():
>>> import binascii
>>> binascii.hexlify(b'\xaa\x01\x004D\x856\x00\xd4?')
b'aa01003444853600d43f'
and you'll see that 4, D, 6 and ? are once again represented by the correct hexadecimal characters.
How do you dynamically create single char hex values?
For instance, I tried
a = "ff"
"\x{0}".format(a)
and
a = "ff"
"\x" + a
I ultimately was looking for something like
\xff
However, neither of the combinations above appear to work.
Additionally, I was originally using chr to obtain single char hex representations of integers but I noticed that chr(63) would return ? (as that is its ascii representation).
Is there another function aside from chr that will return chr(63) as \x_ _ where _ _ is its single char hex representation? In other words, a function that only produces single char hex representations.
When you say \x{0}, Python escapes x and thinks that the next two characters will be hexa-decimal characters, but they are actually not. Refer the table here.
\xhh Character with hex value hh (4,5)
4 . Unlike in Standard C, exactly two hex digits are required.
5 . In a string literal, hexadecimal and octal escapes denote the byte with the given value; it is not necessary that the byte encodes a character in the source character set. In a Unicode literal, these escapes denote a Unicode character with the given value.
So, you have to escape \ in \x, like this
print "\\x{0}".format(a)
# \xff
Try str.decode with 'hex' encoding:
In [204]: a.decode('hex')
Out[204]: '\xff'
Besides, chr returns a single-char string, you don't need to worry about the output of this string:
In [219]: c = chr(31)
In [220]: c
Out[220]: '\x1f'
In [221]: print c #invisible printout
In [222]: