Python3 handling non-ASCII characters in a weird way

Python3 handling non-ASCII characters in a weird way - python

I was trying to solve a pwnable with Python 3. For that I need to print some characters that are not in the ASCII range.
Python 3 is converting these characters into some weird Unicode.
For example if I print "\xff" in Python 3, I get this:
root#kali:~# python3 -c 'print("\xff")' | xxd
00000000: c3bf 0a ...
\xff gets converted to \xc3\xbf
But in Python 2 it works as expected, like this:
root#kali:~# python -c 'print("\xff")' | xxd
00000000: ff0a ..
So how can print it like that in Python 3?

In Python 2, print '\xff' writes a bytes string directly to the terminal, so you get the byte you print.
In Python 3, print('\xff') encodes the Unicode character U+00FF to the terminal using the default encoding...in your case UTF-8.
To directly output bytes to the terminal in Python 3 you can't use print, but you can use the following to skip encoding and write a byte string:
python3 -c "import sys; sys.stdout.buffer.write(b'\xff')"

In Python 2, str and bytes were the same thing, so when you wrote '\xff', the result contained the actual byte 0xFF.
In Python 3, str is closer to Python 2's unicode object, and is not an alias for bytes. \xff is no longer a request to insert a byte, but rather a request to insert a Unicode character whose code can be represented in 8 bits. The string is printed with your default encoding (probably UTF-8), in which character 0xFF is encoded as the bytes \xc3\xbf. \x is basically the one-byte version of \u when it appears in a string. It's still the same thing as before when it appears in a bytes though.
Now for a solution. If you just want some bytes, do
b'\xff'
That will work the same as in Python 2. You can write these bytes to a binary file, but you can't then print directly, since everything you print gets converted to str. The problem with printing is that everything gets encoded in text mode. Luckily, sys.stdout has a buffer attribute that lets you output bytes directly:
sys.stdout.buffer.write(b'\xff\n')
This will only work if you don't replace sys.stdout with something fancy that doesn't have a buffer.

Related

Single byte (\xd0) gets printed as sequence of two bytes (\xc3\x90) in Python 3 [duplicate]

This question already has answers here:
How to write binary data to stdout in python 3?
(4 answers)
Closed 3 years ago.
I am trying to print the byte d0 to stdout in Python 3.7.2. I have the following code:
print("\xd0", end = "")
However, when I execute this, it outputs the bytes c390.
$ python -c 'print("\xd0", end = "")' | xxd
00000000: c3 90
Why is it not outputting the byte \xd0?

"\xd0" is an str object, which in Python 3 is a Unicode string (= a sequence of Unicode code points) containing the Unicode code point U+00D0 (i.e. 208 i.e. Ð); when writing it with print, Python has to convert it from Unicode (str) to bytes (bytes), so it has to use an encoding (an "abstract codepoints" to bytes converter).
In your case, as often, it happens to be UTF-8, where codepoint U+00D0 is encoded as the code-units (= bytes) sequence c3 90.
If you want to output literally a byte 0xd0, you have to use a byte string and go straight to the bytes stream that is beyond sys.stdout:
import sys
sys.stdout.buffer.write(b'\xd0')

It’s printing the character U+00D0 (“Ð”) as UTF-8. If you want to output a byte string to stdout, use sys.stdout.buffer:
import sys
sys.stdout.buffer.write(b"\xd0")
"\xd0" is a string of Unicode codepoints, but b"\xd0" is a string of bytes.

Nul byte representation in Python

I have question that I am having a hard time understanding what the code might look like so I will explain the best I can. I am trying to view and search a NUL byte and replace it with with another NUL type byte, but the computer needs to be able to tell the difference between the different NUL bytes. an Example would be Hex code 00 would equal NUL and hex code 01 equals SOH. lets say I wanted to create code to replace those with each other. code example
TextFile1 = Line.Replace('NUL','SOH')
TextFile2.write(TextFile1)
Yes I have read a LOT of different posts just trying to understand to put it into working code. first problem is I can't just copy and paste the output of hex 00 into the python module it just won't paste. reading on that shows 0x00 type formats are used to represent that but I'm having issues finding the correct representation for python 3.x
Print (\x00)
output = nothing shows #I'm trying to get output of 'NUL' or as hex would show '.' either works fine --Edited
so how to get the module to understand that I'm trying to represent HEX 00 or 'NUL' and represent as '.' and do the same for SOH, Not just limited to those types of NUL characters but just using those as exmple because I want to use all 256 HEX characters. but beable to tell the difference when pasting into another program just like a hex editor would do. maybe I need to get the two programs on the same encoding type not really sure. I just need a very simple example text as how I would search and replace none representable Hexadecimal characters and find and replace them in notepad or notepad++, from what I have read, only notepad++ has the ability to do so.

If you are on Python 3, you should really work with bytes objects. Python 3 strings are sequences of unicode code points. To work with byte-strings, use bytes (which is pretty much the same as a Python 2 string, which used the "sequence of bytes" model).
>>> bytes([97, 98, 99])
b'abc'
>>>
Note, to write a bytes literal, prepend a b before the opening quote in your string.
To answer your question, to find the representation of 0x00 and 0x01 just look at:
>>> bytes([0x00, 0x01])
b'\x00\x01'
Note, 0x00 and 0 are the same type, they are just different literal syntaxes (hex literal versus decimal literal).
>>> bytes([0, 1])
b'\x00\x01'
I have no idea what you mean with regards to Notepad++.
Here is an example, though, of replacing a null byte with something else:
>>> byte_string = bytes([97, 98, 0, 99])
>>> byte_string
b'ab\x00c'
>>> print(byte_string)
b'ab\x00c'
>>> byte_string.replace(b'\x00', b'NONE')
b'abNONEc'
>>> print(byte_string.replace(b'\x00', b'NONE'))
b'abNONEc'

another equivalent way to get the value of \x00 in python is chr(0) i like that way a little better over the literal versions

Python 2.7 Non-ASCII character inside byte string

I understand that Python 2.7 byte string only take ASCII character, and I wonder why the following works? Looks like ü was encoded in some other format, can you explain?
>>> s = "Flügel"
>>> s
'Fl\x81gel'

I understand that Python 2.7 byte string only take ASCII character,
You misunderstood. Python byte strings take any valid bytes. Bytes are basically integer values in the range 0 through to 255 (ASCII covers 0 through to 127).
When you open the interactive interpreter prompt in a terminal or console, the configuration of that terminal or console determine what bytes you can type and send to Python. You appear to be using one that sends Latin text (a number of variants send 0x81 for ü). Python stored that in the bytestring.
You can check what codec was used by looking at sys.stdin.encoding.
Mine is configured to handle UTF-8, which uses two bytes to encode the same character (U+00FC LATIN SMALL LETTER U WITH DIAERESIS):
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> s = 'Flügel'
>>> s
'Fl\xc3\xbcgel'

Latin1 character values not displaying the same as in utf8

FOR PYTHON 2.7 (I took a shot of using encode in 3 and am all confused now...would love some advice how to replicate this test in python 3....)
For the Euro character (€) I looked up what its utf8 Hex code point was using this tool. It said it was 0x20AC.
For Latin1 (again using Python2 2.7), I used decode to get its Hex code point:
>>import unicodedata
>>p='€'
## notably x80 seems to correspond to [Windows CP1252 according to the link][2]
>>p.decode('latin-1')
>>u'\x80'
Then I used this print statement for both of them, and this is what I got:
for utf8:
>>> print unichr(0x20AC).encode('utf-8')
â‚¬
for latin-1:
>>> print unichr(0x80).encode('latin-1')
€
What in the heck happened? Why did encode return 'â‚¬' for utf-8? Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect). But the presence of different code points seems to suggest otherwise to me...HOWEVER the reason why python 2.7 is reading the Windows CP1252 'x80' is a real mystery to me....is this the standard for latin-1 in python 2.7??

You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.
First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2, 0x82, 0xAC.
And that explains your confusion here:
Why did encode return 'â‚¬' for utf-8?
It didn't. It returned the byte string '\xE2\x82\xAC'. You then printed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you â‚¬.
Meanwhile, when you write this:
p='€'
The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80. So, this is the same as typing:
p='\x80'
But in Latin-1, \x80 isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1'), you get back u'\x80'. Which is exactly what you're seeing.
The reason you can't reproduce this in Python 3 is that in Python 3, str, and plain string literals, are Unicode strings, not byte strings. So, when you write this:
p='€'
… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:
p='\u20ac'
… or this:
p=b'\x80'.decode(sys.stdin.encoding)
Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.
A code point is a Unicode concept. A unicode string in Python is a sequence of code points. A str is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC, or 0x20AC, is the same thing as the decimal number 8364, and the hex number 0x80 is the same thing as the decimal number 128.
That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.
Finally:
Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).
Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.

Confused about unicode representations

I am confused about hex representation of Unicode.
I have an example file with a single mathematical integral sign character in it. That is U+222B
If I cat the file or edit it in vi I get an integral sign displayed.
A hex dump of the file shows its hex content is 88e2 0aab
In python I can create an integral unicode character and print p rendering on my terminal and integral sign.
>>> p=u'\u222b'
>>> p
u'\u222b'
>>> print p
∫
What confuses me is I can open a file with the integral sign in it, get the integral symbol but the hex content is different.
>>> c=open('mycharfile','r').read()
>>> c
'\xe2\x88\xab\n'
>>> print c
∫
One is a Unicode object and one is a plain string but what is the relationship between the two hex codes apparently for the same character? How would I manually convert one to another?

The plain string has been encoded using UTF-8, one of a variety of ways to represent Unicode code points in bytes. UTF-8 is a multibyte encoding which has the often useful feature that it is a superset of ASCII - the same byte encodes any ASCII character in UTF-8 or in ASCII.
In Python 2.x, use the encode method on a Unicode object to encode it, and decode or the unicode constructor to decode it:
>>> u'\u222b'.encode('utf8')
'\xe2\x88\xab'
>>> '\xe2\x88\xab'.decode('utf8')
u'\u222b'
>>> unicode('\xe2\x88\xab', 'utf8')
u'\u222b'
print, when given a Unicode argument, implicitly encodes it. On my system:
>>> sys.stdout.encoding
'UTF-8'
See this answer for a longer discussion of print's behavior:
Why does Python print unicode characters when the default encoding is ASCII?
Python 3 handles things a bit differently; the changes are documented here:
http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

Okay i have it. Thanks for the answers. i wanted to see how to do the conversion rather than convert a string using Python.
the conversion works this way.
If you have a unicode character, in my example an integral symbol.
Octal dump produces
echo -n "∫"|od -x
0000000 88e2 00ab
Each hex pair are reversed so it really means
e288ab00
The first hex character is E. the high bit means this is a Unicode string and the next two bits indicate it is 3 three bytes (16 bits) to represent the character.
The first two bits of the remaining hex digits are throw away (they signify they are unicode.) the full bit stream is
111000101000100010101011
Throw away the first 4 bits and the first two bits of the remaining hex digits
0010001000101011
Re-expressing this in hex
222B
They you have it!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.