Python printing bytes strangely - python

So, I have this very simple code:
from construct import VarInt
a = VarInt.build(12345)
print(a)
with open('hello.txt', 'wb') as file:
file.write(a)
When I print "a", it shows only the first byte and the last byte for some reason does not show properly. It prints "b'\xb9`'" and it should print "b'\xb9\x60'", or at least that's what I would like to print. Now, when I look at the file that I stored, it saves the bytes exactly how it should with no issue there. Does anybody know what's going on here? Also, with some integers it prints properly, but with this one, for example, does not.

It's not a single byte but
b'\xb9`'
Do You see "`" after 9? It's a character encoded as 0x60. If You want to display all bytes as hexadecimal You can try this snippet:
print(" ".join(hex(n) for n in a))

Why do you think that is does not print properly? Assuming an ASCII derived code page(ASCII, UTF-8 or latin), the code for the back quote (`) is 96 or 0x60.
As \xb9 does not map to a character in your system code page, it is escaped, but as \x60 does map to the back quote, the representation just uses that character.

Related

Converting a string formatted in hex in Python to binary data correctly

I have a string formatted as '\x00\x00\x00\x00' and need it to be formatted such that, when printed, it appears in the console as b'\x00\x00\x00\x00'
How do I do this?
edit: I had a different version of the code printing out a string formatted with b'\xf4\x00\x00\x00' etc and on my computer it prints '\xf4\x00\x00\x00'
Just add the b literal before the string, in that way you'll be defining the string as bytes
# example
s = b"\x00\x00\x00\x00"
print(s)
If instead you're receiving the string from somewhere else and you're not manually writing it you can just encode the string into bytes
# an other example
# let's pretend that we received the value from, say, a function
s = "\x00\x00\x00\x00".encode() # again, pretend that this value has just been retrieved from a function
print(s)

How to get a character from its UTF-16 code points in Python 3?

I have a list of UTF-16 code points that I need to convert to the actual characters they represent programmatically. This seems unbelievably hard to do in Python 3.
For example, I have the numbers 55357 and 56501 for one character, which I know is this banknote emoji: 馃挼 But I have no idea how to convert that in Python. I first tried chr(55357) + chr(56501), but Python seems to assume that it is UTF-8 encoded and thus gives me broken Unicode.
I then tried re-encoding the string, but since it's broken UTF-8, it gives me what seems to be broken UTF-16. If I tell it to leave it alone with (chr(55357) + chr(56501)).encode('utf-8', 'surrogatepass'), I can actually get a valid bytes of the character, but it's encoded in...CESU-8, for reasons I cannot yet grasp. This is not an encoding Python supports natively, and I can't find a codec to convert it.
I think I could probably write these to the disk and then read them with the right encoding, but that sounds really terrible.
Is there a reasonable way to do this in Python 3?
The trick is not to mess with chr but rather to convert to a byte array, which you can then decode into a string:
a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')
print(x.decode('UTF-16'))
This can be generalized for any number of integers:
data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')
The reason something like chr(55357) + chr(56501) doesn't work is that chr assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.
The folowing code works:
cp1 = 55357
cp2 = 56501
(chr(cp1) + chr(cp2)).encode('utf-16', 'surrogatepass').decode('utf-16')
#馃挼

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?专讱 讛讗讞讚 讜讬砖转讘讞 讛讘讜专讗 讘讞讻诪转讜 讜专爪讜谞讜 讻诇 爪讘讗 讛砖诪讬诐 讗专抓 讜讬诪讬诐 讗诇讛 讜讗诇讜谞讬诐.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("讛拽讚诪讛")
"讛拽讚诪<\b>讛<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"讛拽讚诪讛"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>讛</b>拽讚诪讛
But my Chrome browser displays out.htm as:

Character Encoding, XML, Excel, python

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.
The specific word that is causing me trouble right now is: "Zmys艂owska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).
I have tried a few things, Ill mention three of them here:
(1)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmys脜鈥歰wska, Magdalena
(2)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmys艂owska, Magdalena
Note: this is great, but I need to encode it back to utf-8 before putting the string into my db. When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmys脜鈥歰wska, Magdalena again.
(3)
Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmys脜鈥歰wska, Magdalena
Is there a way for me to convert this string to utf-8?
Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)
Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.
There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.
>>> print page.decode('windows-1252')
Zmys脜鈥歰wska, Magdalena
>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'
There are two ways around this:
Print Unicode strings and let Python take care of it.
Print strings converted to the appropriate encoding.
For option 1:
print page.decode("utf-8") # of unicode(page, "utf-8")
For option 2, it's going to be one of the following:
print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())
Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:
upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")
print upage

Python: Convert Unicode-Hex-String to Unicode

I have a hex-string made from a unicode string with that function:
def toHex(s):
res = ""
for c in s:
res += "%02X" % ord(c) #at least 2 hex digits, can be more
return res
hex_str = toHex(u"...")
This returns a string like this one:
"80547CFB4EBA5DF15B585728"
That's a sequence of 6 chinese symbols.
But
u"Kn枚del"
converts to
"4B6EF664656C"
What I need now is a function to convert this back to the original unicode. The chinese symbols seem to have a 2-byte representation while the second example has 1-byte representations for all characters. So I can't just use unichr() for each 1- or 2-byte block.
I've already tried
binascii.unhexlify(hex_str)
but this seems to convert byte-by-byte and returns a string, not unicode. I've also tried
binascii.unhexlify(hex_str).decode(...)
with different formats. Never got the original unicode string.
Thank you a lot in advance!
This seems to work just fine:
binascii.unhexlify(binascii.hexlify(u"Kn枚del".encode('utf-8'))).decode('utf-8')
Comes back to the original object. You can do the same for the chinese text if it's encoded properly, however ord(x) already destroys the text you started from. You'll need to encode it first and only then treat like a string of bytes.
Can't be done. Using %02X loses too much information. You should be using something like UTF-8 first and converting that, instead of inventing a broken encoding.
>>> u"Kn枚del".encode('utf-8').encode('hex')
'4b6ec3b664656c'
When I was working with Unicode in a VB app a while ago the first 1 or 2 digits would be removed if they were a "0". Meaning "&H00A2" would automatically be converted to "&HA2", I just created a small function to check the length of the string and if it was less than 4 chars add the missing 0's. I'm not sure if this is what's happening to you, but I thought I would give bit of information as something to be aware of.

Categories