How to compare unicode and string in Python? - python

I have two variables (let's say x and y) that have the following values:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.
Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.
I try to use encode
x.encode('utf-8')
It returns something new (the third version):
'Ko\xc5\xa1ick\xc3\xbd'
And using the following:
print x.encode('utf-8')
returns yet another version:
KošickÛ
So, I am totally confused. Is there a way to keep everything in the same format?

You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.
if y.decode('iso-8859-1') == x:
print(u'{0!r} converted to Unicode == {1}".format(y, x))
With your given example, this is not true; but perhaps y is in a different encoding.
In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).

You need to know the encoding of the byte string. It looks like windows-1252:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
print x == y.decode('windows-1252')
print x.encode('windows-1252') == y
Output:
True
True
Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.

Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.
Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.
In your case:
>>> x.encode('cp1252')
'Ko\x9aick\xfd'
So the y strings seems to be cp1252 encoded.

Related

How to get a character from its UTF-16 code points in Python 3?

I have a list of UTF-16 code points that I need to convert to the actual characters they represent programmatically. This seems unbelievably hard to do in Python 3.
For example, I have the numbers 55357 and 56501 for one character, which I know is this banknote emoji: 💵 But I have no idea how to convert that in Python. I first tried chr(55357) + chr(56501), but Python seems to assume that it is UTF-8 encoded and thus gives me broken Unicode.
I then tried re-encoding the string, but since it's broken UTF-8, it gives me what seems to be broken UTF-16. If I tell it to leave it alone with (chr(55357) + chr(56501)).encode('utf-8', 'surrogatepass'), I can actually get a valid bytes of the character, but it's encoded in...CESU-8, for reasons I cannot yet grasp. This is not an encoding Python supports natively, and I can't find a codec to convert it.
I think I could probably write these to the disk and then read them with the right encoding, but that sounds really terrible.
Is there a reasonable way to do this in Python 3?
The trick is not to mess with chr but rather to convert to a byte array, which you can then decode into a string:
a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')
print(x.decode('UTF-16'))
This can be generalized for any number of integers:
data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')
The reason something like chr(55357) + chr(56501) doesn't work is that chr assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.
The folowing code works:
cp1 = 55357
cp2 = 56501
(chr(cp1) + chr(cp2)).encode('utf-16', 'surrogatepass').decode('utf-16')
#💵

python unicode encode not showing readable text

I got a list that looks like this:
myList = [u'\u0442\u043e\u0432\u0447', u'\u0442\u043e\u0432\u0447']
Then I did this:
for x in myList:
print (x.encode('utf-8'))
so I got:
'\xd1\x82\xd0\xbe\xd0\xb2\xd1\x87'
'\xd1\x82\xd0\xbe\xd0\xb2\xd1\x87'
I tried many encoding, decoding standards. None of them helped me. How can I get readable text?
Your strings are already unicode (they start with u', the \u.... parts are unicode codepoints which python converts to an unicode object for you), so you don't need to encode them (you only encode str).
You just need to print them:
myList = [u'\u0442\u043e\u0432\u0447', u'\u0442\u043e\u0432\u0447']
for x in myList:
print(x)
You might need to specify the charset of your terminal with export LC_ALL=utf-8 if you run into issues while printing x
Your strings are a sequence of codepoints. Each code point is a fixed symbol. To covert it to bytes use encode and supply it with encoding (usually utf-8). To get symbols from bytes (i.e. file on a disk) you need to decode (you should know the encoding in advance).
Use print ' '.join([str(i) for i in myList])

Lxml trying to extract data with windows-1250 characters

Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.

Python: Convert Unicode-Hex-String to Unicode

I have a hex-string made from a unicode string with that function:
def toHex(s):
res = ""
for c in s:
res += "%02X" % ord(c) #at least 2 hex digits, can be more
return res
hex_str = toHex(u"...")
This returns a string like this one:
"80547CFB4EBA5DF15B585728"
That's a sequence of 6 chinese symbols.
But
u"Knödel"
converts to
"4B6EF664656C"
What I need now is a function to convert this back to the original unicode. The chinese symbols seem to have a 2-byte representation while the second example has 1-byte representations for all characters. So I can't just use unichr() for each 1- or 2-byte block.
I've already tried
binascii.unhexlify(hex_str)
but this seems to convert byte-by-byte and returns a string, not unicode. I've also tried
binascii.unhexlify(hex_str).decode(...)
with different formats. Never got the original unicode string.
Thank you a lot in advance!
This seems to work just fine:
binascii.unhexlify(binascii.hexlify(u"Knödel".encode('utf-8'))).decode('utf-8')
Comes back to the original object. You can do the same for the chinese text if it's encoded properly, however ord(x) already destroys the text you started from. You'll need to encode it first and only then treat like a string of bytes.
Can't be done. Using %02X loses too much information. You should be using something like UTF-8 first and converting that, instead of inventing a broken encoding.
>>> u"Knödel".encode('utf-8').encode('hex')
'4b6ec3b664656c'
When I was working with Unicode in a VB app a while ago the first 1 or 2 digits would be removed if they were a "0". Meaning "&H00A2" would automatically be converted to "&HA2", I just created a small function to check the length of the string and if it was less than 4 chars add the missing 0's. I'm not sure if this is what's happening to you, but I thought I would give bit of information as something to be aware of.

How do you store raw bytes as text without losing information in python 2.x?

Suppose I have any data stored in bytes. For example:
0110001100010101100101110101101
How can I store it as printable text? The obvious way would be to convert every 0 to the character '0' and every 1 to the character '1'. In fact this is what I'm currently doing. I'd like to know how I could pack them more tightly, without losing information.
I thought of converting bits in groups of eight to ASCII, but some bit combinations are not
accepted in that format. Any other ideas?
What about an encoding that only uses "safe" characters like base64?
http://en.wikipedia.org/wiki/Base64
EDIT: That is assuming that you want to safely store the data in text files and such?
In Python 2.x, strings should be fine (Python doesn't use null terminated strings, so don't worry about that).
Else in 3.x check out the bytes and bytearray objects.
http://docs.python.org/3.0/library/stdtypes.html#bytes-methods
Not sure what you're talking about.
>>> sample = "".join( chr(c) for c in range(256) )
>>> len(sample)
256
>>> sample
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABC
DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83
\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97
\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab
\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf
\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3
\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7
\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb
\xfc\xfd\xfe\xff'
The string sample contains all 256 distinct bytes. There is no such thing as a "bit combinations ... not accepted".
To make it printable, simply use repr(sample) -- non-ASCII characters are escaped. As you see above.
Try the standard array module or the struct module. These support storing bytes in a space efficient way -- but they don't support bits directly.
You can also try http://cobweb.ecn.purdue.edu/~kak/dist/BitVector-1.2.html or http://ilan.schnell-web.net/prog/bitarray/
For Python 2.x, your best bet is to store them in a string. Once you have that string, you can encode it into safe ASCII values using the base64 module that comes with python.
import base64
encoded = base64.b64encode(bytestring)
This will be much more condensed than storing "1" and "0".
For more information on the base64 module, see the python docs.

Categories