I got a list that looks like this:
myList = [u'\u0442\u043e\u0432\u0447', u'\u0442\u043e\u0432\u0447']
Then I did this:
for x in myList:
print (x.encode('utf-8'))
so I got:
'\xd1\x82\xd0\xbe\xd0\xb2\xd1\x87'
'\xd1\x82\xd0\xbe\xd0\xb2\xd1\x87'
I tried many encoding, decoding standards. None of them helped me. How can I get readable text?
Your strings are already unicode (they start with u', the \u.... parts are unicode codepoints which python converts to an unicode object for you), so you don't need to encode them (you only encode str).
You just need to print them:
myList = [u'\u0442\u043e\u0432\u0447', u'\u0442\u043e\u0432\u0447']
for x in myList:
print(x)
You might need to specify the charset of your terminal with export LC_ALL=utf-8 if you run into issues while printing x
Your strings are a sequence of codepoints. Each code point is a fixed symbol. To covert it to bytes use encode and supply it with encoding (usually utf-8). To get symbols from bytes (i.e. file on a disk) you need to decode (you should know the encoding in advance).
Use print ' '.join([str(i) for i in myList])
Related
I have a list of UTF-16 code points that I need to convert to the actual characters they represent programmatically. This seems unbelievably hard to do in Python 3.
For example, I have the numbers 55357 and 56501 for one character, which I know is this banknote emoji: 💵 But I have no idea how to convert that in Python. I first tried chr(55357) + chr(56501), but Python seems to assume that it is UTF-8 encoded and thus gives me broken Unicode.
I then tried re-encoding the string, but since it's broken UTF-8, it gives me what seems to be broken UTF-16. If I tell it to leave it alone with (chr(55357) + chr(56501)).encode('utf-8', 'surrogatepass'), I can actually get a valid bytes of the character, but it's encoded in...CESU-8, for reasons I cannot yet grasp. This is not an encoding Python supports natively, and I can't find a codec to convert it.
I think I could probably write these to the disk and then read them with the right encoding, but that sounds really terrible.
Is there a reasonable way to do this in Python 3?
The trick is not to mess with chr but rather to convert to a byte array, which you can then decode into a string:
a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')
print(x.decode('UTF-16'))
This can be generalized for any number of integers:
data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')
The reason something like chr(55357) + chr(56501) doesn't work is that chr assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.
The folowing code works:
cp1 = 55357
cp2 = 56501
(chr(cp1) + chr(cp2)).encode('utf-16', 'surrogatepass').decode('utf-16')
#💵
For example, I can print Unicode symbol like:
print u'\u00E0'
Or
a = u'\u00E0'
print a
But it looks like I can't do something like this:
a = '\u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
🇦
Any help will be appreciated!
This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode method instead of the unicode type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes method to the int class that lets you bypass the struct module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
You want print unichr(int('00E0',16)). Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won't work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you'll still need to use a font that supports the glyphs for the codepoints.
These are unicode code points but lack the \u python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console's encoding. If its a Windows code page and the characters are not in its range, you'll still get funky errors.
In python 3 that would be b"\\u".
I have two variables (let's say x and y) that have the following values:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.
Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.
I try to use encode
x.encode('utf-8')
It returns something new (the third version):
'Ko\xc5\xa1ick\xc3\xbd'
And using the following:
print x.encode('utf-8')
returns yet another version:
KošickÛ
So, I am totally confused. Is there a way to keep everything in the same format?
You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.
if y.decode('iso-8859-1') == x:
print(u'{0!r} converted to Unicode == {1}".format(y, x))
With your given example, this is not true; but perhaps y is in a different encoding.
In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).
You need to know the encoding of the byte string. It looks like windows-1252:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
print x == y.decode('windows-1252')
print x.encode('windows-1252') == y
Output:
True
True
Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.
Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.
Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.
In your case:
>>> x.encode('cp1252')
'Ko\x9aick\xfd'
So the y strings seems to be cp1252 encoded.
Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.
I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.