Unicode encoding/decoding - python

I have a string that looks like this.
st = '/M\xe4rzen'
I would like to covert this to unicode. How can I do this? I've tried:
st.decode('utf-8')
unicode(t, 'utf-8')
The original file is utf-8 encoded, but I can't seem to get the unicode representation of the string.

Your data is not UTF8 encoded; more likely it is using the Latin-1 encoding:
>>> print st.decode('latin1')
/Märzen
Calling .decode() is enough, no need to also call unicode().

Related

How can I convert a string to "gbk" encoding?

I'm trying to convert some Chinese words into bytes with Python. For example, I have this word: 自 and I tried to convert it by doing this:
"自".encode()
But I only get this:
b'\xe8\x87\xaa'
Looking on the web I think that it needs to be converted with "gbk" encoding but if I try to do it I only get:
b'\xd7\xd4'
What I need is it to be converted into this:
\u81ea
Here you can see a reference to the character I'm talking about: https://charbase.com/81EA
\u81ea is a unicode code point not gbk bytes.
You can convert to this with:
"自".encode("unicode_escape")
# b'\\u81ea'
b'\xd7\xd4' is the gbk encoding of that code point, b'\xe8\x87\xaa' is the utf-8 encoding of the same code point.

Python convert utf-8 bytes to string

I have difficulties converting those bytes to string:
x = b'<strong>\xc5\xb7\xc3\xc0\xd0\xd4\xb8\xd0\xd0\xb1\xc1\xec\xb5\xa5\xbc\xe7\xb3\xa4\xd0\xe4\xb2\xbb\xb9\xe6\xd4\xf2\xc1\xac\xd2\xc2\xc8\xb9\xa3\xac\xb4\xf2\xd4\xec\xd1\xe7\xbb\xe1\xa1\xa2\xca\xb1\xc9\xd0\xb8\xd0\xca\xae\xd7\xe3\xa3\xac\xd5\xc3\xcf\xd4\xc5\xae\xd0\xd4\xf7\xc8\xc1\xa6\xa3\xac\xb4\xf3\xc1\xbf\xcf\xd6\xbb\xf5\xa3\xac\xbb\xb6\xd3\xad\xd0\xc2\xc0\xcf\xbf\xcd\xbb\xa7\xc4\xc3\xd1\xf9\xb2\xc9\xb9\xba\xa3\xa1</strong>'
if i decode via unicode-escape i got weird characters like:
'<strong>Å·ÃÀÐÔ¸ÐбÁìµ¥¼ç³¤Ðä²»¹æÔòÁ¬ÒÂȹ£¬´òÔìÑç»á¡¢Ê±ÉиÐÊ®×㣬ÕÃÏÔÅ®ÐÔ÷ÈÁ¦£¬´óÁ¿ÏÖ»õ£¬»¶Ó\xadÐÂÀÏ¿Í»§ÄÃÑù²É¹º£¡</strong>'
instead of chinese charaters like 欧美性感斜领单肩长袖不规则连衣裙
You seem to be using the wrong encoding. The right encoding seem to be 'GB2312'.
>>> x.decode('GB2312')
'<strong>欧美性感斜领单肩长袖不规则连衣裙... more symbols</strong>'

Decoding Python Unicode strings that contain double blackslashes

My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.
If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.
How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?
In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

How to get the Unicode representation of Arabic strings in Django?

I'm wondering how to get the Unicode representation of Arabic strings like سلام in Python?
The result should be \u0633\u0644\u0627\u0645
I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.
Assuming you have an actual Unicode string, you can do
# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')
output
\u0633\u0644\u0627\u0645
The # -*- coding: utf-8 -*- directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.
If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:
\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85
You can convert that to Unicode like this:
data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')
output
سلام
\u0633\u0644\u0627\u0645
Of course, you do need to make sure that your terminal is set up to handle Unicode properly.
Note that
'\u0633\u0644\u0627\u0645'
is a plain (byte) string containing 24 bytes, whereas
u'\u0633\u0644\u0627\u0645'
is a Unicode string containing 4 Unicode characters.
You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
Since you're using Python 2.x, you'll not be able to use encode. You'll need to use the unicode function to cast the string to a unicode object.
> f='سلام'
> f
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
> unicode(f, 'utf-8') # note: you need to pass the encoding parameter in or you'll
# keep having the same problem.
u'\u0633\u0644\u0627\u0645'
> print unicode(f, 'utf-8')
سلام
I'm not sure what library you're using to fetch the content, but you might be able to fetch the data as unicode initially.
> f = u'سلام'
> f
u'\u0633\u0644\u0627\u0645'
> print f.encode('unicode-escape')
\u0633\u0644\u0627\u0645
> print f
سلام
For python 2.7
string = 'سلام'
new_string = unicode(string)
Prepend your string with u in python 2.x, which makes your string a unicode string. Then you can call the encode method of a unicode string.
arabic_string = u'سلام'
arabic_string.encode('utf-8')
Output:
print arabic_string.encode('utf-8')
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'

Python: convert string from UTF-8 to Latin-1

I feel stacked here trying to change encodings with Python 2.5
I have XML response, which I encode to UTF-8: response.encode('utf-8'). That is fine, but the program which uses this info doesn't like this encoding and I have to convert it to other code page. Real example is that I use ghostscript python module to embed pdfmark data to a PDF file - end result is with wrong characters in Acrobat.
I've done numerous combinations with .encode() and .decode() between 'utf-8' and 'latin-1' and it drives me crazy as I can't output correct result.
If I output the string to a file with .encode('utf-8') and then convert this file from UTF-8 to CP1252 (aka latin-1) with i.e. iconv.exe and embed the data everything is fine.
Basically can someone help me convert i.e. character á which is UTF-8 encoded as hex: C3 A1 to latin-1 as hex: E1?
Instead of .encode('utf-8'), use .encode('latin-1').
data="UTF-8 data"
udata=data.decode("utf-8")
data=udata.encode("latin-1","ignore")
Should do it.
Can you provide more details about what you are trying to do? In general, if you have a unicode string, you can use encode to convert it into string with appropriate encoding. Eg:
>>> a = u"\u00E1"
>>> type(a)
<type 'unicode'>
>>> a.encode('utf-8')
'\xc3\xa1'
>>> a.encode('latin-1')
'\xe1'
If the previous answers do not solve your problem, check the source of the data that won't print/convert properly.
In my case, I was using json.load on data incorrectly read from file by not using the encoding="utf-8". Trying to de-/encode the resulting string to latin-1 just does not help...

Categories