I'm trying to convert some Chinese words into bytes with Python. For example, I have this word: 自 and I tried to convert it by doing this:
"自".encode()
But I only get this:
b'\xe8\x87\xaa'
Looking on the web I think that it needs to be converted with "gbk" encoding but if I try to do it I only get:
b'\xd7\xd4'
What I need is it to be converted into this:
\u81ea
Here you can see a reference to the character I'm talking about: https://charbase.com/81EA
\u81ea is a unicode code point not gbk bytes.
You can convert to this with:
"自".encode("unicode_escape")
# b'\\u81ea'
b'\xd7\xd4' is the gbk encoding of that code point, b'\xe8\x87\xaa' is the utf-8 encoding of the same code point.
Related
I have difficulties converting those bytes to string:
x = b'<strong>\xc5\xb7\xc3\xc0\xd0\xd4\xb8\xd0\xd0\xb1\xc1\xec\xb5\xa5\xbc\xe7\xb3\xa4\xd0\xe4\xb2\xbb\xb9\xe6\xd4\xf2\xc1\xac\xd2\xc2\xc8\xb9\xa3\xac\xb4\xf2\xd4\xec\xd1\xe7\xbb\xe1\xa1\xa2\xca\xb1\xc9\xd0\xb8\xd0\xca\xae\xd7\xe3\xa3\xac\xd5\xc3\xcf\xd4\xc5\xae\xd0\xd4\xf7\xc8\xc1\xa6\xa3\xac\xb4\xf3\xc1\xbf\xcf\xd6\xbb\xf5\xa3\xac\xbb\xb6\xd3\xad\xd0\xc2\xc0\xcf\xbf\xcd\xbb\xa7\xc4\xc3\xd1\xf9\xb2\xc9\xb9\xba\xa3\xa1</strong>'
if i decode via unicode-escape i got weird characters like:
'<strong>Å·ÃÀÐÔ¸ÐбÁìµ¥¼ç³¤Ðä²»¹æÔòÁ¬ÒÂȹ£¬´òÔìÑç»á¡¢Ê±ÉиÐÊ®×㣬ÕÃÏÔÅ®ÐÔ÷ÈÁ¦£¬´óÁ¿ÏÖ»õ£¬»¶Ó\xadÐÂÀÏ¿Í»§ÄÃÑù²É¹º£¡</strong>'
instead of chinese charaters like 欧美性感斜领单肩长袖不规则连衣裙
You seem to be using the wrong encoding. The right encoding seem to be 'GB2312'.
>>> x.decode('GB2312')
'<strong>欧美性感斜领单肩长袖不规则连衣裙... more symbols</strong>'
I'm Using python 3.5
I have a couple of byte strings representing text that is encoded in various codecs so: b'mybytesstring' , now some are Utf8 encoded other are latin1 and so on. What I want to in the following order is:
transform the bytes string into an ascii like string.
transform the ascii like string back to a bytes string.
decode the bytes string with correct codec.
The problem is that I have to move the bytes string into something that does not accept bytes objects so I'm looking for a solution that lets me do bytes -> ascii -> bytes safely.
x = x.decode().encode('ascii',errors='ignore')
You use the encode and decode methods for this, and supply the desired encoding to them. It's not clear to me if you know the encoding beforehand. If you don't know it you're in trouble. You may have to guess the encoding in some way, risking garbage output.
OK I found a solution which is much easier than I thought
mybytes = 'ëýđþé'.encode()
str_mybytes = str(mybytes)
again_mybytes = eval(str_mybytes)
decoded = again_mybytes.decode('utf8')
Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:
I have hebrew data such that \xe0 is the hebrew aleph,
and wish to convert it into utf-8
In general in Python, if you have a byte string you need to use decode first to convert it to the internal representation, afterwards you can encode it to UTF-8. Of course, you need to know the coding of \xe0 for this to work (I assume your character is encoded using ISO-8859-8):
'\xe0'.decode('iso-8859-8').encode('utf-8')
EDIT:
A side note:
Make sure to use the internal representation in your program as long as possible. In general: decode first (on input), encode last (on output).
you can use the "decode" call to transform it in unicode
y = x.decode('iso8859-8')
where x is your 8-bit string and y is the unicode string
then you can convert it to utf-8 using the encode call
z = y.encode('utf-8')
I feel stacked here trying to change encodings with Python 2.5
I have XML response, which I encode to UTF-8: response.encode('utf-8'). That is fine, but the program which uses this info doesn't like this encoding and I have to convert it to other code page. Real example is that I use ghostscript python module to embed pdfmark data to a PDF file - end result is with wrong characters in Acrobat.
I've done numerous combinations with .encode() and .decode() between 'utf-8' and 'latin-1' and it drives me crazy as I can't output correct result.
If I output the string to a file with .encode('utf-8') and then convert this file from UTF-8 to CP1252 (aka latin-1) with i.e. iconv.exe and embed the data everything is fine.
Basically can someone help me convert i.e. character á which is UTF-8 encoded as hex: C3 A1 to latin-1 as hex: E1?
Instead of .encode('utf-8'), use .encode('latin-1').
data="UTF-8 data"
udata=data.decode("utf-8")
data=udata.encode("latin-1","ignore")
Should do it.
Can you provide more details about what you are trying to do? In general, if you have a unicode string, you can use encode to convert it into string with appropriate encoding. Eg:
>>> a = u"\u00E1"
>>> type(a)
<type 'unicode'>
>>> a.encode('utf-8')
'\xc3\xa1'
>>> a.encode('latin-1')
'\xe1'
If the previous answers do not solve your problem, check the source of the data that won't print/convert properly.
In my case, I was using json.load on data incorrectly read from file by not using the encoding="utf-8". Trying to de-/encode the resulting string to latin-1 just does not help...