convert Chinese ascii string to Chinese language string - python

I tried to use sys module to set default encoding to convert the string, but it does not work.
The string is:
`\xd2\xe6\xc3\xf1\xba\xcb\xd0\xc4\xd4\xf6\xb3\xa4\xbb\xec\xba\xcf`
it means 益民核心增长混合 in Chinese. But How can I convert this to Chinese language string?
I tried this:
>>> string = '\xd2\xe6\xc3\xf1\xba\xcb\xd0\xc4\xd4\xf6\xb3\xa4\xbb\xec\xba\xcf'
>>> print string.decode("gbk")
益民核心增长混合 # As you can see here, got the right answer
>>> new_str = string.decode("gbk")
>>> new_str
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # It returns the another encode type.
>>> another = u"益民核心增长混合"
>>> another
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # same as new_str
So, I just confused by this situation, why I can print string.decode("gbk") but the new_str in my python console just return another encode type?
My OS is Windows 10, my Python version is Python 2.7. Thank you very much!

You are doing it correctly.
In this case, new_str is actually a unicode string as denoted by the u prefix.
>>> new_str
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # It returns the another encode type.
When you decode the GBK encoded string, you get a unicode string. Each character of this string is a unicode code point, e.g.
>>> u'\u76ca'
u'\u76ca'
>>> print u'\u76ca'
益
>>> import unicodedata
>>> unicodedata.name(u'\u76ca')
'CJK UNIFIED IDEOGRAPH-76CA'
>>> print new_str
益民核心增长混合
>>> print repr(new_str)
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408
This is how Python displays unicode strings in the interpreter - it is using repr to display it. But when you print the string, Python converts to the encoding for your terminal (sys.stdout.encoding), and that's why the string is displayed as you expect.
So, it's not a different encoding of the string, it's just the way that Python displays the string in the interpreter.

Related

How to print Unicode like “u{variable}” in Python 2.7?

For example, I can print Unicode symbol like:
print u'\u00E0'
Or
a = u'\u00E0'
print a
But it looks like I can't do something like this:
a = '\u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
🇦
Any help will be appreciated!
This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode method instead of the unicode type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes method to the int class that lets you bypass the struct module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
You want print unichr(int('00E0',16)). Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won't work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you'll still need to use a font that supports the glyphs for the codepoints.
These are unicode code points but lack the \u python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console's encoding. If its a Windows code page and the characters are not in its range, you'll still get funky errors.
In python 3 that would be b"\\u".

How to use Python convert a unicode string to the real string [duplicate]

This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法,删除存储在
You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code
Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

Behaviour unicode string in python

I have seen this question I have doubts about how can I convert a var to unicode on running time ?
Is it right use unicode function ?
Are there other way to convert a string on running time ?
print(u'Cami\u00f3n') # prints with right special char
name=unicode('Cami\u00f3n')
print(name) # prints bad ===> Cami\u00f3n
name.encode('latin1')
print(name.decode('latin1')) # prints bad ===> Cami\u00f3n
encoded_id = u'abcd\xc3\x9f'
encoded_id.encode('latin1').decode('utf8')
print encoded_id.encode('latin1').decode('utf8') # prints right
I saw a lot of python unicode questions on stackoverflow but i can't understand this behaviour.
Its just because of that if you don't specify any encoding for unicode function then :
unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.
So you'll have a str version of your unicode (the Unicode part will be escaped):
>>> name=unicode('Cami\u00f3n')
>>> print(name)
Cami\u00f3n
>>> name
u'Cami\\u00f3n'
^
For get ride of this problem you can use 'unicode-escape' as your encoding to escape converting the Unicode to string!
>>> name=unicode('Cami\u00f3n','unicode-escape')
>>> name
u'Cami\xf3n'
>>> print(name)
Camión

How can I print the decimal representation of a unicode string?

I am trying to compare unicode strings in Python. Since a lot of the symbols look similar and some may contain non-printable characters, I am having trouble debugging where my comparisons are failing. Is there a way to take a string of unicode characters and print their unicode codes? i.e.:
>>> unicode_print('❄')
'\u2744'
You can encode that string with some other encoding:
>>> s = '❄'
>>> s.encode() # "utf8" by default
b'\xe2\x9d\x84'
And for the output you specified, I just found this from here:
>>> s.encode("unicode_escape")
b'\\u2744'

How do I convert unicode code to string in Python?

One of my string variables contains unicode code \u0631\u064e\u062c\u0627.
I want to convert it to a string and see what characters were encoded by it.
Is it possible and how can I do it in python?
That's just an internal representation. If you print it, you will get what you want:
>>> print("\u0631\u064e\u062c\u0627")
رَجا
In short, this is how Python stores the characters رَ and جا. If you tell Python to print them, you'll see them converted back to their human-readable form.
Decode using unicode_escape encoding:
In Python 2.x:
>>> text = r'\u0631\u064e\u062c\u0627'
>>> print(text)
\u0631\u064e\u062c\u0627
>>> print(text.decode('unicode-escape'))
رَجا
In Python 3.x:
>>> text = r'\u0631\u064e\u062c\u0627'
>>> print(text.encode().decode('unicode-escape'))
رَجا
>>> print u"\u0631\u064e\u062c\u0627"
رَجا

Categories