How do I convert unicode code to string in Python? - python

One of my string variables contains unicode code \u0631\u064e\u062c\u0627.
I want to convert it to a string and see what characters were encoded by it.
Is it possible and how can I do it in python?

That's just an internal representation. If you print it, you will get what you want:
>>> print("\u0631\u064e\u062c\u0627")
رَجا
In short, this is how Python stores the characters رَ and جا. If you tell Python to print them, you'll see them converted back to their human-readable form.

Decode using unicode_escape encoding:
In Python 2.x:
>>> text = r'\u0631\u064e\u062c\u0627'
>>> print(text)
\u0631\u064e\u062c\u0627
>>> print(text.decode('unicode-escape'))
رَجا
In Python 3.x:
>>> text = r'\u0631\u064e\u062c\u0627'
>>> print(text.encode().decode('unicode-escape'))
رَجا

>>> print u"\u0631\u064e\u062c\u0627"
رَجا

Related

convert Chinese ascii string to Chinese language string

I tried to use sys module to set default encoding to convert the string, but it does not work.
The string is:
`\xd2\xe6\xc3\xf1\xba\xcb\xd0\xc4\xd4\xf6\xb3\xa4\xbb\xec\xba\xcf`
it means 益民核心增长混合 in Chinese. But How can I convert this to Chinese language string?
I tried this:
>>> string = '\xd2\xe6\xc3\xf1\xba\xcb\xd0\xc4\xd4\xf6\xb3\xa4\xbb\xec\xba\xcf'
>>> print string.decode("gbk")
益民核心增长混合 # As you can see here, got the right answer
>>> new_str = string.decode("gbk")
>>> new_str
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # It returns the another encode type.
>>> another = u"益民核心增长混合"
>>> another
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # same as new_str
So, I just confused by this situation, why I can print string.decode("gbk") but the new_str in my python console just return another encode type?
My OS is Windows 10, my Python version is Python 2.7. Thank you very much!
You are doing it correctly.
In this case, new_str is actually a unicode string as denoted by the u prefix.
>>> new_str
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # It returns the another encode type.
When you decode the GBK encoded string, you get a unicode string. Each character of this string is a unicode code point, e.g.
>>> u'\u76ca'
u'\u76ca'
>>> print u'\u76ca'
益
>>> import unicodedata
>>> unicodedata.name(u'\u76ca')
'CJK UNIFIED IDEOGRAPH-76CA'
>>> print new_str
益民核心增长混合
>>> print repr(new_str)
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408
This is how Python displays unicode strings in the interpreter - it is using repr to display it. But when you print the string, Python converts to the encoding for your terminal (sys.stdout.encoding), and that's why the string is displayed as you expect.
So, it's not a different encoding of the string, it's just the way that Python displays the string in the interpreter.

How to use Python convert a unicode string to the real string [duplicate]

This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法,删除存储在
You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code
Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

Behaviour unicode string in python

I have seen this question I have doubts about how can I convert a var to unicode on running time ?
Is it right use unicode function ?
Are there other way to convert a string on running time ?
print(u'Cami\u00f3n') # prints with right special char
name=unicode('Cami\u00f3n')
print(name) # prints bad ===> Cami\u00f3n
name.encode('latin1')
print(name.decode('latin1')) # prints bad ===> Cami\u00f3n
encoded_id = u'abcd\xc3\x9f'
encoded_id.encode('latin1').decode('utf8')
print encoded_id.encode('latin1').decode('utf8') # prints right
I saw a lot of python unicode questions on stackoverflow but i can't understand this behaviour.
Its just because of that if you don't specify any encoding for unicode function then :
unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.
So you'll have a str version of your unicode (the Unicode part will be escaped):
>>> name=unicode('Cami\u00f3n')
>>> print(name)
Cami\u00f3n
>>> name
u'Cami\\u00f3n'
^
For get ride of this problem you can use 'unicode-escape' as your encoding to escape converting the Unicode to string!
>>> name=unicode('Cami\u00f3n','unicode-escape')
>>> name
u'Cami\xf3n'
>>> print(name)
Camión

How can I print the decimal representation of a unicode string?

I am trying to compare unicode strings in Python. Since a lot of the symbols look similar and some may contain non-printable characters, I am having trouble debugging where my comparisons are failing. Is there a way to take a string of unicode characters and print their unicode codes? i.e.:
>>> unicode_print('❄')
'\u2744'
You can encode that string with some other encoding:
>>> s = '❄'
>>> s.encode() # "utf8" by default
b'\xe2\x9d\x84'
And for the output you specified, I just found this from here:
>>> s.encode("unicode_escape")
b'\\u2744'

How can I convert Arabic Unicode

I'm working on Arabic light stemmer package in python
I want to convert the result from any operation from Unicode to Arabic letters.
My code:
import tashaphyne
form tashaphyne import *
>>> text = u"الْعَرَبِيّةُ"
>>> strip_tashkeel(text)
I want it to display "العربية" not it's Unicode
You can convert unicode strings to any other encoding using the encode() function like so:
text.encode('utf8')
Here is a list of possible encodings in Python 2.7.
You see u'\u0627\u0644\u0639\u0631\u0628\u064a\u0629' instead of "العربية" because repr() representation for unicode strings is meant to be displayable even on 7-bit terminals.
To see actual scripts instead of unicode, do either print _ after your call to strip_tashkeel(), or do print strip_tashkeel(text) directly.

Categories