I would like to display unicode characters without using print for example :
>>> print 'é'
é
The unicode is displayed perfectly but when I try to display without print it gives me unwanted results :
>>> 'é'
'\xc3\xa9'
And the expected result is 'é'
EDIT
The reason why I need this feature is, I m writing a scraper with scrapy framework, and I m crawling a website with unicode charachters, when I start crawling the log display something like this :
\u06a9\u06cc\u0644\u0648 \u0645\u062a\u0631 \u0628\u0631 \u0633\u0627\u0639\u062a\r\n\r\n
I've tried to use unicode built-in function, and I've added the header
# -*- coding: utf-8 -*-
But without any results
Python3 could be your solution, this version supports UTF-8 as default string encoding.
I'm not sure why you want to do this, but print statement converts objects given certain string conversion rules. You're seeing the value through conversion.
The expression is the raw return when you're experiencing the unicode.
https://docs.python.org/2/reference/simple_stmts.html#grammar-token-print_stmt
Related
I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.
Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')
My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.
If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.
How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?
In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.
I'm wondering how to get the Unicode representation of Arabic strings like سلام in Python?
The result should be \u0633\u0644\u0627\u0645
I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.
Assuming you have an actual Unicode string, you can do
# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')
output
\u0633\u0644\u0627\u0645
The # -*- coding: utf-8 -*- directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.
If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:
\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85
You can convert that to Unicode like this:
data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')
output
سلام
\u0633\u0644\u0627\u0645
Of course, you do need to make sure that your terminal is set up to handle Unicode properly.
Note that
'\u0633\u0644\u0627\u0645'
is a plain (byte) string containing 24 bytes, whereas
u'\u0633\u0644\u0627\u0645'
is a Unicode string containing 4 Unicode characters.
You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
Since you're using Python 2.x, you'll not be able to use encode. You'll need to use the unicode function to cast the string to a unicode object.
> f='سلام'
> f
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
> unicode(f, 'utf-8') # note: you need to pass the encoding parameter in or you'll
# keep having the same problem.
u'\u0633\u0644\u0627\u0645'
> print unicode(f, 'utf-8')
سلام
I'm not sure what library you're using to fetch the content, but you might be able to fetch the data as unicode initially.
> f = u'سلام'
> f
u'\u0633\u0644\u0627\u0645'
> print f.encode('unicode-escape')
\u0633\u0644\u0627\u0645
> print f
سلام
For python 2.7
string = 'سلام'
new_string = unicode(string)
Prepend your string with u in python 2.x, which makes your string a unicode string. Then you can call the encode method of a unicode string.
arabic_string = u'سلام'
arabic_string.encode('utf-8')
Output:
print arabic_string.encode('utf-8')
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
I am trying to use the string Düsseldorf. When I do that :
# -*- coding: utf-8 -*-
print "Düsseldorf"
it prints strange characters. Could anyone help me please ?
Thank you very much.
>>> print u"Düsseldorf"
Düsseldorf
"Unicode In Python, Completely Demystified"
Most likely, your editor is not set to produce UTF-8 output. Setting it to output UTF-8 should fix the problem.
Alternatively, use unicode escapes:
print u"D\u00FCsseldorf"
Note that string literals in Python 2.x should be prefixed with a u(for unicode). Unprefixed literals(like "Düsseldorf") generate str objects which are byte arrays (despite the name), not strings. Therefore, in Python 2.x with a correctly configured editor, you want:
print u"Düsseldorf"
In Python 3.x, the situation has been rectified by letting str objects represent, well, strings, and introducing the bytes type for byte arrays, as in b'D\xc3\xbcsseldorf'.
In my database, I have stored some UTF-8 characters. E.g. 'α' in the "name" field
Via Django ORM, when I read this out, I get something like
>>> p.name
u'\xce\xb1'
>>> print p.name
α
I was hoping for 'α'.
After some digging, I think if I did
>>> a = 'α'
>>> a
'\xce\xb1'
So when Python is trying to display '\xce\xb1' I get alpha, but when it's trying to display u'\xce\xb1', it's double encoding?
Why did I get u'\xce\xb1' in the first place? Is there a way I can just get back '\xce\xb1'?
Thanks. My UTF-8 and unicode handling knowledge really need some help...
Try to put the unicode signature u before your string, e.g. u'YOUR_ALFA_CHAR' and revise your database encoding, because Django always supports UTF-8 .
What you seem to have is the individual bytes of a UTF-8 encoded string interpreted as unicode codepoints. You can "decode" your string out of this strange form with:
p.name = ''.join(chr(ord(x)) for x in p.name)
or perhaps
p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')
One way to get your strings "encoded" into this form is
''.join(unichr(ord(x)) for x in '\xce\xb1')
although I have a feeling your strings actually got in this state by different components of your system disagreeing on the encoding in use.
You will probably have to fix the source of your bad "encoding" rather than just fixing the data currently in your database. And the code above might be okay to convert your bad data once, but I would advise you don't insert this code into your Django app.
The problem is that p.name was not correctly stored and/or read in from the database.
Unicode small alpha is U+03B1 and p.name should have printed as u'\x03b1' or if you were using a Unicode capable terminal the actual alpha symbol itself may have been printed in quotes. Note the difference between u'\xce\xb1' and u'\xceb1'. The former is a two character string and the latter in a single character string. I have no idea how the '03' byte of the UTF-8 got translated into 'CE'.
You can turn any byte sequence into internal unicode representation through the decode function:
print '\xce\xb1'.decode('utf-8')
This allows you to import a byte sequence from any source and then turn it into a Python unicode string.
Reference: http://docs.python.org/library/stdtypes.html#string-methods
Try converting the encoding with p.name.encode('latin-1'). Here's a demonstration:
>>> print u'\xce\xb1'
α
>>> print u'\xce\xb1'.encode('latin-1')
α
>>> print '\xce\xb1'
α
>>> '\xce\xb1' == u'\xce\xb1'.encode('latin1')
True
For more information, see str.encode and Standard Encodings.