I'm wondering how to get the Unicode representation of Arabic strings like سلام in Python?
The result should be \u0633\u0644\u0627\u0645
I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.
Assuming you have an actual Unicode string, you can do
# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')
output
\u0633\u0644\u0627\u0645
The # -*- coding: utf-8 -*- directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.
If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:
\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85
You can convert that to Unicode like this:
data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')
output
سلام
\u0633\u0644\u0627\u0645
Of course, you do need to make sure that your terminal is set up to handle Unicode properly.
Note that
'\u0633\u0644\u0627\u0645'
is a plain (byte) string containing 24 bytes, whereas
u'\u0633\u0644\u0627\u0645'
is a Unicode string containing 4 Unicode characters.
You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
Since you're using Python 2.x, you'll not be able to use encode. You'll need to use the unicode function to cast the string to a unicode object.
> f='سلام'
> f
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
> unicode(f, 'utf-8') # note: you need to pass the encoding parameter in or you'll
# keep having the same problem.
u'\u0633\u0644\u0627\u0645'
> print unicode(f, 'utf-8')
سلام
I'm not sure what library you're using to fetch the content, but you might be able to fetch the data as unicode initially.
> f = u'سلام'
> f
u'\u0633\u0644\u0627\u0645'
> print f.encode('unicode-escape')
\u0633\u0644\u0627\u0645
> print f
سلام
For python 2.7
string = 'سلام'
new_string = unicode(string)
Prepend your string with u in python 2.x, which makes your string a unicode string. Then you can call the encode method of a unicode string.
arabic_string = u'سلام'
arabic_string.encode('utf-8')
Output:
print arabic_string.encode('utf-8')
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
Related
I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...
This topic is already on StackOverflow but I didn't find any satisfying solution:
I have some strings in Unicode coming from a server and I have some hardcoded strings in the code which I'd like to match against. And I do understand why I can't just make a == but I do not succeed in converting them properly (I don't care if I've to do str -> unicode or unicode -> str).
I tried encode and decode but it didn't gave any result.
Here is what I receive...
fromServer = {unicode} u'Führerschein nötig'
fromCode = {str} 'Führerschein nötig'
(as you can see, it is german!)
How can have them equals in Python 2 ?
First make sure you declare the encoding of your Python source file at the top of the file. Eg. if your file is encoded as latin-1:
# -*- coding: latin-1 -*-
And second, always store text as Unicode strings:
fromCode = u'Führerschein nötig'
If you get bytes from somewhere, convert them to Unicode with str.decode before working with the text. For text files, specify the encoding when opening the file, eg:
# use codecs.open to open a text file
f = codecs.open('unicode.rst', encoding='utf-8')
Code which compares byte strings with Unicode strings will often fail at random, depending on system settings, or whatever encoding happens to be used for a text file. Don't rely on it, always make sure you compare either two unicode strings or two byte strings.
Python 3 changed this behaviour, it will not try to convert any strings. 'a' and b'a' are considered objects of a different type and comparing them will always return False.
tested on 2.7
for German umlauts latin-1 is used.
if 'Führerschein nötig'.decode('latin-1') == u'Führerschein nötig':
print('yes....')
yes....
code
a = "한글" #korean language
a_list = []
a_list.append({'key': a})
print a_list
result
[{'key': u'"\ud55c\uae00"'}]
I don't want to convert unicode.
How can I stay in korean language
I wish to print like this
[{'key': '한글'}]
Your code from the question produces:
[{'key': '\xed\x95\x9c\xea\xb8\x80'}]
This output is different from what you have shown in the question.
To produce: [{"key": "한글"}] you could use json:
print json.dumps(a_list, ensure_ascii=False, encoding=your_source_code_encoding)
Full example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
a = "한글" # you should use u"" literals to work with Unicode strings
a_list = []
a_list.append({'key': a})
print json.dumps(a_list, ensure_ascii=False) # "utf-8" encoding is default
Output
[{"key": "한글"}]
You wrote:
I dont want to convert unicode. How can I stay in korean language
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
The main takeaway is if you are working with text you must specify its encoding.
The most convenient and reliable way is to use Unicode strings throughout your program i.e., decode bytes that you read to Unicode strings as early as possible on input and encode to bytes while writing Unicode strings as late as possible on output.
To enforce that convention all strings are Unicode in Python 3. Python 2 unfortunately allows you to use bytestrings for both text and data with all confusion that it causes.
What difference does it make to your application if you have unicode string? If you do not want the u prefix, you could use Python3 where the strings by default are unicode.
Python file
# -*- coding: UTF-8 -*-
a = 'Köppler'
print a
print a.__class__.__name__
mydict = {}
mydict['name'] = a
print mydict
print mydict['name']
Output:
Köppler
str
{'name': 'K\xc3\xb6ppler'}
Köppler
It seems that the name remains the same, but only when printing a dictionary I get this strange escaped character string. What am I looking at then? Is that the UTF-8 representation?
The reason for that behavior is that the __repr__ function in Python 2 escapes non-ASCII unicode characters. As the link shows, this is fixed in Python 3.
Yes, that's the UTF-8 representation of ö (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS). It consists of a 0xC3 octet followed by a 0xB6 octet. UTF-8 is a very elegant encoding, I think, and worth reading up on. The history of its design (on a placemat in a diner) is described here by Rob Pike.
As far as I'm concerned there are two methods in Python for displaying objects: str() and repr(). Str() is used internally inside print, however Apparently dict's str() uses repr() for keys and values.
As it has been mentioned: repr() escapes unicode characters.
It seems you are using python 2.x, where you have to specify that the object is actually a unicode string and not a plain ascii. You specified that the code is utf-8, thus you actually typed 2 bytes for your ö, and as it is a regular string, you got the 2 escaped chars.
Try to specify the unicode a= u'Köppler'. You may need to encode it before printing, depending on your consol encoding: print a.encode('utf-8')
I have a string that looks like so:
6Â 918Â 417Â 712
The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:
s.replace('Â ', '')
That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.
I never quite could understand how to switch between different encodings.
Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:
#!/usr/bin/python2.4
# -*- coding: utf-8 -*-
The code:
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
s = soup.find('div', {'id':'main_count'})
#making a print 's' here goes well. it shows 6Â 918Â 417Â 712
s.replace('Â ','')
save_main_count(s)
It gets no further than s.replace...
Throw out all characters that can't be interpreted as ASCII:
def remove_non_ascii(s):
return "".join(c for c in s if ord(c)<128)
Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).
Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"") will also fail if s is not a unicode string.
string.replace returns a new string and does not edit in place, so make sure you're using the return value as well
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
The following code will replace all non ASCII characters with question marks.
"".join([x if ord(x) < 128 else '?' for x in s])
Using Regex:
import re
strip_unicode = re.compile("([^-_a-zA-Z0-9!##%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')
Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:
Example (Python 3)
s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE
Output
6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712
#!/usr/bin/env python
# -*- coding: utf-8 -*-
s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "")
print s
This will print out 6 918 417 712
I know it's an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).
Usage : str.translate(table[, deletechars])
>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )
>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6 918 417 712'
Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don't want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.
With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode("ascii", "ignore")
As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:
trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
if isinstance(s, unicode):
return s.encode('ascii', 'replace')
else:
return s.translate(trans_table)
The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by '?'. This is often useful, for instance for indexing purposes.
s.replace(u'Â ', '') # u before string is important
and make your .py file unicode.
This is a dirty hack, but may work.
s2 = ""
for i in s:
if ord(i) < 128:
s2 += i
For what it was worth, my character set was utf-8 and I had included the classic "# -*- coding: utf-8 -*-" line.
However, I discovered that I didn't have Universal Newlines when reading this data from a webpage.
My text had two words, separated by "\r\n". I was only splitting on the \n and replacing the "\n".
Once I looped through and saw the character set in question, I realized the mistake.
So, it could also be within the ASCII character set, but a character that you didn't expect.
my 2 pennies with beautiful soup,
string='<span style="width: 0px> dirty text begin ( ĀĒēāæśḍṣ <0xa0> ) dtext end </span></span>'
string=string.encode().decode('ascii',errors='ignore')
print(string)
will give
<span style="width: 0px> dirty text begin ( ) dtext end </span></span>