I am currently learning Python and I came across the following code:
text=raw_input()
for letter in text:
x=[alpha_dic[letter]]
print x
When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.
You get some trouble from multiple shortcomings in Python (2.x).
raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x
I got this to work borrowing from this answer:
# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
x=[alpha_dict[unicode(letter)]]
print x
>>> ü
>>> ['umlaut']
Python 2 and unicode are not for the feint of heart...
Related
I have this function in python
Str = "ü";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like ü and é however it fails when i try � and ¶
Is there a way i can fix it?
According to UTF8, à and ¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.
What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for ü, the Windows-1252 code of à is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.
Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).
Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is à followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.
For ¶ the Windows-1252 encoding is C2B6. Shouldn't it be ö, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?
I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...
I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?
Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.
As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.
I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()
The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.
Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and
In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.
Suppose if I had a string with some unicode characters inside it, and we needed to do operations on it, what would be the best way to do so?
s = u"blah ascii_word etc شاهد word1 word 2" # Delimited by spaces
words = s.split(u' ')
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in
position 91: ordinal not in range(128)
Any clues?
Also, If I wanted to write this code into a text file and read it back later, what would be the procedure?
When you declare variable the way you do Python assumes it is in your default system encoding you have to add u before the string to make it unicode and add encoding declaration at the top of your file, if you do this you won't get any errors:
# -*- coding: utf-8 -*-
s = u"blah ascii_word etc شاهد word1 word 2"
words = s.split(u' ')
print words
# no error even tough my default system's encoding is ascii
I've checked this now and you don't even need the u - adding encoding is enough to fix the problem.
If you want to do things with unicode strings in the termainal you have to check your system encoding and change it if necessary:
>>> import sys
>>> sys.getdefaultencoding()
'ascii' #I have ascii
You can then manipulate this by using sys.setdefaultencoding(). But this is a tricky issue which depends on your operating system.