I have this function in python
Str = "ü";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like ü and é however it fails when i try � and ¶
Is there a way i can fix it?
According to UTF8, à and ¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.
What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for ü, the Windows-1252 code of à is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.
Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).
Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is à followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.
For ¶ the Windows-1252 encoding is C2B6. Shouldn't it be ö, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?
Related
I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?
Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.
As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.
I use python 2.7.10.
On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input
>>>u'\u00b0'
results in the following output:
u'\xb0'
I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.
My assumptions (correct me if I am wrong):
The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0.
Python doc say, a string literal with u-prefix is encoded using unicode.
Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?
Thanks in advance for any help.
Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.
As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.
I am currently learning Python and I came across the following code:
text=raw_input()
for letter in text:
x=[alpha_dic[letter]]
print x
When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.
You get some trouble from multiple shortcomings in Python (2.x).
raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x
I got this to work borrowing from this answer:
# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
x=[alpha_dict[unicode(letter)]]
print x
>>> ü
>>> ['umlaut']
Python 2 and unicode are not for the feint of heart...
With ord(ch) you can get a numerical code for character ch up to 127. Is there any function that returns a number from 0-255, so to cover also ISO 8859-1 characters?
Edit: Follows my last version of code and error I get
#!/usr/bin/python
# coding: iso-8859-1
import sys
reload(sys)
sys.setdefaultencoding('iso-8859-1')
print sys.getdefaultencoding() # prints "iso-8859-1"
def char_code(c):
return ord(c.encode('iso-8859-1'))
print char_code(u'à')
I get an error:
TypeError: ord() expected a character, but string of length 2 found
When you're starting with a Unicode string, you need to encode rather than decode.
>>> def char_code(c):
return ord(c.encode('iso-8859-1'))
>>> print char_code(u'à')
224
For ISO-8859-1 in particular, you don't even need to encode it at all, since Unicode uses the ISO-8859-1 characters for its first 256 code points.
>>> print ord(u'à')
224
Edit: I see the problem now. You've given a source code encoding comment that indicates the source is in ISO-8859-1. However, I'll bet that your editor is actually working in UTF-8. The source code will be mis-interpreted, and the single-character string you think you created will actually be two characters. Try the following to see:
print len(u'à')
If your encoding is correct, it will return 1, but in your case it's probably 2.
You can get ord() for anything. As you might expect, ord(u'💩') works fine, provided you can represent the character properly in your source, and/or read it in a known encoding.
Your error message vaguely suggests that coding: iso-8859-1 is not actually true, and the file's encoding is actually something else (UTF-8 or UTF-16 would be my guess).
The canonical must-read on character encoding in Python is http://nedbatchelder.com/text/unipain.html
You can still use ord(), but you have to decode it.
Like this:
def char_code(c):
return ord(c.decode('iso-8859-1'))
y0 I have this problem that characters that include ñ or ŕ í á ú etc are discarded when I apply
text = text.encode('ascii', 'ignore')
to a function that needs the input to be ascii.
is there a way to force it to ascii without losing those characters or should I change the function to accept unicode characters?
http://dpaste.com/601417/
The 'ascii' encoding can't represent the characters you refer to. You have to choose a different encoding — perhaps 'cp850' or 'latin_1' — but then you have to be sure that your output terminal interprets 8-bit codes using the relevant code page.
On balance, life is easier if you just go Unicode all the way.
Yes, you should go for another encoding, if you need those characters (for example Unicode). See ascii table for all chars that are included in ascii.