Eliminate accents in python

Eliminate accents in python - python

I have this function to remove accents in a word
def remove_accents(word):
return ''.join(x for x in unicodedata.normalize('NFKD', word) if x in string.ascii_letters)
But when I run it it shows an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 3: ordinal not in range(128)
The character in position 3 is : ó

If your input is a unicode string, it works:
>>> remove_accents(u"foóbar")
u'foobar'
If it isn't, it doesn't. I don't get the error you describe, I get a TypeError instead, and only get the UnicodeDecodeError if I try to cast it to unicode by doing
>>> remove_accents(unicode("foóbar"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If that is your problem, i.e. you have Python 2 str objects as an input, you can solve it by decoding it as utf-8 first:
>>> remove_accents("foóbar".decode("utf-8"))
u'foobar'

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3: invalid start byte

I am using Python 3 but found this error in server log while doing convert format from string to byte
b'\x00\x01_\x97'.decode()
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
b'\x00\x01_\x97'.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3: invalid start byte
How can I convert a string to its byte value? I

You need to specify the encoding type Latin by
>>> b'\x00\x01_\x97'.decode("Latin")
'\x00\x01_\x97'
>>> type(b'\x00\x01_\x97'.decode("Latin"))
<class 'str'>
>>>

Resolving ascii codec can't decode byte in position ordinal not in range

I've seen all of the other posts and done quite a bit of research but I am still scratching my head.
Here is the problem:
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'My Mate\u2019s'
>>> b='\xe2\x80\x99s BBQ'
>>> print a
My Mate’s
>>> print b
’s BBQ
So, the variables are finely printed themselves, but printing a concatenation:
>>> print a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
gives a decode error. So, I try to decode the string:
>>> print a.decode('utf-8')+b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
The error changes into an encode error. So, I try a couple of ways to inform the encoding:
>>> print a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('ascii','ignore')+b.decode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>>
The error persists no matter what I try.
I suppose the problem might be very simple. I'd appreciate someone helping with an explanation of what's going on, and how to resolve this.
I have python 2.7 on ubuntu.

b is encoded as UTF-8 so you have to .decode it to Unicode.
print a + b.decode('utf-8')
Tested in Python 2.7.6 on Ubuntu.
If you want both in UTF-8 you can do:
print a.encode('utf-8') + b
I'll explain why each one of your attempts doesn't work:
a + b # the default decoding is ascii which cannot decode UTF-8
a.decode('utf-8')+b.decode('utf-8') # you don't need to decode Unicode
Again you don't need to decode Unicode.
a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
You keep trying to decode Unicode. What you should do instead is to encode it, or to decode b.
a.decode('ascii','ignore')+b.decode('ascii','ignore')
And finally you again try to decode Unicode. The point to be made here is that UTF-8 is an encoding. You decode from UTF-8 to Unicode.
a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')

ast.literal_eval somehow throwing UnicodeDecodeError

OK so...
a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
a Python 2.x string gets decoded to a Unicode string
Python UnicodeDecodeError - Am I misunderstanding encode?
I've got this python 2.7 code
try:
print '***'
print type(relationsline)
relationsline = relationsline.decode("ascii", "ignore")
print type(relationsline)
relationsline = relationsline.encode("ascii", "ignore")
print type(relationsline)
relations = ast.literal_eval(relationsline)
except ValueError:
return
except UnicodeDecodeError:
return
The last line in the code above sometimes throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position
341: ordinal not in range(128)
I would think that this would (1) start with a string with some (unknown) encoding (2) decode it into a unicode type, representing a string of characters the unicode character set with ascii encodings while ignoring all characters that can't be encoded with ascii (3) encode the unicode type into a string with ascii encoding, ignoring all of the characters that can't be represented in ascii.
Here is the full stack trace:
Traceback (most recent call last):
File "outputprocessor.py", line 69, in <module>
getPersonRelations(lines, fname)
File "outputprocessor.py", line 41, in getPersonRelations
relations = ast.literal_eval(relationsline)
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)
^
SyntaxError: invalid syntax
But that is clearly wrong somewhere. Even more perplexing is that the UnicodeDecodeError is not catching the UnicodeDecodeError. What am I missing? Maybe this is the problem? http://bugs.python.org/issue22221

Look at the stack trace closer. It is throwing a SyntaxError.
You are trying to literal_eval the string "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)". You can encode/decode that string all you want, but ast won't know what to do with it - that's clearly not a valid python literal.
See:
>>> import ast
>>> ast.literal_eval('''UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)''')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 341: ordinal not in range(128)
^
SyntaxError: invalid syntax
I would look at the source of whatever is passing these strings to your function, it's generating some bogus input.

You are trying to literal_eval the traceback from relationsline = relationsline.encode("ascii", "ignore") from the passed in string.
You will need to move your literal_eval check into its own try/except or catch the exception in your original try block or filter the input somehow.

How to read oslo config value in unicode

I have some Japaneses character in my nova.conf file.
クラ
After reading from config file I have to decode it in utf-8 like
my_data = CONF.test.test
my_data = my_data.decode('utf-8')
When I use variable without decode, its giving UnicodeDecodeError
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
Is there any way to read data from config directly in decoded form?

Is there any way to read data from config directly in decoded form?
not in python 2.7. Because in python prior to python 3, strings default to ASCII strings, whereas in python 3 strings default to unicode strings. So basically:
>>> mydata = "クラ"
>>> print mydata.decode('utf-8')
クラ
>>> print mydata
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
whereas in python 3:
>>> mydata = "クラ"
>>> print(mydata)
クラ
So if you want to handle unicode strings painlessly, it's time for you to do the switch.

Why do some strings encode in utf-16, while others only encode in utf-8?

>>> unicode('восстановление информации', 'utf-16')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
>>> unicode('восстановление информации', 'utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
Why do these Russian words encode in UTF-8 fine, but not UTF-16?

You are asking the unicode function to decode a byte string and then giving it the wrong encoding.
Pasting your string into Python-2.7 on OS-X gives
>>> 'восстановление информации'
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
At this stage it is already a UTF-8 encoded string (probably your terminal determined this), so you can decode it by specifying the utf-8 codec
>>> 'восстановление информации'.decode('utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
But not UTF-16 as that would be invalid
>>> 'восстановление информации'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
If you want to encode a unicode string to UTF-8 or UTF-16, then use
>>> u'восстановление информации'.encode('utf-16')
'\xff\xfe2\x04>\x04A\x04A\x04B\x040\x04=\x04>\x042\x04;\x045\x04=\x048\x045\x04 \x008\x04=\x04D\x04>\x04#\x04<\x040\x04F\x048\x048\x04'
>>> u'восстановление информации'.encode('utf-8')
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
Notice the input strings are unicode (have a u at the front), but the outputs here are byte-strings (they don't have u at the start) which contain the unicode data encoded in the respective formats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Eliminate accents in python - python

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3: invalid start byte

Resolving ascii codec can't decode byte in position ordinal not in range

ast.literal_eval somehow throwing UnicodeDecodeError

How to read oslo config value in unicode

Why do some strings encode in utf-16, while others only encode in utf-8?

Categories

Resources