UnicodeDecodeError in Python with codecs module - python

I have a text file which comprises unicode strings "aBiyukÙwa", "varcasÙva" etc. When I try to decode them in the python interpreter using the following code, it works fine and decodes to u'aBiyuk\xd9wa':
"aBiyukÙwa".decode("utf-8")
But when I read it from a file in a python program using the codecs module in the following code it throws a UnicodeDecodeError.
file = codecs.open('/home/abehl/TokenOutput.wx', 'r', 'utf-8')
for row in file:
Following is the error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd9 in position 8: invalid continuation byte
Any ideas what is causing this strange behavior?

Your file is not encoded in UTF-8. Find out what it is encoded in, and then use that.

Related

How to open/read a DBF file on python?

I am getting an error when I write this code
from dbfread import DBF
for record in DBF("filename.dbf"):
print(record)
and the error that i get is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 1: ordinal not in range(128)
The error message you're encountering suggests that the 'ascii' codec is unable to decode some characters in the file you're trying to read. This is most likely because the file uses a different encoding that contains characters not representable in the ASCII character set.
To resolve this issue, you can try specifying the encoding used in the file when you create the DBF object. For example:
from dbfread import DBF
for record in DBF("filename.dbf", encoding='utf8'):
print(record)
Note that the actual encoding used in your file might be different, you can check the encoding of your file and use the appropriate one.

I am facing an error while i try to load a file in python 3

f = open(path,'r',encoding='utf8')
This is the code I'm trying to run but it outputs 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte as the error. What might be the reason for this?
Try changing your encoding to utf-8, and see if that fixes it. Otherwise, the file might not be encoded in utf-8.

Python Encoding Error when writing to file

I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)
my code piece that wants to write to file is following:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w.decode('utf-8'))
new_file.write('\n')
EDIT
Even if I make code as:
t_w = text_list[y].encode('ascii',errors='ignore')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
From what I can tell t_w.decode(...) attempts to convert your characters to ASCII, which doesn't encode some Azeri characters. There is no need to decode the string because you want to write it to the file as UTF-8, so omit the .decode(...) part:new_file.write(t_w)

decode following urls in python

I have a URL like this:
http://idebate.org/debatabase/debates/constitutional-governance/house-supports-dalai-lama%E2%80%99s-%E2%80%98third-way%E2%80%99-tibet
Then I used following script in python to decode this url:
full_href = urllib.unquote(full_href.encode('ascii')).decode('utf-8')
However, i got error like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 89: ordinal not in range(128)
when trying to write in file
Just as #KevinJ.Chase pointed out, you were most likely trying to write to a file with string in incompatible ascii format.
You can either change your write file encoding, or encode your full_href to ascii, something like this:
# don't decode again to utf-8
full_href = urllib.unquote(url.encode('ascii'))
... then write to your file stream
or,
...
# encode your your to compatible encoding on write, ie. utf-8
with open('yourfilenamehere', 'w') as f:
f.write(full_href.encode('utf-8'))

Want the code to read ' instead of ’

I am trying to convert a csv file to a json file. The whole code runs fine but when I encounter the statement:
json.dump(DictName, out_file)
I get the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 15: invalid start byte
Would someone please be able to help?
TIA.
I found the solution. While parsing the string, I converted the string to unicode using the unicode() function:
unicode(stringname, errors='replace')
and it replaced all the erroneous symbols.

Categories