UnicodeDecodeError 'utf8' codec can't decode byte 0xb0 - python

I have a code that goes recursively through some folders, in the way of
for root, subFolders, files in os.walk(str(rootdir)):
Running the program I find the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 37: invalid start byte.
Have changed the direction of rootdir to see when the error starts and have found some folders within the path I actually want to use that are completely fine and some that return the error, but the thing is all subdirectories either only have folders or have basically the same files, so I dont know where the error is coming from or how to fix it.
Please help.
The error appears in a line where I use an outside package but the package is imported fine, code is fine and it works when the unicode error doesn't appear. The line code imports a .xml file in the folder, is that file the one with the problem? (shouldn't since they are all created with the same program and if one is wrong then all should be wrong, not a few)
edit: to actually test my code youd have to install pymatgen (you could with pip) and get a vasprun.xml file. Highly unprobable, hence why I didnt put it at the beginning.
Code (last line with the error)
from pymatgen.electronic_structure.dos import CompleteDos, add_densities, Dos
from pymatgen.electronic_structure.core import Spin, Orbital
from pymatgen.io.vasp.outputs import Vasprun, Procar
vasprun = Vasprun(root+"/vasprun.xml")
Error:
Traceback (most recent call last):
File "an.py", line 196, in <module>
vasprun = Vasprun(root+"/vasprun.xml")
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/outputs.py", line 383, in __init__
self.update_potcar_spec(parse_potcar_file)
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/outputs.py", line 829, in update_potcar_spec
potcar = get_potcar_in_path(os.path.split(self.filename)[0])
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/outputs.py", line 813, in get_potcar_in_path
pc = Potcar.from_file(os.path.join(p, fn))
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/inputs.py", line 1704, in from_file
fdata = reader.read()
File "/usr/lib64/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 37: invalid start byte

The file is apparently not UTF-8 encoded. If it has an XML declaration that specifies UTF-8 (or does not specify an encoding), then you need to replace it. If there is no XML declaration, you should try adding one.
A correct XML declaration will need to specify the actual character set, probably <?xml version="1.0" encoding="iso-8859-1" ?>, or possibly some other ISO encoding.

Related

How to avoid problem with encode UTF-8 error

I've got problem with reading text files. When I start program and add file, it throws an error:
Traceback (most recent call last):
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 38, in <module>
main_func()
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 32, in main_func
read_file()
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 15, in read_file
for i in f.read():
File "C:\Users\Marcin\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 19: invalid start byte
In my code there is a line with "encoding="UTF-8". How to solve the problem. The code below:
files = input("File name: ")
try:
with open(files,"r",encoding="UTF-8") as f:
for i in f.read():
print(i,end='')
except FileNotFoundError:
print("FileNotFoundError")
There is nothing wrong with the program itself. You are getting this error because you are trying to read a file which is not encoded as UTF-8 as UTF-8-encoded. You have to either convert the contents of the file to UTF-8 or specify a different encoding (the one that the file actually uses) in the call to open.
This file is not encoded as UTF-8 try to use encoded="iso-8859-1"

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

import os
import shutil
import codecs
directory = '~/Desktop/ra/clean_tokenized/1987'
for filename in os.listdir(directory):
full_name = directory + '/' + filename
with open(full_name, 'r') as article:
for line in article:
print(line)
Here's the traceback:
Traceback (most recent call last):
File "~/Desktop/corpus_filter/01_corpus.py", line 11, in
for line in article:
File "~/.conda/envs/MangerRA/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
The file contains Japanese characters and I'm just trying make a CSV file with all the words that have come up in the files. But I can't get over this error.
Python is trying to open your file using the UTF-8 encoding (which is the default most of the time these days). Unfortunately, your file is using some other encoding (or is otherwise corrupted), and so the decoding fails.
Unfortunately, I can't tell what encoding your file uses. You'll have to investigate that yourself. You might try another encoding like Shift JIS (using open(full_name, 'r', encoding='shift-jis')), and see if you get valid text or mojibake.
If all else fails, you can open the file in binary mode ('rb' rather than just 'r'), and check out what is located at byte 3131 and immediately afterwards. It may be just a messed up bit of data in the file that you can delete or fix manually.

encode 'UCS-2 Little Endian' file to 'utf8' using python error

I'm trying to encode from UCS-2 Little Endian file to utf8 using python and I'm getting a weird error.
The code I'm using:
file=open("C:/AAS01.txt", 'r', encoding='utf8')
lines = file.readlines()
file.close()
And I'm getting the following error:
Traceback (most recent call last):
File "C:/Users/PycharmProjects/test.py", line 18, in <module>
main()
File "C:/Users/PycharmProjects/test.py", line 7, in main
lines = file.readlines()
File "C:\Python34\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I tried to use codecs commands, but also didn't work...
Any idea what I can do?
The encoding argument to open sets the input encoding. Use encoding='utf_16_le'.
If you're trying to read UCS-2, why are you telling Python it's UTF-8? The 0xff is most likely the first byte of a little endian byte order marker:
>>> codecs.BOM_UTF16_LE
b'\xff\xfe'
UCS-2 is also deprecated, for the simple reason that Unicode outgrew it. The typical replacement would be UTF-16.
More info linked in Python 3: reading UCS-2 (BE) file

Finding the encoding opening csv file in python

I have problems understanding how to detect the proper encoding of a csv file.
I created a small csv file as a sample for testing, cutting and pasting some rows from one of the original files I want to process, and saved that information in my local excel, as CSV.
My program can handle this or similar files without problem, but when I try to open a file sent to me from another computer, the program exits with an error.
The section of the code that opens the file:
with open(file_path,'r') as f:
dialect = csv.Sniffer().sniff(f.read(1024))
f.seek(0)
reader = csv.DictReader(f, fieldnames=['RUT', 'Nombre', 'Telefono'], dialect=dialect)
for row in reader:
numeros.append(row['Telefono'])
The error:
Traceback (most recent call last):
File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 22, in <module>
for row in reader:
File "C:\Program Files\Python35\lib\csv.py", line 110, in __next__
row = next(self.reader)
File "C:\Program Files\Python35\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6392: character maps to <undefined>
Process finished with exit code 1
My locale.getpreferredencoding() is 'cp1252'
I did a couple of attempts to guess the encoding:
with open(file_path,'r', encoding='cp1252') as f:
It works with my local generated csv, but not with the ones I'm sent.
with open(file_path,'r', encoding='utf-8') as f:
Doesn't work with any file, but it generates a different error:
Traceback (most recent call last):
File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 19, in <module>
dialect = csv.Sniffer().sniff(f.read(1024))
File "C:\Program Files\Python35\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 1670: invalid continuation byte
Process finished with exit code 1
I tried too adding newline='' to the open() but it doesn't make a difference.
Following an answer from stackoverflow, I opened the file with notepad, and checked encoding in 'Save as', both my local files and the ones I receive from emails show 'ANSI' as the encoding.
Do I need to figure out the encoding by myself, or python can do that for me? Is there something wrong in my code?
I'm using Python 3.5, and the files are most likley created in computers with Spanish OS.
Update: I been doing some more testing. Almost all csv files open without problems, and the program runs correctly, but there are 2 files that cause an error when I try to open them. If I use excel, or notepad this files look normal. I suspect that the files were created or saved on a computer with an uncommon OS or language.

While reading file on Python, I got a UnicodeDecodeError. What can I do to resolve this?

This is one of my own projects. This will later help benefit other people in a game I am playing (AssaultCube). Its purpose is to break down the log file and make it easier for users to read.
I kept getting this issue. Anyone know how to fix this? Currently, I am not planning to write/create the file. I just want this error to be fixed.
The line that triggered the error is a blank line (it stopped on line 66346).
This is what the relevant part of my script looks like:
log = open('/Users/Owner/Desktop/Exodus Logs/DIRTYLOGS/serverlog_20130430_00.15.21.txt', 'r')
for line in log:
and the exception is:
Traceback (most recent call last):
File "C:\Users\Owner\Desktop\Exodus Logs\Log File Translater.py", line 159, in <module>
main()
File "C:\Users\Owner\Desktop\Exodus Logs\Log File Translater.py", line 7, in main
for line in log:
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3074: character maps to <undefined>
Try:
enc = 'utf-8'
log = open('/Users/Owner/Desktop/Exodus Logs/DIRTYLOGS/serverlog_20130430_00.15.21.txt', 'r', encoding=enc)
if it won't work try:
enc = 'utf-16'
log = open('/Users/Owner/Desktop/Exodus Logs/DIRTYLOGS/serverlog_20130430_00.15.21.txt', 'r', encoding=enc)
you could also try it with
enc = 'iso-8859-15'
also try:
enc = 'cp437'
wich is very old but it also has the "ü" at 0x81 wich would fit to the string "üßer" wich I found on the homepage of assault cube.
If all the codings are wrong try to contact some of the guys developing assault cube or as mentioned in a comment: have a look at https://pypi.python.org/pypi/chardet

Categories