Finding the encoding opening csv file in python - python

I have problems understanding how to detect the proper encoding of a csv file.
I created a small csv file as a sample for testing, cutting and pasting some rows from one of the original files I want to process, and saved that information in my local excel, as CSV.
My program can handle this or similar files without problem, but when I try to open a file sent to me from another computer, the program exits with an error.
The section of the code that opens the file:
with open(file_path,'r') as f:
dialect = csv.Sniffer().sniff(f.read(1024))
f.seek(0)
reader = csv.DictReader(f, fieldnames=['RUT', 'Nombre', 'Telefono'], dialect=dialect)
for row in reader:
numeros.append(row['Telefono'])
The error:
Traceback (most recent call last):
File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 22, in <module>
for row in reader:
File "C:\Program Files\Python35\lib\csv.py", line 110, in __next__
row = next(self.reader)
File "C:\Program Files\Python35\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6392: character maps to <undefined>
Process finished with exit code 1
My locale.getpreferredencoding() is 'cp1252'
I did a couple of attempts to guess the encoding:
with open(file_path,'r', encoding='cp1252') as f:
It works with my local generated csv, but not with the ones I'm sent.
with open(file_path,'r', encoding='utf-8') as f:
Doesn't work with any file, but it generates a different error:
Traceback (most recent call last):
File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 19, in <module>
dialect = csv.Sniffer().sniff(f.read(1024))
File "C:\Program Files\Python35\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 1670: invalid continuation byte
Process finished with exit code 1
I tried too adding newline='' to the open() but it doesn't make a difference.
Following an answer from stackoverflow, I opened the file with notepad, and checked encoding in 'Save as', both my local files and the ones I receive from emails show 'ANSI' as the encoding.
Do I need to figure out the encoding by myself, or python can do that for me? Is there something wrong in my code?
I'm using Python 3.5, and the files are most likley created in computers with Spanish OS.
Update: I been doing some more testing. Almost all csv files open without problems, and the program runs correctly, but there are 2 files that cause an error when I try to open them. If I use excel, or notepad this files look normal. I suspect that the files were created or saved on a computer with an uncommon OS or language.

Related

unable to debug error produced while reading csv in python

I am trying to csv file. Code I've written below gives error(available after code block). Not sure what I am missing or doing wrong.
import csv
file = open('AlfaRomeo.csv')
csvreader = csv.reader(file)
for j in csvreader:
print(j)
Traceback (most recent call last):
File "C:\Users\Pratik\PycharmProjects\AkraScraper\Transform_Directory\Developer_Sandbox.py", line 39, in
for j in csvreader:
File "C:\Users\Pratik\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 402: character maps to
The error is that you have a character in your input file which fails the Unicode decode test. It's value is 0x8d (141 decimal), and it's the 402nd byte in the file. I suggest loading the file in a text editor and search forward until you find it. So you know what you're looking for, it's in the Extended ASCII code section of https://www.asciitable.com/.

UnicodeDecodeError 'utf8' codec can't decode byte 0xb0

I have a code that goes recursively through some folders, in the way of
for root, subFolders, files in os.walk(str(rootdir)):
Running the program I find the error UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 37: invalid start byte.
Have changed the direction of rootdir to see when the error starts and have found some folders within the path I actually want to use that are completely fine and some that return the error, but the thing is all subdirectories either only have folders or have basically the same files, so I dont know where the error is coming from or how to fix it.
Please help.
The error appears in a line where I use an outside package but the package is imported fine, code is fine and it works when the unicode error doesn't appear. The line code imports a .xml file in the folder, is that file the one with the problem? (shouldn't since they are all created with the same program and if one is wrong then all should be wrong, not a few)
edit: to actually test my code youd have to install pymatgen (you could with pip) and get a vasprun.xml file. Highly unprobable, hence why I didnt put it at the beginning.
Code (last line with the error)
from pymatgen.electronic_structure.dos import CompleteDos, add_densities, Dos
from pymatgen.electronic_structure.core import Spin, Orbital
from pymatgen.io.vasp.outputs import Vasprun, Procar
vasprun = Vasprun(root+"/vasprun.xml")
Error:
Traceback (most recent call last):
File "an.py", line 196, in <module>
vasprun = Vasprun(root+"/vasprun.xml")
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/outputs.py", line 383, in __init__
self.update_potcar_spec(parse_potcar_file)
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/outputs.py", line 829, in update_potcar_spec
potcar = get_potcar_in_path(os.path.split(self.filename)[0])
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/outputs.py", line 813, in get_potcar_in_path
pc = Potcar.from_file(os.path.join(p, fn))
File "/usr/lib64/python2.7/site-packages/pymatgen/io/vasp/inputs.py", line 1704, in from_file
fdata = reader.read()
File "/usr/lib64/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 37: invalid start byte
The file is apparently not UTF-8 encoded. If it has an XML declaration that specifies UTF-8 (or does not specify an encoding), then you need to replace it. If there is no XML declaration, you should try adding one.
A correct XML declaration will need to specify the actual character set, probably <?xml version="1.0" encoding="iso-8859-1" ?>, or possibly some other ISO encoding.

Unicode problems in Python again

I have a strange problem. A friend of mine sent me a text file. If I copy/paste the text and then paste it into my text editor and save it, the following code works. If I choose the option to save the file directly from the browser, the following code breaks. What's going on? Is it the browser's fault for saving invalid characters?
This is an example line.
When I save it, the line says
What�s going on?
When I copy/paste it, the line says
What’s going on?
This is the code:
import codecs
def do_stuff(filename):
with codecs.open(filename, encoding='utf-8') as f:
def process_line(line):
return line.strip()
lines = f.readlines()
for line in lines:
line = process_line(line)
print line
do_stuff('stuff.txt')
This is the traceback I get:
Traceback (most recent call last):
File "test-encoding.py", line 13, in <module>
do_stuff('stuff.txt')
File "test-encoding.py", line 8, in do_stuff
lines = f.readlines()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 679, in readlines
return self.reader.readlines(sizehint)
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 588, in readlines
data = self.read()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: invalid start byte
What can I do in such cases?
How can I distribute the script if I don't know what encoding the user who runs it will use?
Fixed:
codecs.open(filename, encoding='utf-8', errors='ignore') as f:
The "file-oriented" part of the browser works with raw bytes, not characters. The specific encoding used by the page should be specified either in the HTTP headers or in the HTML itself. You must use this encoding instead of assuming that you have UTF-8 data.

Python: File encoding errors

From a few days I'm struggling this annoying problem with file encoding in my little program in Python.
I work a lot with MediaWiki - recently I do documents conversion from .doc to Wikisource.
Document in Microsoft Word format is opened in Libre Office and then exported to .txt file with Wikisource format. My program is searching for a [[Image:]] tag and replace it with a name of image taken from a list - and that mechanism works really fine (Big Thanks for help brjaga!).
When I did some test on .txt files created by me everything worked just fine but when I put a .txt file with Wikisource whole thing is not so funny anymore :D
I got this message prom Python:
Traceback (most recent call last):
File "C:\Python33\final.py", line 15, in <module>
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>
And this is my Python code:
li = [
"[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
]
with open ("C:\\124_BPP_PL_PL.txt") as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')
for item in li:
s = s.replace("[[Image:]]", item, 1)
dest.write(s)
dest.close()
OK, so I did some research and found that this is a problem with encoding. So I installed a program Notepad++ and changed the encoding of my .txt file with Wikisource to: UTF-8 and saved it. Then I did some change in my code:
with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
But I got this new error message:
Traceback (most recent call last):
File "C:\Python33\final.py", line 22, in <module>
dest.write(s)
File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
And I'm really stuck on this one. I thought, when I change the encoding manually in Notepad++ and then I will tell the encoding which I set - everything will be good.
Please help, Thank You in advance.
When Python 3 opens a text file, it uses the default encoding for your system when trying to decode the file in order to give you full Unicode text (the str type is fully Unicode aware). It does the same when writing out such Unicode text values.
You already solved the input side; you specified an encoding when reading. Do the same when writing: specify a codec to use to write out the file that can handle Unicode, including the non-breaking whitespace character at codepoint U+FEFF. UTF-8 is usually a good default choice:
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')
You can use the with statement when writing too and save yourself the .close() call:
for item in li:
s = s.replace("[[Image:]]", item, 1)
with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:
dest.write(s)

While reading file on Python, I got a UnicodeDecodeError. What can I do to resolve this?

This is one of my own projects. This will later help benefit other people in a game I am playing (AssaultCube). Its purpose is to break down the log file and make it easier for users to read.
I kept getting this issue. Anyone know how to fix this? Currently, I am not planning to write/create the file. I just want this error to be fixed.
The line that triggered the error is a blank line (it stopped on line 66346).
This is what the relevant part of my script looks like:
log = open('/Users/Owner/Desktop/Exodus Logs/DIRTYLOGS/serverlog_20130430_00.15.21.txt', 'r')
for line in log:
and the exception is:
Traceback (most recent call last):
File "C:\Users\Owner\Desktop\Exodus Logs\Log File Translater.py", line 159, in <module>
main()
File "C:\Users\Owner\Desktop\Exodus Logs\Log File Translater.py", line 7, in main
for line in log:
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3074: character maps to <undefined>
Try:
enc = 'utf-8'
log = open('/Users/Owner/Desktop/Exodus Logs/DIRTYLOGS/serverlog_20130430_00.15.21.txt', 'r', encoding=enc)
if it won't work try:
enc = 'utf-16'
log = open('/Users/Owner/Desktop/Exodus Logs/DIRTYLOGS/serverlog_20130430_00.15.21.txt', 'r', encoding=enc)
you could also try it with
enc = 'iso-8859-15'
also try:
enc = 'cp437'
wich is very old but it also has the "ü" at 0x81 wich would fit to the string "üßer" wich I found on the homepage of assault cube.
If all the codings are wrong try to contact some of the guys developing assault cube or as mentioned in a comment: have a look at https://pypi.python.org/pypi/chardet

Categories