I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?
In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...
Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:
Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8
Related
I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?
In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...
Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:
Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8
Below codes were used in Python 2 to combine all txt files in a folder. It worked fine.
import os
base_folder = "C:\\FDD\\"
all_files = []
for each in os.listdir(base_folder):
if each.endswith('.txt'):
kk = os.path.join(base_folder, each)
all_files.append(kk)
with open(base_folder + "Combined.txt", 'w') as outfile:
for fname in all_files:
with open(fname) as infile:
for line in infile:
outfile.write(line)
When in Python 3, it gives an error:
Traceback (most recent call last):
File "C:\Scripts\thescript.py", line 26, in <module>
for line in infile:
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0xe4 in position 53: No mapping for the Unicode character exists in the target code page.
I made this change:
with open(fname) as infile:
to
with open(fname, 'r', encoding = 'latin-1') as infile:
It gives me “MemoryError”.
How can I correct this error in Python 3? Thank you.
As #transilvlad suggested here, use the open method from the codecs module to read in the file:
import codecs
with codecs.open(fname, 'r', encoding = 'utf-8',
errors='ignore') as infile:
This will strip out (ignore) the characters in the error returning the string without them.
I'm currently trying to run following code:
import csv
with open("C:\\Users\\User\\Downloads\\stu21617.rw2.xlsx", 'r', encoding='utf-8') as infile:
with open("C:\\Users\\User\\Documents\\jens\\komma.xlsx", 'w', encoding='utf-8') as outfile:
tabel = []
writer = csv.writer(outfile, delimiter = ';')
for rij in csv.reader(infile, delimiter = ';'):
writer.writerow(rij.replace('.', ','))
But I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte
I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.
# -*- coding: utf-8 -*-
import os
import glob
import jieba
import jieba.analyse
import csv
import codecs
segList = []
raw_data_path = 'monthly_raw_data/'
file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]
jieba.load_userdict("customized_dict.txt")
for name in file_name:
all_text = ""
multi_line_text = ""
with open(raw_data_path + name + ".txt", "r") as file:
for line in file:
if line != '\n':
multi_line_text += line
templist = multi_line_text.split('\n')
for text in templist:
all_text += text
seg_list = jieba.cut(all_text,cut_all=False)
temp_text = []
for item in seg_list:
temp_text.append(item.encode('utf-8'))
stop_list = []
with open("stopwords.txt", "r") as stoplistfile:
for item in stoplistfile:
stop_list.append(item.rstrip('\r\n'))
text_without_stopwords = []
for word in temp_text:
if word not in stop_list:
text_without_stopwords.append(word)
segList.append(text_without_stopwords)
with open("results/result.csv", 'wb') as f:
writer = csv.writer(f)
writer.writerows(segList)
For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:
#!python2
#coding:utf8
import csv
data = [[u'American', u'美国人'],
[u'Chinese', u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:
#!python3
#coding:utf8
import csv
data = [['American', '美国人'],
['Chinese', '中国人']]
with open('results.csv', 'w', newline='', encoding='utf-8-sig') as f:
w = csv.writer(f)
w.writerows(data)
There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:
#!python2
#coding:utf8
import unicodecsv
data = [[u'American', u'美国人'],
[u'Chinese', u'中国人']]
with open('results.csv', 'wb') as f:
w = unicodecsv.writer(f ,encoding='utf-8-sig')
w.writerows(data)
Here is another way kinda tricky:
#!python2
#coding:utf8
import csv
data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
This code block generate csv file encoded utf-8 .
open file with notepad++ (or other Editor with encode feature)
Encoding -> convert to ANSI
save
Open file with Excel, it's OK.
Seems there are a lot of posts on this subject and my solution is in line with what the most common answer seems to be, however I'm encountering an encoding error that I don't know how to address.
>>> def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(worksheet.row_values(rownum))
csvfile.close()
>>> Excel2CSV(r"C:\Temp\Store List.xls", "Open_Locations",
r"C:\Temp\StoreList.csv")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
Excel2CSV(r"C:\Temp\Store List.xls", "Open_Locations", r"C:\Temp\StoreList.csv")
File "<pyshell#1>", line 10, in Excel2CSV
wr.writerow(worksheet.row_values(rownum))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 14:
ordinal not in range(128)
>>>
Any help or insight is greatly appreciated.
As #davidism points out, the Python 2 csv module doesn't work with unicode. You can work around this by converting all of your unicode objects to str objects before submitting them to csv:
def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(
list(x.encode('utf-8') if type(x) == type(u'') else x
for x in worksheet.row_values(rownum)))
csvfile.close()
The Python 2 csv module has some problems with unicode data. You can either encode everything to UTF-8 before writing, or use the unicodecsv module to do it for you.
First pip install unicodecsv. Then, instead of import csv, just import unicodecsv as csv. The API is the same (plus encoding options), so no other changes are needed.
Another fashion for doing this: cast to string, so as you have a string, you may codify it as "utf-8".
str(worksheet.row_values(rownum)).encode('utf-8')
The whole function:
def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(str(worksheet.row_values(rownum)).encode('utf-8'))
csvfile.close()