ASCII Encode/Decode Error - python

I have csv files having encoding-'utf-8'. I need to convert the csv to excel workbook with same encoding but unable to do so. Tried many things but not able to fix. Here is the code snippet.
NOte: Using xlsxwriter package
def csv_to_excel(input_file_path, output_file_path):
file_path = input_file_path
excel_file_path = output_file_path
wb = Workbook(excel_file_path.encode('utf-8', 'ignore'), {'encoding': 'utf-8'})
sheet1 = wb.add_worksheet(("anyname1").encode('utf-8','ignore'))
sheet2 = wb.add_worksheet(("anyname2").encode('utf-8','ignore'))
for filename in glob.glob(file_path):
(f_path, f_name) = os.path.split(filename)
w_tab = str(f_name.split('_')[2]).split('.')[0]
if (w_tab=="anyname1"):
w_sheet = sheet1
elif (w_tab=="anyname2"):
w_sheet = sheet2
spamReader = csv.reader(open(filename, "rb"), delimiter=',',quotechar='"')
row_count = 0
for row in spamReader:
for col in range(len(row)):
w_sheet.write(row_count,col,row[col])
row_count +=1
try:
os.remove(excel_file_path)
except:
pass
wb.close()
print "Converted CSVs to Excel File"
Errors:
Case1: When I am trying to open the utf-8 encoded csv file as follows:
spamReader = csv.reader(io.open(filename, "r", encoding = 'utf-8'), delimiter=',',quotechar='"')
Then getting error while iterating over the spamReader object as
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 92: ordinal not in range(128)
Case2: When I am trying to open the same csv file as binary as mentioned in above code snippet, then I am not able to save it as utf-8 encoded excel, so while calling wb.close(), getting error as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)
I have just started learning python so maybe this is not that big issue but Please help me on this.

Related

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?
In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...
Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:
Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8

UnicodeDecodeError while trying to change points into commas in Excel

I'm currently trying to run following code:
import csv
with open("C:\\Users\\User\\Downloads\\stu21617.rw2.xlsx", 'r', encoding='utf-8') as infile:
with open("C:\\Users\\User\\Documents\\jens\\komma.xlsx", 'w', encoding='utf-8') as outfile:
tabel = []
writer = csv.writer(outfile, delimiter = ';')
for rij in csv.reader(infile, delimiter = ';'):
writer.writerow(rij.replace('.', ','))
But I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte

Write numpy.ndarray with Russian characters to file

I try to write numpy.ndarray to file.
I use
unique1 = np.unique(df['search_term'])
unique1 = unique1.tolist()
and next try
1)
edf = pd.DataFrame()
edf['term'] = unique1
writer = pd.ExcelWriter(r'term.xlsx', engine='xlsxwriter')
edf.to_excel(writer)
writer.close()
and 2)
thefile = codecs.open('domain.txt', 'w', encoding='utf-8')
for item in unique:
thefile.write("%s\n" % item)
But all return UnicodeDecodeError: 'utf8' codec can't decode byte 0xd7 in position 9: invalid continuation byte
The second example should work if you encode the strings as utf8.
The following works in Python2 with a utf8 encoded file:
# _*_ coding: utf-8
import pandas as pd
edf = pd.DataFrame()
edf['term'] = ['foo', 'bar', u'русском']
writer = pd.ExcelWriter(r'term.xlsx', engine='xlsxwriter')
edf.to_excel(writer)
writer.save()
Output:

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

Python Convert Excel to CSV

Seems there are a lot of posts on this subject and my solution is in line with what the most common answer seems to be, however I'm encountering an encoding error that I don't know how to address.
>>> def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(worksheet.row_values(rownum))
csvfile.close()
>>> Excel2CSV(r"C:\Temp\Store List.xls", "Open_Locations",
r"C:\Temp\StoreList.csv")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
Excel2CSV(r"C:\Temp\Store List.xls", "Open_Locations", r"C:\Temp\StoreList.csv")
File "<pyshell#1>", line 10, in Excel2CSV
wr.writerow(worksheet.row_values(rownum))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 14:
ordinal not in range(128)
>>>
Any help or insight is greatly appreciated.
As #davidism points out, the Python 2 csv module doesn't work with unicode. You can work around this by converting all of your unicode objects to str objects before submitting them to csv:
def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(
list(x.encode('utf-8') if type(x) == type(u'') else x
for x in worksheet.row_values(rownum)))
csvfile.close()
The Python 2 csv module has some problems with unicode data. You can either encode everything to UTF-8 before writing, or use the unicodecsv module to do it for you.
First pip install unicodecsv. Then, instead of import csv, just import unicodecsv as csv. The API is the same (plus encoding options), so no other changes are needed.
Another fashion for doing this: cast to string, so as you have a string, you may codify it as "utf-8".
str(worksheet.row_values(rownum)).encode('utf-8')
The whole function:
def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(str(worksheet.row_values(rownum)).encode('utf-8'))
csvfile.close()

Categories