Python Convert Excel to CSV - python

Seems there are a lot of posts on this subject and my solution is in line with what the most common answer seems to be, however I'm encountering an encoding error that I don't know how to address.
>>> def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(worksheet.row_values(rownum))
csvfile.close()
>>> Excel2CSV(r"C:\Temp\Store List.xls", "Open_Locations",
r"C:\Temp\StoreList.csv")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
Excel2CSV(r"C:\Temp\Store List.xls", "Open_Locations", r"C:\Temp\StoreList.csv")
File "<pyshell#1>", line 10, in Excel2CSV
wr.writerow(worksheet.row_values(rownum))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 14:
ordinal not in range(128)
>>>
Any help or insight is greatly appreciated.

As #davidism points out, the Python 2 csv module doesn't work with unicode. You can work around this by converting all of your unicode objects to str objects before submitting them to csv:
def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(
list(x.encode('utf-8') if type(x) == type(u'') else x
for x in worksheet.row_values(rownum)))
csvfile.close()

The Python 2 csv module has some problems with unicode data. You can either encode everything to UTF-8 before writing, or use the unicodecsv module to do it for you.
First pip install unicodecsv. Then, instead of import csv, just import unicodecsv as csv. The API is the same (plus encoding options), so no other changes are needed.

Another fashion for doing this: cast to string, so as you have a string, you may codify it as "utf-8".
str(worksheet.row_values(rownum)).encode('utf-8')
The whole function:
def Excel2CSV(ExcelFile, SheetName, CSVFile):
import xlrd
import csv
workbook = xlrd.open_workbook(ExcelFile)
worksheet = workbook.sheet_by_name(SheetName)
csvfile = open(CSVFile, 'wb')
wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow(str(worksheet.row_values(rownum)).encode('utf-8'))
csvfile.close()

Related

How to read csv file using python that have multi line data in one field [duplicate]

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?
In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...
Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:
Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8

Why is csv not defined?

I am trying to do a relatively simple parse of a csv file, and I don't understand why the csv module is not working. Here is my code:
import csv
def getFromCSV(fileName):
with open(fileName, 'r') as f:
reader = csv.reader(f)
data = list(reader)
return data
def append_row(fileName, my_list):
with open(fileName, 'a') as output:
writer = csv.writer(output)
writer.writerow(my_list)
data = getFromCSV('dh_internal_all.csv')
for row in data:
if '25252' not in row:
print(row)
append_row('parsed.csv',[row])
This returns:
dh-dfbhv2l:Documents jwr38$ python3 remove_bad_data.py
Traceback (most recent call last):
File "remove_bad_data.py", line 13, in <module>
data = getFromCSV('dh_internal_all.csv')
File "remove_bad_data.py", line 3, in getFromCSV
reader = csv.reader(f)
NameError: name 'csv' is not defined
Thank you in advance for any tips.
EDIT: when I run python3 in terminal, then import csv, and then csv, it seems to recognize it, it returns:
<module 'csv' from '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/csv.py'>
You pasted the wrong code. In your traceback, the faulting line is 3, but in this code, it's 5 - the two missing lines are probably the "import csv" lines.

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?
In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...
Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:
Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8

Read, then Write CSV with "Non-ISO extended-ASCII" text Encoding

My csv has strings like:
TîezÑnmidnan
I'm trying to use the following below to set up a reader/writer
import csv
# File that will be written to
csv_output_file = open(file, 'w', encoding='utf-8')
# File that will be read in
csv_file = open(filename, encoding='utf-8', errors='ignore')
# Define reader
csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
# Define writer
csv_writer = csv.writer(csv_output_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
Then iterating over the information read in
# Iterate over the rows in the csv
for idx, row in enumerate(csv_reader):
csv_writer.writerow(row[0:30])
Problem is in my output file I can't get it to show up with that same string. According to my mac, the csv file type has the encoding "Non-ISO extended-ASCII"
I tried various encodings and some would just remove the special characters while others just wouldn't work.
It's weird because I can hard code that string above into a variable and use it without problems, so I assume it's something to do with how I'm reading in the file. If I breakpoint before it writes it shows up as the following in the debugger.
T�ez�nmidnan
I can't convert the file before running it, so the python code has to handle any conversions itself.
The expected output I want would be for it to remain in the output file looking like
TîezÑnmidnan
Adding a link to a sample csv that shows the issue along with a complete version of my code (with some details removed)
Example file to run with this
import tkinter as tk
from tkinter.filedialog import askopenfilename
import csv
import os
root = tk.Tk()
root.withdraw()
# Ask for file
filename = os.path.abspath(askopenfilename(initialdir="/", title="Select csv file", filetypes=(("CSV Files", "*.csv"),)))
# Set output file name
output_name = filename.rsplit('.')
del output_name[len(output_name) - 1]
output_name = "".join(output_name)
output_name += "_processed.csv"
# Using the file that will be written to
csv_output_file = open(os.path.abspath(output_name), 'w', encoding='utf-8')
# Using the file is be read in
csv_file = open(filename, encoding='utf-8', errors='ignore')
# Define reader with , delimiter
csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
# Define writer to put quotes around input values with a comma in them
csv_writer = csv.writer(csv_output_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
header_row = []
# Iterate over the rows in the csv
for idx, row in enumerate(csv_reader):
if idx != 0:
csv_writer.writerow(row)
else:
header_row = row
csv_writer.writerow(header_row)
csv_file.flush()
csv_output_file.flush()
csv_file.close()
csv_output_file.close()
Expected results
Header1,Header2
Value1,TîezÑnmidnan
Actual results
Header1,Header2
Value1,Teznmidnan
Edit:
chardetect gave me "utf-8 with confidence 0.99"

Python CSV write to file unreadable in Excel (Chinese characters)

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.
# -*- coding: utf-8 -*-
import os
import glob
import jieba
import jieba.analyse
import csv
import codecs
segList = []
raw_data_path = 'monthly_raw_data/'
file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]
jieba.load_userdict("customized_dict.txt")
for name in file_name:
all_text = ""
multi_line_text = ""
with open(raw_data_path + name + ".txt", "r") as file:
for line in file:
if line != '\n':
multi_line_text += line
templist = multi_line_text.split('\n')
for text in templist:
all_text += text
seg_list = jieba.cut(all_text,cut_all=False)
temp_text = []
for item in seg_list:
temp_text.append(item.encode('utf-8'))
stop_list = []
with open("stopwords.txt", "r") as stoplistfile:
for item in stoplistfile:
stop_list.append(item.rstrip('\r\n'))
text_without_stopwords = []
for word in temp_text:
if word not in stop_list:
text_without_stopwords.append(word)
segList.append(text_without_stopwords)
with open("results/result.csv", 'wb') as f:
writer = csv.writer(f)
writer.writerows(segList)
For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:
#!python2
#coding:utf8
import csv
data = [[u'American', u'美国人'],
[u'Chinese', u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:
#!python3
#coding:utf8
import csv
data = [['American', '美国人'],
['Chinese', '中国人']]
with open('results.csv', 'w', newline='', encoding='utf-8-sig') as f:
w = csv.writer(f)
w.writerows(data)
There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:
#!python2
#coding:utf8
import unicodecsv
data = [[u'American', u'美国人'],
[u'Chinese', u'中国人']]
with open('results.csv', 'wb') as f:
w = unicodecsv.writer(f ,encoding='utf-8-sig')
w.writerows(data)
Here is another way kinda tricky:
#!python2
#coding:utf8
import csv
data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
This code block generate csv file encoded utf-8 .
open file with notepad++ (or other Editor with encode feature)
Encoding -> convert to ANSI
save
Open file with Excel, it's OK.

Categories