how to convert Excel file to CSV and prevent UTF-8 encoding - python

I have 5 Excel files that have to be compiled into one csv file that can be uploaded to our website for our affiliated stores database. Until now we've had someone manually cut and paste the rows of each file into one master csv file in Excel then they upload that file to the website.
I've been trying to use Python to consolidate the files so the user would just have to run the Python script that would do this for her. The problem is that the Excel files are encoded in Shift-JIS and when I use CSV writer in Python they get converted to UTF-8. The website we upload them to will only accept files in Shift-JIS, so I have to keep all of this data in Shift-JIS.
Since DOS automatically defaults to ascii encoding, I first have to run this:
import codecs, sys, xlrd, csv
reload(sys)
sys.setdefaultencoding('shift_jis')
Here is a sample of the code for one of the Excel files, which has data on 2 separate worksheets:
with xlrd.open_workbook('Circle.xls') as wb:
for sheet in wb.sheets():
fn = 'store-'
print "Converting files.."
with open(fn + sheet.name + ".csv","wb") as f:
c = csv.writer(f,dialect="excel")
for r in range(sheet.nrows):
c.writerow(sheet.row_values(r))
The conversion runs until it finds a UTF-8 character that doesn't exist in shift-JIS, then it errors out.
Is there a way to convert from Excel to a csv purely in shift-JIS?
(If my question has a flaw, please ask me to edit it before marking it down! I will edit it!)

Related

Python = XML to table form in Excel using win32com

When attempting to open my xml files, I am prompted with three choices:
"Please select how you would like to open this file:
As an XML table
As a read-only workbook
Use the XML Source task pane."
Option 1 basically opens the xml file in a table in an excel worksheet, which is beautiful and easy to work with. More importantly, I have done all of my data cleaning in R on this format.
Now, I would like to automate the process of simply opening the xml file, selecting open "as an XML table" and then saving as an xlsx file.
I am very close to doing this using win32com, but the issue is that the file seems to be saved in the "As a read-only workbook" mode which means all of the column headers are messed up.
My Question: I want to convert an XML file into a table like that feature in excel does it and then immediately save it into xlsx format (without even opening the xml file) using preferably win32com.
My current code:
excel = win32com.client.DispatchEx('Excel.Application')
# excel.visible = True
xml1 = excel.Workbooks.Open(os.getcwd() + "\\" + "1.xml")
xml1.SaveAs(os.getcwd() +"\\" + "1.xlsx")
xml1.Close()
I do not want to open every xml file, it is not very time efficient.

I am facing problem with .CSV format in PANDAS

I will explain in detail:
I have an Excel file and my client is using one tool which reads .csv format files only.
Now I am opening the Excel file in Excel and saving into .CSV format by using Save As option in excel. let me take this is a File_1.
I wrote Python code by using pandas module and i converted that Excel file into csv. let me take this is as a File_2.
My client tool is able to read File_1 but not File_2. Why? What would be the problem?
My observations:
When I am reading File_1 in pandas (which is converted into .CSV manually) I had to mention --> encoding = "ISO-8859-1", otherwise it is giving Unicode error.
Ex: pd.read_csv("File_1.csv", encoding = 'ISO-8859-1")
But when I am reading File_2 in pandas, it simply reading and not giving any error.
Ex: pd.read_csv("File_2.csv")
So what would be the reason to not read File_2 by client tool? Is it Unicode problem or any other?

How to convert a CSV to xlxs file in python

I am trying to convert a CSV to an xlxs file format because I have a code that is meant to read a an excel file, but ended up getting a CSV. Is there a way to convert a CSV file to an TEMP excel file and have it not destroyed until the reading process is done. I have tried using openpyxl but it ends up not working and throwing an error saying it's not a good zip file. I even tried converting the CSV to text and then storing it in a dictionary but it writing to excel using xlrd pakage did not work aswell. I was wondering if there is a way do it in a cc
Seems like you open the file in text mode. Try this to open file
open('sample.csv', "rt", encoding="utf8")
or
open('sample.csv', "rt", encoding="ascii")
depending on the encoding of the file

Excel fails to open Python-generated CSV files

I have many Python scripts that output CSV files. It is occasionally convenient to open these files in Excel. After installing OS X Mavericks, Excel no longer opens these files properly: Excel doesn't parse the files and it duplicates the rows of the file until it runs out of memory. Specifically, when Excel attempts to open the file, a prompt appears that reads: "File not loaded completely."
Example of code I'm using to generate the CSV files:
import csv
with open('csv_test.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
Even the simple file generated by the above code fails to load in Excel. However, if I open the CSV file in a text editor and copy/paste the text into Excel, parse it with text to columns, and then save as CSV from Excel, then I can reopen the CSV file in Excel without issue. Do I need to pass an additional parameter in my scripts to make Excel parse the CSV files the same way it used to? Or is there some setting I can change in OS X Mavericks or Excel? Thanks.
Maybe I had the similar problem, the error message "SYLK: File format is not valid" when open python autogenerated csv file. The solution is really funny. The first two characters must not be I and D in uppercase (ID). Also see "SYLK: File format is not valid" error message when you open file.
Possible solution1: use *.txt instead of *.csv. In this case Excel (at least, 2010) will show you an import data wizard where you can specify delimiters, character encoding, field types, etc.
UPD: Solution2:
The python "csv" module has a "dialect" feature. For example, the following modification of your code generates valid csv file for my environment (Python 2.7, Excel 2010, Windows7, locale with ";" list delimiters):
import csv
with open('csv_test2.csv', 'wb') as f:
csv.excel.delimiter=';'
writer = csv.writer(f, dialect=csv.excel)
writer.writerow([1,2,3])
writer.writerow([4,5,6])

How do i extract specific lines of data from a huge Excel sheet using Python?

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?
I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]
How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message
What operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?
It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.
Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.
If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.
Assuming the first case:
import xlrd
wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'
wb = xlrd.open(wb_path)
ws = wb.sheets()[0] # assuming you want to work with the first sheet in the workbook
with open(output_path, 'w') as output_file:
for i in xrange(ws.nrows):
row = [cell.value for cell in ws.row(i)]
# ... replace the following if statement with your own conditions ...
if row[0] == u'interesting':
output_file.write('\t'.join(row) + '\r\n')
This will give you a tab-delimited output file that should open in Excel.
Edit:
Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.
I haven't used it, but xlrd looks like it does a good job reading Excel data.
Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

Categories