python: converting corrupt xls file - python

I have downloaded few sales dataset from a SAP application. SAP has automatically converted the data to .XLS file. Whenever I open it using Pandas library I am getting the following error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'
When I opened the .XLS file using MSEXCEL it is shows a popup saying that the file is corrupt or unsupported extension do you want to continue when I clicked 'Yes' its showing the correct data. When I saved the file again as .xls using msexcel I am able to use it using Pandas.
So, I tried renaming the file using os.rename() but it dint work. I tried opening the file and removing \xff\xfe\r\x00\n\x00\r\x00, but then also it dint work.
The solution is to open MSEXCEL and save the file again as .xls manually, is there any way to automate this. Kindly help.

Finally I converted the corrupt .xls to a correct .xls file. The following is the code:
# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io
filename = r'SALEJAN17.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('myexcel.xls')
import pandas as pd
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')
No errors.

The other way to solve this problem is using win32com.client library:
import win32com.client
import os
o = win32com.client.Dispatch("Excel.Application")
o.Visible = False
filename = os.getcwd() + '/' + 'SALEJAN17.xls'
output = os.getcwd() + '/' + 'myexcel.xlsx'
wb = o.Workbooks.Open(filename)
wb.ActiveSheet.SaveAs(output,51)
In my example you save to .xlsx format but you can save as .xls as well.

Related

Convert .xlsx to .txt with python? or format .txt file to fix columns indentation?

I have an excel file with many rows/columns and when I convert the file directly from .xlsx to .txt with excel, the file ends up with a weird indentation (the columns are not perfectly aligned like in an excel file) and due to some requirements, I really need them to be.
So, is there a better way to write from excel to txt using python? or format the txt file so the columns perfectly align?
I found this code in a previous question but I am getting the following error:
TypeError: a bytes-like object is required, not 'str'
Code:
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in range(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
is there a better way to write from excel to txt using python?
I'm not sure if it's a better way, but you could write the contents of xlsx file to txt this way:
import pandas as pd
with open('test.txt', 'w') as file:
pd.read_excel('test.xlsx').to_string(file, index=False)
Edit:
to convert date column to a desired format, you could try the following:
with open('test.txt', 'w') as file:
df = pd.read_excel('test.xlsx')
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y%m%d')
df.to_string(file, index=False, na_rep='')
The problem lies in this row:
with open('my.csv', 'wb') as myCsvfile:
'wb' suggests you will be writing bytes, but in reality, you will be writing regular characters. Change it to 'w'. Perhaps the best practice would be to also use with block for Excel file:
import xlrd
import csv
# open the output csv
with open('my.csv', 'w') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
with xlrd.open_workbook('myfile.xlsx') as myXlsxfile:
# get a sheet
mysheet = myXlsxfile.sheet_by_index(0)
# write the rows
for rownum in range(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
import pandas as pd
read_file = pd.read_excel (r'your excel file name.xlsx', sheet_name='your sheet name')
read_file.to_csv (r'Path to store the txt file\File name.txt', index = None, header=True)

Python: Converting multiple files from xls to csv

I'm trying to write a script in Python 2.7 that would convert all .xls and .xlsx files in the current directory into .csv with preserving their original file names.
With help from other similar questions here (sadly, not sure who to credit for the pieces of code I borrowed), here's what I've got so far:
import xlrd
import csv
import os
def csv_from_excel(xlfile):
wb = xlrd.open_workbook(xlfile)
sh = wb.sheet_by_index(0)
your_csv_file = open(os.path.splitext(sxlfile)[0], 'wb')
wr = csv.writer(your_csv_file, dialect='excel', quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
for file in os.listdir(os.getcwd()):
if file.lower().endswith(('.xls','.xlsx')):
csv_from_excel(file)
I have two questions:
1) I can't figure out why the program when run, only converts one file and doesn't iterate through all files in the current directory.
2) I can't figure out how to keep the original filename through the conversion. I.e. that an output file has the same name as an input.
Thank you
One possible solution would be using glob and pandas.
excel_files = glob('*xls*')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel, 'Sheet1')
df.to_csv(out)

How to convert a CSV to xlxs file in python

I am trying to convert a CSV to an xlxs file format because I have a code that is meant to read a an excel file, but ended up getting a CSV. Is there a way to convert a CSV file to an TEMP excel file and have it not destroyed until the reading process is done. I have tried using openpyxl but it ends up not working and throwing an error saying it's not a good zip file. I even tried converting the CSV to text and then storing it in a dictionary but it writing to excel using xlrd pakage did not work aswell. I was wondering if there is a way do it in a cc
Seems like you open the file in text mode. Try this to open file
open('sample.csv', "rt", encoding="utf8")
or
open('sample.csv', "rt", encoding="ascii")
depending on the encoding of the file

how to convert Excel file to CSV and prevent UTF-8 encoding

I have 5 Excel files that have to be compiled into one csv file that can be uploaded to our website for our affiliated stores database. Until now we've had someone manually cut and paste the rows of each file into one master csv file in Excel then they upload that file to the website.
I've been trying to use Python to consolidate the files so the user would just have to run the Python script that would do this for her. The problem is that the Excel files are encoded in Shift-JIS and when I use CSV writer in Python they get converted to UTF-8. The website we upload them to will only accept files in Shift-JIS, so I have to keep all of this data in Shift-JIS.
Since DOS automatically defaults to ascii encoding, I first have to run this:
import codecs, sys, xlrd, csv
reload(sys)
sys.setdefaultencoding('shift_jis')
Here is a sample of the code for one of the Excel files, which has data on 2 separate worksheets:
with xlrd.open_workbook('Circle.xls') as wb:
for sheet in wb.sheets():
fn = 'store-'
print "Converting files.."
with open(fn + sheet.name + ".csv","wb") as f:
c = csv.writer(f,dialect="excel")
for r in range(sheet.nrows):
c.writerow(sheet.row_values(r))
The conversion runs until it finds a UTF-8 character that doesn't exist in shift-JIS, then it errors out.
Is there a way to convert from Excel to a csv purely in shift-JIS?
(If my question has a flaw, please ask me to edit it before marking it down! I will edit it!)

Excel fails to open Python-generated CSV files

I have many Python scripts that output CSV files. It is occasionally convenient to open these files in Excel. After installing OS X Mavericks, Excel no longer opens these files properly: Excel doesn't parse the files and it duplicates the rows of the file until it runs out of memory. Specifically, when Excel attempts to open the file, a prompt appears that reads: "File not loaded completely."
Example of code I'm using to generate the CSV files:
import csv
with open('csv_test.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
Even the simple file generated by the above code fails to load in Excel. However, if I open the CSV file in a text editor and copy/paste the text into Excel, parse it with text to columns, and then save as CSV from Excel, then I can reopen the CSV file in Excel without issue. Do I need to pass an additional parameter in my scripts to make Excel parse the CSV files the same way it used to? Or is there some setting I can change in OS X Mavericks or Excel? Thanks.
Maybe I had the similar problem, the error message "SYLK: File format is not valid" when open python autogenerated csv file. The solution is really funny. The first two characters must not be I and D in uppercase (ID). Also see "SYLK: File format is not valid" error message when you open file.
Possible solution1: use *.txt instead of *.csv. In this case Excel (at least, 2010) will show you an import data wizard where you can specify delimiters, character encoding, field types, etc.
UPD: Solution2:
The python "csv" module has a "dialect" feature. For example, the following modification of your code generates valid csv file for my environment (Python 2.7, Excel 2010, Windows7, locale with ";" list delimiters):
import csv
with open('csv_test2.csv', 'wb') as f:
csv.excel.delimiter=';'
writer = csv.writer(f, dialect=csv.excel)
writer.writerow([1,2,3])
writer.writerow([4,5,6])

Categories