Python: Converting multiple files from xls to csv - python

I'm trying to write a script in Python 2.7 that would convert all .xls and .xlsx files in the current directory into .csv with preserving their original file names.
With help from other similar questions here (sadly, not sure who to credit for the pieces of code I borrowed), here's what I've got so far:
import xlrd
import csv
import os
def csv_from_excel(xlfile):
wb = xlrd.open_workbook(xlfile)
sh = wb.sheet_by_index(0)
your_csv_file = open(os.path.splitext(sxlfile)[0], 'wb')
wr = csv.writer(your_csv_file, dialect='excel', quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
for file in os.listdir(os.getcwd()):
if file.lower().endswith(('.xls','.xlsx')):
csv_from_excel(file)
I have two questions:
1) I can't figure out why the program when run, only converts one file and doesn't iterate through all files in the current directory.
2) I can't figure out how to keep the original filename through the conversion. I.e. that an output file has the same name as an input.
Thank you

One possible solution would be using glob and pandas.
excel_files = glob('*xls*')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel, 'Sheet1')
df.to_csv(out)

Related

How to covert multiple .txt files into .csv file in Python

I'm trying to covert multiple text files into a single .csv file using Python. My current code is this:
import pandas
import glob
#Collects the files names of all .txt files in a given directory.
file_names = glob.glob("./*.txt")
#[Middle Step] Merges the text files into a single file titled 'output_file'.
with open('output_file.txt', 'w') as out_file:
for i in file_names:
with open(i) as in_file:
for j in in_file:
out_file.write(j)
#Reading the merged file and creating dataframe.
data = pandas.read_csv("output_file.txt", delimiter = '/')
#Store dataframe into csv file.
data.to_csv("convert_sample.csv", index = None)
So as you can see, I'm reading from all the files and merging them into a single .txt file. Then I convert it into a single .csv file. Is there a way to accomplish this without the middle step? Is it necessary to concatenate all my .txt files into a single .txt to convert it to .csv, or is there a way to directly convert multiple .txt files to a single .csv?
Thank you very much.
Of course it is possible. And you really don't need to involve pandas here, just use the standard library csv module. If you know the column names ahead of time, the most painless way is to use csv.DictWriter and csv.DictReader objects:
import csv
import glob
column_names = ['a','b','c'] # or whatever
with open("convert_sample.csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob("./*.txt"):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='/', fieldnames=column_names)
writer.writerows(reader)

Convert .xlsx to .txt with python? or format .txt file to fix columns indentation?

I have an excel file with many rows/columns and when I convert the file directly from .xlsx to .txt with excel, the file ends up with a weird indentation (the columns are not perfectly aligned like in an excel file) and due to some requirements, I really need them to be.
So, is there a better way to write from excel to txt using python? or format the txt file so the columns perfectly align?
I found this code in a previous question but I am getting the following error:
TypeError: a bytes-like object is required, not 'str'
Code:
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in range(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
is there a better way to write from excel to txt using python?
I'm not sure if it's a better way, but you could write the contents of xlsx file to txt this way:
import pandas as pd
with open('test.txt', 'w') as file:
pd.read_excel('test.xlsx').to_string(file, index=False)
Edit:
to convert date column to a desired format, you could try the following:
with open('test.txt', 'w') as file:
df = pd.read_excel('test.xlsx')
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y%m%d')
df.to_string(file, index=False, na_rep='')
The problem lies in this row:
with open('my.csv', 'wb') as myCsvfile:
'wb' suggests you will be writing bytes, but in reality, you will be writing regular characters. Change it to 'w'. Perhaps the best practice would be to also use with block for Excel file:
import xlrd
import csv
# open the output csv
with open('my.csv', 'w') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
with xlrd.open_workbook('myfile.xlsx') as myXlsxfile:
# get a sheet
mysheet = myXlsxfile.sheet_by_index(0)
# write the rows
for rownum in range(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
import pandas as pd
read_file = pd.read_excel (r'your excel file name.xlsx', sheet_name='your sheet name')
read_file.to_csv (r'Path to store the txt file\File name.txt', index = None, header=True)

Adding an xls filename to a column in python

I have some code that carries out this task for .csv (thanks to Michal K for assistance).
Any ideas on how I could change this to work on a directory of .xls files rather than .csv files?
import csv
import os
for file_name in os.listdir("c:/projects/files"):
with open(file_name,'r') as csvinput:
reader = csv.reader(csvinput)
all = []
row = next(reader)
row.append('FileName')
all.append(row)
for row in reader:
row.append(file_name)
all.append(row)
with open(file_name, 'w') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
writer.writerows(all)
Excel spreadsheets are slightly more complicated than CSV files, so I'd recommend using an imported module such as openpyxl.
This allows you to get the worksheets (tabs) from the file, and manipulate the columns and rows as you see fit.
The general program structure would look something like this:
for file_name in os.listdir("c:/projects/files"):
if file_name.endswith('.xls'):
workbook = openpyxl.load_workbook(file_name)
# Get worksheets
# Manipulate columns and rows
workbook.save(file_name)
There's a really good tutorial on using openpyxl here
for reading and writing excel and csv files, pandas is very convenient
import pandas as pd
csv
csv_data = pd.read_csv(csv_filename, header=0) # you can define the exact csv format with further arguments
csv_data['filename'] = csv_filename #adds a column with the filename
excel
excel_data = pd.read_excel(excel_filename)
excel_data['filename'] = excel_filename
export
csv_data.to_csv(output_csv)
excel_data.to_excel(output_excel)
You can also export the csv to excel or vice versa
excel_data.to_csv(output_excel_csv)
csv_data.to_excel(output_csv_excel)

python: converting corrupt xls file

I have downloaded few sales dataset from a SAP application. SAP has automatically converted the data to .XLS file. Whenever I open it using Pandas library I am getting the following error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'
When I opened the .XLS file using MSEXCEL it is shows a popup saying that the file is corrupt or unsupported extension do you want to continue when I clicked 'Yes' its showing the correct data. When I saved the file again as .xls using msexcel I am able to use it using Pandas.
So, I tried renaming the file using os.rename() but it dint work. I tried opening the file and removing \xff\xfe\r\x00\n\x00\r\x00, but then also it dint work.
The solution is to open MSEXCEL and save the file again as .xls manually, is there any way to automate this. Kindly help.
Finally I converted the corrupt .xls to a correct .xls file. The following is the code:
# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io
filename = r'SALEJAN17.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('myexcel.xls')
import pandas as pd
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')
No errors.
The other way to solve this problem is using win32com.client library:
import win32com.client
import os
o = win32com.client.Dispatch("Excel.Application")
o.Visible = False
filename = os.getcwd() + '/' + 'SALEJAN17.xls'
output = os.getcwd() + '/' + 'myexcel.xlsx'
wb = o.Workbooks.Open(filename)
wb.ActiveSheet.SaveAs(output,51)
In my example you save to .xlsx format but you can save as .xls as well.

Pasting content of CSV files into an excel sheet in python 2.5

So basically I got an xlsm document, it contains a sheet "data" and a graph, the graph is generated from data.
I have 5 CSV files.
I need to erase the content of data, then fill it with the 5 CSV files.
Is it possible to do without putting my csv file in an array and writing them line per line (very time consuming).
Can I just open my CSV files and sort of paste them into the data sheet?
Thanks
You can just open CSV files in Excel, and you can join your CSV files into one larger one with
my_files = ['file_1.csv', 'file_2.csv', 'file_3.csv', 'file_4.csv', 'file_5.csv']
with open('output.csv', 'w') as oo:
for file in my_files:
with open(file, 'r') as io:
oo.write(io.readlines())
you can then open output.csv in excel.
If you don't want to manually list all your input files, you could use glob
>>> my_files = glob.glob('file_*.csv')
>>> my_files
... ['file_1.csv', 'file_2.csv', 'file_3.csv', 'file_4.csv', 'file_5.csv']
There is even an Excel read / write python module here. You could use it to write directly to an Excel file:
import xlwt
w = xlwt.Workbook()
ws = w.add_sheet('First sheet')
i = 0
for file in my_files:
with open(file, 'r') as fo:
for line in fo.readlines():
for j, col in enumerate(line.split(',')):
ws.write(i, j, col)
i += 1

Categories