Save file name and extension based on another file with .to_csv()

Save file name and extension based on another file with .to_csv() - python

I have an input file name file_a.xml
I already created a function to parse out the xml and save it as a df. Then I used df.to_csv
to save the output file name file_a.csv
Is there a way to do this automatically with default filename and extension?
I need to iterate over a folder with lots of .xml files, so I like to the output filename & extension it based on the input xml file.
xml_file = open ('file/path/dir/file_a.xml','r').read()
def XML_to_CSV(xml_file):
...code to parse out xml...
return df
csv_data = df.to_csv('file/path/dir/file_a.csv',index = False)

Try something like this:
import os
import pandas as pd
from pathlib import Path
for file in os.listdir("your dir"):
if file.endswith(".xml"):
...make the xml turn df.
df.to_csv(Path(file).stem + '.csv', index=False)

Related

SAX Parser in Python

I am parsing xml files in a folder using Python SAX Parser and writing the output in CSV using pandas, But I am getting only the data from last file in the CSV.
I am new to Python and this is for the first time trying SAX Parsing
File read:
for dirpath, dirs, files in os.walk(fp1):
for filename in files:
print(files)
fname = os.path.join(dirpath,filename)
if fname.endswith('.xml'):
print(fname)
#for count in files:
parser.parse(fname)
def characters(self, content):
rows = []
cols = ["ReporterCite","DecisionDate","CaseName","FileNum","CourtLocation","CourtName","CourtAbbrv","Judge","CaseLength","CourtCite","ParallelCite","CitedCount","UCN"]
#ReporteCite, DecisionDate, CaseName, FileNum, CourtLocation, CourtName, CourtAbbrv, Judge, CaseLength, CourtCite, ParallelCite, CitedCount, UCN
rows.append({"ReporterCite":self.rc,
"DecisionDate": self.dd,
"CaseName": self.can,
"FileNum": self.fn,
"CourtLocation": self.loc,
"CourtName": self.cn,
"CourtAbbrv": self.ca,
"Judge": self.j,
"CaseLength": self.cl,
"CourtCite": self.cc,
"ParallelCite": self.pc,
"CitedCount": self.cd,
"UCN": self.rn})
#print(rows)
df = pd.DataFrame(rows, columns=cols)
df.to_csv(fp2,index=False)

I assume you will always overwrite your previous result. This is a pandas question, not a SAX question. You would like append to the existing csv, right? If this is the case you have to use the mode = ‘a’, like
df.to_csv('filename.csv',mode = 'a')
More options, see Doc
'w' open for writing, truncating the file first (default)
'x' open for exclusive creation, failing if file already exists
'a' open for writing, appending to the end of file if it exists

Extracting files from zip file from GET request

I currently have a GET request to a URL that returns three things: .zip file, .zipsig file, and a .txt file.
I'm only interested in the .zip file which has dozens of .json files. I would like to extract all these .json files, preferable directly into a single pandas data frame, but extracting them into a folder also works.
Code so far, mostly stolen:
license = requests.get(url, headers={'Authorization': "Api-Token " + 'blah'})
z = zipfile.ZipFile(io.BytesIO(license.content))
billingRecord = z.namelist()[0]
z.extract(billingRecord, path = "C:\\Users\\Me\\Downloads\\Json license")
This extracts the entire .zip file to the path. I would like to extract the individual .json files from said .zip file to the path.

import io
import zipfile
import pandas as pd
import json
dfs = []
with zipfile.ZipFile(io.BytesIO(license.content)) as zfile:
for info in zfile.infolist():
if info.filename.endswith('.zip'):
zfiledata = io.BytesIO(zfile.read(info.filename))
with zipfile.ZipFile(zfiledata) as json_zips:
for info in json_zips.infolist():
if info.filename.endswith('.json'):
json_data = pd.json_normalize(json.loads(json_zips.read(info.filename)))
dfs.append(json_data)
df = pd.concat(dfs, sort=False)
print(df)

I would do something like this. Obviously this is my test.zip file but the steps are:
List the files from the archive using the .infolist() method on your z archive
Check if the filename ends with the json extension using .endswith('.json')
Extract that filename with .extract(info.filename, info.filename)
Obviously you've called your archive z but mine is archive bu that should get you started.
Example code:
import zipfile
with zipfile.ZipFile("test.zip", mode="r") as archive:
for info in archive.infolist():
print(info.filename)
if info.filename.endswith('.png'):
print('Match: ', info.filename)
archive.extract(info.filename, info.filename)

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

I am writing some automated scripts to process Excel files in Python, some are in XLS format. Here's a code snippet of my attempting to do so with Pandas:
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents is the file contents pulled from an AWS S3 bucket. When this line runs I get [ERROR] ValueError: File is not a recognized excel file.
In troubleshooting this, I have tried to access the spreadsheet using xlrd directly:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
This works without errors so xlrd seems to recognize it as an Excel file, just not when asked to do so by Pandas.
Anyone know why Pandas won't read the file with xlrd as the engine? Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?

Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel can take a book...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.
The code loops through a local folder and shows the file and extension but then uses python-magic to drill into it. It also has a column showing guessing from mimetypes but that isn't as good. Do zoom into the image of the frame and see that some .xls are not what the extension says. Also a .txt is actually an Excel file.
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
From there you could filter files based on their type:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
Output:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4

Need some assistance on a DBF "File not found" error in Python when looping through a directory?

I would like to ask for help with a Python script that is supposed to loop through a directory on a drive. Basically, what I want to do is convert over 10,0000 DBF files to CSV. So far, I can achieve this on an individual dbf file by using using the dbfread and Pandas packages. Running this script over 10,000 individual times is obviously not feasible, hence why I want automate the task by writing a script that will loop through each dbf file in the directory.
Here is what I would like to do.
Define the directory
Write a for loop that will loop through each file in the directory
Only open a file with the extension '.dbf'
Convert to Pandas DataFrame
Define the name for the output file
Write to CSV and place file in a new directory
Here is the code that I was using to test whether I could convert an individual '.dbf' file to a CSV.
from dbfread import DBF
import pandas as pd
table = DBF('Name_of_File.dbf')
#I originally kept receiving a unicode decoding error
#So I manually adjusted the attributes below
table.encoding = 'utf-8' # Set encoding to utf-8 instead of 'ascii'
table.char_decode_errors = 'ignore' #ignore any decode errors while reading in the file
frame = pd.DataFrame(iter(table)) #Convert to DataFrame
print(frame) #Check to make sure Dataframe is structured proprely
frame.to_csv('Name_of_New_File')
The above code worked exactly as it was intended.
Here is my code to loop through the directory.
import os
from dbfread import DBF
import pandas as pd
directory = 'Path_to_diretory'
dest_directory = 'Directory_to_place_new_file'
for file in os.listdir(directory):
if file.endswith('.DBF'):
print(f'Reading in {file}...')
dbf = DBF(file)
dbf.encoding = 'utf-8'
dbf.char_decode_errors = 'ignore'
print('\nConverting to DataFrame...')
frame = pd.DataFrame(iter(dbf))
print(frame)
outfile = frame.os.path.join(frame + '_CSV' + '.csv')
print('\nWriting to CSV...')
outfile.to_csv(dest_directory, index = False)
print('\nConverted to CSV. Moving to next file...')
else:
print('File not found.')
When I run this code, I receive a DBFNotFound error that says it couldn't find the first file in the directory. As I am looking at my code, I am not sure why this is happening when it worked in the first script.
This is the code from the dbfread package from where the exception is being raised.
class DBF(object):
"""DBF table."""
def __init__(self, filename, encoding=None, ignorecase=True,
lowernames=False,
parserclass=FieldParser,
recfactory=collections.OrderedDict,
load=False,
raw=False,
ignore_missing_memofile=False,
char_decode_errors='strict'):
self.encoding = encoding
self.ignorecase = ignorecase
self.lowernames = lowernames
self.parserclass = parserclass
self.raw = raw
self.ignore_missing_memofile = ignore_missing_memofile
self.char_decode_errors = char_decode_errors
if recfactory is None:
self.recfactory = lambda items: items
else:
self.recfactory = recfactory
# Name part before .dbf is the table name
self.name = os.path.basename(filename)
self.name = os.path.splitext(self.name)[0].lower()
self._records = None
self._deleted = None
if ignorecase:
self.filename = ifind(filename)
if not self.filename:
**raise DBFNotFound('could not find file {!r}'.format(filename))** #ERROR IS HERE
else:
self.filename = filename
Thank you any help provided.

os.listdir returns the file names inside the directory, so you have to join them to the base path to get the full path:
for file_name in os.listdir(directory):
if file_name.endswith('.DBF'):
file_path = os.path.join(directory, file_name)
print(f'Reading in {file_name}...')
dbf = DBF(file_path)

xls to csv converter

I am using win32.client in python for converting my .xlsx and .xls file into a .csv. When I execute this code it's giving an error. My code is:
def convertXLS2CSV(aFile):
'''converts a MS Excel file to csv w/ the same name in the same directory'''
print "------ beginning to convert XLS to CSV ------"
try:
import win32com.client, os
from win32com.client import constants as c
excel = win32com.client.Dispatch('Excel.Application')
fileDir, fileName = os.path.split(aFile)
nameOnly = os.path.splitext(fileName)
newName = nameOnly[0] + ".csv"
outCSV = os.path.join(fileDir, newName)
workbook = excel.Workbooks.Open(aFile)
workbook.SaveAs(outCSV, c.xlCSVMSDOS) # 24 represents xlCSVMSDOS
workbook.Close(False)
excel.Quit()
del excel
print "...Converted " + nameOnly + " to CSV"
except:
print ">>>>>>> FAILED to convert " + aFile + " to CSV!"
convertXLS2CSV("G:\\hello.xlsx")
I am not able to find the error in this code. Please help.

I would use xlrd - it's faster, cross platform and works directly with the file.
As of version 0.8.0, xlrd reads both XLS and XLSX files.
But as of version 2.0.0, support was reduced back to only XLS.
import xlrd
import csv
def csv_from_excel():
wb = xlrd.open_workbook('your_workbook.xls')
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open('your_csv_file.csv', 'wb')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()

I would use pandas. The computationally heavy parts are written in cython or c-extensions to speed up the process and the syntax is very clean. For example, if you want to turn "Sheet1" from the file "your_workbook.xls" into the file "your_csv.csv", you just use the top-level function read_excel and the method to_csv from the DataFrame class as follows:
import pandas as pd
data_xls = pd.read_excel('your_workbook.xls', 'Sheet1', index_col=None)
data_xls.to_csv('your_csv.csv', encoding='utf-8')
Setting encoding='utf-8' alleviates the UnicodeEncodeError mentioned in other answers.

Maybe someone find this ready-to-use piece of code useful. It allows to create CSVs from all spreadsheets in Excel's workbook.
Python 2:
# -*- coding: utf-8 -*-
import xlrd
import csv
from os import sys
def csv_from_excel(excel_file):
workbook = xlrd.open_workbook(excel_file)
all_worksheets = workbook.sheet_names()
for worksheet_name in all_worksheets:
worksheet = workbook.sheet_by_name(worksheet_name)
with open(u'{}.csv'.format(worksheet_name), 'wb') as your_csv_file:
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in xrange(worksheet.nrows):
wr.writerow([unicode(entry).encode("utf-8") for entry in worksheet.row_values(rownum)])
if __name__ == "__main__":
csv_from_excel(sys.argv[1])
Python 3:
import xlrd
import csv
from os import sys
def csv_from_excel(excel_file):
workbook = xlrd.open_workbook(excel_file)
all_worksheets = workbook.sheet_names()
for worksheet_name in all_worksheets:
worksheet = workbook.sheet_by_name(worksheet_name)
with open(u'{}.csv'.format(worksheet_name), 'w', encoding="utf-8") as your_csv_file:
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(worksheet.nrows):
wr.writerow(worksheet.row_values(rownum))
if __name__ == "__main__":
csv_from_excel(sys.argv[1])

I'd use csvkit, which uses xlrd (for xls) and openpyxl (for xlsx) to convert just about any tabular data to csv.
Once installed, with its dependencies, it's a matter of:
python in2csv myfile > myoutput.csv
It takes care of all the format detection issues, so you can pass it just about any tabular data source. It's cross-platform too (no win32 dependency).

First read your excel spreadsheet into pandas, below code will import your excel spreadsheet into pandas as a OrderedDict type which contain all of your worksheet as dataframes. Then simply use worksheet_name as a key to access specific worksheet as a dataframe and save only required worksheet as csv file by using df.to_csv(). Hope this will workout in your case.
import pandas as pd
df = pd.read_excel('YourExcel.xlsx', sheet_name=None)
df['worksheet_name'].to_csv('YourCsv.csv')
If your Excel file contain only one worksheet then simply use below code:
import pandas as pd
df = pd.read_excel('YourExcel.xlsx')
df.to_csv('YourCsv.csv')
If someone want to convert all the excel worksheets from single excel workbook to the different csv files, try below code:
import pandas as pd
def excelTOcsv(filename):
df = pd.read_excel(filename, sheet_name=None)
for key, value in df.items():
return df[key].to_csv('%s.csv' %key)
This function is working as a multiple Excel sheet of same excel workbook to multiple csv file converter. Where key is the sheet name and value is the content inside sheet.

#andi I tested your code, it works great, BUT
In my sheets there's a column like this
2013-03-06T04:00:00
date and time in the same cell
It gets garbled during exportation, it's like this in the exported file
41275.0416667
other columns are ok.
csvkit, on the other side, does ok with that column but only exports ONE sheet, and my files have many.

xlsx2csv is faster than pandas and xlrd.
xlsx2csv -s 0 crunchbase_monthly_.xlsx cruchbase
excel file usually comes with n sheetname.
-s is sheetname index.
then, cruchbase folder will be created, each sheet belongs to xlsx will be converted to a single csv.
p.s. csvkit is awesome too.

Quoting an answer from Scott Ming, which works with workbook containing multiple sheets:
Here is a python script getsheets.py (mirror), you should install pandas and xlrd before you use it.
Run this:
pip3 install pandas xlrd # or `pip install pandas xlrd`
How does it works?
$ python3 getsheets.py -h
Usage: getsheets.py [OPTIONS] INPUTFILE
Convert a Excel file with multiple sheets to several file with one sheet.
Examples:
getsheets filename
getsheets filename -f csv
Options:
-f, --format [xlsx|csv] Default xlsx.
-h, --help Show this message and exit.
Convert to several xlsx:
$ python3 getsheets.py goods_temp.xlsx
Sheet.xlsx Done!
Sheet1.xlsx Done!
All Done!
Convert to several csv:
$ python3 getsheets.py goods_temp.xlsx -f csv
Sheet.csv Done!
Sheet1.csv Done!
All Done!
getsheets.py:
# -*- coding: utf-8 -*-
import click
import os
import pandas as pd
def file_split(file):
s = file.split('.')
name = '.'.join(s[:-1]) # get directory name
return name
def getsheets(inputfile, fileformat):
name = file_split(inputfile)
try:
os.makedirs(name)
except:
pass
df1 = pd.ExcelFile(inputfile)
for x in df1.sheet_names:
print(x + '.' + fileformat, 'Done!')
df2 = pd.read_excel(inputfile, sheetname=x)
filename = os.path.join(name, x + '.' + fileformat)
if fileformat == 'csv':
df2.to_csv(filename, index=False)
else:
df2.to_excel(filename, index=False)
print('\nAll Done!')
CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'])
#click.command(context_settings=CONTEXT_SETTINGS)
#click.argument('inputfile')
#click.option('-f', '--format', type=click.Choice([
'xlsx', 'csv']), default='xlsx', help='Default xlsx.')
def cli(inputfile, format):
'''Convert a Excel file with multiple sheets to several file with one sheet.
Examples:
\b
getsheets filename
\b
getsheets filename -f csv
'''
if format == 'csv':
getsheets(inputfile, 'csv')
else:
getsheets(inputfile, 'xlsx')
cli()

We can use Pandas lib of Python to conevert xls file to csv file
Below code will convert xls file to csv file .
import pandas as pd
Read Excel File from Local Path :
df = pd.read_excel("C:/Users/IBM_ADMIN/BU GPA Scorecard.xlsx",sheetname=1)
Trim Spaces present on columns :
df.columns = df.columns.str.strip()
Send Data frame to CSV file which will be pipe symbol delimted and without Index :
df.to_csv("C:/Users/IBM_ADMIN/BU GPA Scorecard csv.csv",sep="|",index=False)

Python is not the best tool for this task. I tried several approaches in Python but none of them work 100% (e.g. 10% converts to 0.1, or column types are messed up, etc). The right tool here is PowerShell, because it is an MS product (as is Excel) and has the best integration.
Simply download this PowerShell script, edit line 47 to enter the path for the folder containing the Excel files and run the script using PowerShell.

Using xlrd is a flawed way to do this, because you lose the Date Formats in Excel.
My use case is the following.
Take an Excel File with more than one sheet and convert each one into a file of its own.
I have done this using the xlsx2csv library and calling this using a subprocess.
import csv
import sys, os, json, re, time
import subprocess
def csv_from_excel(fname):
subprocess.Popen(["xlsx2csv " + fname + " --all -d '|' -i -p "
"'<New Sheet>' > " + 'test.csv'], shell=True)
return
lstSheets = csv_from_excel(sys.argv[1])
time.sleep(3) # system needs to wait a second to recognize the file was written
with open('[YOUR PATH]/test.csv') as f:
lines = f.readlines()
firstSheet = True
for line in lines:
if line.startswith('<New Sheet>'):
if firstSheet:
sh_2_fname = line.replace('<New Sheet>', '').strip().replace(' - ', '_').replace(' ','_')
print(sh_2_fname)
sh2f = open(sh_2_fname+".csv", "w")
firstSheet = False
else:
sh2f.close()
sh_2_fname = line.replace('<New Sheet>', '').strip().replace(' - ', '_').replace(' ','_')
print(sh_2_fname)
sh2f = open(sh_2_fname+".csv", "w")
else:
sh2f.write(line)
sh2f.close()

I've tested all anwers, but they were all too slow for me. If you have Excel installed you can use the COM.
I thought initially it would be slower since it will load everything for the actual Excel application, but it isn't for huge files. Maybe because the algorithm for opening and saving files runs a heavily optimized compiled code, Microsoft guys make a lot of money for it after all.
import sys
import os
import glob
from win32com.client import Dispatch
def main(path):
excel = Dispatch("Excel.Application")
if is_full_path(path):
process_file(excel, path)
else:
files = glob.glob(path)
for file_path in files:
process_file(excel, file_path)
excel.Quit()
def process_file(excel, path):
fullpath = os.path.abspath(path)
full_csv_path = os.path.splitext(fullpath)[0] + '.csv'
workbook = excel.Workbooks.Open(fullpath)
workbook.Worksheets(1).SaveAs(full_csv_path, 6)
workbook.Saved = 1
workbook.Close()
def is_full_path(path):
return path.find(":") > -1
if __name__ == '__main__':
main(sys.argv[1])
This is very raw code and won't check for errors, print help or anything, it will just create a csv file for each file that matches the pattern you entered in the function so you can batch process a lot of files only launching excel application once.

As much as I hate to rely on Windows Excel proprietary software, which is not cross-platform, my testing of csvkit for .xls, which uses xlrd under the hood, failed to correctly parse dates (even when using the commandline parameters to specify strptime format).
For example, this xls file, when parsed with csvkit, will convert cell G1 of 12/31/2002 to 37621, whereas when converted to csv via excel -> save_as (using below) cell G1 will be "December 31, 2002".
import re
import os
from win32com.client import Dispatch
xlCSVMSDOS = 24
class CsvConverter(object):
def __init__(self, *, input_dir, output_dir):
self._excel = None
self.input_dir = input_dir
self.output_dir = output_dir
if not os.path.isdir(self.output_dir):
os.makedirs(self.output_dir)
def isSheetEmpty(self, sheet):
# https://archive.is/RuxR7
# WorksheetFunction.CountA(ActiveSheet.UsedRange) = 0 And ActiveSheet.Shapes.Count = 0
return \
(not self._excel.WorksheetFunction.CountA(sheet.UsedRange)) \
and \
(not sheet.Shapes.Count)
def getNonEmptySheets(self, wb, as_name=False):
return [ \
(sheet.Name if as_name else sheet) \
for sheet in wb.Sheets \
if not self.isSheetEmpty(sheet) \
]
def saveWorkbookAsCsv(self, wb, csv_path):
non_empty_sheet_names = self.getNonEmptySheets(wb, as_name=True)
assert (len(non_empty_sheet_names) == 1), \
"Expected exactly 1 sheet but found %i non-empty sheets: '%s'" \
%(
len(non_empty_sheet_names),
"', '".join(name.replace("'", r"\'") for name in non_empty_sheet_names)
)
wb.Worksheets(non_empty_sheet_names[0]).SaveAs(csv_path, xlCSVMSDOS)
wb.Saved = 1
def isXlsFilename(self, filename):
return bool(re.search(r'(?i)\.xls$', filename))
def batchConvertXlsToCsv(self):
xls_names = tuple( filename for filename in next(os.walk(self.input_dir))[2] if self.isXlsFilename(filename) )
self._excel = Dispatch('Excel.Application')
try:
for xls_name in xls_names:
csv_path = os.path.join(self.output_dir, '%s.csv' %os.path.splitext(xls_name)[0])
if not os.path.isfile(csv_path):
workbook = self._excel.Workbooks.Open(os.path.join(self.input_dir, xls_name))
try:
self.saveWorkbookAsCsv(workbook, csv_path)
finally:
workbook.Close()
finally:
if not len(self._excel.Workbooks):
self._excel.Quit()
self._excel = None
if __name__ == '__main__':
self = CsvConverter(
input_dir='C:\\data\\xls\\',
output_dir='C:\\data\\csv\\'
)
self.batchConvertXlsToCsv()
The above will take an input_dir containing .xls and output them to output_dir as .csv -- it will assert that there is exactly 1 non-empty sheet in the .xls; if you need to handle multiple sheets into multiple csv then you'll need to edit saveWorkbookAsCsv.

I was trying to use xlrd library in order to convert the format xlsx into csv, but I was getting error: xlrd.biffh.XLRDError: Excel xlsx file; not supported. That was happening because this package is no longer reading any other format unless xls, according to xlrd documentation.
Following the answer from Chris Withers I was able to change the engine for the function read_excel() from pandas, then I was able to a create a function that is converting any sheet from your Excel spreadsheet you want to successfully.
In order to work the function below, don't forget to install the openpyxl library from here.
Function:
import os
import pathlib
import pandas as pd
# Function to convert excel spreadsheet into csv format
def Excel_to_csv():
# Excel file full path
excel_file = os.path.join(os.path.sep, pathlib.Path(__file__).parent.resolve(), "Excel_Spreadsheet.xlsx")
# Excel sheets
excel_sheets = ['Sheet1', 'Sheet2', 'Sheet3']
for sheet in excel_sheets:
# Create dataframe for each sheet
df = pd.DataFrame(pd.read_excel(excel_file, sheet, index_col=None, engine='openpyxl'))
# Export to csv. i.e: sheet_name.csv
df.to_csv(os.path.join(os.path.sep, pathlib.Path(__file__).parent.resolve(), sheet + '.csv'), sep=",", encoding='utf-8', index=False, header=True)
# Runs the excel_to_csv function:
Excel_to_csv()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Save file name and extension based on another file with .to_csv() - python

Try something like this: import os import pandas as pd from pathlib import Path for file in os.listdir("your dir"): if file.endswith(".xml"): ...make the xml turn df. df.to_csv(Path(file).stem + '.csv', index=False)

Related

SAX Parser in Python

Extracting files from zip file from GET request

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

Need some assistance on a DBF "File not found" error in Python when looping through a directory?

xls to csv converter

Categories

Resources