I am writing some automated scripts to process Excel files in Python, some are in XLS format. Here's a code snippet of my attempting to do so with Pandas:
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents is the file contents pulled from an AWS S3 bucket. When this line runs I get [ERROR] ValueError: File is not a recognized excel file.
In troubleshooting this, I have tried to access the spreadsheet using xlrd directly:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
This works without errors so xlrd seems to recognize it as an Excel file, just not when asked to do so by Pandas.
Anyone know why Pandas won't read the file with xlrd as the engine? Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?
Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel can take a book...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.
The code loops through a local folder and shows the file and extension but then uses python-magic to drill into it. It also has a column showing guessing from mimetypes but that isn't as good. Do zoom into the image of the frame and see that some .xls are not what the extension says. Also a .txt is actually an Excel file.
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
From there you could filter files based on their type:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
Output:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4
I was trying to convert xlsb file to xlsx using Python but I am not able to figure out my problem in my all unsuccessful attempts.
Code:
import pandas as pd
import os
import glob
source='C:\\Users\\JS Developer\\sample.xlsb'
dest= 'C:\\Users\\JS Developer\\Desktop\\New folder'
os.chdir(source)
for file in glob.glob("*.xlb"):
df.to_csv(dest+file+'.csv', index=False)
os.remove(file)
for file in glob.glob("*.xlsb"):
df = pd.read_excel(file)
df.to_csv(dest+file+'.csv', index=False)
os.remove(file)
Once you read the excel and stored it in pandas dataframe save it as
df.to_excel(r'Path\name.xlsx')
Try:
for file in glob.glob("*.xlsb"):
df = pd.read_excel(file)
df.to_excel(dest+file+'.xlsx', index = None, header=True)
os.remove(file)
For some reason, when I attempt to read a hdf file from S3 using the pandas.read_hdf() method, I get a FileNotFoundError when I put an s3 url. The file definitely exists and I have tried using the pandas.read_csv() method with a csv file in the same s3 directory and that works. Is there something else I need to be doing? Here's the code:
import boto3
import h5py
import s3fs
import pandas as pd
csvDataframe = pd.read_csv('s3://BUCKET_NAME/FILE_NAME.csv', key='df')
print("Csv data:")
print(csvDataframe)
dataframe = pd.read_hdf('s3://BUCKET_NAME/FILE_NAME.h5', key='df')
print("Hdf data:")
print(dataframe)
Here is the error:
FileNotFoundError: File s3://BUCKET_NAME/FILE_NAME.h5 does not exist
In the actual code, BUCKET_NAME and FILE_NAME are replaced with their actual strings.
Please make sure file extension is .h5
I want to do something like this:
import openpyxl as x
wb = x.load_workbook(filename)
# do some edit to the workbook
wb.save(filename)
The file specified by the filename is opening in Excel. Excel is locking the file so I will get permission denied error running the above code. Is there a way to edit/save it?
from openpyxl import load_workbook
ifile = 'Whales.xlsx'
wb = load_workbook(filename=ifile)
# do some edit to the workbook
wb.save(ifile)
Is it possible to copy a macros enabled excel workbook? For example, I have:
from xlutils.copy import copy
from xlrd import open_workbook
from tempfile import TemporaryFile
book = open_workbook("book.xlsm")
book_copy = copy(book)
book_copy.save("bookcopy.xlsm")
book_copy.save(TemporaryFile())
However, when I then click on bookcopy.xlsm to open it, I get the following error:
"Excel cannot open the file 'bookcopy.xlsm' because the file format or the file extesion is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file."
I don't get this error when I save is as "bookcopy.xls", but I need it to be .xlsm. Does anyone have an idea what the problem is?