Trying to read a directory of .xlsm files in pandas - python

I (a noob) am currently trying to read a directory of .xlsm files into a pandas dataframe, with the intention of merging them all together into one big file. I've done similar tasks in the past with .csv files and had no problems, but this has me at a loss.
I'm currently running this:
import pandas as pd
import glob
import openpyxl
df = [pd.read_excel(filename,engine="openpyxl") for filename in glob.glob(r'\\data\Designer\BI_Development\BI_2022_Objective\BIDataLake\MTT\Automation\TimeTrackingSheets_Automation\TimeTrackingSheets_Automation\TM_TimeTrackingSheets\*.xlsm')]
This solution has worked for me in the past. But here, when I run the above code, i get the following error:
zipfile.BadZipFile: File is not a zip file
Which is confusing me, because the file that I'm trying to access is not a zip file. Granted, there is a zip file with that same name in the same directory, but when I rename the file I'm referencing in my program to distinguish it from the zip file, I get the same error.
Anyone have any ideas? I've lurked for a long time and this is my first question, so apologies if it's not formatted in the proper way. Happy to provide more information as necessary. Thank you in advance!
UPDATE
This was fixed by excluding hidden files in the script, something I was unaware was happening.
path = r'\\data\Designer\BI_Development\BI_2022_Objective\BIDataLake\MTT\Automation\TimeTrackingSheets_Automation\TimeTrackingSheets_Automation\TM_TimeTrackingSheets'
# read all the files with extension .xlsm i.e. excel
filenames = glob.glob(path + "\[!~]*.xlsm")
# print('File names:', filenames)
# empty data frame for the new output excel file with the merged excel files
outputxlsx = pd.DataFrame()
# for loop to iterate all excel files
for file in filenames:
# using concat for excel files
# after reading them with read_excel()
df = pd.concat(pd.read_excel( file, ["BW_TimeSheet"]), ignore_index=True, sort=False)
df['Username'] = os.path.basename(file)
outputxlsx.append(df)
# appending data of excel files
outputxlsx = outputxlsx.append( df, ignore_index=True)
print('Final Excel sheet now generated at the same location:')
outputxlsx.to_excel(path+"/Output.xlsx", index=False)
Thanks everyone for your help!

Please delete the encryption of the file.
engine="openpyxl"
This does not support reading encrypted files.

I refer to this issue.
This problem is related to excel and openpyxl. The best way is trying reading and writing to CSV.

Related

Merge 200 + Excel Files into 1

I have a folder with 200+ excel files. I have the respective path and sheet names for each file in the folder. Is it possible to merge all of these files into one or a couple large excel file via python? If so, what libraries would be good for me to start reading up on for this type of script?
I am trying to condense the files into 1-8 excel files in total not 200+ excel files.
Thank you!
For example, suppose there are a.xlsx, b.xlsx, c.xlsx.
With using os(by import os) and endswith method, you can take all xlsx files.(You would easily find how to do it)
Then, read xlsx files in the loop(for or while statement) and with pandas and add it into a new excelwriter like below
e.g.
import pandas as pd
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('goal.xlsx', engine='xlsxwriter')
while True:
df = pd.read_excel('excel_file_path')
# Write each dataframe to a different worksheet.
df.to_excel(writer, sheet_name='Sheet{}'.format(sheet_number))
writer.save()
Go to directory where all csv are located (eg cd C:\Temp)
Click on the file directory and type "cmd"; this will open a command prompt at the file location.
Type "copy *.csv combine.csv"
(Replace "csv" with the type of excel file you have, however this will probably work best with csv files)

PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

I am not sure why I am getting this error although sometimes my code works fine!
Excel file format cannot be determined, you must specify an engine manually.
Here below is my code with steps:
1- list of columns of customers Id:
customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]
2- The code to find all xlsx files in a folder and read them:
l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
df.columns = ["ID"] # to have only one column once concat
l.append(df)
all_data = pd.concat(l, ignore_index=True) # concat all data
I added the engine openpyxl
df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)
Now I got a different error:
BadZipFile: File is not a zip file
pandas version: 1.3.0
python version: python3.9
os: MacOS
is there a better way to read all xlsx files from a folder ?
Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:
~$datasheet.xlsx
So, when I run the code to read all the files from the folder it gives me the error:
Excel file format cannot be determined, you must specify an engine manually.
When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.
Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.
if file_extension == 'xlsx':
df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
df = pd.read_excel(file.read())
elif file_extension == 'csv':
df = pd.read_csv(file.read())
https://stackoverflow.com/a/32241271/17411729
link to an answer on how to remove hidden files
Mac = go to folder press cmd + shift + .
will show the hidden file, delete it, run it back.
In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)
for file in os.scandir(test_folder):
filename = os.fsdecode(file)
if '.DS_Store' not in filename:
execute_function(file)
You can filter out the unwanted temp files by checking if file starts with "~".
import os
for file in os.listdir(folder path):
if not file.startswith("~") and file.endswith(".xlsx"):
print(file)
I also got an 'Excel file format...' error when I manually changed the 'CSV' suffix to 'XLS'. All I had to do was open excel and save it to the format I wanted.
Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)
In my case, I usedxlrd. So in terminal:
pip install xlrd
If pandas is not installed, install it:
pip install pandas
Now read the excel file this way:
import pandas as pd
df = pd.read_excel("filesFolder/excelFile.xls", engine='xlrd')

Reading a csv file in using pandas csv_read in a for loop

I am using Macbook with MAC OS X catalina and the latest anaconda installation.
I have a list of files I want to be read in a folder I have with many files. The list of files are contained in an excel sheet in the following format.
This file is called list.xlsx
The directory and sub directories of csv files are located in a folder as follows "/Users/XXX/Documents/test/data"
There are many other files in the directory which I don't want to use, therefore I want to cycle through this list.xlsx of files I have.
When I do df = pd.csv_read("/Users/XXX/Documents/test/data/A/ABCS.csv"), the file is read perfectly fine. This is the first file in my list.
However, when I load the file this way, so I can perform a for loop,
filelist = pd.read_excel("/Users/XXX/Documents/test/list.xlsx")
df = pd.csv_read(f"/Users/XXX/Documents/test/data/{filelist.File[0]}")
I get a 'FileNotFoundError: [Error 2] File /Users/XXX/Documents/test/data/A/ABCS.csv does not exist:/Users/XXX/Documents/test/data/A/ABCS.csv'
Even though it shows the exact location I used above. Why is this happening and how do I fix it? It seems that when I load the file names using pandas, it can't be read properly.
This would be a work around:
filelist = pd.read_excel("/Users/XXX/Documents/test/list.xlsx")
DF = []
for i in range(len(filelist)):
file = str(filelist[i])
df = pd.read_csv(file, index_col=None, header=0)
DF.append(df)
#combine all files
DF = pd.concat(DF, axis=0, ignore_index=True)
Serge Ballesta said I should not blindly trust the printed string, I ran the command print([(i, hex(ord(i))) for i in filelist.File[0]]) as he suggested and found there were a string of spaces after the file names which was messing up the read function.

how to get the name of an unknown .XLS file into a variable in Python 3.7

I'm using Python 3.7.
I have to download an excel file (.xls) that has a unique filename every time I download it into a specific downloads folder location.
Then with Python and Pandas, I then have to open the excel file and read/convert it to a dataframe.
I want to automate the process, but I'm having trouble telling Python to get the full name of the XLS file as a variable, which will then be used by pandas:
# add dependencies and set location for downloads folder
import os
import glob
import pandas as pd
download_dir = '/Users/Aaron/Downloads/'
# change working directory to download directory
os.chdir(download_dir)
# get filename of excel file to read into pandas
excel_files = glob.glob('*.xls')
blah = str(excel_files)
blah
So then for example, the output for "blah" is:
"['63676532355861.xls']"
I have also tried just using "blah = print(excel_files)" for the above block, instead of the "str" method, and assigning that to a variable, which still doesn't work.
And then the rest of the process would do the following:
# open excel (XLS) file with unknown filename in pandas as a dataframe
data_df = pd.read_excel('WHATEVER.xls', sheet_name=None)
And then after I convert it to a data frame, I want to DELETE the excel file.
So far, I have spent a lot of time reading about fnames, io, open, os.path, and other libraries.
I still don't know how to get the name of the unknown .XLS file into a variable, and then later deleting that file.
Any suggestions would be greatly appreciated.
This code finds an xls file in your specified path reads the xls file and deletes the file.If your directory contains more than 1 xls file,It reads the last one.You can perform whatever operation you want if you find more than one xls files.
import os
for filename in os.listdir(os.getcwd()):
if filename.endswith(".xls"):
print(filename)
#do your operation
data_df = pd.read_excel(filename, sheet_name=None)
os.remove(filename)
Check this,
lst = os.listdir()
matching = [s for s in lst if '.xls' in s]
matching will have all list of excel files.
As you are having only one excel file, you can save in variable like file_name = matching[0]

Converting .xls to .csv before recombining multiple files into a .xls

I am working on a webscraper tool which downloads excel files from a website. Of course, those .xls files are actually just renamed .csv files, which prevents me from just combining the .xls files together. Instead, I need to convert them all to .csv, them use pyexcel's pyexcel.merge_csv_to_a_book(filelist, outfilename='merged.xls') function to create a excel book from these .csv files.
Here is what I tried:
def concatenate_excel_files():
indexer = 0
excel_file_list = []
for file in glob.glob(os.getcwd()+'\Reports\*.'):
pyexcel.save_as(file_name=file, dest_file_name=str(indexer)+'.csv')
excel_file_list[indexer] = file
indexer += 1
pyexcel.merge_csv_to_a_book(excel_file_list, outfilename='merged.xls')
This fails to even convert the files to .csv (IndexError: list index out of range error.)
Any help rewriting this would be appreciated.
Answer by chfw:
for pyexcel to work properly, it needs to know file extension but in your case, the file extension is missing. And it will more helpful if the full stack trace is shown.

Categories