Converting .xls to .csv before recombining multiple files into a .xls - python

I am working on a webscraper tool which downloads excel files from a website. Of course, those .xls files are actually just renamed .csv files, which prevents me from just combining the .xls files together. Instead, I need to convert them all to .csv, them use pyexcel's pyexcel.merge_csv_to_a_book(filelist, outfilename='merged.xls') function to create a excel book from these .csv files.
Here is what I tried:
def concatenate_excel_files():
indexer = 0
excel_file_list = []
for file in glob.glob(os.getcwd()+'\Reports\*.'):
pyexcel.save_as(file_name=file, dest_file_name=str(indexer)+'.csv')
excel_file_list[indexer] = file
indexer += 1
pyexcel.merge_csv_to_a_book(excel_file_list, outfilename='merged.xls')
This fails to even convert the files to .csv (IndexError: list index out of range error.)
Any help rewriting this would be appreciated.

Answer by chfw:
for pyexcel to work properly, it needs to know file extension but in your case, the file extension is missing. And it will more helpful if the full stack trace is shown.

Related

Trying to read a directory of .xlsm files in pandas

I (a noob) am currently trying to read a directory of .xlsm files into a pandas dataframe, with the intention of merging them all together into one big file. I've done similar tasks in the past with .csv files and had no problems, but this has me at a loss.
I'm currently running this:
import pandas as pd
import glob
import openpyxl
df = [pd.read_excel(filename,engine="openpyxl") for filename in glob.glob(r'\\data\Designer\BI_Development\BI_2022_Objective\BIDataLake\MTT\Automation\TimeTrackingSheets_Automation\TimeTrackingSheets_Automation\TM_TimeTrackingSheets\*.xlsm')]
This solution has worked for me in the past. But here, when I run the above code, i get the following error:
zipfile.BadZipFile: File is not a zip file
Which is confusing me, because the file that I'm trying to access is not a zip file. Granted, there is a zip file with that same name in the same directory, but when I rename the file I'm referencing in my program to distinguish it from the zip file, I get the same error.
Anyone have any ideas? I've lurked for a long time and this is my first question, so apologies if it's not formatted in the proper way. Happy to provide more information as necessary. Thank you in advance!
UPDATE
This was fixed by excluding hidden files in the script, something I was unaware was happening.
path = r'\\data\Designer\BI_Development\BI_2022_Objective\BIDataLake\MTT\Automation\TimeTrackingSheets_Automation\TimeTrackingSheets_Automation\TM_TimeTrackingSheets'
# read all the files with extension .xlsm i.e. excel
filenames = glob.glob(path + "\[!~]*.xlsm")
# print('File names:', filenames)
# empty data frame for the new output excel file with the merged excel files
outputxlsx = pd.DataFrame()
# for loop to iterate all excel files
for file in filenames:
# using concat for excel files
# after reading them with read_excel()
df = pd.concat(pd.read_excel( file, ["BW_TimeSheet"]), ignore_index=True, sort=False)
df['Username'] = os.path.basename(file)
outputxlsx.append(df)
# appending data of excel files
outputxlsx = outputxlsx.append( df, ignore_index=True)
print('Final Excel sheet now generated at the same location:')
outputxlsx.to_excel(path+"/Output.xlsx", index=False)
Thanks everyone for your help!
Please delete the encryption of the file.
engine="openpyxl"
This does not support reading encrypted files.
I refer to this issue.
This problem is related to excel and openpyxl. The best way is trying reading and writing to CSV.

Read .xls file with Python pandas read_excel not working, says it is a .xlsb file

I'm trying to read several .xls files, saved on a NAS folder, with Apache Airflow, using the read_excel python pandas function.
This is the code I'm using:
df = pd.read_excel('folder/sub_folder_1/sub_folder_2/file_name.xls', sheet_name=April, usecols=[0,1,2,3], dtype=str, engine='xlrd')
This worked for a time, but recently I have been getting this error for several of those files:
Excel 2007 xlsb file; not supported
[...]
xlrd.biffh.XLRDError: Excel 2007 xlsb file; not supported
These files are clearly .xls files, yet my code seems to detect them as .xlsb files, which are not supported. I would prefer a way to specify they are .xls file, or alternatively, a way to read xlsb files.
Not sure if this is relevant, but these files are updated by an external team, who may have modified some parameter of these files without me knowing so, but I think that if this was the case, I would be getting a different error.
Try:
import openpyxl
xls = pd.ExcelFile('data.xls', engine='openpyxl')
df = pd.read_excel(xls)
XLRD has removed the ability to read in some excel datatypes recently like xlxs

Is there way faster way to edit xlsm file other than openpyxl?

file='excel.xlsm'
wb=openpyxl.load_workbook(filename=file, read_only=False, keep_vba=True)
sheet=wb['Template']
rowx=['x','y','z']
rows=sheet.max_row
sheet.cell(row=rows+1, column=j+1).value=row[j]
wb.save(file)
I have an xlsm file and I have tried edit it using openpyxl. But there is a problem. When I try to edit a xlsm which has 4 or 5 templates, the size of the file is 4-5 mb which takes a lot of time to load when using openpyxl. Is there any way that I can modify my current file faster without having to create a new one?
An .xlsx or .xlsm file is basically a bunch of XML files wrapped up in a zip file.
In my scripts repo I have an example (unlock-excel.py) where I use the zipfile module to open and modify an excel file. (In this case removing the <sheetProtect> and <workbookProtect> elements from the pages)
What I learned about the excel file format for creating this program is documented in this article on my website. The highlights:
An xlsx (or xlsm) file is basically a zip-file with a standard directory structure and a gaggle of XML files. When unpacking these files, I generally found the following directories.
> mkdir foo
> cd foo
> unzip ../foo.xlsx
> find . -type d|sort
.
./_rels
./docProps
./xl
./xl/_rels
./xl/printerSettings
./xl/theme
./xl/worksheets
./xl/worksheets/_rels
The _rels directories don't matter for this purpose.
The docProps directory contains two files; app.xml and core.xml. The app.xml file basically contained a list of the titles as seen on the tabs on the bottom of the worksheets. The titles are listed in this file in the sequence they appear in the xlsx file from left to right. They are bracketed between <vt:lpstr> and </vt:lpstr> tags.
The workbook.xml file in the xl directory contains a number of sheet definitions. These link the name of the worksheet to several numbers. Each sheet has a single tag with attributes, like this.
<sheet name="template" sheetId="4" r:id="rId1"/>
In the subdirectory xl/worksheets there is a number of XML files named sheetN.xml, where N is a number. These are the actual worksheets. One might expect that N corresponds with the sheetId. But that turns out not to be the case. The sheet number N actually is the number in the r:id attribute after the rId text. So in the example above, the worksheet named template is xl/worksheets/sheet1.xml.

Zipfile namelist() missing members from archive

I'm currently trying to open an .xlsx file with zipfile on Python, finding all files with namelist(), then using .count() to find all images in .png format within the archive.
My problem is currently, the list returned by namelist() function returns only 1680 elements.
After saving the xlsx file as an html, I am able to view all images contained in the excel spreadsheet and the total file count is 3,352 files.
I checked documentation for zipfile and exhausted the best Google searches I could muster. I appreciate any hints or advice!
Here's the snippet of code I'm using:
import zipfile as zf
xlsx = 'myfile.xlsx'
xlsx_file = zf.ZipFile(xlsx)
fileList = xlsx_file.namelist()
maybe convert it to a wheel file? wheel works good to me

Python Iterating program

I am a beginner programmer who is working on my first project. I am trying to create a script that unzips two files and extracts a folder that contains .csv files to a temp directory. I am hoping to import and format those .csv with Xlsx lib. My code is able to do the first part, it unzips all the .csv's perfectly.
I need some pointers on how to iterate over all the .csv's on the temp folder and copy the data of each .csv to an excel spreadsheet. I must note that the .csv files have only one row with 5 columns of data. Here is what I have:
for zfiles in glob.glob("*.zip"):
with zipfile.ZipFile(zfiles, 'r') as myS:
myS.extractall(tempDir)
os.chdir(tempDir)
for z in glob.glob("*.zip"):
with zipfile.ZipFile(z, 'r') as mySecondS:
mySecondS.extractall()
Thank you!

Categories