Check if a Excel Sheet is empty - python

I have an excel workbook with multiple sheets. I need to delete the sheets which are completely empty, as my code when processing finds a blank sheet it fails.
os.chdir(path)
list_file=[]
for file in glob.glob("*.xlsx"):
print(file)
list_file.append(file)
I have listed all the files here available.
AB_list=[s for s in list_file if "India" in s]
CD_list=[s for s in list_file if "Japan" in s]
Then, i store the file names is list as per requirement. Now I need to delete empty sheets from those excel files before I move them to a dataframe. Then loop through to read the files into individual dataframe.

ws.max_row and ws.max_column should give you last used cell position. Based on that you can determine if sheet is empty. Also check if this works for you ws.calculate_dimension(), which should return a range.
All the functions are from openpyxl which you are already familiar with.

You've tagged openpyxl so I assume you're using that.
# workbook is opened MS Exel sheet opened with openpyxl.Workbook
empty_sheets = []
for name in workbook.get_sheet_names():
sheet = workbook.get_sheet_by_name(name)
if not sheet.columns():
empty_sheets.append(sheet)
map(workbook.remove, empty_sheets)

you can easily do this with pandas which I'm using too.
here
and code looks like
import pandas as pd
df = pd.read_csv(filename) #
or
pd.read_excel(filename) for xls file
df.empty

Related

Removing the Indexed Column when Merging 2 Excel Spreadsheets into a new Sheet in an existing Excel Spreadsheet using Pandas

I wanted to automate comparing two excel spreadsheets and updating old data (call this spreadsheet Old_Data.xlsx) with new data (from a different excel document; called New_Data.xlsx) and placing the updated data into a different sheet on on Old_Data.xlsx.
I am able to successfully create the new sheet in Old_Data.xlsx and see the changes between the two data sets, however, in the new sheet an index appears labeling the rows of data from 0-n. I've tried hiding this index so the information on each sheet in Old_Data.xlsx appears the same, however, I cannot successfully seem to get rid of the addition of the index. See the code below:
from openpyxl import load_workbook
# import xlwings as xl
import pandas as pd
import jinja2
# Load the workbook that is going to updated with new information.
wb = load_workbook('OldData.xlsx')
# Define the file path for all of the old and new data.
old_path = 'OldData.xlsx'
new_path = 'NewData.xlsx'
# Load the data frames for each Spreadsheet.
df_old = pd.read_excel(old_path)
print(df_old)
df_new = pd.read_excel(new_path)
print(df_new)
# Keep all original information why showing the differences in information and write
# to a new sheet in the workbook.
difference = pd.merge(df_old, df_new, how='right')
difference = difference.style.format.hide()
print(difference)
# Append the difference to an existing Excel File
with pd.ExcelWriter('OldData.xlsx', mode='a', engine='openpyxl', if_sheet_exists='replace') as writer:
difference.to_excel(writer, sheet_name="1-25-2023")
This is an image of the table of the second sheet that I creating. (https://i.stack.imgur.com/7Amdf.jpg)
I've tried adding the code:
difference = difference.style.format.hide
To get rid of the row, but I have not succeeded.
pass index = False as an argument in last line of you code. It should be something like this :-
with pd.ExcelWriter('OldData.xlsx', mode='a', engine='openpyxl', if_sheet_exists='replace') as writer:
difference.to_excel(writer, sheet_name="1-25-2023", index = False)
I think this should solve your problem.

Python: how to make a loop to copy data from different Excel files into a new one in an iterative way with pandas

I need to copy data from different Excel files into a new one. I would like to just tell the program to take all the files into a specific folder and copy two columns from each of them into a new Excel file. I tried a for loop but it overwrites data coming from different files and I get a new Excel file with just one sheet with data copied from the last file read by the program. Could you help me, please?
Here is my code:
import os.path
import pandas as pd
folder=r'C:\\Users\\PycharmProjects\\excelfile\\'
for fn in os.listdir(folder):
fx = pd.read_excel(os.path.join(folder, fn), usecols='H,E')
with pd.ExcelWriter('Output.xlsx') as writer:
ws = os.path.splitext(fn)[0]
fx.to_excel(writer, sheet_name=ws)
You should open the output file in append mode like so:
with pd.ExcelWriter("Output.xlsx", engine='openpyxl', mode='a') as writer:
ws = os.path.splitext(fn)[0]
fx.to_excel(writer, sheet_name=ws)

Importing and writing multiple excel sheets with Panda

I am trying to import excel files which have multiple sheets. Currently, my code (below) is only importing the first sheet. The remainder of the code is preforming calculations from only one sheet (currently the first since I moved it there to make it work-but bonus if I can avoid this step).
Ideally, I would like to import all the sheets, preform calculations on the one sheet, and export all sheets again in an excel file. A majority of the sheets would be import/export with no changes, while the one sheet with a specific/consistent name would have calculations preformed on it and also exported. Not sure what functions to look into. Thanks!
df = pd.read_excel("excelfilename.xlsx")
df.head()
#other code present here preforming calculations
df.to_excel(r'newfilename.xlsx', index = False)
Load Excel file using pandas, then get sheet names using xlrd, and then save modified data back.
import xlrd
file_name = "using_excel.xlsx"
sheet_names_ = xlrd.open_workbook(file_name, on_demand=True).sheet_names()
for sheet_name in sheet_names_:
df_sheet = pd.read_excel(file_name, sheet_name=sheet_name)
# do something
if you_want_to_write_back_to_same_sheet_in_same_file:
writer = pd.ExcelWriter(file_name)
df_sheet.to_excel(writer, sheet_name=sheet_name)
writer.save()

Python - whats the most efficient way to read large multi sheet spreadsheets into a pandas dataframe

I have a directory full of large spreadsheets.
My plan is to read each of the sheets into a dataframe, drop what I dont need and remove duplicates, then append to a master dataframe that I will then save as an excell file.
My current method like the following...
for workbook in filelist:
For sheet in workbook:
Df = pd.read_excell(workbook, sheet)
## Do table manipulation and append to master df
My problem is it takes a long time, I'm concerned that everytime I loop it is opening and closing the workbook.
Is there a way I can open the workbook and then cycle through each sheet saving it to a dataframe?
Note, the column headers are the same on each sheet.
Apologise for the shorthand code up there,I'm afk.
You can open the workbook once and read sheets from it. I don't know if this is really any faster, but worth a try
import pandas as pd
for filename in filelist:
workbook = pd.ExcelFile()
for sheet in workbook.sheet_names:
df = workbook.parse(sheet)

pandas read excel values not formulas

Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?
That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.
I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.
you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility

Categories