Python script to parse a big workbook

Python script to parse a big workbook - python

I have an extra sized excel file and I need to automate a task I do everyday: Add rows to the bottom with the day's date, save a new workbook, crop the old ones and save as a new file with the day's date.
An example is today only having rows with date 04-10-2016 and the filename would be [sheetname]04102016H12 or [sheetname]04102016H16if it has passed 12 pm.
I've tried xldr, doing this in VBA and so on but I can't get along with VBA and it is slow. So I'd rather use Python here - lightweight, does the job and so on.
Anyway, so far, I have done the follwing:
import xlsxwriter, datetime, xlrd
import pandas as pd
# Parsing main excel sheet to save the correct
with xlrd.open_workbook(r'D:/path/to/file/file.xlsx', on_demand=True) as xls:
for sheet in xls.parse(xls.sheet_names([0])):
dfs = pd.read_excel(xls, sheet, header = 1)
now = datetime.date.today()
df[df['Data'] != now]
if datetime.time()<datetime.time(11,0,0,0):
df.to_excel(r'W:\path\I\need'+str(sheet)+now+'H12.xlsx', index=False)
else:
df.to_excel(r'W:\path\I\need'+str(sheet)+now+'H16.xlsx', index=False)
Unfortunately, this does not separate the main file into as many files as worksheets the workbook contains. It outputs TypeError: 'list' object is not callable, regarding this in xls.parse(xls.sheet_names([0])).

Based on comments below I am updating my answer. Just do:
xls.sheet_names()[0]
However, if you want to loop through the sheets, then you may want all sheet names instead of just the first one.

Related

Removing the Indexed Column when Merging 2 Excel Spreadsheets into a new Sheet in an existing Excel Spreadsheet using Pandas

I wanted to automate comparing two excel spreadsheets and updating old data (call this spreadsheet Old_Data.xlsx) with new data (from a different excel document; called New_Data.xlsx) and placing the updated data into a different sheet on on Old_Data.xlsx.
I am able to successfully create the new sheet in Old_Data.xlsx and see the changes between the two data sets, however, in the new sheet an index appears labeling the rows of data from 0-n. I've tried hiding this index so the information on each sheet in Old_Data.xlsx appears the same, however, I cannot successfully seem to get rid of the addition of the index. See the code below:
from openpyxl import load_workbook
# import xlwings as xl
import pandas as pd
import jinja2
# Load the workbook that is going to updated with new information.
wb = load_workbook('OldData.xlsx')
# Define the file path for all of the old and new data.
old_path = 'OldData.xlsx'
new_path = 'NewData.xlsx'
# Load the data frames for each Spreadsheet.
df_old = pd.read_excel(old_path)
print(df_old)
df_new = pd.read_excel(new_path)
print(df_new)
# Keep all original information why showing the differences in information and write
# to a new sheet in the workbook.
difference = pd.merge(df_old, df_new, how='right')
difference = difference.style.format.hide()
print(difference)
# Append the difference to an existing Excel File
with pd.ExcelWriter('OldData.xlsx', mode='a', engine='openpyxl', if_sheet_exists='replace') as writer:
difference.to_excel(writer, sheet_name="1-25-2023")
This is an image of the table of the second sheet that I creating. (https://i.stack.imgur.com/7Amdf.jpg)
I've tried adding the code:
difference = difference.style.format.hide
To get rid of the row, but I have not succeeded.

pass index = False as an argument in last line of you code. It should be something like this :-
with pd.ExcelWriter('OldData.xlsx', mode='a', engine='openpyxl', if_sheet_exists='replace') as writer:
difference.to_excel(writer, sheet_name="1-25-2023", index = False)
I think this should solve your problem.

Importing and writing multiple excel sheets with Panda

I am trying to import excel files which have multiple sheets. Currently, my code (below) is only importing the first sheet. The remainder of the code is preforming calculations from only one sheet (currently the first since I moved it there to make it work-but bonus if I can avoid this step).
Ideally, I would like to import all the sheets, preform calculations on the one sheet, and export all sheets again in an excel file. A majority of the sheets would be import/export with no changes, while the one sheet with a specific/consistent name would have calculations preformed on it and also exported. Not sure what functions to look into. Thanks!
df = pd.read_excel("excelfilename.xlsx")
df.head()
#other code present here preforming calculations
df.to_excel(r'newfilename.xlsx', index = False)

Load Excel file using pandas, then get sheet names using xlrd, and then save modified data back.
import xlrd
file_name = "using_excel.xlsx"
sheet_names_ = xlrd.open_workbook(file_name, on_demand=True).sheet_names()
for sheet_name in sheet_names_:
df_sheet = pd.read_excel(file_name, sheet_name=sheet_name)
# do something
if you_want_to_write_back_to_same_sheet_in_same_file:
writer = pd.ExcelWriter(file_name)
df_sheet.to_excel(writer, sheet_name=sheet_name)
writer.save()

Python - whats the most efficient way to read large multi sheet spreadsheets into a pandas dataframe

I have a directory full of large spreadsheets.
My plan is to read each of the sheets into a dataframe, drop what I dont need and remove duplicates, then append to a master dataframe that I will then save as an excell file.
My current method like the following...
for workbook in filelist:
For sheet in workbook:
Df = pd.read_excell(workbook, sheet)
## Do table manipulation and append to master df
My problem is it takes a long time, I'm concerned that everytime I loop it is opening and closing the workbook.
Is there a way I can open the workbook and then cycle through each sheet saving it to a dataframe?
Note, the column headers are the same on each sheet.
Apologise for the shorthand code up there,I'm afk.

You can open the workbook once and read sheets from it. I don't know if this is really any faster, but worth a try
import pandas as pd
for filename in filelist:
workbook = pd.ExcelFile()
for sheet in workbook.sheet_names:
df = workbook.parse(sheet)

Python transfer excel formatting between two Excel documents

I'd like to copy the formatting between two Excel sheets in python.
Here is the situation:
I have a script that effectively "alters" (ie overwrites) an excel file by opening it using pd.ExcelWriter, then updates values in the rows. Finally, file is overwritten using ExcelWriter.
The Excel file is printed/shared/read by humans between updates done by the code. Humans will do things like change number formatting, turn on/off word wrap, and alter column widths.
My goal is the code updates should only alter the content of the file, not the formatting of the columns.
Is there a way I can read/store/write the sheet format within python so the output file has the same column formatting as the input file?
Here's the basic idea of what I am doing right now:
df_in= pd.read_excel("myfile.xlsx")
# Here is where I'd like to read in format of the first sheet of this file
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
# Here is where I'd like to apply the format I read earlier to the sheet
xlwriter.save()
Note: I have played with xlsxwriter.set_column and add_format. As far as I can tell, these don't help me read the format from the current file

Pandas uses xlrd package for parsing Excel documents to DataFrames.
Interoperability between other xlsx packages and xlrd could be problematic when it comes to the data structure used to represent formatting information.
I suggest using openpyxl as your engine when instantiating pandas.ExcelWriter. It comes with reader and writer classes that are interoperable.
import pandas as pd
from openpyxl.styles.stylesheet import apply_stylesheet
from openpyxl.reader.excel import ExcelReader
xlreader = ExcelReader('myfile.xlsx', read_only=True)
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='openpyxl')
df_in = pd.read_excel("myfile.xlsx")
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
apply_stylesheet(xlreader.archive, xlwriter.book)
xlwriter.save()

Using Python to load template excel file, insert a DataFrame to specific lines and save as a new file

I'm having troubles writing something that I believe should be relatively easy.
I have a template excel file, that has some visualizations on it with a few spreadsheets. I want to write a scripts that loads the template, inserts an existing dataframe rows to specific cells on each sheet, and saves the new excel file as a new file.
The template already have all the cells designed and the visualization, so i will want to insert this data only without changing the design.
I tried several packages and none of them seemed to work for me.
Thanks for your help! :-)

I have written a package for inserting Pandas DataFrames to Excel sheets (specific rows/cells/columns), it's called pyxcelframe:
https://pypi.org/project/pyxcelframe/
It has very simple and short documentation, and the method you need is insert_frame
So, let's say we have a Pandas DataFrame called df which we have to insert in the Excel file ("MyWorkbook") sheet named "MySheet" from the cell B5, we can just use insert_frame function as follows:
from pyxcelframe import insert_frame
from openpyxl import load_workbook
workbook = load_workbook("MyWorkbook.xlsx")
worksheet = workbook["MySheet"]
insert_frame(worksheet=worksheet,
dataframe=df,
row_range=(5, 0),
col_range=(2, 0))
0 as the value of the second element of row_range or col_range means that there is no ending row or column specified, if you need specific ending row/column you can replace 0 with it.

Sounds like a job for xlwings. You didn't post any test data, but modyfing below to suit your needs should be quite straight-forward.
import xlwings as xw
wb = xw.Book('your_excel_template.xlsx')
wb.sheets['Sheet1'].range('A1').value = df[your_selected_rows]
wb.save('new_file.xlsx')
wb.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python script to parse a big workbook - python

Based on comments below I am updating my answer. Just do: xls.sheet_names()[0] However, if you want to loop through the sheets, then you may want all sheet names instead of just the first one.

Related

Removing the Indexed Column when Merging 2 Excel Spreadsheets into a new Sheet in an existing Excel Spreadsheet using Pandas

Importing and writing multiple excel sheets with Panda

Python - whats the most efficient way to read large multi sheet spreadsheets into a pandas dataframe

Python transfer excel formatting between two Excel documents

Using Python to load template excel file, insert a DataFrame to specific lines and save as a new file

Categories

Resources