Overwriting subsection of Excel file with pandas dataframe

Overwriting subsection of Excel file with pandas dataframe - python

I have a large (90-120mb) excel file that generates a monthly report I do. Every month new data comes in that needs to be inserted into columns A:K on one sheet with formulas in J on that create some sub report items. In the past, new data was cleaned manually in excel, however, I have been writing a python script that will do this heavy lifting for me. Currently the data is cleaned, but I cannot figure out how to export to the specific range I need it to go into, without overwriting all other 'data'.
I've tried:
wb = openpyxl.load_workbook('workbook.xlsx')
ws = wb['sheet']
for row in ws.iter_rows(min_row=5, max_row=ws.max_row, min_col=1, max_col=11):
for cell in row:
cell.value = None
wb.save('workbook.xlsx')
However, my issue here is that the file cannot be loaded via openpyxl. The command will run for roughly an hour before throwing me a memory error, and I know that this code works on a smaller file.
I've read nearly every thread I could find via google and nothing has yet worked. I am aware openpyxl has a write_only mode, but as far as I can find in the documentation this doesn't allow for one: the use of a pre-existing file, nor two: the targeting of specific cells.
I've been able to do similar things via R and STATA which I am more familiar with, but for this specific project Python is mandatory for the automation.
Any help would be appreciated.
Python 3.9.7
Pandas 1.3.5
Openpyxl 3.0.9

Related

Excel sheet concatenated with pandas cannot be read by processing software without manually clearing formats

I am fairly new to programming and wrote a simple Python code to concatenate many Excel sheets into one big one to make processing it faster.
The Excel sheets contain a name and are then processed by a web application to compare the entries with existing data in a database. Entries that are in the Excel file and the database are then displayed. So far, so good.
I have written the following code:
import openpyxl as xl
import glob
import pandas as pd
inputpath="C:/Users/Me/myinputfolder"
filenames=glob.glob(inputpath+"/*.xlsx")
files=[]
#I only need to use the first two columns
fields=[0,1]
# create new data frame
output=pd.DataFrame(columns=['Name', 'ID(*)','Alternative Name','Version', 'Nationality', 'State', 'City'])
for file in filenames:
files.append(pd.read_excel(file, sheet_name="Data", usecols=fields,
names=["Name", "ID(*)"], skiprows=[i for i in range(1,5)]))
for excl_file in files:
output=pd.concat([output,excl_file],ignore_index=True)
outpath="C:/Users/Me/outputfolde/Output.xlsx"
output.to_excel(outpath, index=False, sheet_name="Data")
This code runs just fine. It takes all xlsx files from a specified folder and generates a new xlsx with the data I want. The only problem occurs when I upload it to the web application. The sheet uploads just fine and now errors are displayed but it seems like the application cannot read any data from the sheet. Once I manually clear all formatting in the output.xlsx file, it works.
I was wondering if there is a way to implement the "clear formatting" in my code as well.
Here is what I added to the end of my previous code:
ws = wb.worksheets[0]
for row in ws.iter_rows():
for cell in row:
cell.style='Normal'
wb.save(outpath)
Seemingly, this also looks fine and just like a file with manually cleared formatting but it still cannot be read by the application. Does anyone have any ideas why this still does not work?
Unfortunately, I do not have any information on how the web application reads/processes the data and the person in charge unfortunately does not reply. I would appreciate any ideas/suggestions to solve my issue.

Write data with Python into existing excel file keeping it intact as much as possible

We have a rather complicated Excel based VBA Tool that shall be replaced by a proper Database and Python based application step by step.
There will be time of the transition between were the not yet completely ready Python tool and the already existing VBA solution will coexist.
To allow interoperability the Python tool must be able to export the database values into the Excel VBA Tool keeping it intact. Meaning that not only all VBA codes have to work as expected but also Shapes, Special Formats etc, Checkboxes etc. have to work after the export.
Currently a simple:
from openpyxl import load_workbook
wb = load_workbook(r'Tool.xlsm', keep_vba=True)
# Write some data i.e. (not required to destroy the file)
wb["SomeSheet!SomeCell"] = "SomeValue"
wb.save(r"Tool_filled.xlsm")
will destroy the file, i.e. shapes won't work, checkboxes neither. (The resulting file is only 5 MB from originally 8 MB, showing that something went quite wrong).
Is there a way to only modify only the data of an ExcelSheet keeping everything else intact/untouched?
As far I know an Excel Sheet are only zipped .xml files. So it should be possible to edit only the related sheets? Correct?
Is there a more comfortable way as writing everything from scratch to only modify the data of an existing Excel file?
Note: The solution has to work in Linux, so simple remote Excel calls are not an option.

Can a pycel object be saved as an Excel Workbook

After parsing Excel file to Python and evaluating the workbook using pycel, can the pycel object be saved as an Excel file maintaining all original formatting, etc? I.e. only values need to be updated.

TL;DR
No, you cannot save a pycel object back into Excel.
Why not?
The basic problem is that pycel is based on openpyxl. Openpyxl is used to read (and write if needed) Excel spreadsheets. However, while openpyxl has the computed values available for formula cells for a workbook it read in, it does not really allow those computed values to be saved back into a workbook it writes. It doesn't really make sense to save a different computed value for a formula cell, since the cell's value will be recomputed once it is opened back up in Excel.
While it is true that pycel has the information available to properly populate a new value when the workbook is written, it evidently is not a use case that was important to the openpyxl authors or contributors.
Please note that the openpyxl maintainers gladly took pull requests to make it run better with pycel. It seems likely they would be open to discussing a PR for writing values into workbooks.

How to programmatically copy excel worksheet sheet to a new one

I have created a python script that parses some data and uses it to generate a few worksheet excel files (xlsx) (I am not much knowledgeable in excel by the way).
The original worksheet i am reading the data from has a main sheet used to fill in lots of info which is then distributed across many other sheets, which perform some formulas to calculate some results.
My python script does the processing i want by automatically filling in this main sheet, which then produces the results of the other sheets. I then save the worksheet and eveything looks fine.
Now, I want to split those worksheets into individual ones without including the main sheet in any of them. This appears to be pretty challenging.
I first tried using the data_only argument of load_workbook, but I quickly discovered that in order to preserve the values and not the formulas in the spreadsheet(or just get None back), I'd have to manually open and save each one of the files so that a temporary cache is created. That won't really do it in my case since i want the whole thing to be automated.
Among other things, the one that came closer is this piece of code:
workbook = xw.Book('generatedFiles/generated.xlsx')
sheet1 = workbook.sheets['PRIM'].used_range.value
df = pd.DataFrame(sheet1)
df.to_excel('generatedFiles/fixed-generated.xlsx', index = False, header = False)
This does indeed generate a spreadsheet with the values, using its pandas dataframe, but the problem is that it doesn't preserve any information about the types of the values. So for example, an integer is being treated as a string when saved.
This messes the processing being done by an external parser that I feed these files with.
Any ideas on how to fix that ? What would be the best way to go about doing something like that ?
Thanks !

OpenPyXl removing formulas on load

I'm trying to use OpenPyXL to
Open an .xlsx file
Read a cell that I know contains a number
Write a different number to that cell
Save the result to the same or different named .xlsx file
However even if I only perform the first and last steps the resulting .xlsx file has all formulas removed. The simplest version of my code goes like:
from openpyxl import load_workbook
wb = load_workbook(filename=file_path, data_only=False, guess_types=False)
wb.save(file_path_new)
However even without changing anything I still lose all the formulas. I have tried different values for the options. My biggest problem is that only yesterday, the full code (including reading and writing a numerical cell) was working and the saved result had the new number in that cell (when viewed in excel).
I updated from 1.8.5 to 2.0.2 at some point but can't remember if this was before or after the original code worked.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.