I have created a python script that parses some data and uses it to generate a few worksheet excel files (xlsx) (I am not much knowledgeable in excel by the way).
The original worksheet i am reading the data from has a main sheet used to fill in lots of info which is then distributed across many other sheets, which perform some formulas to calculate some results.
My python script does the processing i want by automatically filling in this main sheet, which then produces the results of the other sheets. I then save the worksheet and eveything looks fine.
Now, I want to split those worksheets into individual ones without including the main sheet in any of them. This appears to be pretty challenging.
I first tried using the data_only argument of load_workbook, but I quickly discovered that in order to preserve the values and not the formulas in the spreadsheet(or just get None back), I'd have to manually open and save each one of the files so that a temporary cache is created. That won't really do it in my case since i want the whole thing to be automated.
Among other things, the one that came closer is this piece of code:
workbook = xw.Book('generatedFiles/generated.xlsx')
sheet1 = workbook.sheets['PRIM'].used_range.value
df = pd.DataFrame(sheet1)
df.to_excel('generatedFiles/fixed-generated.xlsx', index = False, header = False)
This does indeed generate a spreadsheet with the values, using its pandas dataframe, but the problem is that it doesn't preserve any information about the types of the values. So for example, an integer is being treated as a string when saved.
This messes the processing being done by an external parser that I feed these files with.
Any ideas on how to fix that ? What would be the best way to go about doing something like that ?
Thanks !
Related
I have a large (90-120mb) excel file that generates a monthly report I do. Every month new data comes in that needs to be inserted into columns A:K on one sheet with formulas in J on that create some sub report items. In the past, new data was cleaned manually in excel, however, I have been writing a python script that will do this heavy lifting for me. Currently the data is cleaned, but I cannot figure out how to export to the specific range I need it to go into, without overwriting all other 'data'.
I've tried:
wb = openpyxl.load_workbook('workbook.xlsx')
ws = wb['sheet']
for row in ws.iter_rows(min_row=5, max_row=ws.max_row, min_col=1, max_col=11):
for cell in row:
cell.value = None
wb.save('workbook.xlsx')
However, my issue here is that the file cannot be loaded via openpyxl. The command will run for roughly an hour before throwing me a memory error, and I know that this code works on a smaller file.
I've read nearly every thread I could find via google and nothing has yet worked. I am aware openpyxl has a write_only mode, but as far as I can find in the documentation this doesn't allow for one: the use of a pre-existing file, nor two: the targeting of specific cells.
I've been able to do similar things via R and STATA which I am more familiar with, but for this specific project Python is mandatory for the automation.
Any help would be appreciated.
Python 3.9.7
Pandas 1.3.5
Openpyxl 3.0.9
We have a rather complicated Excel based VBA Tool that shall be replaced by a proper Database and Python based application step by step.
There will be time of the transition between were the not yet completely ready Python tool and the already existing VBA solution will coexist.
To allow interoperability the Python tool must be able to export the database values into the Excel VBA Tool keeping it intact. Meaning that not only all VBA codes have to work as expected but also Shapes, Special Formats etc, Checkboxes etc. have to work after the export.
Currently a simple:
from openpyxl import load_workbook
wb = load_workbook(r'Tool.xlsm', keep_vba=True)
# Write some data i.e. (not required to destroy the file)
wb["SomeSheet!SomeCell"] = "SomeValue"
wb.save(r"Tool_filled.xlsm")
will destroy the file, i.e. shapes won't work, checkboxes neither. (The resulting file is only 5 MB from originally 8 MB, showing that something went quite wrong).
Is there a way to only modify only the data of an ExcelSheet keeping everything else intact/untouched?
As far I know an Excel Sheet are only zipped .xml files. So it should be possible to edit only the related sheets? Correct?
Is there a more comfortable way as writing everything from scratch to only modify the data of an existing Excel file?
Note: The solution has to work in Linux, so simple remote Excel calls are not an option.
After parsing Excel file to Python and evaluating the workbook using pycel, can the pycel object be saved as an Excel file maintaining all original formatting, etc? I.e. only values need to be updated.
TL;DR
No, you cannot save a pycel object back into Excel.
Why not?
The basic problem is that pycel is based on openpyxl. Openpyxl is used to read (and write if needed) Excel spreadsheets. However, while openpyxl has the computed values available for formula cells for a workbook it read in, it does not really allow those computed values to be saved back into a workbook it writes. It doesn't really make sense to save a different computed value for a formula cell, since the cell's value will be recomputed once it is opened back up in Excel.
While it is true that pycel has the information available to properly populate a new value when the workbook is written, it evidently is not a use case that was important to the openpyxl authors or contributors.
Please note that the openpyxl maintainers gladly took pull requests to make it run better with pycel. It seems likely they would be open to discussing a PR for writing values into workbooks.
I have an Excel workbook with multiple sheets. Some contain lots of data (f.e. 6000000 cells), and some do not. I'm attempting to read one of the sheets that's significantly smaller, a simple 2 column - 500 row sheet using the following line of code:
df = pd.read_excel('C:/Data.xlsx', sheetname='Contracts')
However, this read takes an incredible amount of time, whereas the sheet standalone in an Excel does not. Is there a reason for this?
I tried to look at the API to help on how the function works for processing it but didn't come up with anything big. Few things of note:
1) assuming you are using 0.21.0 on wards you want to use sheet_name instead of sheet name
2) according to: https://realpython.com/working-with-large-excel-files-in-pandas/ the speed of pandas process directly correlates to your system ram.
3) the read_excel function opens the entire excel file and then selects the specific sheet making you load those super long sheets as well. You can test for this by just making the short sheet into a separate excel file and then running the read_excel on your new file.
Hope this helps
I got an excel file that has four sheets. One sheet, sheet 4. contains data in simple CSV and the others read the data of this sheet and make different calculations and graphs. In my python application I would like to open the excel file, open sheet 4, and replace the data. I know you technically can't open and edit excel however you like with Python, due to the complex file structure of XLS (previous relevant answer), but is there a work around for this specific case? Remember the only thing I want to do is to open the data sheet, write to it, and ignore the others...
Note: Previous answers to relevant questions have suggested using the copy function in xlutils. But that doesn't work in this case, as the rest of the sheets are rather complex. The graphs, for example, can't be preserved with the copy function.
I used to use pyExcelerator. It did certainly a good job, but I'm not sure if it is maintained.
https://pypi.python.org/pypi/pyExcelerator/
hth.