pandas - read_excel efficiency on multiple large sheets - python

I have an Excel workbook with multiple sheets. Some contain lots of data (f.e. 6000000 cells), and some do not. I'm attempting to read one of the sheets that's significantly smaller, a simple 2 column - 500 row sheet using the following line of code:
df = pd.read_excel('C:/Data.xlsx', sheetname='Contracts')
However, this read takes an incredible amount of time, whereas the sheet standalone in an Excel does not. Is there a reason for this?

I tried to look at the API to help on how the function works for processing it but didn't come up with anything big. Few things of note:
1) assuming you are using 0.21.0 on wards you want to use sheet_name instead of sheet name
2) according to: https://realpython.com/working-with-large-excel-files-in-pandas/ the speed of pandas process directly correlates to your system ram.
3) the read_excel function opens the entire excel file and then selects the specific sheet making you load those super long sheets as well. You can test for this by just making the short sheet into a separate excel file and then running the read_excel on your new file.
Hope this helps

Related

Excel sheet concatenated with pandas cannot be read by processing software without manually clearing formats

I am fairly new to programming and wrote a simple Python code to concatenate many Excel sheets into one big one to make processing it faster.
The Excel sheets contain a name and are then processed by a web application to compare the entries with existing data in a database. Entries that are in the Excel file and the database are then displayed. So far, so good.
I have written the following code:
import openpyxl as xl
import glob
import pandas as pd
inputpath="C:/Users/Me/myinputfolder"
filenames=glob.glob(inputpath+"/*.xlsx")
files=[]
#I only need to use the first two columns
fields=[0,1]
# create new data frame
output=pd.DataFrame(columns=['Name', 'ID(*)','Alternative Name','Version', 'Nationality', 'State', 'City'])
for file in filenames:
files.append(pd.read_excel(file, sheet_name="Data", usecols=fields,
names=["Name", "ID(*)"], skiprows=[i for i in range(1,5)]))
for excl_file in files:
output=pd.concat([output,excl_file],ignore_index=True)
outpath="C:/Users/Me/outputfolde/Output.xlsx"
output.to_excel(outpath, index=False, sheet_name="Data")
This code runs just fine. It takes all xlsx files from a specified folder and generates a new xlsx with the data I want. The only problem occurs when I upload it to the web application. The sheet uploads just fine and now errors are displayed but it seems like the application cannot read any data from the sheet. Once I manually clear all formatting in the output.xlsx file, it works.
I was wondering if there is a way to implement the "clear formatting" in my code as well.
Here is what I added to the end of my previous code:
ws = wb.worksheets[0]
for row in ws.iter_rows():
for cell in row:
cell.style='Normal'
wb.save(outpath)
Seemingly, this also looks fine and just like a file with manually cleared formatting but it still cannot be read by the application. Does anyone have any ideas why this still does not work?
Unfortunately, I do not have any information on how the web application reads/processes the data and the person in charge unfortunately does not reply. I would appreciate any ideas/suggestions to solve my issue.

Load from different multiple excel files with different structure into the spreadsheets of an excel files with names using python

I need some help with Azure data factory.
I have multiple excel files on blob storage, they have different number of columns and different structure and I need to load all into one excel file, not to one excel sheet, but multiple sheets. I am looking for a way to solve this. Does not matter If I run Python script to do that or some other ways.
If somebody knows, would appreciate
thank you
Your question is a little light on details. However, if I understand what you're trying to do correctly, you should be able to copy the worksheets from the source workbooks and merge them into a target workbook using Pandas. It would look something like this:
import pandas as pd
#this targets a single worksheet but you could
#iterate over a list of sheets in a workbook
df = pd.read_excel(source_workbook, sheet_name="my worksheet")
#push to target workbook
#again this could be in a loop to add each worksheet
df.to_excel(target_workbook, sheet_name="new worksheet")
That is a very simple example but something similar to that should work for merging worksheets into a new workbook.

Overwriting subsection of Excel file with pandas dataframe

I have a large (90-120mb) excel file that generates a monthly report I do. Every month new data comes in that needs to be inserted into columns A:K on one sheet with formulas in J on that create some sub report items. In the past, new data was cleaned manually in excel, however, I have been writing a python script that will do this heavy lifting for me. Currently the data is cleaned, but I cannot figure out how to export to the specific range I need it to go into, without overwriting all other 'data'.
I've tried:
wb = openpyxl.load_workbook('workbook.xlsx')
ws = wb['sheet']
for row in ws.iter_rows(min_row=5, max_row=ws.max_row, min_col=1, max_col=11):
for cell in row:
cell.value = None
wb.save('workbook.xlsx')
However, my issue here is that the file cannot be loaded via openpyxl. The command will run for roughly an hour before throwing me a memory error, and I know that this code works on a smaller file.
I've read nearly every thread I could find via google and nothing has yet worked. I am aware openpyxl has a write_only mode, but as far as I can find in the documentation this doesn't allow for one: the use of a pre-existing file, nor two: the targeting of specific cells.
I've been able to do similar things via R and STATA which I am more familiar with, but for this specific project Python is mandatory for the automation.
Any help would be appreciated.
Python 3.9.7
Pandas 1.3.5
Openpyxl 3.0.9

How to programmatically copy excel worksheet sheet to a new one

I have created a python script that parses some data and uses it to generate a few worksheet excel files (xlsx) (I am not much knowledgeable in excel by the way).
The original worksheet i am reading the data from has a main sheet used to fill in lots of info which is then distributed across many other sheets, which perform some formulas to calculate some results.
My python script does the processing i want by automatically filling in this main sheet, which then produces the results of the other sheets. I then save the worksheet and eveything looks fine.
Now, I want to split those worksheets into individual ones without including the main sheet in any of them. This appears to be pretty challenging.
I first tried using the data_only argument of load_workbook, but I quickly discovered that in order to preserve the values and not the formulas in the spreadsheet(or just get None back), I'd have to manually open and save each one of the files so that a temporary cache is created. That won't really do it in my case since i want the whole thing to be automated.
Among other things, the one that came closer is this piece of code:
workbook = xw.Book('generatedFiles/generated.xlsx')
sheet1 = workbook.sheets['PRIM'].used_range.value
df = pd.DataFrame(sheet1)
df.to_excel('generatedFiles/fixed-generated.xlsx', index = False, header = False)
This does indeed generate a spreadsheet with the values, using its pandas dataframe, but the problem is that it doesn't preserve any information about the types of the values. So for example, an integer is being treated as a string when saved.
This messes the processing being done by an external parser that I feed these files with.
Any ideas on how to fix that ? What would be the best way to go about doing something like that ?
Thanks !

How do I write to one sheet in an already existing excel sheet in Python?

I got an excel file that has four sheets. One sheet, sheet 4. contains data in simple CSV and the others read the data of this sheet and make different calculations and graphs. In my python application I would like to open the excel file, open sheet 4, and replace the data. I know you technically can't open and edit excel however you like with Python, due to the complex file structure of XLS (previous relevant answer), but is there a work around for this specific case? Remember the only thing I want to do is to open the data sheet, write to it, and ignore the others...
Note: Previous answers to relevant questions have suggested using the copy function in xlutils. But that doesn't work in this case, as the rest of the sheets are rather complex. The graphs, for example, can't be preserved with the copy function.
I used to use pyExcelerator. It did certainly a good job, but I'm not sure if it is maintained.
https://pypi.python.org/pypi/pyExcelerator/
hth.

Categories