I have a big problem here with python, openpyxl and Excel files. My objective is to write some calculated data to a preconfigured template in Excel. I load this template and write the data on it. There are two problems:
I'm talking about writing Excel books with more than 2 millions of cells, divided into several sheets.
I do this successfully, but the waiting time is unthinkable.
I don't know other way to solve this problem. Maybe openpyxl is not the solution. I have tried to write in xlsb, but I think openpyxl does not support this format. I have also tried with optimized writer and reader, but the problem comes when I save, due to the big data. However, the output file size is 10 MB, at most. I'm very stuck with this. Do you know if there is another way to do this?
Thanks in advance.
The file size isn't really the issue when it comes to memory use but the number of cells in memory. Your use case really will push openpyxl to the limits at the moment which is currently designed to support either optimised reading or optimised writing but not both at the same time. One thing you might try would be to read in openpyxl with use_iterators=True this will give you a generator that you can call from xlsxwriter which should be able to write a new file for you. xlsxwriter is currently significantly faster than openpyxl when creating files. The solution isn't perfect but it might work for you.
Related
I've got very hard task to do. I need to process Excel file with 6336 rows x 53 columns. My task is to create program which:
Read data from input Excel file.
Sort all rows by specific column data, for eg. Sort by A1:A(last)
Place columns in new output Excel file by given order, for eg.
SaleCity Branch CustomerID InvoiceNum
Old File For eg. Old File Merge old file cols
Col[A1:A(last)] SaleCity='Oklahoma' Col[M1:M(last) Col[K1:K(last) &
Branch='OKL GamesShop' B1:B(last)]
Save new excel File.
Excel Sample:
Excel
(All data in this post is not real so don't try to hack someone or something :D)
I know that I did not provide any code but to be honest I tried solving it by myself and I don't even know which module I should use. I tried using OpenPyXl and Pandas but there's too much data for my capabilities.
Thank you in advance for any help. If I asked the question in the wrong place, please direct me to the right one.
Edit:
To be clear. I'm not asking for full solution here. What am I asking for is guidance and mentority.
I would recommend you to use PySpark. It is more difficult than pandas, but the parallelization provided will help with yours large excel files.
Or you could also use multiprocessing lib from python to paralelize pandas functions.
https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1
Morning,
I have dynamic data which is updated either daily, weekly or monthly in excel (this is the only api link). However, for use in python, is it better to keep the data stored in excel or transfer it to SQLite and access it from there?
Or is there a more efficient way of managing this process?
thanks
It depends on what you really need (see below, formulae). KISS (Keep it stupid simple) way is often the good one.
Some Python API like xlwt and xlrd can read and write Excel files :
http://www.python-excel.org/
But xlwt and xlrd can't evaluate formulae. If you need formulae, try openpyxl http://openpyxl.readthedocs.org/en/2.5/
I'm trying to figure out which one is generally faster for a similar task: using VBA or openpyxl.
I know it probably depends on the task you want to achieve, but let's say I have a table that is 50 cells wide and 150,000 cells tall and I want to copy it from woorkbook A to workbook B.
Any thoughts on whether python will do better or if Excel is better in dealing with itself?
My guts tell me that python should be fairly faster for some reasons:
In order for a sub to copy from a workbook to another, both should be open and running, whereas with python I can simply load both;
VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources
Besides that, I'd like to know if I can make some further improvements to a openpyxl script, like multithreading or perhaps using NumPy along with it.
Thanks for the help!
TBH the fastest approach would probably be remote controlling Excel using xlwings, because this can take advantage of Excel's optimisation. VBA might be able to hook into that as well but I've never found VBA to be fast.
Python will have to convert from XML to Python and back to XML. You've got around 5,000,000 million cells so I'd expect this to take about a minute on my machine. I'd suggest combining read-only and write-only modes to do this to keep memory use low.
If you only have numerical data (no dates) then you might be able to find a shortcut and "transplant" the relevant worksheet XML file from one Excel file to another and just alter the relevant metadata.
TL;DR Consider making a direct data connection to the Excel file (ADO in VBA or Python+PyWin32, pyodbc in Python, or the .NET OleDbConnection class, among others). The language in which you make such a connection is much less relevant.
Long version
If all you want is to work with the data itself, you might want to consider a direct connection to Excel using ADO, pyodbc, or the .NET OleDbConnection class.
Automating the Excel application (with the Microsoft Excel object model, or (presumably) with xlwings) incurs a lot of overhead, which is understandable, because you might not be only reading the data in the Excel file, but also manipulating all the objects in the Excel UI — windows, menus — as well as objects beyond the data, such as formatting on individual cells or ranges.
It's true that openpyxl doesn't have all this overhead of UI elements, because it's reading the file directly, but I'm presuming there is still some overhead incurred because openpyxl has to make available all the information in the file, which is more than just the data — cell formatting, for example.
Making a data connection also allows you to treat the Excel file as a database, to which you can issue SQL statements, with all the power of SQL -- joins, sorting, grouping, aggregates.
See here for an example using ADO and VBA.
With openpyxl ...
This link was really helpful for me:
https://blog.dchidell.com/2019/06/24/openpyxl-poor-performance-optimisation/
Use read_only when opening the file if all you're doing is reading.
Use the built in iterators!
I cannot stress this enough - the iterators are fast, crazy fast.
Call functions as infrequently as possible and store intermediate
data in variables. It may bulk the code up a bit, but it tends to be
more efficient and also allows your code to be more readable (but this
is icing on the cake compared to points 1 and 2). Python can also be
ambiguous as to what is a variable and what is a function; but as a
general rule intermediate variables are good for multiple function
calls.
I was doing some reading of values in a particular workbook, and I did this initially:
wb = load_workbook(filename)
And that would take nearly 80 seconds. Caching the workbook between actions with it was helpful but still painful every time I reloaded my script.
I switched to reading only.
wb = load_workbook(filename, data_only=True, read_only=True)
Now it only takes < 0.1 seconds.
My problem is rather simple : I have an Excel Sheet that does calculations and creates a graph based on the values of two cells in the sheet. I also have two lists of inputs in text files. I would like to loop through those text files, add the values to the excel sheet, refresh the sheet, and print the resulting graph to a pdf file or an excel file named something like 'input1 - input2.xlsx'.
My programming knowledge is limited, I am decent with Python and have looked into python libraries that work with excel such as openpyxl, however most of those don't seem to work for me for various reasons. Openpyxl deletes the graphs when opening an excel file; XlsxWriter can only write files, not read from them; and xlwings won't work for me.
Should I use python, which I'm familiar with, or would VBA work for this kind of problem? Have any of you ever done something of the sort?
Thanks in advance
As a more transitional approach to what m. wasowski wrote above, I'd suggest you do the following.
Install the pandas package, and see how easy it is to load a file using read_excel. Then, read 10 Minutes to Pandas, and manipulate the data.
You state that the Excel sheet is complex. In general, the more complex it is, this approach will eventually make it simpler. But you don't have to switch everything immediately. You can still do parts in Excel and parts in pandas.
I think you should consider win32Com for excel operation in python instead of Openpyxl,XlsxWriter.
you can read/write excel, create chart and format excel file using win32com without any limitation.
And creating chart you can consider matplotlib, in that after creating chart you can save it in pdf file also.
I have really big database which I want write to xlsx/xls file. I already tried to use xlwt, but it allows to write only 65536 rows (some of my tables have more than 72k rows). I also found openpyxl, but it works too slow, and use huge amount of memory for big spreadsheets. Are there any other possibilities to write excel files?
edit:
Following kennym's advice i used Optimised Reader and Writer. It is less memory consuming now, but still time consuming. Exporting takes more than hour now (for really big tables- up to 10^6 rows). Are there any other possibilities? Maybe it is possible to export whole table from HDF5 database file to excel, instead of doing it row after row- like it is now in my code?
Try and use XlsxWriter in Constant Memory mode.
Only for Writing Excel 2007 xlsx/xlsm files
It works much faster than Openpyxl
Provide Constant memory mode. : http://xlsxwriter.readthedocs.org/working_with_memory.html
For .xls files I fear there's no memory optimized way. Did you find any ?
Use the Optimized Reader and Writer of the openpyxl package. The optimized reader and writer run much faster and use far less memory than the standard openpyxl methods.
XlsxWriter work for me. I try openpyxl but it error. 22k*400 r*c