I'm trying to figure out which one is generally faster for a similar task: using VBA or openpyxl.
I know it probably depends on the task you want to achieve, but let's say I have a table that is 50 cells wide and 150,000 cells tall and I want to copy it from woorkbook A to workbook B.
Any thoughts on whether python will do better or if Excel is better in dealing with itself?
My guts tell me that python should be fairly faster for some reasons:
In order for a sub to copy from a workbook to another, both should be open and running, whereas with python I can simply load both;
VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources
Besides that, I'd like to know if I can make some further improvements to a openpyxl script, like multithreading or perhaps using NumPy along with it.
Thanks for the help!
TBH the fastest approach would probably be remote controlling Excel using xlwings, because this can take advantage of Excel's optimisation. VBA might be able to hook into that as well but I've never found VBA to be fast.
Python will have to convert from XML to Python and back to XML. You've got around 5,000,000 million cells so I'd expect this to take about a minute on my machine. I'd suggest combining read-only and write-only modes to do this to keep memory use low.
If you only have numerical data (no dates) then you might be able to find a shortcut and "transplant" the relevant worksheet XML file from one Excel file to another and just alter the relevant metadata.
TL;DR Consider making a direct data connection to the Excel file (ADO in VBA or Python+PyWin32, pyodbc in Python, or the .NET OleDbConnection class, among others). The language in which you make such a connection is much less relevant.
Long version
If all you want is to work with the data itself, you might want to consider a direct connection to Excel using ADO, pyodbc, or the .NET OleDbConnection class.
Automating the Excel application (with the Microsoft Excel object model, or (presumably) with xlwings) incurs a lot of overhead, which is understandable, because you might not be only reading the data in the Excel file, but also manipulating all the objects in the Excel UI — windows, menus — as well as objects beyond the data, such as formatting on individual cells or ranges.
It's true that openpyxl doesn't have all this overhead of UI elements, because it's reading the file directly, but I'm presuming there is still some overhead incurred because openpyxl has to make available all the information in the file, which is more than just the data — cell formatting, for example.
Making a data connection also allows you to treat the Excel file as a database, to which you can issue SQL statements, with all the power of SQL -- joins, sorting, grouping, aggregates.
See here for an example using ADO and VBA.
With openpyxl ...
This link was really helpful for me:
https://blog.dchidell.com/2019/06/24/openpyxl-poor-performance-optimisation/
Use read_only when opening the file if all you're doing is reading.
Use the built in iterators!
I cannot stress this enough - the iterators are fast, crazy fast.
Call functions as infrequently as possible and store intermediate
data in variables. It may bulk the code up a bit, but it tends to be
more efficient and also allows your code to be more readable (but this
is icing on the cake compared to points 1 and 2). Python can also be
ambiguous as to what is a variable and what is a function; but as a
general rule intermediate variables are good for multiple function
calls.
I was doing some reading of values in a particular workbook, and I did this initially:
wb = load_workbook(filename)
And that would take nearly 80 seconds. Caching the workbook between actions with it was helpful but still painful every time I reloaded my script.
I switched to reading only.
wb = load_workbook(filename, data_only=True, read_only=True)
Now it only takes < 0.1 seconds.
Related
We have a rather complicated Excel based VBA Tool that shall be replaced by a proper Database and Python based application step by step.
There will be time of the transition between were the not yet completely ready Python tool and the already existing VBA solution will coexist.
To allow interoperability the Python tool must be able to export the database values into the Excel VBA Tool keeping it intact. Meaning that not only all VBA codes have to work as expected but also Shapes, Special Formats etc, Checkboxes etc. have to work after the export.
Currently a simple:
from openpyxl import load_workbook
wb = load_workbook(r'Tool.xlsm', keep_vba=True)
# Write some data i.e. (not required to destroy the file)
wb["SomeSheet!SomeCell"] = "SomeValue"
wb.save(r"Tool_filled.xlsm")
will destroy the file, i.e. shapes won't work, checkboxes neither. (The resulting file is only 5 MB from originally 8 MB, showing that something went quite wrong).
Is there a way to only modify only the data of an ExcelSheet keeping everything else intact/untouched?
As far I know an Excel Sheet are only zipped .xml files. So it should be possible to edit only the related sheets? Correct?
Is there a more comfortable way as writing everything from scratch to only modify the data of an existing Excel file?
Note: The solution has to work in Linux, so simple remote Excel calls are not an option.
I would like to process a large data set of a mechanical testing device with Python. The software of this device only allows to export the data as an Excel file. Therefore, I use the xlrd package which works fine for small *.xlsx files.
The problem I have is, that when I want to open a common data set (3-5 MB) by
xlrd.open_workbook(path_wb)
the access time is about 30s to 60s. Is there any more effecitve and faster way to access Excel files?
You could access the file as a database via PyPyODBC instead, which may (or may not) be faster - you'd have to try it out and compare the results.
This method should work for both .xls and .xlsx files. Unfortunately, it comes with a couple of caveats:
As far as I am aware, this will only work on Windows machines, since you're relying on the Microsoft Jet database driver.
The Microsoft Jet database driver can be rather buggy, especially with dates.
It's not possible to create or modify Excel files (a note in the PyPyODBC exceltests.py file says: I have not been able to successfully create or modify Excel files.). Your question seems to indicate that you're only interested in reading files, though, so hopefully this will not be a problem.
I just figured out that it wasn't actually the problem with the access time but I created an object in the same step. Now, by creating the object separately everything works fast and nice.
I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..
I have a big problem here with python, openpyxl and Excel files. My objective is to write some calculated data to a preconfigured template in Excel. I load this template and write the data on it. There are two problems:
I'm talking about writing Excel books with more than 2 millions of cells, divided into several sheets.
I do this successfully, but the waiting time is unthinkable.
I don't know other way to solve this problem. Maybe openpyxl is not the solution. I have tried to write in xlsb, but I think openpyxl does not support this format. I have also tried with optimized writer and reader, but the problem comes when I save, due to the big data. However, the output file size is 10 MB, at most. I'm very stuck with this. Do you know if there is another way to do this?
Thanks in advance.
The file size isn't really the issue when it comes to memory use but the number of cells in memory. Your use case really will push openpyxl to the limits at the moment which is currently designed to support either optimised reading or optimised writing but not both at the same time. One thing you might try would be to read in openpyxl with use_iterators=True this will give you a generator that you can call from xlsxwriter which should be able to write a new file for you. xlsxwriter is currently significantly faster than openpyxl when creating files. The solution isn't perfect but it might work for you.
I have an Excel spreadsheet with calculations I would like to use in a Django web application. I do not need to present the spreadsheet as it appears in Excel. I only want to use the formulae embedded in it. What is the best way to do this?
You can control Excel with Python via COM. See this thread: Driving Excel from Python in Windows
It might be a challenge to get this to work reliably as part of a Django app.
In addition to the COM solution, xlrd is cross-platform. That might be more suitable, since I believe Linux is still the most common deployment environment for django. It's also a lighter-weight solution than pyUno.
I think the only thing you can do is use some python/excel mechanism (the only one I could find was this: http://www.python-excel.org/; the tutorial makes me think it might be doable) to read and write from an excel spreadsheet.
You would write to certain cells that would be used by the spreadsheet formulas and then read the results from the formulas from other cells.
Django per-se has nothing to help you with this.
I'll retag your question to include python so that, maybe, someone with Python-excel experience can comment...
You need to use Excel to calculate the results? I mean, maybe you could run the Excel sheet from OpenOffice and use a pyUNO macro, which is somehow "native" python.
A different approach will be to create a macro to generate some more friendly code to python, if you want Excel to perform the calculation is easy you end up with a very slow process.