How can I reduce the access time on large Excel files? - python

I would like to process a large data set of a mechanical testing device with Python. The software of this device only allows to export the data as an Excel file. Therefore, I use the xlrd package which works fine for small *.xlsx files.
The problem I have is, that when I want to open a common data set (3-5 MB) by
xlrd.open_workbook(path_wb)
the access time is about 30s to 60s. Is there any more effecitve and faster way to access Excel files?

You could access the file as a database via PyPyODBC instead, which may (or may not) be faster - you'd have to try it out and compare the results.
This method should work for both .xls and .xlsx files. Unfortunately, it comes with a couple of caveats:
As far as I am aware, this will only work on Windows machines, since you're relying on the Microsoft Jet database driver.
The Microsoft Jet database driver can be rather buggy, especially with dates.
It's not possible to create or modify Excel files (a note in the PyPyODBC exceltests.py file says: I have not been able to successfully create or modify Excel files.). Your question seems to indicate that you're only interested in reading files, though, so hopefully this will not be a problem.

I just figured out that it wasn't actually the problem with the access time but I created an object in the same step. Now, by creating the object separately everything works fast and nice.

Related

How to version-control a set of input data along with its processing scripts?

I am working with a set of Python scripts that take data from an Excel file that is set up to behave as a pseudo-database. Excel is used instead of an SQL software due to compatibility and access requipements for other people I work with who aren't familiar with databases.
I have a set of about 10 tables with multiple records in each and relational keys linking them all (again in a pseudo-linking kind of way, using some flimsy data validation).
The scripts I am using are version controlled by Git, and I know the pitfalls of adding a .xlsx file to a repo, so I have kept it away. Since the data is a bit vulnerable, I want to make sure I have a way of keeping track of any changes we make to it. My thought was to have a script that breaks the Excel file into .csv tables and adds those to the repo, i.e.:
import pandas as pd
from pathlib import Path
excel_input_file = Path(r"<...>")
output_path = Path(r"<...>")
tables_dict = pd.read_excel(excel_input_file, sheet_name=None)
for i,x in tables_dict.items():
x.to_csv(output_path / (i+'.csv'), index=False)
Would this be a typically good method for keeping track of the input files at each stage?
Git tends to work better with text files rather than binary files, as you've noted, so this would be a better choice than just checking in an Excel file. Specifically, Git would be able to merge and diff these files, whereas they couldn't be merged natively by Git otherwise.
Typically the way that people handle this sort of situation is to take one or more plain text input files (e.g., CSV or SQL) and then build them into the usable output format (e.g., Excel or database) as part of the build or test step, depending on where they're needed. I've done similar things by using a Git fast-export dump to create test Git repositories, and it generally works well.
If you had just one input file, which you don't in this case, you could also use smudge and clean filters to turn the source file in the repository into a different format in the checkout. You can read about this with man gitattributes.

Need code for importing .csv file via python or ruby code to Cassandra 3.11.3 DB (Production use)

We have 7 node Cassandra 3.11.3 production cluster, we get ticket details dump to a mid server, I need to read from this .csv file and import .csv data to cassandra table. I tried ruby code which was easy for me to write but it does not take care of all the column values (As this .csv will have special characters, enters/different lines, UTF issues, too much of text description as it is in ticketing tool) as data keep changing in each and every row in .csv.
I Want to know if ruby or python is good to perform this activity in production or does anyone have good sample code for mitigating issues mentioned above and performing this kind of activity in production environment?
Both Ruby and Python are perfect for this kind of task, but if your source file is in bad format then any potential tool could fail - there is no magic button tool that could deduce the context from the (broken) data file and fix all the problems for you automatically.
I'd suggest splitting the task into two parts: 1) fix the encoding and data quality problem(s) (and perform any data transformations if necessary) and then 2) import clean data.
Task 2 could be easily done with almost any programming language (that has appropriate cassandra driver available) but if you have a well-formatted csv source you probably don't need any hacking at all (depending on the use case, of course) - Cassandra supports copy ... from command that allows importing data from csv directly (https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html).

Using Python / pandaSDMX to process & upload to Postgresql

I am seeking a way to take SDMX files (like here: http://www12.statcan.gc.ca/datasets/Alternative.cfm?PID=105929&EXT=SDMX) and process them into a Postgresql table.
I can use rsdmx (https://cran.r-project.org/web/packages/rsdmx/index.html) for smaller datasets but for large ones we reach a number of limitations in R.
PandaSDMX (https://pandasdmx.readthedocs.io/en/latest/) appears to resolve some of these issues, but I am not experienced in Python and can't seem to get the syntax to work. I'm able to use Response.get() to load a local file as a response object but am not sure where to go to from there.
I know I need to apply the code tables (structure file) but I'm not sure how to do that or make it so I can use odo (http://odo.pydata.org/en/latest/) to send it to postgresql.
Hoping someone can help out or suggest another method to pursue.

Efficiency: openpyxl or VBA?

I'm trying to figure out which one is generally faster for a similar task: using VBA or openpyxl.
I know it probably depends on the task you want to achieve, but let's say I have a table that is 50 cells wide and 150,000 cells tall and I want to copy it from woorkbook A to workbook B.
Any thoughts on whether python will do better or if Excel is better in dealing with itself?
My guts tell me that python should be fairly faster for some reasons:
In order for a sub to copy from a workbook to another, both should be open and running, whereas with python I can simply load both;
VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources
Besides that, I'd like to know if I can make some further improvements to a openpyxl script, like multithreading or perhaps using NumPy along with it.
Thanks for the help!
TBH the fastest approach would probably be remote controlling Excel using xlwings, because this can take advantage of Excel's optimisation. VBA might be able to hook into that as well but I've never found VBA to be fast.
Python will have to convert from XML to Python and back to XML. You've got around 5,000,000 million cells so I'd expect this to take about a minute on my machine. I'd suggest combining read-only and write-only modes to do this to keep memory use low.
If you only have numerical data (no dates) then you might be able to find a shortcut and "transplant" the relevant worksheet XML file from one Excel file to another and just alter the relevant metadata.
TL;DR Consider making a direct data connection to the Excel file (ADO in VBA or Python+PyWin32, pyodbc in Python, or the .NET OleDbConnection class, among others). The language in which you make such a connection is much less relevant.
Long version
If all you want is to work with the data itself, you might want to consider a direct connection to Excel using ADO, pyodbc, or the .NET OleDbConnection class.
Automating the Excel application (with the Microsoft Excel object model, or (presumably) with xlwings) incurs a lot of overhead, which is understandable, because you might not be only reading the data in the Excel file, but also manipulating all the objects in the Excel UI — windows, menus — as well as objects beyond the data, such as formatting on individual cells or ranges.
It's true that openpyxl doesn't have all this overhead of UI elements, because it's reading the file directly, but I'm presuming there is still some overhead incurred because openpyxl has to make available all the information in the file, which is more than just the data — cell formatting, for example.
Making a data connection also allows you to treat the Excel file as a database, to which you can issue SQL statements, with all the power of SQL -- joins, sorting, grouping, aggregates.
See here for an example using ADO and VBA.
With openpyxl ...
This link was really helpful for me:
https://blog.dchidell.com/2019/06/24/openpyxl-poor-performance-optimisation/
Use read_only when opening the file if all you're doing is reading.
Use the built in iterators!
I cannot stress this enough - the iterators are fast, crazy fast.
Call functions as infrequently as possible and store intermediate
data in variables. It may bulk the code up a bit, but it tends to be
more efficient and also allows your code to be more readable (but this
is icing on the cake compared to points 1 and 2). Python can also be
ambiguous as to what is a variable and what is a function; but as a
general rule intermediate variables are good for multiple function
calls.
I was doing some reading of values in a particular workbook, and I did this initially:
wb = load_workbook(filename)
And that would take nearly 80 seconds. Caching the workbook between actions with it was helpful but still painful every time I reloaded my script.
I switched to reading only.
wb = load_workbook(filename, data_only=True, read_only=True)
Now it only takes < 0.1 seconds.

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

Categories