How to write big set of data to xls file? - python

I have really big database which I want write to xlsx/xls file. I already tried to use xlwt, but it allows to write only 65536 rows (some of my tables have more than 72k rows). I also found openpyxl, but it works too slow, and use huge amount of memory for big spreadsheets. Are there any other possibilities to write excel files?
edit:
Following kennym's advice i used Optimised Reader and Writer. It is less memory consuming now, but still time consuming. Exporting takes more than hour now (for really big tables- up to 10^6 rows). Are there any other possibilities? Maybe it is possible to export whole table from HDF5 database file to excel, instead of doing it row after row- like it is now in my code?

Try and use XlsxWriter in Constant Memory mode.
Only for Writing Excel 2007 xlsx/xlsm files
It works much faster than Openpyxl
Provide Constant memory mode. : http://xlsxwriter.readthedocs.org/working_with_memory.html
For .xls files I fear there's no memory optimized way. Did you find any ?

Use the Optimized Reader and Writer of the openpyxl package. The optimized reader and writer run much faster and use far less memory than the standard openpyxl methods.

XlsxWriter work for me. I try openpyxl but it error. 22k*400 r*c

Related

pandas.read_csv vs other csv libraries for loading CSV into a Postgres Database

I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?
It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:
The data might have errors (some rows may be not be parseable), the
data might be duplicated, the data might be really large.
Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).
I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.
Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?
A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).
There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.
Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).
I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Excel api, but where to store the data

Morning,
I have dynamic data which is updated either daily, weekly or monthly in excel (this is the only api link). However, for use in python, is it better to keep the data stored in excel or transfer it to SQLite and access it from there?
Or is there a more efficient way of managing this process?
thanks
It depends on what you really need (see below, formulae). KISS (Keep it stupid simple) way is often the good one.
Some Python API like xlwt and xlrd can read and write Excel files :
http://www.python-excel.org/
But xlwt and xlrd can't evaluate formulae. If you need formulae, try openpyxl http://openpyxl.readthedocs.org/en/2.5/

Efficiency: openpyxl or VBA?

I'm trying to figure out which one is generally faster for a similar task: using VBA or openpyxl.
I know it probably depends on the task you want to achieve, but let's say I have a table that is 50 cells wide and 150,000 cells tall and I want to copy it from woorkbook A to workbook B.
Any thoughts on whether python will do better or if Excel is better in dealing with itself?
My guts tell me that python should be fairly faster for some reasons:
In order for a sub to copy from a workbook to another, both should be open and running, whereas with python I can simply load both;
VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources
Besides that, I'd like to know if I can make some further improvements to a openpyxl script, like multithreading or perhaps using NumPy along with it.
Thanks for the help!
TBH the fastest approach would probably be remote controlling Excel using xlwings, because this can take advantage of Excel's optimisation. VBA might be able to hook into that as well but I've never found VBA to be fast.
Python will have to convert from XML to Python and back to XML. You've got around 5,000,000 million cells so I'd expect this to take about a minute on my machine. I'd suggest combining read-only and write-only modes to do this to keep memory use low.
If you only have numerical data (no dates) then you might be able to find a shortcut and "transplant" the relevant worksheet XML file from one Excel file to another and just alter the relevant metadata.
TL;DR Consider making a direct data connection to the Excel file (ADO in VBA or Python+PyWin32, pyodbc in Python, or the .NET OleDbConnection class, among others). The language in which you make such a connection is much less relevant.
Long version
If all you want is to work with the data itself, you might want to consider a direct connection to Excel using ADO, pyodbc, or the .NET OleDbConnection class.
Automating the Excel application (with the Microsoft Excel object model, or (presumably) with xlwings) incurs a lot of overhead, which is understandable, because you might not be only reading the data in the Excel file, but also manipulating all the objects in the Excel UI — windows, menus — as well as objects beyond the data, such as formatting on individual cells or ranges.
It's true that openpyxl doesn't have all this overhead of UI elements, because it's reading the file directly, but I'm presuming there is still some overhead incurred because openpyxl has to make available all the information in the file, which is more than just the data — cell formatting, for example.
Making a data connection also allows you to treat the Excel file as a database, to which you can issue SQL statements, with all the power of SQL -- joins, sorting, grouping, aggregates.
See here for an example using ADO and VBA.
With openpyxl ...
This link was really helpful for me:
https://blog.dchidell.com/2019/06/24/openpyxl-poor-performance-optimisation/
Use read_only when opening the file if all you're doing is reading.
Use the built in iterators!
I cannot stress this enough - the iterators are fast, crazy fast.
Call functions as infrequently as possible and store intermediate
data in variables. It may bulk the code up a bit, but it tends to be
more efficient and also allows your code to be more readable (but this
is icing on the cake compared to points 1 and 2). Python can also be
ambiguous as to what is a variable and what is a function; but as a
general rule intermediate variables are good for multiple function
calls.
I was doing some reading of values in a particular workbook, and I did this initially:
wb = load_workbook(filename)
And that would take nearly 80 seconds. Caching the workbook between actions with it was helpful but still painful every time I reloaded my script.
I switched to reading only.
wb = load_workbook(filename, data_only=True, read_only=True)
Now it only takes < 0.1 seconds.

openpyxl: writing large excel files with python

I have a big problem here with python, openpyxl and Excel files. My objective is to write some calculated data to a preconfigured template in Excel. I load this template and write the data on it. There are two problems:
I'm talking about writing Excel books with more than 2 millions of cells, divided into several sheets.
I do this successfully, but the waiting time is unthinkable.
I don't know other way to solve this problem. Maybe openpyxl is not the solution. I have tried to write in xlsb, but I think openpyxl does not support this format. I have also tried with optimized writer and reader, but the problem comes when I save, due to the big data. However, the output file size is 10 MB, at most. I'm very stuck with this. Do you know if there is another way to do this?
Thanks in advance.
The file size isn't really the issue when it comes to memory use but the number of cells in memory. Your use case really will push openpyxl to the limits at the moment which is currently designed to support either optimised reading or optimised writing but not both at the same time. One thing you might try would be to read in openpyxl with use_iterators=True this will give you a generator that you can call from xlsxwriter which should be able to write a new file for you. xlsxwriter is currently significantly faster than openpyxl when creating files. The solution isn't perfect but it might work for you.

Python - Reading a spreadsheet

What I need to know is, can I get Python to read a spreadsheet (preferably Microsoft Excel), then parse the information and input it into an equation?
It's for a horse-racing program, where the information for several horses will be in one excel spreadsheet, in different rows or columns. I need to know if I can run a calculation for each of those horses separately and then calculate a score for the given horse.
My suggestion is:
Save the Excel file as a csv comma separated value file, which is a plain text format and much easier to work with.
Use Python's built-in csv module to work with the data in csv format.
You can work with Excel files directly in Python (Excel 2003 format supported via the third party modules xlwt, xlrd) but this is much harder than working with CSV.
OpenPyXL ("A Python library to read/write Excel 2007 xlsx/xlsm files") has a very nice and Pythonic API.
Use xlrd package. It's on PyPI, so you can just easy_install xlrd
You can export the spreadsheet as a .csv and read it in as a text file, then process it. I have a niggling feeling there might even a CSV parsing python library.
AFAIK there isn't a .xls parser, although I might be wrong.
EDIT: I was wrong: http://www.python-excel.org/

Categories