Combine tab-separated value (TSV) files into an Excel 2007 (XLSX) spreadsheet - python

I need to combine several tab-separated value (TSV) files into an Excel 2007 (XLSX) spreadsheet, preferably using Python. There is not much cleverness needed in combining them - just copying each TSV file onto a separate sheet in Excel will do. Of course, the data needs to be split into columns and rows same as Excel does when I manually copy-paste the data into the UI.
I've had a look at the raw XML file Excel 2007 generates and it's huge and complex, so writing that from scratch doesn't seem realistic. Are there any libraries available for this?

Looks like xlwt may serve your needs -- you can read each TSV file with Python's standard library csv module (which DOES do tab-separated as well as comma-separated etc, don't worry!-) and use xlwt (maybe via this cheatsheet;-) to create an XLS file, make sheets in it, build each sheet from the data you read via csv, etc. Not sure about XLSX vs plain XLS support but maybe the XLS might be enough...?

The best python module for directly creating Excel files is xlwt, but it doesn't support XLSX.
As I see it, your options are:
If you only have "several", you could just do it by hand.
Use pythonwin to control Excel through COM. This requires you to run the code on a Windows machine with Excel 2007 installed.
Use python to do some preprocessing on the TSV to produce a format that will make step (1) easier. I'm not sure if Excel reads TSV, but it will certainly read CSV files directly.

Note that Excel 2007 will quite happily read "legacy" XLS files (those written by Excel 97-2003 and by xlwt). You need XLSX files because .....?
If you want to go with the defaults that Excel will choose when deciding whether each piece of your data is a number, a date, or some text, use pythonwin to drive Excel 2007. If the data is in a fixed layout such that other than a possible heading row, each column contains data that is all of one known type, consider using xlwt.
You may wish to approach xlwt via http://www.python-excel.org which contains an up-to-date tutorial for xlrd, xlwt, and xlutils.

Related

How do xlrd, xlwt, xlutils work with Excel in the low level?

They are all open source Python packages to control Excel (see python-excel). I am still trying to understand their code. If anyone could give a hint, do how they connect in a low lever to Excel? Via xml, Excel API, or some other basic Python packages?
If we are talking about reading and writing XLS files, basically xlrd and xlwt follow the OpenOffice.org document/specification describing Excel's format and BIFF (Binary Interchange File Format) records to read and write XLS files. If you would inspect the xlwt source code, you would find it manipulates the BIFF records for everything needs to be written: creating workbook, worksheets, writing data, formatting, alignment etc.
With XLSX the story is a bit different. To read XLSX xlrd relies on the openxmlformats XML schemas and use built into Python ElementTree XML parsers (cElementTree if available, otherwise ElementTree) to parse the XLSX file which is, to simplify, a zip archive containing XML files inside. Here is a good overview of what is inside the archive:
Anatomy of OOXML - xlsx
I would also recommend studying the xlsxwriter module - from my point of view, the package is much better documented and the code is much more cleaner and readable than xlwt or xlrd.

How to parse cell format data from a .xlsx file using xlrd

I am relatively new to python and I am trying to read information from an excel sheet to generate a graph. So far I am using the most current version of the xlrd library (0.9.4) in a nested for loop to grab the value from each cell. However, I am unsure how to access the formatting information for each cell
For example, if a cell were formatted to display as currency in the excel file, using the standard sheet.cell(row, column).value from xlrd would only return 5.0 instead of $5.00
I found here that you can set the formatting_info parameter to true when opening the workbook in order to see some of the format information, however I am primarily using excel 2013 and my excel sheets are being saved by default as .xlsx files. According to this issue on GitHub, support for formatting_info has not yet been implemented for .xlsx files.
Is there any way around using the formatting_info flag, or any other way that I can detect when a format, currency specifically, has been used in order to reflect that in my graphs? I am aware that it is possible to convert .xlsx files to .xls files such as shown here, but I am concerned about information/formatting loss.

XLSX to XML with schema map

I have built a couple basic workflows using XML tools on top of XLSX workbooks that are mapped to an XML schema. You would enter data into the spreadsheet, export the XML and I had some scripts that would then work with the data.
Now I'm trying to eliminate that step and build a more integrated and portable tool that others could use easily by moving from XSLT/XQuery to Python. I would still like to use Excel for the data entry, but have the Python script read the XLSX file directly.
I found a bunch of easy to use libraries to read from Excel but they need to explicitly state what cells the data is in, like range('A1:C2') etc. The useful thing about using the XML maps was that users could resize or even move tables to fit different rows and rename sheets. Is their a library that would let me select tables as units?
Another approach I tried was to just uncompress the XLSX and just parse the XML directly. The problem with that is that our data is quite complex (taking up to 30-50 sheets) and parsing that in the uncompressed XLSX structure is really daunting. I did find my XML schema within the uncompressed XLSX, so is there any way to reformat the data into this schema outside of Excel? (basically what Excel does when I save a workbook as an .xml file)
The Excel format is pretty complicated with dependencies between components – you can't for example be sure of that the order of the worksheets in the folder worksheets has any bearing to what the file looks like in Excel.
I don't really understand exactly what you're trying to do but the existing libraries present an interface for client code that hides the XML layer. If you don't want that you'll have to root around for the parts that you find useful. In openpyxl you want to look at the stuff in openpyxl/reader specifically worksheet.py.
However, you might have better luck using lxml as this (using libxml2 in the background) will allow you load a single XML into Python and manipulate it directly using the .objectify() method. We don't do this in openpyxl because XML trees consume a lot of memory (and many people have very large worksheets) but the library for working with Powerpoint shows just how easy this can be.

How to programmatically import csv into excel and use excel formatting?

I have a very large (> 2 million rows) csv file that is being generated and viewed in an internal web service. The problem is that when users of this system want to export this csv to run custom queries, they open these files in excel. Excel is formatting the numbers the best it can, but there are some requests to have the data in xlsx format with filters and whatnot.
The question boils down to: Using python2.7, how can I read a large csv file (>2 million rows) into excel (or multiple excel files) and control the formatting? (dates, numbers, autofilters, etc)
I am open to python and internal excel solutions.
Without more information about the data types in the csv, or your exact issue with EXCEL properly handling those data types, it's hard to give you an exact answer.
However, recommending looking at this module (https://xlsxwriter.readthedocs.org/) which can be used in Python to create xlsx files. I haven't used it, but it seems to have more features than you need.
Especially if you need to split between multiple files, or workbooks. And it looks like you can pre-create the filters and have total control over the formating

Python - Reading a spreadsheet

What I need to know is, can I get Python to read a spreadsheet (preferably Microsoft Excel), then parse the information and input it into an equation?
It's for a horse-racing program, where the information for several horses will be in one excel spreadsheet, in different rows or columns. I need to know if I can run a calculation for each of those horses separately and then calculate a score for the given horse.
My suggestion is:
Save the Excel file as a csv comma separated value file, which is a plain text format and much easier to work with.
Use Python's built-in csv module to work with the data in csv format.
You can work with Excel files directly in Python (Excel 2003 format supported via the third party modules xlwt, xlrd) but this is much harder than working with CSV.
OpenPyXL ("A Python library to read/write Excel 2007 xlsx/xlsm files") has a very nice and Pythonic API.
Use xlrd package. It's on PyPI, so you can just easy_install xlrd
You can export the spreadsheet as a .csv and read it in as a text file, then process it. I have a niggling feeling there might even a CSV parsing python library.
AFAIK there isn't a .xls parser, although I might be wrong.
EDIT: I was wrong: http://www.python-excel.org/

Categories