how would you parse a Microsoft OLE compound document using Python?
Edit: Sorry, I forgot to say that I need write support too.. In short, I have an OLE compound file that I have to read, modify a bit and write back to disk (it's a file made with a CAD application)
Just found OleFileIO_PL, but it doesn't have write support.. :/ and as of version 0.40 (2014) it has write support.
Edit: Looks like there's a way (though Windows-only) that supports writing too.. The pywin32 extensions (StgOpenStorage function and related)
An alternative: The xlrd package has a reader. The xlwt package (a fork of pyExcelerator) has a writer. They handle filesizes of 100s of MB cheerfully; the packages have been widely used for about 4 years. The compound document modules are targetted at getting "Workbook" streams into and out of Excel .xls files as efficiently as possible, but are reasonably general-purpose. Unlike OleFileIO_PL, they don't provide access to the internals of Property streams.
http://pypi.python.org/pypi/xlrd
http://pypi.python.org/pypi/xlwt
If you decide to use them and need help, ask in this forum:
http://groups.google.com/group/python-excel
For completeness: on Linux there's also the GNOME Structured File Library (but the default package for Debian/Ubuntu has Python support disabled, since the Python bindings are unsupported since 2006) and the POIFS Java library.
Related
Is there a way to import .yxdb (Alteryx database files) into Pandas/Python, without using Alteryx as a go-between?
The short answer is no, not at this time.
Longer answer: the raw C++ for .yxdb support is available on github, as it was open sourced in order to adhere to R licensing when Alteryx hooked into R. See this link where Ned Harding explains it all in his blog. So basically, everything is there for someone to build Python support by utilizing the open source C++ ... but nobody has done so just yet.
Not python specific, a YXDB to SQLite DB command line based on the C++ library open sourced by Alteryx.
Limitations:
Not a Python module: use subprocess to invoke the command, then pandas/sqlite3 to read from the SQLite database file.
Read YXDB into SQLite only: write to YXDB not implemented (altough the alteryx library would allow it)
Disclaimer: I'm the author of the fork.
In Python, is it possible at run time to convert a Google Protocol Buffers .proto file into a python class that reads that data? Python is a very dynamic language. When you use protoc to convert a .proto file to python source code, the generated code makes a lot of use of python metaclasses, so it's already very dynamic.
Ideally, I'm thinking of something like this:
import whatever
module = whatever.load_from_file("myfile.proto")
Is this possible?
(I am new to protocol buffers, please let me know if my question makes no sense)
In theory, all the pieces exist to make this work. The Python protobuf implementation could call the C++ .proto parser library (libprotoc) as a C extension to get Descriptors, and then could feed those into the metaclasses.
However, as far as I know, no one has quite tied it altogether. (Disclaimer: My knowledge is a few years old, I may have missed a new development, but I don't see anything in the docs.)
Incidentally, Cap'n Proto's Python implementation does do what you describe, proving it is possible. But that doesn't help you if you need to work with Protobuf format.
(Disclosure: I was the author of most of Google's open source Protobuf code, and I am also the author of Cap'n Proto.)
The most elegant solution to this problem is to use init.py. See: What is __init__.py for?
You can invoke your script to generate Python protobuf classes in init.py. Upon importing, that script will be automatically invoked to generate Python protobuf classes.
I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it tough to find one:
cross platform
supports DOC, DOCX and PDF formats at once
easy to use with python
can be set up in a major shared host
For PDFs, I recommend PDFminer.
Try the docx module (I have not used it myself)
I am not aware of any pure python module that can read .doc files.
There are command-line tools to extract text from .doc files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess to shell out to these tools. Available on Windows via Cygwin.
Apache POI is a Java library that can extract text from Office documents. If your shared host has Java installed, you could write a bit of Java (or Jython) code and execute using subprocess.
If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice
One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.
Textract uses the default tools for every kind of file.
https://github.com/deanmalmgren/textract
Is it somehow possible to extract .cab files in python?
Not strictly answering what you asked, but if you are running on a windows platform you could spawn a process to do it for you.
Taken from Wikipedia:
Microsoft Windows provides two
command-line tools for creation and
extraction of CAB files. They are
MAKECAB.EXE (included within Windows
packages such as 'ie501sp2.exe' and
'orktools.msi'; also available from
the SDK, see below) and EXTRACT.EXE
(included on the installation CD),
respectively. Windows XP also provides
the EXPAND.EXE command.
I had the same problem last week so I implemented this in python. Comments, additions and especially pull requests welcome: https://github.com/hughsie/python-cabarchive
Oddly, the msilib can only create or append to .CAB files, but not extract them. :(
However, the hachoir parser module can apparently read & edit Cabinets. (I have not used it, though, so I couldn't tell you how fitting it is or not!)
I'm interested in using Python to hack on the data in Flash swf files. There is good documentation available on the format of swf files, and I am considering writing my own Python lib to parse that data out using the standard Python struct lib.
Does anybody know of a Python project that already does this? I would also be interested in any available solutions that use Perl, Ruby, Haskell, etc.
Well, unless you're doing it for fun (in which case, go for it!), why not use Ming? It supposedly has python wrappers...
I found another option in SWF Tools. They provide a Python wrapper that supports generating SWF files in Python.
I'm not sure if either SWF Tools or Ming actually supports parsing in and modifying an existing swf file, however. Both seem geared more towards generating swf files from scratch.