Decoding KeyNote IWA protobuf data with Python - python

Good afternoon,
I am looking for a bit of insight into working with KeyNote files (~2017 ver 8.x).
My objective is fairly basic. I just want to extract the text and images from about 3000 KeyNote files. I am working in Python 2.7 due to the age of many of the tools, but I would like to upgrade to 3.x or 4.x eventually. Despite a lot of reading and experimenting I seen to have hit a wall extracting messages from the IWA objects.
I have been experimenting with various approaches and have also been trying to manually deconstruct the IWA files by hand using the protobuf encoding information. However something just does not add up. Testing with messages created using the Protobuf sample code I can deconstruct 100%, but .IWA blocks from KeyNote files end up with invalid wire types, repeat field numbers or field sizes that don't makes sense (e.g. larger that the size of the IWA block).
What I think I know.
1/ The .key files are a grouping of objects that are zipped and can be unzipped using a generic module like zipfile.
Once unzipped, the key file can be separated giving access to the/index branch and constituant IWA objects.
2/ The IWA files have a 4 byte little endian header, and the rest should follow the google protobuf encoding.
3/ The protobuf encoding does hold for some aspects of the IWA files. e.g recognized blocks of text have the correct tags. However other parts of the IWA does not seem to follow the rules either resulting in invalid wire-type codes (e.g. wire-type=6 ) or, field numbers are zero or are reused.
What I would appreciate is if:
A/ Someone could confirm that the KeyNote encoding does comply with the Google protobuf encoding, or point me at a valid encoding schedule or scheme that I can use.
B/ Someone could clarify if the IAW objects are or are not individually compressed in addition to the compressing applied to the whole .key file. The documentation is unclear, but my attempts to further decompress the IWA objects was not successful.
C/ Someone could direct me to a functional Python library that can extract data from KeyNote files.
As much as I am having fun playing with file deconstruction at the byte and bit level, I still have an objective to achieve :-)
Thank you.
Rusty
Any insights gratefully accepted

I know this is a relatively old question, but I came across it and would offer up some information.
The page
https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md#iwa
seems to have a lot of info on the format. In particular, it seems (from what I gather from that page) that the IWA does not follow exactly the ProtoBuf encoding, which is probably the cause of your problems with invalid wire numbers and non-sensable field lengths.

Related

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

Can't Read Encoded Text in Visual FoxPro DBF FIles

I recently acquired a ton of data stored in Visual FoxPro 9.0 databases. The text I need is in Cyrillic (Russian), but of the 1000 .dbf files (complete with .fpt and .cdx files), only 4 or 5 return readable text. The rest (usually in the form of memos) returns something like this:
??9Y?u?
yL??x??itZ?????zv?|7?g?̚?繠X6?~u?ꢴe}
?aL1? Ş6U?|wL(Wz???8???7?#R?
.FAc?TY?H???#f U???K???F&?w3A??hEڅԦX?MiOK?,?AZ&GtT??u??r:?q???%,NCGo0??H?5d??]?????O{??
z|??\??pq?ݑ?,??om???K*???lb?5?D?J+z!??
?G>j=???N ?H?jѺAs`c?HK\i
??9a*q??
For the life of me, I can't figure out how this is encoded. I have tried all kinds of online decoders, opened up the .dbfs in many database programs, and used Python to open and manipulate them. All of them returns the similar messiness as above, but never readable Russian.
Note: I know that these databases are not corrupt, because they came accompanied by enterprise software that can open, query and read them successfully. However, that software will not export the data, so I am left working directly with the .dbfs.
Happy to share an example .dbf if would help get to the bottom of this.
I would expect if it is FoxPro database, that the Russian there is encoded in some pre-Unicode encoding for Russian as for most Eastern European languages in ancient times.
For example: Windows-1251 or ISO 8859-5.
'?' characters don't convey much. Try looking at the contents of the memo fields as hex, and see whether what you're seeing looks anything like text in any encodings. (Apologies if you've tried this using Python already). Of course if it is actually encrypted you may be out of luck unless you can find out the key and method.
There are two possibilities:
the encoding has not been correctly stored in the dbf file
the dbf file has been encrypted
If it's been encrypted I can't help you. If it's a matter of finding the correct encoding, my dbf package may be of use. Feel free to send me a sample dbf file if you get stuck.

How to parse a .shp file?

I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

Any good pdf417 Barcode libraries for Python?

I'm looking for a good python module to generate pdf417 barcodes. Has anyone used one they liked?
Ideally I would like one with as few dependencies as possible, and one that runs on both linux and MacOSX.
We recently had to approach this problem as well, and being a Python shop we wanted a Python solution. It become clear the elaphe is the project that had the potential to actually accomplish pdf 417 barcode.
However what we found was it errors by todays standards, and so we entered the hunt to fix the library. Turns out elaphe must generate an outdated form of *.eps post script that can't be interpreted by ghost script and this is where the bar code generation fails.
Well fortunately elphae uses a common library behind the scenes called Barcode Writer in Pure PostScript # http://bwipp.terryburton.co.uk
This common backend library which has many projects in multi-languages using it to generate projects. The fix specifically for us was to fork elaphe, and correct it's *.eps file generation.
To determine what is broken in the *.eps, look at this other site that is made using postscriptbarcode, and it let's you generate the pdf417 barcode online (as well as other formats): http://www.terryburton.co.uk/barcodewriter/generator/
Once you generate a pdf417 barcode it gives you the option to download the .png, .jpg, and YES the .eps file!
Using this .eps file you can pipe it to ghost script and tweak the parameterization to get the exact pdf417 barcode you are looking for. Then take this result and integrate it into the elaphe library and actually get a pull request on that thing ....
Seems to be a bit of work, but nothing that can't be knocked out in an afternoon. It is ideal to get the elaphe library back in shape to generate these without making this enhancement.
Please note that the performance of this approach for us is a few seconds to generate this barcode due to the fact it creates the 2000 line eps file and pipes it to ghost script which generates another image file that we send back as the final barcode result. This is not as performance as code128 with reportlab.
Perhaps room for optimizations: Is pillow faster than PIL in anyway? Do we need all the parts of the eps file to generate the barcode of type pdf417? Other ways to optimize?
Anyway, great question Ken and I hope you find this to be a great answer.
I guess the issue in elaphe reported by Matteius in 2013 has been fixed, since the issues and commit logs show updates on the pdf417 topic since then.
Anyway, there are now a few other options (got the list with either pip search elaphe or pip search pdf417) :
elaphe ;
elaphe3 (fork of elaphe tested against python3) ;
candybar (no documentation ? also a webservice) ;
pdf417gen ;
treepoem (about the name : barcode -> bark ode -> tree poem =D ) — edit : didn't dig the issue, but as of today generation of PDF417 seems broken.
All but pdf417gen support several types of barcodes.
Note that the documentation of bwipp (on which are based elaphe and treepoem) only mentions 5 levels of error correction (1 to 5), while pdf417gen claims to support 9 security levels (0 to 8).
Reportlab does have an extension called rlbarcode, but this one does not include support for pdf417 codes. I do not know of any other extension for reportlab including support for pdf417 bar codes.
Anyway, if you are interested in generation of pdf417 codes from python, you may be interested in this project: elaphe.
I have still not tested it (in fact, I need to generate pdf417 from python, and I found this thread as well as the elaphe project page) I am going to download the elaphe tools in order to test it right now.

Categories