Documenting Python to MS Word readable format - python

I'm trying to produce a post-dev "interface control document" for a not particularly well documented, smallish Python codebase. To fit into an in-place document schema it needs to be delivered as a flat Word document. It needs to be mostly extracted from the code.
So far my only candidate is doxygen's .rtf capability. And doxygen is sort of so-so with Python. And so-so with rtf. So. Can anyone suggest a better solution?

The python community is a strong supporter of the ReST format. I've grown to love it after having to deal with massive word documents. Sphinx is a great documentation tool that can be used document your projects, and is used by http://readthedocs.org, although it may be a little much to setup for a small project.
So I would recommend writing in ReST and seeing how it converts to word, via pandoc.
And really, pandoc supports converting the following formats to word and many other formats, so you have a lot of formats you can write documentation in if you want to avoid writing in word:
markdown
reStructuredText (ReST)
textile
HTML
DocBook
LaTeX

Related

Use PLY to read YACC file (file.y)

PLY has a somewhat complex system of defining tokens, lexems, grammar etc. but I would like to create a parse tree using an already existing Ruby's file - parse.y.
Is there a way to read the file parse.y and create a parse tree for Ruby program in PLY?
Short answer: no.
That file contains 13,479 lines; the actual grammar 769 lines, including 46 Mid-Rule Actions (MRAs), so there are close to 13 thousand lines of C code which would have to be rewritten in Python in order to reproduce the functionality. That functionality includes the lexical analyser, which is about a thousand lines of C code plus supporting functions. (If you think Ply's method of defining lexical analysis is complicated, wait until you try to reproduce a hand-written analyser written in C. :-) )
I extracted the grammar from that file using bison (although I had to edit the file a bit in order for bison to not choke on it; I don't know where the Makefile is in that source repository but I presume that it includes a preprocessing step to make a valid bison grammar file out of parse.y). So you could do that, too, and use the result as the basis of a Ply grammar. You might be able to automate the construction of the grammar, but my guess is that you would still have to do quite a lot of work by hand, and if you don't have at least some experience in writing parsers that work is not going to simple. (It may be educational, though.)
Good luck with your project.

Generate a pdf with python

I'm trying to develop a small script that generate a complete new pdf, mainly text and tables, file as result.
I'm searching for the best way to do it.
I've read about reportlab, which seems pretty good. It has only one drawback asfar as I can see. It is quiet hard to write a template without the commercial version, and the code seems to be hard to maintain.
So I've searched for a more sufficient way and found xhtml2pdf, but this software is quiet old, and cannot generate tables over two pages or more.
The last solution in my mind it to generate a tex-File with a template framework, and later call pdftex as subprocess.
I would implement the last one, and go over LateX. Would you do so, have you better ideas?
I would suggest using the LaTeX approach. It is cross-platform, works in many different languages and is easy to maintain. Plus it's non-commercial!
Pisa is a Html/CSS to PDF converter. It's a great tool for developing Pdf's from scratch using python.
If you need to just append Pdf Pages together, or search through Pdf data, then I'd suggest pyPdf it is free and pretty well documented and easy to use. You can download it here
You may check the http://pypi.python.org/pypi/z3c.rml/ package as an implemenation of Reportlab's RML.
"best way" means? What are you requirements? Some PDF requirements can be accomplished with "cheap" open-source generators or you may end up with some commercial PDF converter. Higher quality means higher price.

Is there an existing library or api I can use to separate words in character based languages?

I'm working on a little hobby Python project that involves creating dictionaries for various languages using large bodies of text written in that language. For most languages this is relatively straightforward because I can use the space delimiter between words to tokenize a paragraph into words for the dictionary, but for example, Chinese does not use a space character between words. How can I tokenize a paragraph of Chinese text into words?
My searching has found that this is a somewhat complex problem, so I'm wondering if there are off the shelf solutions to solve this in Python or elsewhere via an api or any other language. This must be a common problem because any search engine made for asian languages would need to overcome this issue in order to provide relevant results.
I tried to search around using Google, but I'm not even sure what this type of tokenizing is called, so my results aren't finding anything. Maybe just a nudge in the right direction would help.
Language tokenization is a key aspect of Natural Language Processing (NLP). This is a huge topic for major corporations and universities and has been the subject of numerous PhD theses.
I just submitted an edit to your question to add the 'nlp' tag. I suggest you take a look at the "about" page for the 'nlp' tag. You'll find links to sites such as the Natural Language Tool Kit, which includes a Python-based tokenizer.
You can also search Google for terms like: "language tokenization" AND NLP.

Generating parser in Python language from JavaCC source?

I do mean the ??? in the title because I'm not exactly sure. Let me explain the situation.
I'm not a computer science student & I never did any compilers course. Till now I used to think that compiler writers or students who did compilers course are outstanding because they had to write Parser component of the compiler in whatever language they are writing the compiler. It's not an easy job right?
I'm dealing with Information Retrieval problem. My desired programming language is Python.
Parser Nature:
http://ir.iit.edu/~dagr/frDocs/fr940104.0.txt is the sample corpus. This file contains around 50 documents with some XML style markup. (You can see it in above link). I need to note down other some other values like <DOCNO> FR940104-2-00001 </DOCNO> & <PARENT> FR940104-2-00001 </PARENT> and I only need to index the <TEXT> </TEXT> portion of document which contains some varying tags which I need to strip down and a lot of <!-- --> comments that are to be neglected and some &hyph; &space; & character entities. I don't know why corpus has things like this when its know that it's neither meant to be rendered by browser nor a proper XML document.
I thought of using any Python XML parser and extract desired text. But after little searching I found JavaCC parser source code (Parser.jj) for the same corpus I'm using here. A quick look up on JavaCC followed by Compiler-compiler revealed that after all compiler writers aren't as great as I thought. They use Compiler-compiler to generate parser code in desired language. Wiki says input to compiler-compiler is input is a grammar (usually in BNF). This is where I'm lost.
Is Parser.jj the grammar (Input to compiler-compiler called JavaCC)? It's definitely not BNF. What is this grammar called? Why is this grammar has Java language? Isn't there any universal grammar language?
I want python parser for parsing the corpus. Is there any way I can translate Parser.jj to get python equivalent? If yes, what is it? If no, what are my other options?
By any chance does any one know what is this corpus? Where is its original source? I would like to see some description for it. It is distributed on internet with name frDocs.tar.gz
Why do you call this "XML-style" markup? - this looks like pretty standard/basic XML to me.
Try elementTree or lxml. Instead of writing a parser, use one of the stable, well-hardened libraries that are already out there.
You can't build a parser - let alone a whole compiler - from a(n E)BNF grammar - it's just the grammar, i.e. syntax (and some syntax, like Python's indentation-based block rules, can't be modeled in it at all), not the semantics. Either you use seperate tools for these aspects, or use a more advances framework (like Boost::Spirit in C++ or Parsec in Haskell) that unifies both.
JavaCC (like yacc) is responsible for generating a parser, i.e. the subprogram that makes sense of the tokens read from the source code. For this, they mix a (E)BNF-like notation with code written in the language the resulting parser will be in (for e.g. building a parse tree) - in this case, Java. Of course it would be possible to make up another language - but since the existing languages can handle those tasks relatively well, it would be rather pointless. And since other parts of the compiler might be written by hand in the same language, it makes sense to leave the "I got ze tokens, what do I do wit them?" part to the person who will write these other parts ;)
I never heard of "PythonCC", and google didn't either (well, theres a "pythoncc" project on google code, but it's describtion just says "pythoncc is a program that tries to generate optimized machine Code for Python scripts." and there was no commit since march). Do you mean any of these python parsing libraries/tools? But I don't think there's a way to automatically convert the javaCC code to a Python equivalent - but the whole thing looks rather simple, so if you dive in and learn a bit about parsing via javaCC and [python library/tool of your choice], you might be able to translate it...

wiki/docbook/latex documentation template system

I'm searching for a documentation template system or rather will be creating one.
It should support the following features:
Create output in PDF and HTML
Support for large & complicated (LaTeX) formulas
References between documents
Bibliographies
Templates will be filled by a Python script
I've tried LaTeX with various TeX-to-HTML converters but I'm not satisfied with the results.
I've been using DocBook for a while, but I think that editing DocBook is not easy to write and the support for formulas is not yet sufficient.
The main problem is, that there will be users of this system that do not know LaTeX syntax or DocBook. I've thought about an alternative for these users providing an editing possibility with Wiki syntax (converted by Python to LaTeX).
Let's sum up: I want HTML and PDF output from at least LaTeX and Wiki input. DocBook could be used as intermediate format.
Has anybody had a similar problem or can give me an advice on which tools and which file formats I should use ?
We use sphinx: https://www.sphinx-doc.org
It does almost all of that.
Your python script or your users or whomever (I can't follow the question) can create content using RST markup (which is perhaps the easiest of markup languages). You run it through Sphinx and you get HTML and Latex.
I created a LaTeX pre-processor and python module that allows you to embed python or SQL inside a LaTeX file. The python and/or SQL is executed and the output is folded in.
With latex2html or latex2rtf you can then use the LaTeX code to produce HTM and RTF.
I've posted it for you at http://simson.net/pylatex/
Arbortext supports LaTeX natively. You can send the publishing engine or print composer LaTeX and it'll pass it through directly.
It also supports a lot of other composition languages as well and even gives the opportunity to do page-layout manipulation like you'd see in InDesign (without the headache and overhead of ID).
I think that Asciidoc is better targeted at what you are trying to get. It is a simple markup language, it allows latex formulas in it and it generates Docbook documents from which you can further generate the readable HTML or Latex representation

Categories