Extracting text from text documents with Poedit - python

I am making a quiz app which reads data from text files. The app works fine but I now want to translate it into English (from my native language). I can do that for strings defined in source files (.py) such as text on buttons etc., but have troubles with extracting text that needs translating from those text documents where all my questions and possible answers are.
I am using module gettext with Python and am using operator _ or _( to indicate translatable strings (which I have set in Poedit under Properties - Sources Keywords).
I have also set paths of my translatable sources to . (all files in that directory) and even tried setting those .txt files specifically for extracting.
My text file looks like this (one line of one file):
_(Koliko je 2/0?);_(0):_(ni definirano):_(2);_(ni definirano)
I tried to find which document type's Poedit extracts text from but did not find anything other than "from source" - should it be able to extract from .txt files or not? If not, how should I name them?
As I said, it does extracts strings from my .py files so it is working otherwise.

Poedit can't magically know the syntax of your homegrown file format, so simply adding .txt files can't possibly do anything. You'll have to write a custom extractor (see how xgettext works for reference) or switch to some standard syntax:
Be sufficiently similar to a supported programming language, such as C (where as luck would have it, ; and : are both valid syntax elements, although using e.g. , instead of : would be safer):
_("Koliko je 2/0?");_("0"):_("ni definirano"):_("2");_("ni definirano")
Use XML-based format, where xgettext supports extraction rules described with ITS.

I was confronted with the same problem when trying to extract strings to be translated from an options.txt file in a WordPress plugin. The only solution I found was to copy that options.txt file to options.php, which PoEdit was able to search for strings. When the translation operation is finished, the options.txt file can then be deleted.

Related

Creating a kindle dictionary

I am trying to create a Kindle dictionary that can be used for offline lookup. I already have the words and their inflections, but turning this into a working dictionary is difficult.
There is some documentation about this provided by Amazon. It basically says that you should:
Create an XHTML file with their special markup specifying all inflections etc.
Turn it into an epub
Open it with Kindle Previewer
Export it with Kindle Previewer to MOBI
So I created a large XHTML file (23 MB or so) according to the Amazon specifications and opened it in Kindle Previewer, and it looked fine. However, Kindle Previewer does not let you export XHTML files to MOBI. They want you to create an intermediate epub file.
I tried using Pandoc to do the conversion, which did not work because it stripped out all the specific HTML tags and only left in paragraphs. Then I tried using calibre. The normal XHTML -> epub conversion failed because the XHTML file was too large, according to an error message. Calibre suggests to turn on the "heuristic mode" if you run into this error, which I tried, but which did not finish running after hours of runtime.
Then I attempted to create the epub file myself, using a sample file taken from this tutorial. I discovered that this is not trivial, and a check using epubcheck revealed many hard-to-understand errors in my generated file. The generation of the epub file is also a bit complicated by the fact that you probably need to split the XHTML files into many smaller files, which should maybe be 250 kb in size, because e-readers tend to struggle with parsing larger files.
So I thought there should maybe be an easier way to do this, or maybe a library that helps doing this. Maybe it would even be a good idea to output the words + inflections into some other easier dictionary format and then convert it to a MOBI using an existing library and leaving out the XHTML generation completely. Currently I am using Python, but I'd also use other languages if it is necessary. What could I try?
Edit: To add to the things I have tried: there is an apparently closed source script here that unfortunately doesn't support inflections, so does not work. And there are instructions here that advise converting the file to PRC using Mobipocket Creator and then opening it with Kindle Previewer. The problem with this approach is that Kindle Previewer throws the error:
Kindle Previewer does not support this file, which has either been created using an older version of KindleGen or a third party application. We recommend using EPUB or DOCX format directly for previewing and publishing your book on Kindle.
There are also more detailed instructions for Mobipocket Creator here, which tell you to directly move the generated .prc file onto the kindle. I tried that but it is not being recognized as a dictionary.
I figured it out by myself. First I implemented a solution myself, then I found the pyglossary library (right now the code below only works with the version from Github and not from pip) and used it like this:
from pyglossary.glossary import Glossary
Glossary.init()
glos = Glossary()
defiFormat = "h"
base_forms = get_base_forms()
for canonical_form in base_forms:
inflections = get_inflections(canonical_form)
definitions = get_definition(canonical_form)
definitionhtml = ""
for definition in definitions:
definitionhtml += "<p>" + gloss + "</p>"
all_forms = [canonical_form]
all_forms.extend(inflections)
glos.addEntryObj(glos.newEntry(all_forms, glosshtml, defiFormat))
glos.setInfo("title", "Russian-English Dictionary")
glos.setInfo("author", "Vuizur")
glos.sourceLangName = "Russian"
glos.targetLangName = "English"
glos.write("test.mobi", format="Mobi", keep=True, kindlegen_path="path/to/kindlegen.exe")

How to use doxygen in a project containing both C++ and Python (multi-programming language project)

I am creating documentation for my project which has codes written in both C++ and Python. I can generate documentation with doxygen on my C++ codes, but find no way to do the same with my Python codes after I have the former files. And if I need both doxygen documents to appear under the same index.html, i.e. to merge them into one uniform doxygen file, how can I do this? What is the convention that everyone uses?
First off, thank you albert and shawnhcorey for your quick responses.
I tried what you suggested, and made the following modifications to parse documentation comments from files written in both languages:
In Doxyfile
INPUT = include/my_package/ scripts/my_package/
Both directories are separated by space.
Then,
FILE_PATTERNS = *.h *.py
The two wildcards patterns are also separated by a space.
Back to where you placed your doxyfile, then run
doxygen
And off you go!

How to convert XML Word Documents to DOCX?

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

Create PDF from Word template in Python

I need to generate reports in python. The report needs to contain a header and footer on each page, some text that won't change, some text that's dynamic, and some charts.
I've created a template using Word, and I'm looking for a way of replacing placeholders such as [+my_placeholder+] with text content/charts/whatever.
Is there anything that can let me use Word documents (or something Word can write) as a template to create a PDF in Python? Since I've already created sample reports in Word I want to reuse what I've got instead of having to recreate them using ReportLab or HTML (I know about ReportLab, pyPDF and also xhtml2pdf).
about the first part, editing word templates, you can have a look at this question: Reading/Writing MS Word files in Python
There is a package for editing word files, called python-docx.
For converting word files into pdf documents, there should be a variety of different tools, including command line tools. So you could use python to edit your document and then use a converter tool that you call from your python code to create the pdf.

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

Categories