Creating a kindle dictionary - python

I am trying to create a Kindle dictionary that can be used for offline lookup. I already have the words and their inflections, but turning this into a working dictionary is difficult.
There is some documentation about this provided by Amazon. It basically says that you should:
Create an XHTML file with their special markup specifying all inflections etc.
Turn it into an epub
Open it with Kindle Previewer
Export it with Kindle Previewer to MOBI
So I created a large XHTML file (23 MB or so) according to the Amazon specifications and opened it in Kindle Previewer, and it looked fine. However, Kindle Previewer does not let you export XHTML files to MOBI. They want you to create an intermediate epub file.
I tried using Pandoc to do the conversion, which did not work because it stripped out all the specific HTML tags and only left in paragraphs. Then I tried using calibre. The normal XHTML -> epub conversion failed because the XHTML file was too large, according to an error message. Calibre suggests to turn on the "heuristic mode" if you run into this error, which I tried, but which did not finish running after hours of runtime.
Then I attempted to create the epub file myself, using a sample file taken from this tutorial. I discovered that this is not trivial, and a check using epubcheck revealed many hard-to-understand errors in my generated file. The generation of the epub file is also a bit complicated by the fact that you probably need to split the XHTML files into many smaller files, which should maybe be 250 kb in size, because e-readers tend to struggle with parsing larger files.
So I thought there should maybe be an easier way to do this, or maybe a library that helps doing this. Maybe it would even be a good idea to output the words + inflections into some other easier dictionary format and then convert it to a MOBI using an existing library and leaving out the XHTML generation completely. Currently I am using Python, but I'd also use other languages if it is necessary. What could I try?
Edit: To add to the things I have tried: there is an apparently closed source script here that unfortunately doesn't support inflections, so does not work. And there are instructions here that advise converting the file to PRC using Mobipocket Creator and then opening it with Kindle Previewer. The problem with this approach is that Kindle Previewer throws the error:
Kindle Previewer does not support this file, which has either been created using an older version of KindleGen or a third party application. We recommend using EPUB or DOCX format directly for previewing and publishing your book on Kindle.
There are also more detailed instructions for Mobipocket Creator here, which tell you to directly move the generated .prc file onto the kindle. I tried that but it is not being recognized as a dictionary.

I figured it out by myself. First I implemented a solution myself, then I found the pyglossary library (right now the code below only works with the version from Github and not from pip) and used it like this:
from pyglossary.glossary import Glossary
Glossary.init()
glos = Glossary()
defiFormat = "h"
base_forms = get_base_forms()
for canonical_form in base_forms:
inflections = get_inflections(canonical_form)
definitions = get_definition(canonical_form)
definitionhtml = ""
for definition in definitions:
definitionhtml += "<p>" + gloss + "</p>"
all_forms = [canonical_form]
all_forms.extend(inflections)
glos.addEntryObj(glos.newEntry(all_forms, glosshtml, defiFormat))
glos.setInfo("title", "Russian-English Dictionary")
glos.setInfo("author", "Vuizur")
glos.sourceLangName = "Russian"
glos.targetLangName = "English"
glos.write("test.mobi", format="Mobi", keep=True, kindlegen_path="path/to/kindlegen.exe")

Related

How to convert fb2 file in to txt using Python

Hello I am struggling to convert hundreds of fb2 files to txt using Python. I find pyandoc and EbookLib but I didn't find in their functionality this option, or I didn't search carefully.
Can someone suggest me something relevant in my case ? Maybe free API, but I think there could be a library.
something relevant in my case
I did look for fb2 and FictionBook2 at PyPI and found 2 potentially useful to you: catpandoc and FB2. 1st does Cat multiple documents to the terminal. and support numerous file formats. 2nd is Python package for working with FictionBook2. For FB2 example is given how to create FB2 file, but not how to read. I do not know if it means documentation is poor or it does not have read support at all.
EDIT: After some research I found that FictionBook2 files are XML files. Example can be seen here. That being said I encourage you to first try existing FB2 tools and only if they fail to deliver desired result implement extraction by XML parsing.

Python 3 - Data mining from PDF

I'm working on a project that requires obtaining data from some PDF documents.
Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project.
I've tested all the free converters that I could find (like xpdf, pdftotext) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.
I've tried some Python modules like pdfminer but they don't seem to work well in Python 3.
I can't get the data before it's converted to PDF because I get them from a phone carrier.
I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.
Update:
PyPDF2 is not grabbing any text whatsoever from the pdf document.
The PyPDF2 seems to be the best one available for Python3
It's well documented and the API is simple to use.
It also can work with encrypted files, retrieve metadata, merge documents, etc
A simple use case for extracting the text:
from PyPDF2 import PdfFileReader
with open("test.pdf",'rb') as f:
if f:
ipdf = PdfFileReader(f)
text = [p.extractText() for p in ipdf.pages]
I don't believe that there is a good free python pdf converter sadly, however pdf2html although it is not a python module, works extremely well and provides you with much more structured data(html) compared to a simple text file. And from there you can use python tools such as beautiful soup to scrape the html file.
link - http://coolwanglu.github.io/pdf2htmlEX/
Hope this helps.
Here is an example of pyPDF2 codes:
from PyPDF2 import PdfFileReader
pdfFileObj = open("FileName", "rb")
pdfReader = PdfFileReader(pdfFileObj,strict = False)
data=[page.extractText() for page in pdfReader.pages]
more information on pyPDF2 here.
I had the same problem when I wanted to do some deep inspection of PDFs for security analysis - I had to write my own utility that parses the low-level objects and literals, unpacks streams, etc so I could get at the "raw data":
https://github.com/opticaliqlusion/pypdf
It's not a feature complete solution, but it is meant to be used in a pure python context where you can define your own visitors to iterate over all the streams, text, id nodes, etc in the PDF tree:
class StreamIterator(PdfTreeVisitor):
'''For deflating (not crossing) the streams'''
def visit_stream(self, node):
print(node.value)
pass
...
StreamIterator().visit(tree)
Anyhow, I dont know if this is the kind of thing you were looking for, but I used it to do some security analysis when looking at suspicious email attachments.
Cheers!

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories