Fast Python PDF metadata reader - python

I'm looking for a very fast, lightweight Python library to read PDF metadata. I don't need any write capabilities. It would be better if only the metadata information is loaded, not the entire file.
I realise an interpreted language like Python isn't the best choice for speed, but as this solution needs to be cross platform and work with an existing Python application there doesn't seem to be much of a choice.
I checked out pyPdf and some other libraries, but am ideally looking for something lighter and faster, suitable for processing tens of thousands of files in one go.

pdfrw can read the metadata without reading parsing the entire file. (Disclaimer: I am the author of pdfrw.) For example:
>>> from pdfrw import PdfReader
>>> PdfReader('pdf_reference_1-7.pdf').Info
{'/Title': '(PDF Reference, version 1.7)',
'/CreationDate': '(D:20061017081020Z)',
'/Producer': '(Acrobat Distiller 7.0.5 \\(Windows\\))',
'/Creator': '(FrameMaker 7.2)',
'/ModDate': "(D:20061118211043-02'30')",
'/Author': '(Adobe Systems Incorporated)',
'/Subject': '(Adobe Portable Document Format \\(PDF\\))'}

Here's something I just put together, built on top of the Python PDFMiner library. You can extract both "Info" and XMP type metadata with it.

Have you seen this answer to a similar question? It suggests using fopen and manually parsing the metadata. If the metadata is all you need, you can parse it yourself and make it as fast as you like.

It's a little Raw, but this should get you the meta data
f = open('file.pdf', 'r')
pdfdata=f.read()
metas=re.findall('<</Metadata(.*?)>>',pdfdata)

Related

How to convert fb2 file in to txt using Python

Hello I am struggling to convert hundreds of fb2 files to txt using Python. I find pyandoc and EbookLib but I didn't find in their functionality this option, or I didn't search carefully.
Can someone suggest me something relevant in my case ? Maybe free API, but I think there could be a library.
something relevant in my case
I did look for fb2 and FictionBook2 at PyPI and found 2 potentially useful to you: catpandoc and FB2. 1st does Cat multiple documents to the terminal. and support numerous file formats. 2nd is Python package for working with FictionBook2. For FB2 example is given how to create FB2 file, but not how to read. I do not know if it means documentation is poor or it does not have read support at all.
EDIT: After some research I found that FictionBook2 files are XML files. Example can be seen here. That being said I encourage you to first try existing FB2 tools and only if they fail to deliver desired result implement extraction by XML parsing.

Python 3 - Data mining from PDF

I'm working on a project that requires obtaining data from some PDF documents.
Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project.
I've tested all the free converters that I could find (like xpdf, pdftotext) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.
I've tried some Python modules like pdfminer but they don't seem to work well in Python 3.
I can't get the data before it's converted to PDF because I get them from a phone carrier.
I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.
Update:
PyPDF2 is not grabbing any text whatsoever from the pdf document.
The PyPDF2 seems to be the best one available for Python3
It's well documented and the API is simple to use.
It also can work with encrypted files, retrieve metadata, merge documents, etc
A simple use case for extracting the text:
from PyPDF2 import PdfFileReader
with open("test.pdf",'rb') as f:
if f:
ipdf = PdfFileReader(f)
text = [p.extractText() for p in ipdf.pages]
I don't believe that there is a good free python pdf converter sadly, however pdf2html although it is not a python module, works extremely well and provides you with much more structured data(html) compared to a simple text file. And from there you can use python tools such as beautiful soup to scrape the html file.
link - http://coolwanglu.github.io/pdf2htmlEX/
Hope this helps.
Here is an example of pyPDF2 codes:
from PyPDF2 import PdfFileReader
pdfFileObj = open("FileName", "rb")
pdfReader = PdfFileReader(pdfFileObj,strict = False)
data=[page.extractText() for page in pdfReader.pages]
more information on pyPDF2 here.
I had the same problem when I wanted to do some deep inspection of PDFs for security analysis - I had to write my own utility that parses the low-level objects and literals, unpacks streams, etc so I could get at the "raw data":
https://github.com/opticaliqlusion/pypdf
It's not a feature complete solution, but it is meant to be used in a pure python context where you can define your own visitors to iterate over all the streams, text, id nodes, etc in the PDF tree:
class StreamIterator(PdfTreeVisitor):
'''For deflating (not crossing) the streams'''
def visit_stream(self, node):
print(node.value)
pass
...
StreamIterator().visit(tree)
Anyhow, I dont know if this is the kind of thing you were looking for, but I used it to do some security analysis when looking at suspicious email attachments.
Cheers!

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories