Python 3 - Data mining from PDF

Python 3 - Data mining from PDF - python

I'm working on a project that requires obtaining data from some PDF documents.
Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project.
I've tested all the free converters that I could find (like xpdf, pdftotext) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.
I've tried some Python modules like pdfminer but they don't seem to work well in Python 3.
I can't get the data before it's converted to PDF because I get them from a phone carrier.
I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.
Update:
PyPDF2 is not grabbing any text whatsoever from the pdf document.

The PyPDF2 seems to be the best one available for Python3
It's well documented and the API is simple to use.
It also can work with encrypted files, retrieve metadata, merge documents, etc
A simple use case for extracting the text:
from PyPDF2 import PdfFileReader
with open("test.pdf",'rb') as f:
if f:
ipdf = PdfFileReader(f)
text = [p.extractText() for p in ipdf.pages]

I don't believe that there is a good free python pdf converter sadly, however pdf2html although it is not a python module, works extremely well and provides you with much more structured data(html) compared to a simple text file. And from there you can use python tools such as beautiful soup to scrape the html file.
link - http://coolwanglu.github.io/pdf2htmlEX/
Hope this helps.

Here is an example of pyPDF2 codes:
from PyPDF2 import PdfFileReader
pdfFileObj = open("FileName", "rb")
pdfReader = PdfFileReader(pdfFileObj,strict = False)
data=[page.extractText() for page in pdfReader.pages]
more information on pyPDF2 here.

I had the same problem when I wanted to do some deep inspection of PDFs for security analysis - I had to write my own utility that parses the low-level objects and literals, unpacks streams, etc so I could get at the "raw data":
https://github.com/opticaliqlusion/pypdf
It's not a feature complete solution, but it is meant to be used in a pure python context where you can define your own visitors to iterate over all the streams, text, id nodes, etc in the PDF tree:
class StreamIterator(PdfTreeVisitor):
'''For deflating (not crossing) the streams'''
def visit_stream(self, node):
print(node.value)
pass
...
StreamIterator().visit(tree)
Anyhow, I dont know if this is the kind of thing you were looking for, but I used it to do some security analysis when looking at suspicious email attachments.
Cheers!

Related

Creating a kindle dictionary

I am trying to create a Kindle dictionary that can be used for offline lookup. I already have the words and their inflections, but turning this into a working dictionary is difficult.
There is some documentation about this provided by Amazon. It basically says that you should:
Create an XHTML file with their special markup specifying all inflections etc.
Turn it into an epub
Open it with Kindle Previewer
Export it with Kindle Previewer to MOBI
So I created a large XHTML file (23 MB or so) according to the Amazon specifications and opened it in Kindle Previewer, and it looked fine. However, Kindle Previewer does not let you export XHTML files to MOBI. They want you to create an intermediate epub file.
I tried using Pandoc to do the conversion, which did not work because it stripped out all the specific HTML tags and only left in paragraphs. Then I tried using calibre. The normal XHTML -> epub conversion failed because the XHTML file was too large, according to an error message. Calibre suggests to turn on the "heuristic mode" if you run into this error, which I tried, but which did not finish running after hours of runtime.
Then I attempted to create the epub file myself, using a sample file taken from this tutorial. I discovered that this is not trivial, and a check using epubcheck revealed many hard-to-understand errors in my generated file. The generation of the epub file is also a bit complicated by the fact that you probably need to split the XHTML files into many smaller files, which should maybe be 250 kb in size, because e-readers tend to struggle with parsing larger files.
So I thought there should maybe be an easier way to do this, or maybe a library that helps doing this. Maybe it would even be a good idea to output the words + inflections into some other easier dictionary format and then convert it to a MOBI using an existing library and leaving out the XHTML generation completely. Currently I am using Python, but I'd also use other languages if it is necessary. What could I try?
Edit: To add to the things I have tried: there is an apparently closed source script here that unfortunately doesn't support inflections, so does not work. And there are instructions here that advise converting the file to PRC using Mobipocket Creator and then opening it with Kindle Previewer. The problem with this approach is that Kindle Previewer throws the error:
Kindle Previewer does not support this file, which has either been created using an older version of KindleGen or a third party application. We recommend using EPUB or DOCX format directly for previewing and publishing your book on Kindle.
There are also more detailed instructions for Mobipocket Creator here, which tell you to directly move the generated .prc file onto the kindle. I tried that but it is not being recognized as a dictionary.

I figured it out by myself. First I implemented a solution myself, then I found the pyglossary library (right now the code below only works with the version from Github and not from pip) and used it like this:
from pyglossary.glossary import Glossary
Glossary.init()
glos = Glossary()
defiFormat = "h"
base_forms = get_base_forms()
for canonical_form in base_forms:
inflections = get_inflections(canonical_form)
definitions = get_definition(canonical_form)
definitionhtml = ""
for definition in definitions:
definitionhtml += "<p>" + gloss + "</p>"
all_forms = [canonical_form]
all_forms.extend(inflections)
glos.addEntryObj(glos.newEntry(all_forms, glosshtml, defiFormat))
glos.setInfo("title", "Russian-English Dictionary")
glos.setInfo("author", "Vuizur")
glos.sourceLangName = "Russian"
glos.targetLangName = "English"
glos.write("test.mobi", format="Mobi", keep=True, kindlegen_path="path/to/kindlegen.exe")

Why PyPDF2 showing this output when printing extractText?

I am trying to extract data from pdf using PyPDF2 but instead of showing actual text it showing something else in the output what could be the reason behind it?
Here is my code
xfile=open('filename','rb')
pdfReader = PyPDF2.PdfFileReader(xfile)
num=pdfReader.numPages
pageobj=pdfReader.getPage(0)
print(pageobj.extractText())
when I run above program I get this output what could be the reason?
!"#$%#&'(%!#)
(((((((((((((((((((((((((((((((((((((((((((((((((!"#$%#&'(%!#)*+,-./0!$1(230
4444444444445674+8,8,9:+*8
4&*)+!,$-.
4,*7;44444444444444444444444444
4$/012/($/3414546(78(,69:/7;7<=(>"#)?#(A2B2/231
(444<(4=&2#4$>4?&#!0$24A>/$>&&#$>/B4?CDEF4+(;8
4,*7,444*B62C;2/0(#B(%69(%9:77;#("1;23D5B
((((?C<GA47,H#B48:(,*I
4,*7*444E2F2:2B(.2G702=2(A10=2;2=2#("1;23D5B
((((?<GA47*H#B4?CDEF46(8
44%'$HH%(!.*($.,&I&%,%

Pdf is a file format oriented around page layout. Thus, text present in a pdf can be stored in various methods. It is not guaranteed that your pdf is stored in a format readable by PyPDF.
Moving forward: you can try extracting data from other pdfs before concluding if there is a fault with your PyPdf implementation.
you can also try extracting data from pytesseract and see if your result improves.

From PyPDF2s documentation:
This works well for some PDF files, but poorly for others, depending on the generator used.
Your PDF might be of the latter category and you are SOL...
With PyPDF2 not being actively developed anymore (no updates to the Pypi package since 2016) maybe try a more up-to-date package like pdftotext

How to remove all the unnecessary tags and signs from html files?

I am trying to extract "only" text information from 10-K reports (e.g. company's proxy reports) on SEC's EDGAR system by using Python's BeautifulSoup or HTMLParser. However, the parsers that I am using do not seem to work well onto the 'txt'-format files, including a large portion of meaningless signs and tags along with some xbrl information, which is not needed at all. However, when I apply the parser directly onto 'htm'-format files, which are more or less free from the issues of meaningless tags, the parser seems works relatively fine.
"""for Python 3, from urllib.request import urlopen"""
from urllib2 import urlopen
from bs4 import BeautifulSoup
"""for extracting text data only from txt format"""
txt = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/0001660156-16-000019.txt")
bs_txt = BeautifulSoup(txt.read())
bs_txt_text = bs_txt.get_text()
len(bs_txt_text) # 400051
"""for extracting text data only from htm format"""
html = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/f201510kzec2_10k.htm")
bs_html = BeautifulSoup(html.read())
bs_html_text = bs_html.get_text()
len(bs_html_text) # 98042
But the issue is I am in a position to rely on 'txt'-format files, not on 'htm' ones, so my question is, is there any way to deal with removing all the meaningless signs and tags from the files and extracting only text information as the one directly extracted from 'htm' files? I am relatively new to parsing using Python, so if you have any idea on this, it would be of great help. Thank you in advance!

The best way to deal with XBRL data is to use an XBRL processor such as the open-source Arelle (note: I have no affiliation with them) or other proprietary engines.
You can then look at the data with a higher level of abstraction. In terms of the XBRL data model, the process you describe in the question involves
looking for concepts that are text blocks (textBlockItemType) in the taxonomy;
retrieving the value of the facts reported against these concepts in the instance;
additionally, obtaining some meta-information regarding it: who (reporting entity), when (XBRL period), what the text is about (concept metadata and documentation), etc.
An XBRL processor will save you the efforts of resolving the entire DTS as well as dealing with the complexity of the low-level syntax.
The second most appropriate way is to use an XML parser, maybe with an XML Schema engine as well as XQuery or XSLT, but this will require more work as you will need to either:
look at the XML Schema (XBRL taxonomy schema) files, recursively navigating them and looking for text block concepts, deal with namespaces, links, and so on (which an XBRL processor shields you from)
or only look at the instance, ideally the XML file (e.g., https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/zeci-20151231.xml ) with a few hacks (such as taking XML elements ending with TextBlock), but this is at your own risks and not recommended as this bypasses the taxonomy.
Finally, as you suggest in the original question, you can also look at the document-format files (HTML, etc) rather than at the data files of the SEC filing, however in this case it defeats the purpose of using XBRL, which is to make the data understandable by a computer thanks to tags and contexts, and it may miss important context information associated with the text -- a bit like opening a spreadsheet file with a text/hex editor.
Of course, there are use cases that could justify using that last approach such as running natural language processing algorithms. All I am saying is that this is then outside of the scope of XBRL.

There is an HTML tag stripper at the pyparsing wiki Examples page. It does not try to build an HTML doc, it merely looks for HTML and script tags and strips them out.

Fast Python PDF metadata reader

I'm looking for a very fast, lightweight Python library to read PDF metadata. I don't need any write capabilities. It would be better if only the metadata information is loaded, not the entire file.
I realise an interpreted language like Python isn't the best choice for speed, but as this solution needs to be cross platform and work with an existing Python application there doesn't seem to be much of a choice.
I checked out pyPdf and some other libraries, but am ideally looking for something lighter and faster, suitable for processing tens of thousands of files in one go.

pdfrw can read the metadata without reading parsing the entire file. (Disclaimer: I am the author of pdfrw.) For example:
>>> from pdfrw import PdfReader
>>> PdfReader('pdf_reference_1-7.pdf').Info
{'/Title': '(PDF Reference, version 1.7)',
'/CreationDate': '(D:20061017081020Z)',
'/Producer': '(Acrobat Distiller 7.0.5 \\(Windows\\))',
'/Creator': '(FrameMaker 7.2)',
'/ModDate': "(D:20061118211043-02'30')",
'/Author': '(Adobe Systems Incorporated)',
'/Subject': '(Adobe Portable Document Format \\(PDF\\))'}

Here's something I just put together, built on top of the Python PDFMiner library. You can extract both "Info" and XMP type metadata with it.

Have you seen this answer to a similar question? It suggests using fopen and manually parsing the metadata. If the metadata is all you need, you can parse it yourself and make it as fast as you like.

It's a little Raw, but this should get you the meta data
f = open('file.pdf', 'r')
pdfdata=f.read()
metas=re.findall('<</Metadata(.*?)>>',pdfdata)

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.

I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.

If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.

There is an add-on for ReportLab — PageCatcher.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.