Python Slate Library: PDF text extraction concatenating words - python

Just trying to extract the text from a PDF in Python, using the Slate Library and PyPDF2. Unfortunately some PDFs are being output with multiple words merged/concatenated together. This seems to happen intermittently, for example for some PDFs words are extracted with the spaces between them correctly, whereas others are not.
One example of a PDF where words are not extracted correctly is included and available for download (SO wouldn't let me upload it) here. The output from
slate.PDF(open(name, 'rb') ).text()
is (or at least a segment is):
,notonadhocprocedures,andcanbeusedwithdatacollectedatmul-tiplespatialresolutions(Kulldorff1999).Ifdataontheabundanceofataxonovertimeareavailable,thesedatacanbeincorporatedintoanSTPSanalysistoincreasethesensitivityandreliabilityofthemodeltodetectsightingclusters,
where of course the first comma-separated token should be not on adhoc procedures
Does anybody know why this is happening, or have a better idea of a library to use for PDF text extraction?
Thanks for the help!

Related

Python - Split multi document pdf by keyword and save to it's own pdf file

I have to take a pdf file that has multiple documents in it and identify when each document starts and ends by using provided key phrases and then save that document to a separate pdf. I'm pretty sure I need to use the PyPDF2 module to split the pdf up into separate pdfs but I'm not sure how to approach identifying the start and end of the documents by the key phrases. Code would help, but I more so need clarity on the approach

Reading an online based pdf files in python and separating data in to columns -OSError

I m having an issue with python in getting an online based pdf file to python. The below is the code i wrote
import PyPDF2
import pandas as pd
from PyPDF2 import PdfReader
reader = PdfReader(r"http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
nd this gives me an error
OSError: [Errno 22] Invalid argument: 'http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf'
If we fix this, how do we separate the extracted data in to separate columns using pandas?
there are three tables in this pdf file.I need the first one. I have tried so many tutorials but none of them helped me. Can anyone help me in this regard please?
Thanks,
Snyder
Part one of your question is how to access the PDF content for extraction.
In order to view modify or extract the contents the bitstream it needs to be saved as a editable file. Thats why a binary DTP / printout file needs download to view. Every character in your browser screen was downloaded as text then converted from local file byte storage into graphics.
The simplest method is
curl -O http://www.meteo.gov.lk/images/mergepdf/20221004MERGED.pdf
which saves a working copy locally as 20221004MERGED.pdf
The next issue is that multi language files are a devil to extract and that one has many errors that need editing before extraction.
Here we see in Acrobat or other viewers (on the left) there are failures where eastern characters are mixed in with western ones due to the language mixing, so need corrective edit as shown on the right. Also the underlying text for extraction as seen by pdf readers is western characters that get translated inside the PDF by glyph mapping but for the extractable searchable text are just garbled plain text. this is what Adobe sees for search that first line k`l²zìq&` m[&Sw`n so you can see the W 3rd character from right.
Seriously there are just so many subset related problems to address, that it is easiest to open the PDF in any editor to reset the fonts to what they should be in each language.
The fonts you need in Word Office etc. are Kandy as I used to correct that word plus these others :-

how can I extract Chinese text from PDF using simple ‘with open’?

I need to extract pdf text using python,but pdfminer and others are too big to use,but when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately. The text looks like bytes because it start with b'. My code and the result screenshot:
with open(r"C:\Users\admin\Desktop\aaa.pdf","rb") as file:
aa=file.readlines()
for a in aa:
print(a)
Output Screenshot:
To generate an answer from the comments...
when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately
The reason is that PDF is not a plain text format but instead a binary format whose contents may be compressed and/or encrypted. For example the object you posted a screenshot of,
4 0 obj
<</Filter/FlateDecode/Length 210>>
stream
...
endstream
endobj
contains FLATE compressed data between stream and endstream (which is indicated by the Filter value FlateDecode).
But even if it was not compressed or encrypted, you might still not recognize any text displayed because each PDF font object can use its own, completely custom encoding. Furthermore, glyphs you see grouped in a text line do not need to be drawn by the same drawing instruction in the PDF, you may have to arrange all the strings in drawing instructions by coordinate to be able to find the text of a text line.
(For some more details and backgrounds read this answer which focuses on the related topic of replacement of text in a PDF.)
Thus, when you say
pdfminer and others are too big to use
please consider that they are so big for a reason: They are so big because you need that much code for adequate text extraction. This is in particular true for Chinese text; for simple PDFs with English text there are some short cuts working in benign circumstances, but for PDFs with CJK text you should not expect such short cuts.
If you want to try nonetheless and implement text extraction yourself, grab a copy of ISO 32000-1 or ISO 32000-2 (Google for pdf32000 for a free copy of the former) and study that pdf specification. Based on that information you can step by step learn to parse those binary strings to pdf objects, find content streams therein, parse the instructions in those content streams, retrieve the text pieces drawn by those instructions, and arrange those pieces correctly to a whole text.
Don't expect your solution to be much smaller than pdfminer etc...

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet?

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.
Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.
You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

Categories