How to parse this kind of PDF with python - python

I am trying to parse the pdf found here: https://corporate.lowes.com/sites/lowes-corp/files/annual-report/lowes-2020ar.pdf with python. It seems to be text-based, according to the copy/paste test, and the first several pages parse just fine using, e.g. pymupdf.
However, after about page 12, there seems to be an internal change in the document encoding. For example, this section from page 18:
It looks like text, but when you copy and paste it, it becomes:
%A>&1;<81
FB9#4AH4EL
%BJ8XF8#C?BL874CCEBK<#4G8?L
9H??G<#84FFB6<4G8F4A7
C4EGG<#84FFB6<4G8F
CE<#4E<?L<AG;8.A<G87,G4G8F4A74A474"A9<F64?
J88KC4A787BHEJBE>9BE68
;<E<A:4FFB6<4G8F<AC4EGG<#8
F84FBA4?
4A79H??G<#8CBF<G<BAFGB9H?9<??G;8F84FBA4?78#4A7B9BHE,CE<A:F84FBA
<A6E84F8778#4A77HE<A:G;8(/"C4A78#<6
4F6HFGB#8EF9B6HF87BA;B#8<#CEBI8#8AGCEB=86GF
4A74A4G<BAJ<78899BEGGB#B7<9LBHEFGBE8?4LBHG
What is going on here? Will I need to use OCR to parse a file like this? Or is there some way of translating that the stuff above back to text?

Pages 13 to 100 have been imported also there are other odd practices thus suggest you will get 12 good pages then need to OCR 13-100 then probably good 3 pages from 101-104 again see https://stackoverflow.com/a/68627207/10802527
The majority of Pages 13-100 contain structured text that is described as Roman, and coincidentally the Romans were fond of encoding messages by sliding the alphabet a few step to the right or left and that's exactly what's happening here by character sliding we could extract much of the corrupted text using chars+n so read
A and replace with n
B and replace with o
C and replace with p
etc. but I will leave it there as I have little time to do 90 pages of analysis on a bad file font definition.
I tried Acrobat and Exchange plus others all agreed the text was defined as a reasonable form of Times Roman thus nothing to fix but content is meaningless nevertheless Selecting the characters for "We" (08) generally jumped to another instance suggesting there could be some slight possibility of redemption but then yet again the same two characters stopped on occasion at "ai" which is what's needed so I would say the file is Borked.
In theory the corruption should be recoverable in the PDF by remapping that font (at least for those pages), and with good Char remapping by adding or subtracting accordingly the plain text may be more easily converted.

Related

PyPDF2 Extract from field or location

I have a python script running fine, it scans a folder and collects data based on text line position which could work great but if any lines have missing data it throws my numbers off obviously.
I have looked in the pdf file using iText RUPS and I can find a reference to one set of the data I need
BT
582 -158.78 Td
(213447) Tj
ET
the information I want is in the brackets, can I somehow use the coordinates? if all fails, I might be able to get people to agree to start the info I need to collect with a flag XX12345 or YY12345 then I can easily pick out the data from the text extraction, but I'd rather find a better way.
Not added code examples as that works fine it's just the next step I'm struggling with, but I can if anyone wishes.
Many thanks
I tried to use just text extraction, but missing inputs throw my numbering scheme off.

how can I extract Chinese text from PDF using simple ‘with open’?

I need to extract pdf text using python,but pdfminer and others are too big to use,but when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately. The text looks like bytes because it start with b'. My code and the result screenshot:
with open(r"C:\Users\admin\Desktop\aaa.pdf","rb") as file:
aa=file.readlines()
for a in aa:
print(a)
Output Screenshot:
To generate an answer from the comments...
when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately
The reason is that PDF is not a plain text format but instead a binary format whose contents may be compressed and/or encrypted. For example the object you posted a screenshot of,
4 0 obj
<</Filter/FlateDecode/Length 210>>
stream
...
endstream
endobj
contains FLATE compressed data between stream and endstream (which is indicated by the Filter value FlateDecode).
But even if it was not compressed or encrypted, you might still not recognize any text displayed because each PDF font object can use its own, completely custom encoding. Furthermore, glyphs you see grouped in a text line do not need to be drawn by the same drawing instruction in the PDF, you may have to arrange all the strings in drawing instructions by coordinate to be able to find the text of a text line.
(For some more details and backgrounds read this answer which focuses on the related topic of replacement of text in a PDF.)
Thus, when you say
pdfminer and others are too big to use
please consider that they are so big for a reason: They are so big because you need that much code for adequate text extraction. This is in particular true for Chinese text; for simple PDFs with English text there are some short cuts working in benign circumstances, but for PDFs with CJK text you should not expect such short cuts.
If you want to try nonetheless and implement text extraction yourself, grab a copy of ISO 32000-1 or ISO 32000-2 (Google for pdf32000 for a free copy of the former) and study that pdf specification. Based on that information you can step by step learn to parse those binary strings to pdf objects, find content streams therein, parse the instructions in those content streams, retrieve the text pieces drawn by those instructions, and arrange those pieces correctly to a whole text.
Don't expect your solution to be much smaller than pdfminer etc...

PyPDF2 can't read non-English characters, returns empty string on extractText()

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long)
that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.
I'm new to this library and i don't know what to do, to fix this problem!!
PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.
You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.
There are three different things in a PDF that can look like letters to the human eye.
Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
Bit maps (.jpeg/gif/etc images) that define a grid of pixels.
In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).
With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Find width of pdf form field in python

I have a fillable pdf with fields that need to be filled out by the user. I am trying to auto-generate responses to for those fields with python, but I need to know the width/length of the form fields in order to know whether my responses will fit in the field.
How do I find the width of these fields, or at least test whether a possible response will fit?
I was thinking that if I knew the font and font size of the field, that might help.
Edit: I just realized that the pdf is encrypted, so interfacing with the pdf in a programmatic way may be impossible. Any suggestions for a quick and dirty solution are welcome though.
Link to form: http://static.e-publishing.af.mil/production/1/af_a1/form/af910/af910.pdf
I need to know the width of the comments blocks.
After some quick digging around in pdf files and one of Adobe's pdf references (source) it turns out that a text field may have a key "MaxLen" whose value is an integer representing the maximum length of the field's text, in characters (See page 444 in the reference mentioned). It appears that if no such key is present, there is no maximum length.
What one could do then, is simply search the pdf file for the "MaxLen" keys (if multiple text fields, else you could just search for one) and return their values. E.g.:
import re
with open('your_file.pdf', 'r', errors='ignore') as pdf_file:
content = pdf_file.read()
# Matches every substring "n" (n is an integer) with a preceding "/MaxLen "
regexp = '(?<=\/MaxLen )\d+'
max_lengths = [int(match) for match in re.findall(regexp, content)]
(If the file is huge you may not be able to read it all into memory at once. If that's the case, reading it line by line could be a solution.)
max_lengths will then be a list of all the "MaxLen" values, ordered after occurrence in the file (the first occurrence will be first etc.).
However, depending on what you need, you may need to further your search and add more conditionals to my code. For example, if a file contains multiple text fields but not all of them has a maximum length, you may not know which length correspond to which field. Also, if a pdf file has been modified and saved (not using "Save As"), the modifications will be appended to the old file instead of overwriting it completely. I'm not sure exactly how this works, but I suppose it could make you get the max lengths of previously removed fields etc. if you're not careful and check for that.
(Working with pdf's in this way is very new to me, please correct me if I'm wrong about anything. I'm not saying that there's no library that can do this for you, maybe PDFMiner can, though it will probably be more advanced.)
Update 23-10-2017
I'm afraid the problem just got a lot harder. I believe you still should be able to deduce the widths of the text fields by parsing the right parts of the pdf file. Why? Because Adobe's software can render it properly (at least Adobe Acrobat Pro DC) without requiring some password to decrypt it first. The problem is that I don't know how to parse it. Dig deep enough and you may find out, or not.
I suppose you could solve the problem in a graphical way, opening every pdf with some viewer which can read them properly, then measuring the widths of the text fields. But, this would be fairly slow and I'm not sure how you would go about recognizing text fields.
It doesn't help that the forms don't use a monospaced font, but that's a smaller problem that definitely can be solved (find which font the text fields use, look up the width of all the characters in that font and use that information in your calculations).
If you do manage to solve the problem, please share. :)

Categories