I have a fillable pdf with fields that need to be filled out by the user. I am trying to auto-generate responses to for those fields with python, but I need to know the width/length of the form fields in order to know whether my responses will fit in the field.
How do I find the width of these fields, or at least test whether a possible response will fit?
I was thinking that if I knew the font and font size of the field, that might help.
Edit: I just realized that the pdf is encrypted, so interfacing with the pdf in a programmatic way may be impossible. Any suggestions for a quick and dirty solution are welcome though.
Link to form: http://static.e-publishing.af.mil/production/1/af_a1/form/af910/af910.pdf
I need to know the width of the comments blocks.
After some quick digging around in pdf files and one of Adobe's pdf references (source) it turns out that a text field may have a key "MaxLen" whose value is an integer representing the maximum length of the field's text, in characters (See page 444 in the reference mentioned). It appears that if no such key is present, there is no maximum length.
What one could do then, is simply search the pdf file for the "MaxLen" keys (if multiple text fields, else you could just search for one) and return their values. E.g.:
import re
with open('your_file.pdf', 'r', errors='ignore') as pdf_file:
content = pdf_file.read()
# Matches every substring "n" (n is an integer) with a preceding "/MaxLen "
regexp = '(?<=\/MaxLen )\d+'
max_lengths = [int(match) for match in re.findall(regexp, content)]
(If the file is huge you may not be able to read it all into memory at once. If that's the case, reading it line by line could be a solution.)
max_lengths will then be a list of all the "MaxLen" values, ordered after occurrence in the file (the first occurrence will be first etc.).
However, depending on what you need, you may need to further your search and add more conditionals to my code. For example, if a file contains multiple text fields but not all of them has a maximum length, you may not know which length correspond to which field. Also, if a pdf file has been modified and saved (not using "Save As"), the modifications will be appended to the old file instead of overwriting it completely. I'm not sure exactly how this works, but I suppose it could make you get the max lengths of previously removed fields etc. if you're not careful and check for that.
(Working with pdf's in this way is very new to me, please correct me if I'm wrong about anything. I'm not saying that there's no library that can do this for you, maybe PDFMiner can, though it will probably be more advanced.)
Update 23-10-2017
I'm afraid the problem just got a lot harder. I believe you still should be able to deduce the widths of the text fields by parsing the right parts of the pdf file. Why? Because Adobe's software can render it properly (at least Adobe Acrobat Pro DC) without requiring some password to decrypt it first. The problem is that I don't know how to parse it. Dig deep enough and you may find out, or not.
I suppose you could solve the problem in a graphical way, opening every pdf with some viewer which can read them properly, then measuring the widths of the text fields. But, this would be fairly slow and I'm not sure how you would go about recognizing text fields.
It doesn't help that the forms don't use a monospaced font, but that's a smaller problem that definitely can be solved (find which font the text fields use, look up the width of all the characters in that font and use that information in your calculations).
If you do manage to solve the problem, please share. :)
Related
I have a python script running fine, it scans a folder and collects data based on text line position which could work great but if any lines have missing data it throws my numbers off obviously.
I have looked in the pdf file using iText RUPS and I can find a reference to one set of the data I need
BT
582 -158.78 Td
(213447) Tj
ET
the information I want is in the brackets, can I somehow use the coordinates? if all fails, I might be able to get people to agree to start the info I need to collect with a flag XX12345 or YY12345 then I can easily pick out the data from the text extraction, but I'd rather find a better way.
Not added code examples as that works fine it's just the next step I'm struggling with, but I can if anyone wishes.
Many thanks
I tried to use just text extraction, but missing inputs throw my numbering scheme off.
There is a case in my job where l have to remove a specific section (Glossary) from thousands of pdf documents.
The text l want to remove has a different font from the other parts:
Example:
"Floor" the lower surface of a room, on which one may walk.
"exchange" an act of giving one thing and receiving another (especially of the same type or value) in return.
Can you please suggest a way how to do it faster?
One of the possible ways to solve this problem is to find the section you want to delete using regex. Then using one of the libraries for pdf editing in python to delete this section.
I am trying to parse the pdf found here: https://corporate.lowes.com/sites/lowes-corp/files/annual-report/lowes-2020ar.pdf with python. It seems to be text-based, according to the copy/paste test, and the first several pages parse just fine using, e.g. pymupdf.
However, after about page 12, there seems to be an internal change in the document encoding. For example, this section from page 18:
It looks like text, but when you copy and paste it, it becomes:
%A>&1;<81
FB9#4AH4EL
%BJ8XF8#C?BL874CCEBK<#4G8?L
9H??G<#84FFB6<4G8F4A7
C4EGG<#84FFB6<4G8F
CE<#4E<?L<AG;8.A<G87,G4G8F4A74A474"A9<F64?
J88KC4A787BHEJBE>9BE68
;<E<A:4FFB6<4G8F<AC4EGG<#8
F84FBA4?
4A79H??G<#8CBF<G<BAFGB9H?9<??G;8F84FBA4?78#4A7B9BHE,CE<A:F84FBA
<A6E84F8778#4A77HE<A:G;8(/"C4A78#<6
4F6HFGB#8EF9B6HF87BA;B#8<#CEBI8#8AGCEB=86GF
4A74A4G<BAJ<78899BEGGB#B7<9LBHEFGBE8?4LBHG
What is going on here? Will I need to use OCR to parse a file like this? Or is there some way of translating that the stuff above back to text?
Pages 13 to 100 have been imported also there are other odd practices thus suggest you will get 12 good pages then need to OCR 13-100 then probably good 3 pages from 101-104 again see https://stackoverflow.com/a/68627207/10802527
The majority of Pages 13-100 contain structured text that is described as Roman, and coincidentally the Romans were fond of encoding messages by sliding the alphabet a few step to the right or left and that's exactly what's happening here by character sliding we could extract much of the corrupted text using chars+n so read
A and replace with n
B and replace with o
C and replace with p
etc. but I will leave it there as I have little time to do 90 pages of analysis on a bad file font definition.
I tried Acrobat and Exchange plus others all agreed the text was defined as a reasonable form of Times Roman thus nothing to fix but content is meaningless nevertheless Selecting the characters for "We" (08) generally jumped to another instance suggesting there could be some slight possibility of redemption but then yet again the same two characters stopped on occasion at "ai" which is what's needed so I would say the file is Borked.
In theory the corruption should be recoverable in the PDF by remapping that font (at least for those pages), and with good Char remapping by adding or subtracting accordingly the plain text may be more easily converted.
I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.
I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs
For debugging I have tried
doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
print('\n'.join(fullText))
Which only seems to print out about half of the file's contents.
There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?
Edit: the missing text is contained within a mail merge field if that makes any difference
The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.
This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).
The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph, etc. once you've gotten hold of the elements you want.
I am processing large text files using Python. Each line of a file is a complete JSON message, and might be very long. I need to insert information about each line into a database. This info is very simple: the length of the line plus a unique ID which each message contains. So each line has the form
{"field1":"val1", ..., "ID":"12345", ..., "fieldK":"valK"}
and I need to extract "12345" from the message.
Right now I load the entire string using json.loads() then find the ID and ignore the rest.
My code is too slow and I need to speed it up. I am trying to see if there is a way of extracting "ID" faster than loading like the whole string. One option is to search the string for "ID" and then process :"12345". But it might be brittle if it so happens that there is a substring "ID" someplace else in the message.
So is there a way of somehow partially loading the line to find ID, which would be as robust as, but also faster than, loading the whole line?
I would recommend a couple of paths:
If your input is very large, it may be that loading it wholly into memory is wasteful. It may be faster to load/parse each line separately.
If the above won't help, then devising some way to search for the right ID in the file isn't a bad idea. Just verify that the input is kosher when you actually find the right ID: number. So you should:
Search (regex or otherwise) for the ID you expect.
For a match, actually parse the line and make sure it's valid. If it isn't (say, just ID: embedded in some string), drop it and keep searching.
Since non-legit occurrences of (2) should be rare, the verification doesn't have to be very efficient.