I have a python script running fine, it scans a folder and collects data based on text line position which could work great but if any lines have missing data it throws my numbers off obviously.
I have looked in the pdf file using iText RUPS and I can find a reference to one set of the data I need
BT
582 -158.78 Td
(213447) Tj
ET
the information I want is in the brackets, can I somehow use the coordinates? if all fails, I might be able to get people to agree to start the info I need to collect with a flag XX12345 or YY12345 then I can easily pick out the data from the text extraction, but I'd rather find a better way.
Not added code examples as that works fine it's just the next step I'm struggling with, but I can if anyone wishes.
Many thanks
I tried to use just text extraction, but missing inputs throw my numbering scheme off.
Related
There is a case in my job where l have to remove a specific section (Glossary) from thousands of pdf documents.
The text l want to remove has a different font from the other parts:
Example:
"Floor" the lower surface of a room, on which one may walk.
"exchange" an act of giving one thing and receiving another (especially of the same type or value) in return.
Can you please suggest a way how to do it faster?
One of the possible ways to solve this problem is to find the section you want to delete using regex. Then using one of the libraries for pdf editing in python to delete this section.
I am trying to parse the pdf found here: https://corporate.lowes.com/sites/lowes-corp/files/annual-report/lowes-2020ar.pdf with python. It seems to be text-based, according to the copy/paste test, and the first several pages parse just fine using, e.g. pymupdf.
However, after about page 12, there seems to be an internal change in the document encoding. For example, this section from page 18:
It looks like text, but when you copy and paste it, it becomes:
%A>&1;<81
FB9#4AH4EL
%BJ8XF8#C?BL874CCEBK<#4G8?L
9H??G<#84FFB6<4G8F4A7
C4EGG<#84FFB6<4G8F
CE<#4E<?L<AG;8.A<G87,G4G8F4A74A474"A9<F64?
J88KC4A787BHEJBE>9BE68
;<E<A:4FFB6<4G8F<AC4EGG<#8
F84FBA4?
4A79H??G<#8CBF<G<BAFGB9H?9<??G;8F84FBA4?78#4A7B9BHE,CE<A:F84FBA
<A6E84F8778#4A77HE<A:G;8(/"C4A78#<6
4F6HFGB#8EF9B6HF87BA;B#8<#CEBI8#8AGCEB=86GF
4A74A4G<BAJ<78899BEGGB#B7<9LBHEFGBE8?4LBHG
What is going on here? Will I need to use OCR to parse a file like this? Or is there some way of translating that the stuff above back to text?
Pages 13 to 100 have been imported also there are other odd practices thus suggest you will get 12 good pages then need to OCR 13-100 then probably good 3 pages from 101-104 again see https://stackoverflow.com/a/68627207/10802527
The majority of Pages 13-100 contain structured text that is described as Roman, and coincidentally the Romans were fond of encoding messages by sliding the alphabet a few step to the right or left and that's exactly what's happening here by character sliding we could extract much of the corrupted text using chars+n so read
A and replace with n
B and replace with o
C and replace with p
etc. but I will leave it there as I have little time to do 90 pages of analysis on a bad file font definition.
I tried Acrobat and Exchange plus others all agreed the text was defined as a reasonable form of Times Roman thus nothing to fix but content is meaningless nevertheless Selecting the characters for "We" (08) generally jumped to another instance suggesting there could be some slight possibility of redemption but then yet again the same two characters stopped on occasion at "ai" which is what's needed so I would say the file is Borked.
In theory the corruption should be recoverable in the PDF by remapping that font (at least for those pages), and with good Char remapping by adding or subtracting accordingly the plain text may be more easily converted.
I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!
I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.
I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.