Extracting text from specific coordinates of a PDF in python - python

I have some pre-determined coordinates that I want to look into a PDF to extract text from (some part on the top of the page). I've been trying to use the library pdfminer.six but it seems like the smallest unit for processing and extracting elements is a page.
I was thinking that in order to just get text from a small part of a page, it could get a little inefficient to go through and analyse the entire page when there are large number of documents to process.
Is there any way to do so? Or is there some other library that can work with this use case, where I can pass in coordinates? Or am I getting the concept wrong fundamentally?
Thanks!

You can use visitor functions to do that:
https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer

Related

Extracting Data from Unstructured PDFs in Python

Like it says, I'm trying to find a method to extract data from PDFs in Python. I've explored a few solutions already, but I'm not finding any solution that fit the need.
The PDF I have is scanned in, but I can use Tesseract to turn it into a text pdf if necessary. The goal in the short term is to grab a few values from the PDF and store them. The large scale goal is to get a large number of these PDFs and perform this task automatically. I know how to store the data if I can get it out of the PDF, my problem is actually getting the values out.
I'm not at liberty to display the PDF, below is an example of what the document looks like.
Sorry for my crude art, I figured this would be easier than recreating an empty copy of the PDF, but I can make a better mock up if necessary. The fields I would like to extract are highlighted in red. Wherever it says TITLE: next to a field is where title would appear on the document, usually on a separate line, save for the field at the bottom.
I've tried using a few tools, notably Azure Cognitive Services and PyPDF2, however the issues I'm usually running into is that the output has each group of words as an individual line in the output, which does not work if the title of a form field is above it, like the example table below
left
center
right
One
Two
Three
The output returns left, then center, then right, then One, then Two, then Three. If the field for Two or One was left blank, searching for 3 rows below right would not give me the expected output.
I've run into a few other bugs with other solutions, like needing to have bounding boxes on my PDF for it to work, but I'm starting to run out of solutions to find, and I was wondering if anyone had any ideas for how I can get this task done.
There are multiple pages, however I only really need 1-2, and I only have 1 scanned with Tesseract. The format stays relatively the same, although each pdf is independently scanned in so there could be minor changes there.
Any and all help is greatly appreciated.

Use Python to search online database and download images

I'd like my Python code to iterate through a numpy array and search an online database with each of these numbers. This seems fairly complicated, since it must enter the number into a certain search box after navigating to the URL, and then I would like it to return all the images that appear after searching for each number (there are multiple images in each search result). Is there a straightforward way to do this?

Convert to PDF's Last Page using GraphicsMagick with Python

To convert a range of say the 1st to 5th page of a multipage pdf into single images is fairly straight forward using:
convert file.pdf[0-4] file.jpg
But how do i convert say the 5th to the last page when i dont know the number of pages in the pdf?
In ImageMagick "-1"represents the last page, so:
convert file.pdf[4--1] file.jpg works, great stuff,
but it doesnt work in GraphicsMagick.
Is there a way of doing this easily or do i need to find the number of pages?
PS: need to use graphicsmagick instead of imagemagick.
Thank you so much in advance.
Future readers of this, if you're experiencing the same dilemma in GraphicsMagick. Here's the easy solution:
Simply write a big number to represent the "last page".
That is: something like:
convert file.pdf[4-99999] +adjoin file%02d.jpg
will work to convert from the 5th pdf page to the last pdf page, into jpgs.
Note: "+adjoin" & "%02d" have to do with getting all the images rather than just the last. You'll see what i mean if you try it.

Working on tables in pdf using python [duplicate]

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 14 days ago.
I am working on a pdf file. There is number of tables in that pdf.
According to the table names given in the pdf, I wanted to fetch the data from that table using python.
I have worked on html, xlm parsing but never with pdf.
Can anyone tell me how to fetch tables from pdf using python?
I think that you need a python parser library. The most famous is PDFMiner.
According to the documentation :
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
This is a very complex problem and not solvable in general.
The reason for this is simply that the format PDF is too flexible. Some PDFs are only bitmaps (you would have to do your own OCR then—obviously not our topic here), some are a bunch of letters literally spilled out over the pages; this means that by parsing the text information in the PDF you could get single characters placed on some coordinates. In some cases these come in an orderly fashion (line by line, from left to right), but in some cases you will get rather random-like distributions, most commonly with and stuff, but also special characters, characters in a different font etc. can come way out of line.
The only proper approach is to place all characters according to their coordinates on a page model and then use heuristics to find out what the lines are.
I propose to have a look at your PDFs and the tables therein you want to parse before starting. Maybe they are alike all the time and well-parsable.
Good luck!
I had a similar problem recently, and wrote a library to help solve it: pdfquery.
PDFQuery creates an element tree from the PDF (using pdfminer, with some extra sugar) and lets you fetch elements from the page using JQuery or XPath selectors, based mostly on the text contents or locations of the elements. So to parse a table, you would first find where it is in the document by searching for the label:
label = pdf.pq(':contains("Name of your table")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
Then you would keep searching for lines underneath the table, until the search didn't return results:
page = label.closest('LTPage')
while 1:
row = pdf.extract( [
('column_1', ':in_bbox("%s,%s,%s,%s")' % (left_corner+10, bottom_corner+40, left_corner+50, bottom_corner+20)),
('column_2', ':in_bbox("%s,%s,%s,%s")' % (left_corner+50, bottom_corner+40, left_corner+80, bottom_corner+20))
], page)
if not row['column_1'] or row['column_2']:
break
print "Got row:", matches
bottom_corner -= 20
This assumes that your rows are 20 pts high, the first one starts 20 pts below the label, the first column spans from 10 to 50 points from the left edge of the label, and the second column spans from 50 to 80 pts from the left edge of the label.
If you have blank lines or lines with varying heights, this is going to get more annoying. You may also need to use the merge_tags=None option to select individual characters rather than words, if the entries in the table are close enough to make the parser think it's just one line. But hopefully this gets you closer ...
You can use Camelot to extract tabular data from your PDF and export it to your favorite format. Currently; CSV, Excel, JSON and HTML are supported. You can check out the documentation at: http://camelot-py.readthedocs.io. It would be helpful if you could post a link to your PDF. Here's a generic code example:
>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_csv('file.csv')
Disclaimer: I'm the author of the library.

How to parse pdf file using python [duplicate]

This question already has answers here:
How can I extract tables from PDF documents?
(4 answers)
Closed 14 days ago.
I am working on a pdf file. There is number of tables in that pdf.
According to the table names given in the pdf, I wanted to fetch the data from that table using python.
I have worked on html, xlm parsing but never with pdf.
Can anyone tell me how to fetch tables from pdf using python?
I think that you need a python parser library. The most famous is PDFMiner.
According to the documentation :
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
This is a very complex problem and not solvable in general.
The reason for this is simply that the format PDF is too flexible. Some PDFs are only bitmaps (you would have to do your own OCR then—obviously not our topic here), some are a bunch of letters literally spilled out over the pages; this means that by parsing the text information in the PDF you could get single characters placed on some coordinates. In some cases these come in an orderly fashion (line by line, from left to right), but in some cases you will get rather random-like distributions, most commonly with and stuff, but also special characters, characters in a different font etc. can come way out of line.
The only proper approach is to place all characters according to their coordinates on a page model and then use heuristics to find out what the lines are.
I propose to have a look at your PDFs and the tables therein you want to parse before starting. Maybe they are alike all the time and well-parsable.
Good luck!
I had a similar problem recently, and wrote a library to help solve it: pdfquery.
PDFQuery creates an element tree from the PDF (using pdfminer, with some extra sugar) and lets you fetch elements from the page using JQuery or XPath selectors, based mostly on the text contents or locations of the elements. So to parse a table, you would first find where it is in the document by searching for the label:
label = pdf.pq(':contains("Name of your table")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
Then you would keep searching for lines underneath the table, until the search didn't return results:
page = label.closest('LTPage')
while 1:
row = pdf.extract( [
('column_1', ':in_bbox("%s,%s,%s,%s")' % (left_corner+10, bottom_corner+40, left_corner+50, bottom_corner+20)),
('column_2', ':in_bbox("%s,%s,%s,%s")' % (left_corner+50, bottom_corner+40, left_corner+80, bottom_corner+20))
], page)
if not row['column_1'] or row['column_2']:
break
print "Got row:", matches
bottom_corner -= 20
This assumes that your rows are 20 pts high, the first one starts 20 pts below the label, the first column spans from 10 to 50 points from the left edge of the label, and the second column spans from 50 to 80 pts from the left edge of the label.
If you have blank lines or lines with varying heights, this is going to get more annoying. You may also need to use the merge_tags=None option to select individual characters rather than words, if the entries in the table are close enough to make the parser think it's just one line. But hopefully this gets you closer ...
You can use Camelot to extract tabular data from your PDF and export it to your favorite format. Currently; CSV, Excel, JSON and HTML are supported. You can check out the documentation at: http://camelot-py.readthedocs.io. It would be helpful if you could post a link to your PDF. Here's a generic code example:
>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_csv('file.csv')
Disclaimer: I'm the author of the library.

Categories