Convert to PDF's Last Page using GraphicsMagick with Python

Convert to PDF's Last Page using GraphicsMagick with Python - python

To convert a range of say the 1st to 5th page of a multipage pdf into single images is fairly straight forward using:
convert file.pdf[0-4] file.jpg
But how do i convert say the 5th to the last page when i dont know the number of pages in the pdf?
In ImageMagick "-1"represents the last page, so:
convert file.pdf[4--1] file.jpg works, great stuff,
but it doesnt work in GraphicsMagick.
Is there a way of doing this easily or do i need to find the number of pages?
PS: need to use graphicsmagick instead of imagemagick.
Thank you so much in advance.

Future readers of this, if you're experiencing the same dilemma in GraphicsMagick. Here's the easy solution:
Simply write a big number to represent the "last page".
That is: something like:
convert file.pdf[4-99999] +adjoin file%02d.jpg
will work to convert from the 5th pdf page to the last pdf page, into jpgs.
Note: "+adjoin" & "%02d" have to do with getting all the images rather than just the last. You'll see what i mean if you try it.

Related

Extracting Data from Unstructured PDFs in Python

Like it says, I'm trying to find a method to extract data from PDFs in Python. I've explored a few solutions already, but I'm not finding any solution that fit the need.
The PDF I have is scanned in, but I can use Tesseract to turn it into a text pdf if necessary. The goal in the short term is to grab a few values from the PDF and store them. The large scale goal is to get a large number of these PDFs and perform this task automatically. I know how to store the data if I can get it out of the PDF, my problem is actually getting the values out.
I'm not at liberty to display the PDF, below is an example of what the document looks like.
Sorry for my crude art, I figured this would be easier than recreating an empty copy of the PDF, but I can make a better mock up if necessary. The fields I would like to extract are highlighted in red. Wherever it says TITLE: next to a field is where title would appear on the document, usually on a separate line, save for the field at the bottom.
I've tried using a few tools, notably Azure Cognitive Services and PyPDF2, however the issues I'm usually running into is that the output has each group of words as an individual line in the output, which does not work if the title of a form field is above it, like the example table below
left
center
right
One
Two
Three
The output returns left, then center, then right, then One, then Two, then Three. If the field for Two or One was left blank, searching for 3 rows below right would not give me the expected output.
I've run into a few other bugs with other solutions, like needing to have bounding boxes on my PDF for it to work, but I'm starting to run out of solutions to find, and I was wondering if anyone had any ideas for how I can get this task done.
There are multiple pages, however I only really need 1-2, and I only have 1 scanned with Tesseract. The format stays relatively the same, although each pdf is independently scanned in so there could be minor changes there.
Any and all help is greatly appreciated.

Extracting text from specific coordinates of a PDF in python

I have some pre-determined coordinates that I want to look into a PDF to extract text from (some part on the top of the page). I've been trying to use the library pdfminer.six but it seems like the smallest unit for processing and extracting elements is a page.
I was thinking that in order to just get text from a small part of a page, it could get a little inefficient to go through and analyse the entire page when there are large number of documents to process.
Is there any way to do so? Or is there some other library that can work with this use case, where I can pass in coordinates? Or am I getting the concept wrong fundamentally?
Thanks!

You can use visitor functions to do that:
https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer

Use Python to search online database and download images

I'd like my Python code to iterate through a numpy array and search an online database with each of these numbers. This seems fairly complicated, since it must enter the number into a certain search box after navigating to the URL, and then I would like it to return all the images that appear after searching for each number (there are multiple images in each search result). Is there a straightforward way to do this?

Capturing screenshot and parsing data from the captured image

I need to write a desktop application that performs the following operations. I'm thinking of using Python as the programming language, but I'd be more than glad to switch, if there's an appropriate approach or library in any other languages.
The file I wish to capture is an HWP file, that only certain word processors can run.
Capture the entire HWP document in an image, might span multiple pages (>10 and <15)
The HWP file contains an MCQ formatted quiz
Parse the data from the image that is separate out the questions and answers and save them as separate image files.
I have looked into the following python library, but am still not able to figure out how to perform both 1 and 3.
https://pypi.python.org/pypi/pyscreenshot
Any help would be appreciated.

If i got it correctly , you need to extract text from image.
For this one you should use an OCR like tesseract.
Before using an OCR, try to clear noises from image.
To split the image try to add some unique strings to distinguish between the quiz Q/A

best way to print data in columnar format?

I am using Python to read in data in a user-unfriendly format and transform it into an easier-to-read format. The records I am outputting are usually going to be just a last name, first name, and room code. I
I would like to output a series of pages, each containing a contiguous subset of the total records, divided into multiple columns, each of which contains a contiguous subset of the total records on the page. (So in other words, you'd read down the first column, move to the next column, move to the next column, etc., and then start over on the next page...)
The problem I am facing now is that for output formats, I'm almost certainly limited to HTML (and Javascript, CSS, etc.) What is the best way to get the data into this columnar format? If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally, for instance, I could easily print tables of 5x20, but I don't know if there's a way to indicate a page break -- and I don't know if there's any way to calculate programmatically how many records will fit on the page.
How would you approach this?
EDIT: The reason I said that I was limited in output: I have to produce the file on one computer, then bring it to a different computer upon which we cannot install new software and on which the selection of existing software is not optimal. The file itself is only going to be used to make a physical printout (which is what the end users will actually work with), but my time on the computer that I can print from is going to be limited, so I need to have the file all ready to go and print right away without a lot of tweaking.
Right now I've managed to find a word processor that I can use on the target machine, so I'm going to see if I can target a format that the word processor uses.
EDIT: Once I knew there was a word processor I could use, I made a simple skeleton file with the settings that I wanted (column and tab settings, monospaced font in a small point size, etc.) and then measured how many characters I got per line of a column and how many lines I got per column. I've watched the runs pretty carefully to make sure that there weren't some strange lines that somehow overflowed the characters-per-line guideline (which shouldn't happen with monospaced font, of course, but how many times do you end up having to figure out why that thing that "shouldn't" happen is happening anyways?)
If there hadn't been a word processor on the target machine that I could use, I probably would have looked at PDF as an output format.

"If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally"
You do know that.
You know the size of your paper. You know the size of your font. You can easily do the math.
"almost certainly limited to HTML..." doesn't make much sense. Is this a web application? The page can have a "Previous" and "Next" button to step through the pages? Pick a size that looks good to you and display one page full with "Previous" and "Next" buttons.
If it's supposed to be one HTML page that prints correctly, that's hard. There are CSS things you can do, but you'll be happier creating a PDF file.
Get PyX or ReportLab and create a PDF that prints properly.
I -- personally -- have no patience with any of this. I try put this kind of thing into a CSV file. My users can then open CSV with a tool spreadsheet (Open Office Org has a good one) and then adjust the columns and print with it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.