Extracting specific page from multiple PDF files - python

I have hundreds of PDF files with same format, but different content.
I need to extract second page from every files individually.
It's like below
original1.pdf -> 2_original1.pdf
original2.pdf -> 2_original2.pdf
original3.pdf -> 2_original3.pdf
I'm trying to use PyPDF2 but I cannot figure out the code from google.
So far I used below one, but I don't think this is correct.
cd C:\Users\ukil.yeom\Desktop\pdf_extracts
convert -density 150 *.pdf[2] only-page-2.pdf
Please show me the way how to deal with it.
Thanks in advance.

Related

Python Pandas PDF/Web Scrape

I am trying to extract the first top 2 pdf's from this page under the current product range.
https://www.intermediary.natwest.com/intermediary-solutions/products.html
I have managed to create a function that uses selenium to click the download links and download the 2 pdf's into a temporary location, however, I am struggling to find a viable way to read in the tables with minimal cleaning required.
Please can anyone help with potential solutions to download these 2 pdf tables and export them into csv's, I have tried using PDF plumber but it converts the data into a list of lists which is a nightmare to clean. I have also tried PyPDF2 which is also very messy with hundreds of lines of code needed to clean the data. I would just like to find a good best practice solution to read the pdfs in as they are and convert them to csv's.
Any help would be immensely appreciated.
:)

How to lock a pdf selecting text not available

I need to convert a pdf document so that the user can't select any text.
I thought about overlaying a pdf with transparent image. Is it possible in python 3 ?
Maybe just converting every page to a jpeg and then putting it all together as pdf is a better idea?
Thanks
The PDF standard includes a set of access permissions that can be used (among other things) to restrict users from copying text and images to the clipboard or otherwise extracting text from the PDF. See https://helpx.adobe.com/acrobat/using/securing-pdfs-passwords.html for more information.

Python : Convert multiple images as multiple pages in pdf for windows

How to convert multiple images(jpeg) as a pdf file with multiple pages in windows.
Using Image library, i can convert every image as single pdf, i can merge those converted files to a single pdf file using pdfminer, but it is two way work.
I try to download MagicK, but couldn't get binary for windows. Is it possible to achieve using PIL ?
I'm not totally sure, but you can create a report with jasperReport and create a pdf file after. I believe python also can work with jasper reports.
what do you think? maybe is too much work.

How can I extract the tables, text and the pictures in ODT(OpenDocumentText) format using Python?

How can I extract the tables, text and the pictures in an ODT(OpenDocumentText) file to output them to another ODT file using Python on Ubuntu?
OOoPy seems to be a good fit. I've never used it, but it comes with documentation and code examples, and it can read and write ODT files.
An easy way is to just rename the foo.odt to foo.zip and then extract it. the extracted directory contains many files including Pictures.
However I think it's better to change it's type to docx and then do the process on docx (extract it). Because it extract images with better name (image1, image2, ...).

Manipulate and print to PDF files in a script

I have several pdf files of some lecture slides. I want to do the following: print every pdf file to another pdf file in which there are 6 slides per page and then merge all the resulting files to one big file while making sure that every original file starts on an odd page number (Edit: obviously, it will be printed in duplex) (possibly adding blank pages when necessary).
Is that possible?
Edit: For those interested, this is for printing a LOT of course material for an exam... And I need to do this for a lot of courses.
If it were me, I would use PDFjam or a similar tool to perform the 6-up on each of the source documents.
I would then use PyPDF to calculate the number of pages in each, add a blank page if necessary, and merge the rest of the pages. Something like:
blank_page = PDFFileReader('blank.pdf').pages[0]
dest = PDFFileWriter()
for source in sources:
PDF = PDFFileReader(source)
dest.addPage(PDF.pages)
if PDF.numPages % 2: #odd number of pages in source
dest.addPage(blank_page)
It appears PyPDF does also have support for merging pages with resize and relocate, so theoretically, it should also work for creating an n-up document, though I see no example code for that.
For putting multiple slides on one page, pdfnup from the PDFjam package is your friend.
For inserting the blank pages, I'm not sure; maybe you can convince pdfjam to do this as well. But can't you just turn off duplexing in the print settings?

Categories