Folder of images to individual PDF reports - python

I have various folders containing images. Inside each folder the size of the images are the same, but the sizes vary slightly between the various folders.
I am trying to automate processing the folders by taking each image and placing it within a Letter template file (existing PDF file) then saving it before moving to the next image in the folder. The template file is just a simple header and footer document.
I have thought about using python to replace the image within a HTML file or as an overlay in the exiting PDF template, but not sure if my approach is correct or if there is a more simpler option.
I have already looked at:
pdfkit
wkhtmltopdf
ReportLab
Just looking for some suggestions at this point.

Related

Tesseract OCR Arabic Scanned PDF

Just wondering if anyone did manage to use Python script to utilize Tesseract OCR Engine to extract Arabic text from PDF scanned images without the process of converting the file into PNG images, if these lines of codes are available please do share it with me, also please do provide me which URL should be added into the PATH environment variables.
If this option is not available, then is it possible to batch process multiple images in Tesseract instead of one by one?
Thanks,
Medo Hamdani
Tried these as a general process of dealing with PDF Arabic files:
Use this line of code
python split-image.py pet.png 1 1
Althougth it automatically splitted every single image insdie the folder.
Some times you will have to add --load-large-images to load large images
# e.g. split_image("bridge.jpg", 2, 2, True, False)
# https://pypi.org/project/split-image/
# split_image(image_path, rows, cols, should_square, should_cleanup, [output_dir])
===================================================================================
For Tesseract use this line of code
tesseract 5.png outtput -l ara
#no need to add python in the front
To download for Windows
https://github.com/UB-Mannheim/tesseract/wiki
===================================================================================
Combine image
Ensure all the images are in one folder. For example 10 images
Use this line of code
Python combine-images.py
It will create an output.png file

How can I separate images into subfolders based on metadata in a CSV file (with python)?

I have a folder of 10,000 jpg images that all have a unique name. A CSV file has the name of all of these files (called image_id) as well as metadata. I want to put every image in a folder based on the value of the "dx" class in the CSV.
I am a beginner in python and everything I tried has not worked at all.

In plone, With collective.documentviewer, will space occupied be more if jpgs are uploaded or corresponding pdfs?

I have a plone site, where I upload jpgs(scanned copies) of pdf files. Will there be an impact on the space occupied when I use collective.documentviewer, if otherwise I had used pdfs itself? Using plone 4.3
not sure if i understand your question correct:
collective.documentviewer will require space for
the pdf itself
the jpeg version of each page
SearchableText index of the pdf content
(see this related answer)
uploading jpeg files, will only require space for the images.

text extract from scanned pdfs

My problem is that I have a bunch of PDF files and I want to convert them to text files. Some of them are pure PDFs while others have scanned pages inside. I am writing a program in python so I am using pdftotext to convert them to TXTs.
I am using the command below
filename = glob.glob(src) //src is my directory with my files
for file in filename:
subprocess.call(["pdftotext", file])
What I would like to ask is if there is a way to check for scanned pages before the conversion so that I can use ghostscript commands with pdftotext to manipulate them.
For now I have a treshold to check the size of the .txt file and if it is below that treshold I am using ghostscript commands to manipulate them.
The problem is that for big-sized files with 50 or 60 scanned out of 90 pages even with pdftotext the size of the file is always above the treshold.
A 'pure' PDF file can have images in it....
There's no easy way to tell whether a PDF file is a scanned page or not. Your best bet, I think, would be to analyse the page content streams to see if they consist of nothing but images (some scanners break up the single scanned page into multiple images). You could assume that they are scanned pages, in any event you won't get any text out of them with Ghostscript.
Another approach would be to use the pdf_info.ps program for Ghostscript and have it list fonts uses. No fonts == no text, though potentially there may be fonts present and still no text. Also I don't think this works on a page by page basis.

How to generate a PDF index page?

I have a python script that exports 772 pdfs and combines them into a multi-page pdf binder. While exporting each PDF, it also adds the name of the current pdf as an entry in a text file. After the whole binder is created, the text file has an entry for each PDF page in the same order as the PDF binder. I need to use this text file to create an index page at the beginning of the PDF, preferably linking to each page in the document.
If I have to do this task manually, I will (and I'm open to suggestions), but I hope to find a way to automate this.
Also, this doesn't have to be done in Python, but it would be nice to fit it in with my current script.
Thanks for the feedback,
Tanner
Poking around in the docs for arcpy.mapping, I can see that you weren't kidding about "it's limited".
Rather than adding new pages, have you considered adding bookmarks to the PDF?
And the only Python software I could dig up that can add bookmarks was pdfrecylce. It's in version 0.05, so I'm gonna go out on a limb and guess it might not be too stable.
If you're willing to use Java or C# there's iText and iTextSharp (but I'm biased). There are quite a few other PDF libraries floating around capable of manipulating existing PDFs... pick a language and start googling.
PDFsam will merge PDFs and create an index with links based on each individual PDF file name or title.
I initially downloaded PDFsam Basic because it will auto organize the PDFs to be merged in order of folder structure instead of only alphabetically. To add multiple PDFs from various folders I go to a directory, search "." to locate and select all the PDFs to add. I think the PDFsam Enhanced allows you to simply drag and drop an entire folder directory. Highly recommend.

Categories