I have 100 pdf files in a folder in A0 format I'd like to scale them to A3.
I have tried various methods but none of them worked. My goal is to create a script (preferably in python) to scale all the files in A3.
I tried to use PIL but my version of python doesn't support it.
you can write commands on windows command processor
to do these functions :
-Merge PDF files together, or split them apart
-Encrypt and decrypt
-Scale, crop and rotate pages
-Read and set document info and metadata
-Copy, add or remove bookmarks
-Stamp logos, text, dates, page numbers
-Add or remove attachments
-Losslessly compress PDF files
it can work also in different OS such as / MAC / Ubunto / Linux
scale command example :
cpdf -scale-to-fit a3portrait in.pdf -o out.pdf
using this tool Coherent PDF Command Line Tools , you may download it from here click here to open link
you can as well use python script to execute the code for you on the command processor / terminals .
hope my post helped you , good luck
You could try the linux pdftk program (although it might even be available on Windows, i am not sure). Once installed, you can use the command below to convert an A0 pdf to an A3 pdf
convert -page a0 infile.pdf -page a3 outfile.pdf
Related
Currently, I'm working on a python3 script that helps me sort the Google Photos takeout files. The Takeout service for Google Photos actually strips all the metadata of an image/video into a separate JSON file.
This script that I'm working on helps me to merge the timestamp present in the JSON file into its subsequent photo or video. In order to achieve this, I'm currently using - ExifTool by Phil Harvey, which is a Perl executable. I call this tool in a subprocess to edit the Date tags in EXIF/Metadata.
This process is quite hefty and is taking a large amount of time. Then I realised that most of my photos are JPG and videos are MP4, it is very easy to edit Exif data of JPG files in python using some of the libraries present & for the lesser proportion of photos like PNG I can use exiftool.
This has drastically improved the runtime of my script. Now I want to know that is there any way to edit the creation dates of MP4 files natively in python which can theoretically execute faster than the subprocess method.
Please help! Thanks in advance.
Im not too familiar with it, but ffmpeg seems like an option for just the mp4's.
cmd = 'ffmpeg -i "file.mp4" -codec copy -metadata timestamp="new_time_here" "output.mp4"'
subprocess.call(shlex.split(cmd))
modified from:
https://www.reddit.com/r/learnpython/comments/3yotj2/comment/cyfiyb7/?utm_source=share&utm_medium=web2x&context=3
I need to remove all text information from pdf file. So the file I wanna get should be like scan: only images wrapped as pdf, no texts that u can copy or select. Now I'm using ghostscript command:
import os
...
os.system(f"gs -o {output_path} -sDEVICE=pdfwrite -dFILTERTEXT {input_path}")
unfortunately, with some documents it removes not only text layer but real pixels of characters!!! And sometimes I cannot see any text pictures on the page, it's not what I need
Is there some stable and fast solution with python or pip utils? It will be wonderful if I can solve this with PyMuPDF (fitz) but I couldn't find anything about it
I would like to extract text, including tables from pdf file.
I tried camelot, but it can only get table data not text.
I also tried PDF2, however it can't read Chinese characters.
Here is the pdf sample to read.
Are there any recommended text-extraction python packages?
By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.
>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf
NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.
I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊ݢቹៜϐѦჾѱ॥ᓀϩӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.
I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).