Convert multiple multipage PDFs to JPGs in subfolders

Convert multiple multipage PDFs to JPGs in subfolders - python

Simple use case:
A folder with many (mostly multipage) PDF files.
A script should convert each PDF page to JPG and store it in a subfolder named after the PDF filename. (e.g. #33.pdf to folder #33)
Single JPG files should also have this filename plus a counter mirroring the sequential page number in the PDF. (e.g. #33_001.jpg)
I found a bounch of related questions, but nothing that quite does what I want, e.g.
How do I convert multiple PDFs into images from the same folder in Python?
A python script would work fine, but also any other way to do this in Win10 (imagemagick, e.g.) is cool with me.

Your comment requests how a batch can do as required, for simplicity the following only processes a single file so Python will need to loop through a folder and call with each name in turn. That could be done by adding a "for loop" in batch but first see where problems arise, as many of my single test files threw differing errors.
I have tried to cover several fails in this batch file, in my system, but there can still be issues such as a file that has no valid fonts to display
For most recent poppler windows 64bit utils see https://github.com/oschwartz10612/poppler-windows/releases/ for 32 bit use xpdf latest version http://www.xpdfreader.com/download.html but that has direct pdftopng.exe so needs a few edits.
pdf2dir.bat
#echo off
set "bin=C:\Apps\PDF\poppler\22.11.0\Library\bin"
set "res=200"
REM for type use one of 3 i.e. png jpeg jpegcmyk (PNG is best for documents)
set "type=png"
if exist "%~dpn1\*.%type%" echo: &echo Files already exist in "%~dpn1" skipping overwrite&goto pause
if not exist "%~dpn1.pdf" echo: &echo "%~dpn0" File "%~dpn1.pdf" not found&goto pause
if not exist "%~dpn1\*.*" md "%~dpn1"
REM following line deliberately opens folder to show progress delete it or prefix with REM for blind running
explorer "%~dpn1"
"%bin%\pdftoppm.exe" -%type% -r %res% "%~dpn1.pdf" "%~dpn1\%~n1"
if %errorlevel%==1 echo: &echo Build of %type% files failed&goto pause
if not exist "%~dpn1\*.%type%" echo: &echo Build of %type% files failed&goto pause
:pause
echo:
pause
:end
It requires Poppler binaries path to pdftoppm be correctly set in the second line
It can be placed wherever desired i.e. work folder or desktop
It allows for drag and drop of one pdf on top will (should work) without need to run in console
Can be run in a command console and place a space character after, you can drag and drop a single filename but any spaces in name must be "double quoted"
can be run from any shell or OS command as "path to/batchfile.bat" "c:\path to\file.pdf"

Related

Unknown problem in Python. Printing does not work and files are not saved correctly

When I try to write something, such as variables, the code is renamed to the file name on the computer.
For example, if I write:
a = 20
f = 15
print(a+f)
then the code file will automatically be renamed to the first line, i.e. "a = 20"
Then, when I try to run the code, the program outputs nothing but "Python" and some incomprehensible words.
What could it be related to?
enter image description here
enter image description here
I installed the latest version of Visual Stuio Code with Python, they are new, so there should be no problems. But this time it went wrong.
After reinstalling the program, the problem remains.

First of all, if there is no special requirement, please do not use Code Runner to run the script, using the official extension Python is a better choice.
In addition, the dot on your file label means that you have not saved the file, you can add the following setting to enable automatic saving in the settings.
"files.autoSave": "afterDelay",
You may have created the file using the following method. File --> New File... --> Python File. At this time, the file has not been named, also not saved. You can see that there is no such file in the resource manager list at this time.
So the file label shows the first line of codes. This is a feature of vscode, you can refer to this link. And because the file has not been saved, there will be problems executing the script.
You can rename the script file directly (F2), or vscode will remind you to name the file when saving. Another way to create a file is to right click and choose New File..., enter filename and end with .py extension.

Saving retrieved documents from API

I am currently trying to download a cif file from materialsproject.org which is only possible via an API. They told me to use Mybinder.org to run their code:
from mp_api.client import MPRester
from pymatgen.analysis.diffraction.xrd import XRDCalculator
from pymatgen.symmetry.analyzer import SpacegroupAnalyzer
with MPRester(api_key='8dI6UZHs3Nc9lxTp75RrJcPdwvPn6jZb') as mpr:
# first retrieve the relevant structure
structure = mpr.get_structure_by_material_id('mp-980949')
# important to use the conventional structure to ensure
# that peaks are labelled with the conventional Miller indices
sga = SpacegroupAnalyzer(structure)
conventional_structure = sga.get_conventional_standard_structure()
# this example shows how to obtain an XRD diffraction pattern
# these patterns are calculated on-the-fly from the structure
calculator = XRDCalculator(wavelength='CuKa')
pattern = calculator.get_pattern(conventional_structure)
When I run the code it tells me "Retrieving MaterialsDoc documents: 100%". How do I go on from here? I assume it has only retrieved the document, not downloaded it yet onto my pc. I have exactly zero knowledge about programming and APIs. It also doesn't need to download as a cif file. A simple txt. File would also help. I could create my own cif file from that.
I tried running the code from my PC with Python, but nothing really happens. After using Google to find out ways to download data retrieved from APIs I copied some code that others used to download retrieved data, but that also didn't work.

The little you've provided (which may be all they told you?) assumes prior knowledge of or familiarity with Python and/or MyBinder. If they didn't provide you that then perhaps they directed you to other materials and resources on their sites?
Here's how you'd accomplish this, I think, without the authors helping further:
Click here to start up a temporary session that has all the necessary dependencies needed to run your provided code already installed.
Start a new notebook and enter your code. (Alternatively to have that code and further steps already be in a notebook, in the new notebook cell enter !curl -OL https://gist.githubusercontent.com/fomightez/8ef2f588965fbc10f19db79ee0035094/raw/778f399652a08a1f4be70400c5592c59e59cd876/demo_use_mp_api_with_MaterialsProject.ipynb. Give it a few seconds to fetch that notebook file & then double-click that demo_use_mp_api_with_MaterialsProject.ipynb file that should now appear in the file browser panel on the left. Open that notebook and then select from the 'File menu' area > 'Run' > 'Run All Cells' and look there for the rest of what I write about here.)
I noted that it didn't didn't generate any files even though it said 'Retrieving MaterialsDoc documents'. So the contents retrieved must be among the Python objects now active, it seems.
To see what those are, I looked at the assigned variables and entered them as the last line in Jupyter notebook cells and ran those cells. Example, I entered in a cell the following:
structure
Ran that cell and then I saw:
Structure Summary
Lattice
abc : 7.086166087833533 7.086166087833533 8.514034991951231
angles : 54.50319138768853 54.50319138768853 54.77885193228324
volume : 264.22028210282747
A : 3.259891 6.291809 0.0
B : -3.259891 6.291809 0.0
C : 0.0 5.567899 6.441063
pbc : True True True
PeriodicSite: W (1.9253, 6.2918, 0.0000) [0.7953, 0.2047, 0.0000]
PeriodicSite: W (-1.9253, 6.2918, 0.0000) [0.2047, 0.7953, 0.0000]
PeriodicSite: Cl (0.0000, 13.1912, 5.4807) [0.6718, 0.6718, 0.8509]
PeriodicSite: Cl (0.0000, 7.1635, 5.2908) [0.2058, 0.2058, 0.8214]
PeriodicSite: Cl (1.7149, 10.5162, 4.5739) [0.7845, 0.2585, 0.7101]
PeriodicSite: Cl (-1.7149, 10.5162, 4.5739) [0.2585, 0.7845, 0.7101]
PeriodicSite: Cl (1.7149, 7.6353, 1.8672) [0.7415, 0.2155, 0.2899]
PeriodicSite: Cl (-1.7149, 7.6353, 1.8672) [0.2155, 0.7415, 0.2899]
PeriodicSite: Cl (0.0000, 10.9880, 1.1503) [0.7942, 0.7942, 0.1786]
PeriodicSite: Cl (0.0000, 4.9603, 0.9603) [0.3282, 0.3282, 0.1491]
I also did that with pattern and conventional_structure.
(Note that the last line in a Jupyter notebook cell is special in that the REPL context will get used to evaluate and display the corresponding output whatever is on that line.)
What it shows in the output for those, you can select and copy and then make files back on your local machine by pasting the clipboard into your favorite code editor. For your needs maybe that is sufficient? If you read on, I describe how you can send the values assigned the variables to text files you can download without copying-pasting.
If you want to make text files from the contents assigned to each variable you can use Jupyter/IPython %store magic to send the values of the variables to a text file. I'll use structure as an example again. Enter the following in a cell to save the value of pattern to a text file.
%store pattern >pattern.txt
You'll see pattern.txt show up in the file browser a few seconds later after it automatically updates. You can run ls in a cell if you don't want to wait. It should be listed among the files present in the current working directory.
Anything that is made that is useful, you need to download from the temporary session before it times out. So if you made the text files using step #6, you'll want to download those text files from the remote session back to your local machine. If you do this in a notebook, you may want to download that as well. For example, that's how I got this, which I optionally suggested you fetch and run in your session in step #2 above.
To download the files showing in the file browser on the left, locate the files in the list, then right-click on them in the file browser to select them individually, and then from the menu that comes up select 'Download'. You'll be prompted as to where you want to save the files on your call machine.

ValueError: need more than 0 values to unpack (Python 2)

I am trying to replicate another researcher's findings by using the Python file that he added as a supplement to his paper. It is the first time I am diving into Python, so the error might be extremely simple to fix, yet after two days I haven't still. For context, in the Readme file there's the following instruction:
"To run the script, make sure Python2 is installed. Put all files into one folder designated as “cf_dir”.
In the script I get an error at the following lines:
if __name__ == '__main__':
cf_dir, cf_file, cf_phys_file = sys.argv[1:4]
os.chdir(cf_dir)
cf = pd.read_csv(cf_file)
cf_phys = pd.read_csv(cf_phys_file)
ValueError: need more than 0 values to unpack
The "cf_file" and "cf_phys_file" are two major components of all files that are in the one folder named "cf_dir". The "cf_phys_file" relates only to two survey question's (Q22 and Q23), and the "cf_file" includes all other questions 1-21. Now it seems that the code is meant to retrieve those two files from the directory? Only for the "cf_phys_file" the columns 1:4 are needed. The current working directory is already set at the right location.
The path where I located "cf_dir" is as follows:
C:\Users\Marc-Marijn Ossel\Documents\RSM\Thesis\Data\Suitable for ML\Data en Artikelen\Per task Suitability for Machine Learning score readme\cf_dir
Alternative option in readme file,
In the readme file there's this option, but also here I cannot understand how to direct the path to the right location:
"Run the following command in an open terminal (substituting for file names
below): python cfProcessor_AEAPnP.py cf_dir cf_file cf_phys_file task_file jobTaskRatingFile
jobDataFile OESfile
This should generate the data and plots as necessary."
When I run that in "Command Prompt", I get the following error, and I am not sure how to set the working directory correctly.
- python: can't open file 'cfProcessor_AEAPnP.py': [Errno 2] No such file or directory
Thanks for the reading, and I hope there's someone who could help me!
Best regards & stay safe out there during Corona!!
Marc

cf_dir, cf_file, cf_phys_file = sys.argv[1:4]
means, the python file expects few arguments when called.
In order to run
python cfProcessor_AEAPnP.py cf_dir cf_file cf_phys_file task_file jobTaskRatingFile jobDataFile OESfile
the command prompt should be in that folder.
So, open command prompt and type
cd path_to_the_folder_where_ur_python_file_is_located
Now, you would have reached the path of the python file.
Also, make sure you give full path in double quotes for the arguments.

In Python 3 on Windows, how can I set NTFS compression on a file? Nothing I've googled has gotten me even close to an answer

(Background: On an NTFS partition, files and/or folders can be set to "compressed", like it's a file attribute. They'll show up in blue in Windows Explorer, and will take up less disk space than they normally would. They can be accessed by any program normally, compression/decompression is handled transparently by the OS - this is not a .zip file. In Windows, setting a file to compressed can be done from a command line with the "Compact" command.)
Let's say I've created a file called "testfile.txt", put some data in it, and closed it. Now, I want to set it to be NTFS compressed. Yes, I could shell out and run Compact, but is there a way to do it directly in Python code instead?

In the end, I ended up cheating a bit and simply shelling out to the command line Compact utility. Here is the function I ended up writing. Errors are ignored, and it returns the output text from the Compact command, if any.
def ntfscompress(filename):
import subprocess
_compactcommand = 'Compact.exe /C /I /A "{}"'.format(filename)
try:
_result = subprocess.run(_compactcommand, timeout=86400,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,text=True)
return(_result.stdout)
except:
return('')

Python for Data Analysis, Chapter 2, first example

I'm following along with the examples in a translated version of Wes McKinney's "Python for Data Analysis" and I was blocked in first example of Chapter 2
I think my problem arose because I saved a data file in a wrong path. is that right?
I stored a file, usagov_bitly_data2012-03-16-1331923249.txt, in C:\Users\HRR
and also stored folder, pydata-book-mater, that can be downloaded from http://github.com/pydata-book in C:\Users\HRR\Anaconda2\Library\bin.

Depends.
You might change the location you save your File or eddit the path you give to your code in Line 10. Since you're yousing relativ Paths i guess your script runs in C:\Users\HRR\Anaconda2\Library\bin, which means you have to go back to C:\Users\HRR or use an absolute Path ... or move the File, but hell you don't want to move a file every time you want to open it, like moving word files into msoffice file to open it, so try to change the Path.
And allways try harder ;)

In python open() will open from the current directory down unless given a full path (in linux that starts with / and windows <C>://). In your case the command is open the folder ch02 in the directory the script is running from and then open usagov_bitly_data2012-03-16-1331923249.txt in that folder.
Since you are storing the text file in C:\Users\HRR\usagov_bitly_data2012-03-16-1331923249.txt and you did not specify the directory of the script. I recommend the following command instead open(C:\\Users\\HRR\\usagov_bitly_data2012-03-16-1331923249.txt)
Note: the double \ is to escape the characters and avoid tabs and newlines showing up in the path.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.