Walk directories and create pdf from images - python

I have a folder Alpha which contains a series of folders named Beta1, Beta2, ..., Beta 397. Each of the Beta folders contains a variable number of alphanumerically numbered images of different file formats.
My goal is to run a script that crawls all these Beta folders, selectively chooses images of jpeg/png format, and merges them to a pdf (per Beta folder) after name-sort.
My code is stored alongside the Beta folders and reads:-
import glob
import re
import img2pdf
import os
_nsre = re.compile('([0-9]+)')
def natural_sort_key(s):
return [int(text) if text.isdigit() else text.lower()
for text in re.split(_nsre, s)]
for X in range(1, 397):
dirname = os.path.join('./','BetaX', '')
output = os.path.join('./','BetaX', '/output.pdf')
# Get all the filenames per image format
filenames1 = [f for f in glob.iglob(f'{dirname}*.jpg')]
filenames2 = [f for f in glob.iglob(f'{dirname}*.png')]
# Merges the 2 lists
filenames3 = filenames1 + filenames2
# Sort the list alphanumerically
filenames3.sort(key=natural_sort_key)
# Print to pdf
with open(output,"wb") as f:
f.write(img2pdf.convert(filenames3))
print(f'Finished converting {output}')
filenames1.clear()
filenames2.clear()
filenames3.clear()
If I remove the for loop line and type the value of X, the pdf is outputted without any fuss, on an individual-folder basis. However, I am looking for ways to treat X as a loop-variable from the range and batch-process all the folders at once.

The way your code is currently:
for X in range(1, 397):
dirname = os.path.join('./','BetaX', '')
output = os.path.join('./','BetaX', '/output.pdf')
The X is just a character in the string BetaX. You need to cause X to be treated like an integer value and then you need to concatenate that value onto Beta to come up with your full folder name.
Also, you don't want the slashes in what you're passing to os.path.join. The point of the join call is to hide the details of the path separator character. The value of output will be just /output.pdf with what you have, as the third parameter will be considered an absolute path because of the slash at the front of it.
Here's that part of your code with both of these issues addressed:
for X in range(1, 397):
dirname = os.path.join('.','Beta' + str(X), '')
output = os.path.join('.','Beta' + str(X), 'output.pdf')

Related

Save part of a file name in seperate panda df columns

I would like to save different positions of a file name in different panda df columns.
For example my file names look like this:
001015io.png
position 0-2 in column 'y position' in this case '001'
position 3-5 in column 'x position' in this case '015'
position 6-7 in column 'status' in this case 'io'
My folder contains about 400 of these picture files. I'm a beginner in programming, so I don't know how I should start to solve this.
If the parts of the file names that you need are consistent (same position and length in all files), you can use string slicing to create new columns from the pieces of the file name like this:
import pandas as pd
df = pd.DataFrame({'file_name': ['001015io.png']})
df['y position'] = df['file_name'].str[0:3]
df['x position'] = df['file_name'].str[3:6]
df['status'] = df['file_name'].str[6:8]
This results in the dataframe:
file_name y position x position status
0 001015io.png 001 015 io
Note that when you slice a string you give a start position and a stop position like [0:3]. The start position is inclusive, but the stop position is not, so [0:3] gives you the substring from 0-2.
You can do this with slicing. A string is basically a list of character, so you can slice this string into the parts you need. See the example below.
filename = '001015io.png'
x = filename[0:3]
y = filename[3:6]
status = filename[6:8]
print(x, y, status)
output
001 015 io
As for getting the list of files, there's an absurdly complete answer for that here.
I have this function below in my personal library which I reuse whenever I need to generate a list of files.
def get_files_from_path(path: str = ".", ext=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this. I tend to use this one.
note that it also goes into subdirs of the path
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of full file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
There's one thing you need to take into account here, this function will give the full filepath, fe: path/to/file/001015io.png
You can use the code below to get just the filename:
import os
print(os.path.basename('path/to/file/001015io.png'))
ouput
001015io.png
Use what Bill the Lizard said to turn it into a df

How to get the latest folder in a directory using Python

I need to retrieve the directory of the most recently create folder. I am using a program that will output a new run## folder each time it is executed (i.e run01, run02, run03 and so on). Within any one run## folder resides a data file that I want analyze (file-i-want.txt).
folder_numb = 'run01'
dir = os.path.dirname(__file__)
filepath = os.path.join(dir, '..\data\directory',run_numb,'file-i-want.txt')
In short I want to skip having to hardcode in run## and just get the directory of a file within the most recently created run## folder.
You can get the creation date with os.stat
path = '/a/b/c'
#newest
newest = max([f for f in os.listdir(path)], key=lambda x: os.stat(os.path.join(path,x)).st_birthtime)
# all files sorted
sorted_files = sorted([f for f in os.listdir(path)],key=lambda x: os.stat(os.path.join(path, x)).st_birthtime, reverse=True)
pathlib is the recommeded over os for filesystem related tasks.
reference
You can try:
filepath = Path(__file__).parent / 'data/directory'
fnames = sorted(list(Path(filepath).rglob('file-i-want.txt')), key=lambda x: Path.stat(x).st_mtime, reverse=True)
filepath = str(fnames[0])
filepath
glob.glob('run*') will return the list of files/directories that match the pattern ordered by name.
so if you want the latest run your code will be:
import glob
print(glob.glob('run*')[-1]) # raises index error if there are no runs
IMPORTANT, the files are ordered by name, in this case, for example, 'run21' will come AFTER 'run100', so you will need to use a high enough number of digits to not see this error. or just count the number of matched files and recreate the name of the folder with this number.
you can use glob to check the number of files with the same name pattern:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n)
Note: with this code the file names starts from 0, if you want to start from 1 just add 1 to n.
if you want always double digit run number (00, 01, 02) instead of 'str(n)' use 'str(n).zfill(2)'
example:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n + 1).zfill(2)

Finding all subfolders that contain two files that end with certain strings

So I have a folder, say D:\Tree, that contains only subfolders (names may contain spaces). These subfolders contain a few files - and they may contain files of the form "D:\Tree\SubfolderName\SubfolderName_One.txt" and "D:\Tree\SubfolderName\SubfolderName_Two.txt" (in other words, the subfolder may contain both of them, one, or neither). I need to find every occurence where a subfolder contains both of these files, and send their absolute paths to a text file (in a format explained in the following example). Consider these three subfolders in D:\Tree:
D:\Tree\Grass contains Grass_One.txt and Grass_Two.txt
D:\Tree\Leaf contains Leaf_One.txt
D:\Tree\Branch contains Branch_One.txt and Branch_Two.txt
Given this structure and the problem mentioned above, I'd to like to be able to write the following lines in myfile.txt:
D:\Tree\Grass\Grass_One.txt D:\Tree\Grass\Grass_Two.txt
D:\Tree\Branch\Branch_One.txt D:\Tree\Branch\Branch_Two.txt
How might this be done? Thanks in advance for any help!
Note: It is very important that "file_One.txt" comes before "file_Two.txt" in myfile.txt
import os
folderPath = r'Your Folder Path'
for (dirPath, allDirNames, allFileNames) in os.walk(folderPath):
for fileName in allFileNames:
if fileName.endswith("One.txt") or fileName.endswith("Two.txt") :
print (os.path.join(dirPath, fileName))
# Or do your task as writing in file as per your need
Hope this helps....
Here is a recursive solution
def findFiles(writable, current_path, ending1, ending2):
'''
:param writable: file to write output to
:param current_path: current path of recursive traversal of sub folders
:param postfix: the postfix which needs to match before
:return: None
'''
# check if current path is a folder or not
try:
flist = os.listdir(current_path)
except NotADirectoryError:
return
# stores files which match given endings
ending1_files = []
ending2_files = []
for dirname in flist:
if dirname.endswith(ending1):
ending1_files.append(dirname)
elif dirname.endswith(ending2):
ending2_files.append(dirname)
findFiles(writable, current_path+ '/' + dirname, ending1, ending2)
# see if exactly 2 files have matching the endings
if len(ending1_files) == 1 and len(ending2_files) == 1:
writable.write(current_path+ '/'+ ending1_files[0] + ' ')
writable.write(current_path + '/'+ ending2_files[0] + '\n')
findFiles(sys.stdout, 'G:/testf', 'one.txt', 'two.txt')

Looping through a directory, importing and processing text files in a chronology order using Python

Hoping I could get help with my python code, currently I have to change the working directory manually every time I run my code that loops through all the .txt files in chronological order, since they are numbered 1_Ix_100.txt, 2_Ix_99.txt etc etc until 201_Ix_-100.txt. all the text files are in the same directory i.e. C:/test/Ig=*/340_TXT what changes is the starred folder which goes from 340 to 1020 in increments of 40 i.e. C:/test/Ig=340/340_TXT, C:/test/Ig=380/340_TXT etc etc etc until C:/test/Ig=1020/340_TXT.
I'm looking for a way to automate this process so that the code loops through the different /Ig=*/ folder, process the text files and save the outcome as csv file in the /Ig=/
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import re
import os
import glob
D = []
E = []
F = []
os.chdir('C:/test/**Ig=700**/340_TXT') #Need to loop through the different folders in bold, these go from Ig=340 to Ig=1020 in incruments of 40
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
name=['1', '2']
results = pd.read_table(infile, sep='\s+', names=name)
#process files here with output [D], [E], [F]
ArrayMain = []
ArrayMain = np.column_stack((D,E,F))
np.savetxt("C:/test/**Ig=700**/Grey_Zone.csv", ArrayMain, delimiter=",", fmt='%.9f') #save output in this directory which is one less than the working directory
I really hope the way I have worded it makes sense and I appreciate any help at all, thank you
Using a simple loop and some string manipulation you can create a list of the paths you want and then iterate over them.
Ig_txts = []
i=340
while i <= 1020:
Ig_txts.append( 'Ig='+str(i) )
i += 40
for Ig_txt in Ig_txts:
path = 'C:/test/'+Ig_txt+'/340_TXT'
out_file = 'C:/test/'+Ig_txt+'/Grey_Zone.csv'
os.chdir(path)
...
...
EDIT: Gabriel brought up that my range is a little off. Check the second code blurb for the modification.
I would first put your script into a function that takes, as one of its arguments, a path. The details are up to you, this code just details how to loop through the file names.
for root, _, files in os.walk('C:/test/'):
for f in files:
os.chdir(os.path.join(root, f))
#You now have the paths you need to open, close, etc.
Now, if there are other garbage files in 'C:/test/', then you could use a range based loop:
min_file_num = 340
max_file_num = 1020
for dir_num in range(min_file_num, max_file_num+1, 40):
path = 'C:/test/Ig=' + str(dir_num) + '/'
for root, _, files in os.walk(path):
for f in files:
os.chdir(os.path.join(root, f))
#You now have the paths you need to open, close, etc.

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories