How to loop reading files group by group? - python

Generally, I loop through files one by one in Python. Now I want to loop through them group by group. How do I read them efficiently?
Here's an example to explain my question.
Given files like these:
group1: m2000_01, m2000_02,..., m2000_12
group2: m2001_01, m2001_02,...., m2001_12
.....
group17: m2016_01, m2016_02,...., m2016_12
I want to read files in same year for calculation and loop alone time serials for batching. Pseudo-code as follow:
for year in list[2000,2001,...,2016]:
A=open(m2000_01), B=open(m2000_02), C=open(m2000_03).... # reading files section
mean2000 = (A + B + C ...) / 12
#calculation body,how to set varibles for each file.such as A=m2000_01, B=m2000_02, ...,
#use a dict to set these files?
print mean2000, mean2001,..., mean2016 #result I want
Maybe I could make a list, and then loop element in list for matching(seive) and extracting group files. But if there are many groups of files and the group key words (such as 2000 in above example) are irregular. Are there any common method to solve similar problems? I think there is a proven method, but I don't know how to describe and search. Please forgive me if this problem is simple.

This will do
import os
path = "your\\path"
all_files = [x for x in os.listdir(path) if os.path.isfile(path + "\\" + x)]
for year in range(2000, 2017):
for file_name in [y for y in all_files if str(year) in y]:
sub_file_path = path + "\\" + file_name
# read file, insert appropriate code yourself

You can find and group the files for processing using os.listdir(), along with the re regex module, and the itertools.groupby() function to do something along these lines:
from itertools import groupby
import os
import re
folder_path = 'data_folder'
pattern = r'm\d\d\d\d_\d\d'
filenames = [filename for filename in sorted(os.listdir(folder_path))
if re.match(pattern, filename)]
for k, g in groupby(filenames, lambda filename: filename.split('_')[0]):
year = int(k[1:])
year_files = list(g)
print('{}: {}'.format(year, year_files))
Sample output:
2000: ['m2000_01', 'm2000_02', 'm2000_03', 'm2000_04', 'm2000_05', 'm2000_06', 'm2000_07', 'm2000_08', 'm2000_09', 'm2000_10', 'm2000_11', 'm2000_12']
2001: ['m2001_01', 'm2001_02', 'm2001_03', 'm2001_04', 'm2001_05', 'm2001_06', 'm2001_07', 'm2001_08', 'm2001_09', 'm2001_10', 'm2001_11', 'm2001_12']
2002: ['m2002_01', 'm2002_02', 'm2002_03', 'm2002_04', 'm2002_05', 'm2002_06', 'm2002_07', 'm2002_08', 'm2002_09', 'm2002_10', 'm2002_11', 'm2002_12']

Related

how to count the no of files of each extension in a directory using python?

I'm fairly new to python and i came across this problem,
i want to be able to write a python script that counts the no of files in a directory of each extension and output the following details
First row shows image count
Second row shows file names in padding format.
Third row shows frame numbers continuity
Example:
files in the directory:-
alpha.txt
file01_0040.rgb
file01_0041.rgb
file01_0042.rgb
file01_0043.rgb
file02_0044.rgb
file02_0045.rgb
file02_0046.rgb
file02_0047.rgb
Output:-
1 alpha.txt
4 file01_%04d.rgb 40-43
4 file02_%04d.rgb 44-47
I'd suggest you have a look at python's native Pathlib library (it comes with glob)
here's a first idea how you could do it (sure this can be improved but it should provide you with the basics):
from pathlib import Path
from itertools import groupby
basepath = Path(r"path-to-the-folder")
# get all filenames that follow the pattern
files = basepath.glob("file*_[0-9][0-9][0-9][0-9].rgb")
# extract the pattern so that we get the 2 numbers separately
patterns = [i.stem.lstrip("file").split("_") for i in files]
# group by the first number
groups = groupby(patterns, key=lambda x: x[0])
filenames = list()
# loop through the obtained groups and extract min/max file_ids found
for file_num, group in groups:
file_ids = [int(i[1]) for i in group]
filenames.append(f"file_{file_num}_%04d.rgb {min(file_ids)} - {max(file_ids)}")
print(*filenames, sep="\n")
you can use the glob library to search through directories like this
import glob
glob.glob('*.rgb')
This code will return the filenames of all files ending with .rgb in an array for you to sort and edit

Walk directories and create pdf from images

I have a folder Alpha which contains a series of folders named Beta1, Beta2, ..., Beta 397. Each of the Beta folders contains a variable number of alphanumerically numbered images of different file formats.
My goal is to run a script that crawls all these Beta folders, selectively chooses images of jpeg/png format, and merges them to a pdf (per Beta folder) after name-sort.
My code is stored alongside the Beta folders and reads:-
import glob
import re
import img2pdf
import os
_nsre = re.compile('([0-9]+)')
def natural_sort_key(s):
return [int(text) if text.isdigit() else text.lower()
for text in re.split(_nsre, s)]
for X in range(1, 397):
dirname = os.path.join('./','BetaX', '')
output = os.path.join('./','BetaX', '/output.pdf')
# Get all the filenames per image format
filenames1 = [f for f in glob.iglob(f'{dirname}*.jpg')]
filenames2 = [f for f in glob.iglob(f'{dirname}*.png')]
# Merges the 2 lists
filenames3 = filenames1 + filenames2
# Sort the list alphanumerically
filenames3.sort(key=natural_sort_key)
# Print to pdf
with open(output,"wb") as f:
f.write(img2pdf.convert(filenames3))
print(f'Finished converting {output}')
filenames1.clear()
filenames2.clear()
filenames3.clear()
If I remove the for loop line and type the value of X, the pdf is outputted without any fuss, on an individual-folder basis. However, I am looking for ways to treat X as a loop-variable from the range and batch-process all the folders at once.
The way your code is currently:
for X in range(1, 397):
dirname = os.path.join('./','BetaX', '')
output = os.path.join('./','BetaX', '/output.pdf')
The X is just a character in the string BetaX. You need to cause X to be treated like an integer value and then you need to concatenate that value onto Beta to come up with your full folder name.
Also, you don't want the slashes in what you're passing to os.path.join. The point of the join call is to hide the details of the path separator character. The value of output will be just /output.pdf with what you have, as the third parameter will be considered an absolute path because of the slash at the front of it.
Here's that part of your code with both of these issues addressed:
for X in range(1, 397):
dirname = os.path.join('.','Beta' + str(X), '')
output = os.path.join('.','Beta' + str(X), 'output.pdf')

Python Glob regex file search with for single result from multiple matches

In Python, I am trying to find a specific file in a directory, let's say, 'file3.txt'. The other files in the directory are 'flie1.txt', 'File2.txt', 'file_12.txt', and 'File13.txt'. The number is unique, so I need to search by a user supplied number.
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + f'{file_num} + '.txt')
Problem is, that returns both 'file3.txt' and 'File13.txt'. If I try lookbehind, I get no files:
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + r'(?<![1-9]*)' + f'{file_num}' + '.txt')
How do I only get 'file3.txt'?
glob accepts Unix wildcards, not regexes. Those are less powerful but what you're asking can still be achieved. This:
glob.glob("/path/to/file/*[!0-9]3.txt")
filters the files containing 3 without digits before.
For other cases, you can use a list comprehension and regex:
[x for x in glob.glob("/path/to/file/*") if re.match(some_regex,os.path.basename(x))]
The problem with glob is that it has limited RegEx. For instance, you can't have "[a-z_]+" with glob.
So, it's better to write your own RegEx, like this:
import re
import os
file_num = 3
file_re = r"[a-z_]+{file_num}\.txt".format(file_num=file_num)
match_file = re.compile(file_re, flags=re.IGNORECASE).match
work_dir = "C:/Path_to_dir/"
names = list(filter(match_file, os.listdir(work_dir)))

Keep latest file and delete all other

In my folder there are many pdf files with date-timestamp format such as shown in the last.
I would like to keep the latest file for the day and delete the rest for that day. How can I do in Python ?
2012-07-13-15-13-27_1342167207.pdf
2012-07-13-15-18-22_1342167502.pdf
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-14-45-12_1343029512.pdf
2012-07-23-14-56-48_1343030208.pdf
2012-07-23-16-03-45_1343034225.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-07-27-19_1343262439.pdf
2012-07-26-07-33-27_1343262807.pdf
2012-07-26-07-51-59_1343263919.pdf
2012-07-26-22-38-30_1343317110.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-10-43-27_1343360607.pdf
2012-07-27-10-58-40_1343361520.pdf
2012-07-27-11-03-19_1343361799.pdf
2012-07-27-11-04-14_1343361854.pdf
Should I use list to fill and sort out then ? Desired output is:
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-11-04-14_1343361854.pdf
Thanks
Your desired list can also be achieved using groupby .
from itertools import groupby
from os import listdir,unlink
filtered_list = list()
names = os.listdir()
for key,group in groupby(names,lambda x : x[:10]): # groups based on the start 10 characters of file
filtered_list.append([item for item in group][-1]) #picks the last file from the group
print filtered_list
Sort the list and delete files if the next file in the list is on the same day,
import glob
import os
files = glob.glob("*.pdf")
files.sort()
for ifl, fl in enumerate(files[:-1]):
if files[ifl+1].startswith(fl[:10]): #Check if next file is same day
os.unlink(fl) # It is - delete current file
Edit:
As the OPs question became clearer it became evident that not just the last file of the list is required, but the latest file of each day - to achieve this I included a "same day" conditioned unlinking.
You could do it that way. The following code is untested, but may work:
import os
names = os.listdir()
names.sort()
for f in names[:-1]:
os.unlink(f)
Fortunately your file names use ISO8601 date format so the textual sort achieves the desired result with no need to parse the dates.
The following snippet works with the test case given.
files = os.listdir(".")
days = set(fname[8:10] for fname in files)
for d in days:
f = [i for i in files if i[8:10] == d]
for x in sorted(f)[:-1]:
os.remove(x)
Using dictionary You can keep one value. This can be dirty and quickest solution, maybe not the best.
#!/usr/bin/env python
import os
import datetime
import stat
import shutil
filelist=[]
lst=[]
dc={}
os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
lst.append(files)
for x in lst:
print x[0:10].replace("-","")
dc[int(x[0:10].replace("-",""))]=x
a = dc.items()
flist=[]
for k, v in a:
flist.append(v)
dir="tmpdir"
if not os.path.exists(dir):
os.makedirs(dir)
from shutil import copyfile
for x in flist:
print x
copyfile(x, dir + "/" + x)
#os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
os.unlink(files)
os.chdir("./tmpdir")
for files in os.listdir("."):
if files.endswith(".pdf"):
copyfile(files, "../"+files)
os.chdir("../")
shutil.rmtree(os.path.abspath(".")+"/tmpdir")

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories