Grouping and deleting Files - python

I have to come up with a solution to delete all files but the newest 2 in a directory stucture of our owncloud. The be exact - its the file versioning folder. There are files in one folder with the following structure:
Filename.Ext.v[random_Number]
The hard part is that there are different files in one folder I need to keep.
IE: Content of folder A:
HelloWorld.txt.v123
HelloWorld.txt.v555
HelloWorld.txt.v666
OtherFile.pdf.v143
OtherFile.pdf.v1453
OtherFile.pdf.v123
OtherFile.pdf.v14345
YetOtherFile.docx.v11113
In this case we have 3 "basefiles". And I would have to keep the newest 2 files of each "basefile".
I tried Python3 with os.walk and regex to filter out the basename. I tried build in Linux tools like find with -ctime. I could use also bash.
But my real problem is more the logic. How would you approach this task?
EDIT 2:
Here my progress:
import os
from itertools import groupby
directory = 'C:\\Users\\x41\\Desktop\\Test\\'
def sorted_ls(directory):
mtime = lambda f: os.stat(os.path.join(directory, f)).st_mtime
return list(sorted(os.listdir(directory), key=mtime))
print(sorted_ls(directory))
for basename, group in groupby(sorted_ls(directory), lambda x: x.rsplit('.')[0]):
for i in basename:
finallist = []
for a in group:
finallist.append(a)
print(finallist[:-2])
I am almost there. The function sorts the files in the directory based on the mtime value. The suggested groupby() function calls my custom sort function.
Now the problem here is that I have to dump the sort() before the groupby() because this would reset my custom sort. But it now also returns more groups than anticipated.
If my sorted list looks like this:
['A.txt.1', 'B.txt.2', 'B.txt.1', 'B.txt.3', 'A.txt.2']
I would get 3 groups. A, B, and A again.
Any suggestions?
FINAL RESULT
Here is my final version with added recursiveness:
import os
from itertools import groupby
directory = r'C:\Users\x41\Desktop\Test'
for dirpath, dirs, files in os.walk(directory):
output = []
for basename, group in groupby(sorted(files), lambda x: x.rsplit('.')[0]):
output.extend(sorted(group, key=lambda x: os.stat(os.path.join(dirpath, x)).st_mtime)[:-2])
for file in output:
os.remove(dirpath + "\\" + file)

You need to do a simple sort first on the file names so that they are in alphabetical order to allow the groupby function to work correctly.
With each of the resulting file groups, you can then sort using your os.stat key as follows:
import os
from itertools import groupby
directory = r'C:\Users\x41\Desktop\Test'
output = []
for basename, group in groupby(sorted(os.listdir(directory)), lambda x: x.rsplit('.')[0]):
output.extend(sorted(group, key=lambda x: os.stat(os.path.join(directory, x)).st_mtime)[-2:])
print output
This will produce a single list containing the latest two files from each group.

The logic isn't extremely hard here, if that's the only thing you're looking for.
You'd group files by base name, in a python dictionary for example, where the key is your "base filename" such as "HelloWorld.txt" and the value is a list of all files with the same basename sorted by ctime (or some other metric of time depending on how you define newest), and then you delete all files in the list from index 2 onwards accordingly.

Related

Python 3.6 - enumerate files

I am trying to loop a series of jpg files in a folder. I found example code of that:
for n, image_file in enumerate(os.scandir(image_folder)):
which will loop through the image files in image_folder. However, it seems like it is not following any sequence. I have my files name like 000001.jpg, 000002.jpg, 000003.jpg,... and so on. But when the code run, it did not follow the sequence:
000213.jpg
000012.jpg
000672.jpg
....
What seems to be the issue here?
Here's the relevant bit on os.scandir():
os.scandir​(path='.')
Return an iterator of os.DirEntry objects
corresponding to the entries in the directory given by path. The
entries are yielded in arbitrary order, and the special entries '.'
and '..' are not included.
You should not expect it to be in any particular order. The same goes for listdir() if you were considering this as an alternative.
If you strictly need them to be in order, consider sorting them first:
scanned = sorted([f for f in os.scandir(image_folder)], key=lambda f: f.name)
for n, image_file in enumerate(scanned):
# ... rest of your code
I prefer to use glob:
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell, although results are
returned in arbitrary order. No tilde expansion is done, but *, ?, and
character ranges expressed with [] will be correctly matched.
You will need this if you handle more complex file structures so starting with glob isnt that bad. For your case you also can use os.scandir() as mentioned above.
Reference: glob module
import glob
files = sorted(glob.glob(r"C:\Users\Fabian\Desktop\stack\img\*.jpg"))
for key, myfile in enumerate(files):
print(key, myfile)
notice even if there other files like .txt they wont be in your list
Output:
C:\Users\Fabian\Desktop\stack>python c:/Users/Fabian/Desktop/stack/img.py
0 C:\Users\Fabian\Desktop\stack\img\img0001.jpg
1 C:\Users\Fabian\Desktop\stack\img\img0002.jpg
2 C:\Users\Fabian\Desktop\stack\img\img0003.jpg
....

rename files in order from a folder using python

I have a folder with files that are named from 0.txt to 100.txt.
They are created in order from a list L.
I want to rename the files in that folder with the name from the list, however, they are renamed in "wrong" order, meaning they are not renamed as the list.
My code is like:
import os
folder = r'D:\my_files'
os.chdir(folder)
for i,j in zip(os.listdir(folder), L):
os.rename(i, j + ".txt")
where L is the list with names for the files.
How do I keep the order of files in the directory to match my names in the L list, so the files are renamed according to my list?
As per the Python documentation:
os.listdir(path='.')
Return a list containing the names of the entries
in the directory given by path. The list is in arbitrary order, and
does not include the special entries '.' and '..' even if they are
present in the directory.
Therefore, you need to sort your files before you use zip:
for i,j in zip(sorted(os.listdir(folder), key=lambda x: int(x.split('.')[0])), L):
# logic to rename file
With sorted, the parameter key=lambda x: int(x.split('.')[0]) will ensure the ordering is correct.

Read only the first file from a given image sequence path

I have an image sequence path that is as follows : /host_server/master/images/set01a/env_basecolor_default_v001/basecolor_default.*.jpg
In a pythonic way, is it possible for me to code and have it read the first file based on the above file path given?
If not, can I have it list the entire sequence of the sequence but only of that naming? Assuming that there is another sequence called basecolor_default_beta.*.jpgin the same directory
For #2, if I used os.listdir('/host_server/master/images/set01a/env_basecolor_default_v001'), it will be listing out files of the both image sequences
The simplest solution seems to be to use several functions.
1) To get ALL of the full filepaths, use
main_path = "/host_server/master/images/set01a/env_basecolor_default_v001/"
all_files = [os.path.join(main_path, filename) for filename in os.listdir(main_path)]
2) To choose only those of a certain kind, use a filter.
beta_files = list(filter(lambda x: "beta" in x, all_files))
beta_files.sort()
read the first file based on the above file path given?
With effective glob.iglob(pathname, recursive=False) (if you need the name/path of the 1st found file):
import glob
path = '/host_server/master/images/set01a/env_basecolor_default_v001/basecolor_default.*.jpg'
it = glob.iglob(path)
first = next(it)
glob.iglob() - Return an iterator which yields the same values as
glob() without actually storing them all simultaneously.
Try using glob. Something like:
import glob
import os
path = '/host_server/master/images/set01a/env_basecolor_default_v001'
pattern = 'basecolor_default.*.jpg'
filenames = glob.glob(os.path.join(path, pattern))
# read filenames[0]

move folders from folder list to other folder list using python

hello I want to move or copy many folders from some folder list to other folder list I use glob and shutil libraries for this work.
first I create a folder list :
import glob
#paths from source folder
sourcepath='C:/my/store/path/*'
paths = glob.glob(sourcepath)
my_file='10'
selected_path = filter(lambda x: my_file in x, paths)
#paths from destination folder
destpath='C:/my/store/path/*'
paths2 = glob.glob(destpath)
my_file1='20'
selected_path1 = filter(lambda x: my_file1 in x, paths2)
and now I have two lists from paths(selected_path,selected_path1)
now I want to movie or copy folder from first list(selected_path) to second list(selected_path1)
finaly I try this code to move folders but without success :
import shutil
for I,j in zip(selected_path,selected_path1)
shutil.move(i, j)
but that cant work,any ide how to do my code to work ?
First, Obviously your use of lambda isn't useful, glob function can perform this filtering. This is what glob really does, so you're basically littering your code with more unnecessary function call, which is quite expensive in terms of performance.
Look at this example, identical to yours:
import glob
# Find all .py files
sourcepath= 'C:/my/store/path/*.py'
paths = glob.glob(sourcepath)
# Find files that end with 'codes'
destpath= 'C:/my/store/path/*codes'
paths2 = glob.glob(destpath)
Second, the second glob function call may or may not return a list of directories to move your directories/files to. This makes your code dependent on what C:/my/store/pathcontains. That is, you must guarantee that 'C:/my/store/path must contain only directories and never files, so glob will return only directories to be used in shutil.move. If the user later added files not folders to C:/my/store/path that happened to end with the name 'codes' and they didn't specify any extensions (e.g, codes.txt, codes.py...) then you'll find this file in the returned list of glob in paths2. Of course, guaranteeing a directory to contain only subdirectories is problematic and not a good idea, not at all. You can test for directories through os.path.isdir
Notice something, you're using lambda with the help of filter to filter out any string that doesn't contain 10 in your first call to filter, something you can achieve with glob itself:
glob.glob('C:/my/store/path/*10*')
Now any file or subdirectory of C:/my/store/path that contains 10 in its name will be collected in the returned list of the glob function.
Third, zip truncates to the shortest iterable in its argument list. In other words, if you would like to move every path in paths to every path in paths2, you need len(paths) == len(paths2) so each file or directory in paths has a directory to be moved to in paths2.
Fourth, You missed the semicolon for the for loop and in the call for shutil.move you used i instead of I. Python is a case-sensitive language, and I uppercase isn't exactly the same as i lowercase:
import shutil
for I,j in zip(selected_path,selected_path1) # missing :
shutil.move(i, j) # i not I
Corrected code:
import shutil
for I,j in zip(selected_path,selected_path1) # missing :
shutil.move(I, j) # i not I
Presumably, paths2 contains only subdirectories of C:/my/store/path directory, this is a better approach to write your code, but definitely not the best:
import glob
#paths from source folder
sourcepath='C:/my/store/path/*10*'
paths = glob.glob(sourcepath)
#paths from destination folder
destpath='C:/my/store/path/*20*'
paths2 = glob.glob(destpath)
import shutil
for i,j in zip(paths,paths2):
shutil.move(i, j)
*Still some of the previous issues that I mentioned above apply to this code.
And now that you finished the long marathon of reading this answer, what would you like to do to improve your code? I'll be glad to help if you still find something ambiguous.
Good luck :)

Find duplicate filenames, and only keep newest file using python

I have +20 000 files, that look like this below, all in the same directory:
8003825.pdf
8003825.tif
8006826.tif
How does one find all duplicate filenames, while ignoring the file extension.
Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)
For example:
"8003825" appears twice
Then look at the metadata of each duplicate file and only keep the newest one.
Similar to this post:
Keep latest file and delete all other
I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?
I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...
Python 2.6
Windows 7
You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.
import os
from collections import namedtuple
directory = #file directory
os.chdir(directory)
newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file = newest_files.get(name)
this_file_date = os.path.getmtime(file_name)
if cashed_file is None:
newest_files[name] = Entry(this_file_date,file_name)
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.
In the end you have a dictionary with the most recent files.
Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).
For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file_name = newest_files.get(name).file_name
if file_name != cashed_file_name: #it's not the newest with this name
os.remove(file_name)
As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
os.remove(cashed_file.file_name) #this line added
First, get a list of file names and sort them. This will put any duplicates next to each other.
Then, strip off the file extension and compare to neighbors, os.path.splitext() and itertools.groupby() may be useful here.
Once you have grouped the duplicates, pick the one you want to keep using os.stat().
In the end your code might looks something like this:
import os, itertools
files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
dups = list(g)
if len(dups) > 1:
# figure out which file(s) to remove
You shouldn't have to worry about memory here, you're looking at something on the order of a couple of megabytes.
For the filename counter you could use a defaultdict that stores how many times each file appears:
import os
from collections import defaultdict
counter = defaultdict(int)
for file_name in file_names:
file_name = os.path.splitext(os.path.basename(file_name))[0]
counter[file_name] += 1

Categories