Search multiple directories, delete duplicate files

Search multiple directories, delete duplicate files - python

I have a directory of files that contain files of records. I just got access to a new directory that has the same records but additional files as well, but the additional files are buried deep inside other folders and i cant find them.
So my solution would be to have a python program run and delete all files that are duplicates in the two different directories (and subdirectories), and leave others intact, which will give me the "new files" i'm looking for.
I have seen a couple of programs that find duplicates, but i'm unsure as to how they run really, and they haven't been helpful.
Any way i can accomplish what i'm looking for?
Thanks!

Possible approach:
Create a set of MD5 hashes from your original folder.
Recursively MD5 hash the files in your new folder, deleting any files that generate hashes already present in your set.
Caveat to the above is that there is a chance two different files can generate the same hash. How different are the files?

use fslint or some similar software. Fslint is able to for example give you a list of the different files and hardlink the copies together, or delete the duplicates. One option is also to just use a diff-like program to diff the directories if their internal structure is the same.

Do they duplicate files in both directories have the same name/path? If I understand correctly you want to find the duplicate filenames rather than file contents? If so, a 'synchronised' call to os.walk in both trees might be helpful.

Related

Maintaining sequence of the read files in a directory with python

I am trying to loop through the csv files in a directory so that they are read in the same sequence as stored in the directory. I am aware of the methods like using os.walk or os.listdir. To preserve the sequence I used a sorted(os.listdir..) but the sequence is still not the same. I have attached two images for ease of understanding. I do not want to change the names of the files now because this data was generated from different simulation software.

try this statement: sorted(os.listdir(dir_unresolved),key=len)

iterate through folder and sub folders to find a specific file [duplicate]

I am trying to automate a search and delete operation for specific files and folder underneath a specific folder. Below is the folder structure which I have:
Primary Directory is MasterFolder, which includes multiple sub directories which are Child Folders Fol1, Fol2, Fol3, Fol4 the sub directories may vary folder to folder.
The Sub folders have more files and subfolders. ExL Fol1 holds someFilesFolder, sometext.txt, AnotherFilesFolder same applies to other Fol2,Fol3 etc sub directories under the MasterFolder.
Now what I would like to do is I wound want to scan the MasterFolder and go through every ChildFolder and look for 1 file named someText.txt and 1 folder named someFilesFolder under every child folder and remove the same. Ideally the folder name and file name I would want to delete is same under every ChildFolder, so the find should happen only one level down the MasterFolder. I checked multiple articles but everything specifies deleting a specific file or a directory using shutil.rmtree under one folder, but I am looking for something which will do the find and delete recursively I believe.

To get you started:
Ideally the folder name and file name I would want to delete is same under every ChildFolder, so the find should happen only one level down the MasterFolder.
One easy way to go through every child folder under MasterFolder is to loop over [os.listdir]('/path/to/MasterFolder'). This will give you both files and child folders. You can check them each with os.path.isdir. But it's much simpler (and more efficient, and cleaner) to just try to operate on them as if they were all folders, and handle the exceptions on non-folders by doing nothing/logging/whatever seems appropriate.
The list you get back from listdir is just bare names, so you will need os.path.join to concatenate each name to /path/to/MasterFolder. And you'll need to use it to concatenate "someTxt.txt" and "someFilesFolder" as well, of course.
Finally, while you could listdir again on each child directory, and only delete the file and subdirectory if they exist, again, it's simpler (and cleaner and more efficient) to just try each one. You apparently already know how to shutil.rmtree and os.unlink, so… you're done.
If that "ideally" isn't actually guaranteed, instead of os.listdir, you will have to use os.walk. This is slightly more complicated, but if you look at the examples, then come back up and read the docs above the examples for the details, it's not hard to figure out.

Python - Fast search for modified files in multiple folders

I am looking for an efficient implementation for finding modified files in multiple directories.
I do know that I can just recursively go through all those directories and check the modified date of those files I wanna check. This is quite trivial.
But what if I have a complex folder structure with many files in it? The upper approach won't really scale and might take up several minutes.
Is there a better approach to probe for modifications? Like is there something like a checksum on folders that I could use to narrow down the number of folders I have to check for modifications or anything like that?
A second step to my problem is also finding newly created files.
I am looking for a python based solution which is windows compatible

Find and delete specific file and sub directory within a directory using Python

I am trying to automate a search and delete operation for specific files and folder underneath a specific folder. Below is the folder structure which I have:
Primary Directory is MasterFolder, which includes multiple sub directories which are Child Folders Fol1, Fol2, Fol3, Fol4 the sub directories may vary folder to folder.
The Sub folders have more files and subfolders. ExL Fol1 holds someFilesFolder, sometext.txt, AnotherFilesFolder same applies to other Fol2,Fol3 etc sub directories under the MasterFolder.
Now what I would like to do is I wound want to scan the MasterFolder and go through every ChildFolder and look for 1 file named someText.txt and 1 folder named someFilesFolder under every child folder and remove the same. Ideally the folder name and file name I would want to delete is same under every ChildFolder, so the find should happen only one level down the MasterFolder. I checked multiple articles but everything specifies deleting a specific file or a directory using shutil.rmtree under one folder, but I am looking for something which will do the find and delete recursively I believe.

To get you started:
Ideally the folder name and file name I would want to delete is same under every ChildFolder, so the find should happen only one level down the MasterFolder.
One easy way to go through every child folder under MasterFolder is to loop over [os.listdir]('/path/to/MasterFolder'). This will give you both files and child folders. You can check them each with os.path.isdir. But it's much simpler (and more efficient, and cleaner) to just try to operate on them as if they were all folders, and handle the exceptions on non-folders by doing nothing/logging/whatever seems appropriate.
The list you get back from listdir is just bare names, so you will need os.path.join to concatenate each name to /path/to/MasterFolder. And you'll need to use it to concatenate "someTxt.txt" and "someFilesFolder" as well, of course.
Finally, while you could listdir again on each child directory, and only delete the file and subdirectory if they exist, again, it's simpler (and cleaner and more efficient) to just try each one. You apparently already know how to shutil.rmtree and os.unlink, so… you're done.
If that "ideally" isn't actually guaranteed, instead of os.listdir, you will have to use os.walk. This is slightly more complicated, but if you look at the examples, then come back up and read the docs above the examples for the details, it's not hard to figure out.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks

See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.

Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.

I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search multiple directories, delete duplicate files - python

use fslint or some similar software. Fslint is able to for example give you a list of the different files and hardlink the copies together, or delete the duplicates. One option is also to just use a diff-like program to diff the directories if their internal structure is the same.

Do they duplicate files in both directories have the same name/path? If I understand correctly you want to find the duplicate filenames rather than file contents? If so, a 'synchronised' call to os.walk in both trees might be helpful.

Related

Maintaining sequence of the read files in a directory with python

iterate through folder and sub folders to find a specific file [duplicate]

Python - Fast search for modified files in multiple folders

Find and delete specific file and sub directory within a directory using Python

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

Categories

Resources