Find and remove duplicate files using Python - python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)

Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

Related

deleting duplicate files with parenthesis in the name, only if original exists

i didn't really find anything that suits my needs, and i am not sure how to proceed.
i have lots of photos, scattered across different folder, while many of them are just duplicates. like for example:
20180514.jpg(1).jpg
20180514.jpg
or
20180514(1).jpg
now, i want to create a python script, which looks for files with a parenthesis, checks if a related file without parenthesis exists and deletes the file with the parenthesis. with my lack of python skills, i managed to search for all files with wildcards:
parenthesisList = glob.glob('**/*(*)*', recursive=True)
and could theoretically delete them from there, but with 30k+ pictures, i wouldn't dare just deleting them, not knowing if they really had an original file or not.
the tricky part now is to compare that list to another list, which is something like:
everythingList = glob.glob('**/*(*)*', recursive=True)
and to evaluate, which of the files in
parenthesisList have a file with the same name, except the parenthesis.
bonus point would be to only delete the file, if the size of the file is the same or less, but don't really need that. thanks for any help!
EDIT: my post sounds like i want someone to do it for me, but if it wasn't clear, my question is: how do you check if items from list A contain items from list A minus '('?
from os import listdir
for filename in listdir(r"C:\Users\...\Pictures"):
# check if file name contains parenthesis
if "(" in filename:
os.remove(r"C:\Users\...\Pictures\\" + filename)
Note that this will also remove folders names with "(".

Most efficient way to check if a string contains any file format?

I have a .txt with hundreds of thousands of paths and I simply have to check if each line is a folder or a file. The hard drive is not with me so I can't use the module os with the os.path.isdir() function. I've tried the code below but it is just not perfect since some folders contains . at the end.
for row in files:
if (row[-6:].find(".") < 0):
folders_count += 1
It is just not worth testing if the ending of the string contains any known file format (.zip, .pdf, .doc ...) since there are dozens of different files format inside this HD. When my code reads the .txt, it stores each line as a string inside an array, so my code should work with the string format.
An example of a folder path:
'path1/path2/truckMV.34'
An example of a file path:
'path1/path2/certificates.pdf'
It's impossible for us to judge if it's a file or path just by the string since an extension is just an arbitrary agreeable string that programs choose to decode in a certain way.
Having said that, if I had the same problem I would do my best to estimate with the following pseudo code:
Create a hash map (or a dictionary as you are in Python)
For every line of the file, read the last bit and see if there's a "." in the last path
Create a key for it on the hash map with a counter of how many times you have encountered the "possible extensions".
After you go through all of the list you will have a collection of possible extensions and how many you have encountered them. Assume the ones with only 1 occurrence (or any other low arbitrary number) to be a path and not an extension.
The basis of this heuristic is that it's unlikely for a person to have a lot of unique extensions on their desktop - but that's just an assumption I came up with.

Python: Need help comparing two lists

I recently discovered a huge bug in my script and I need some help figuring it out.
This script searches through multiple sub-directories in our codebase and finds all source files that fall into a certain criteria and makes a list of the names. I call this list: all_files
It then runs another program, and this program makes new source files (close to the same names) that get placed in another directory. I go into that directory and make a list of the file names in there. I call this list: covered_files
Now here's the problem, the new source file names differ in that they have leading words. For example:
--> file name in all_files: fileName.cpp
--> file name in covered_files: prefix_fileName.cpp
These two files correspond to each other, and the job of this script is to return a list of names from all_files that show up in covered_files... That is, in the above example the name "fileName.cpp" would be added to the list because it is in both.
One main issue is that there are names that would correspond to "fileName1.cpp", fileName2.cpp", and so on. And as you'll see in my code below, it only accounts for one of these files being covered when they both need to be.
What I currently have:
def find_covered_files(all_files):
covered_path = <path to the newly generated files>
# Make a list of every file in the covered folder
covered_files = [f for f in listdir(covered_path) if isfile(join(covered_path, f))]
file_match_list = []
# Seach through the covered files for matches
for cov_file in covered_files:
# Find the file matches
for files in all_files:
if files in cov_file:
file_match_list.append(files)
return file_match_list
So overall my question is: Is there a way to search through the two lists, where one of the entries is a substring of the other, that will give me each file name that is covered regardless if that substring appears more than once? Thank you in advance for any help!
Edit: Another example using some actual file names:
There are files Segment.cpp, SegmentAllocation.cpp, and SegmentImpl.cpp These would be in the all_files list with the matching prefixed names in the covered_files list.
After the code is ran, I would expect all three of them to be in the file_match_list. However, only the first file name is in there repeated 3 times.
So instead of: (desired)
['Segment.cpp', 'SegmentAllocation.cpp', 'SegmentImpl.cpp']
I get:
['Segment.cpp', 'Segment.cpp', 'Segment.cpp']
I answered the question after debugging, I needed a break statement in the for loop:
def find_covered_files(all_files):
covered_path = <path to the newly generated files>
# Make a list of every file in the covered folder
covered_files = [f for f in listdir(covered_path) if isfile(join(covered_path, f))]
file_match_list = []
# Seach through the covered files for matches
for cov_file in covered_files:
# Find the file matches
for files in all_files:
if files in cov_file:
file_match_list.append(files)
break
return file_match_list

How can I improve performance of finding all files in a folder created at a certain date?

There are 10,000 files in a folder. Few files are created on 2018-06-01, few on 2018-06-09, like that.
I need to find all files which are created on 2018-06-09. But it is taking to much time (almost 2 hours) to read each file and get the file creation date and then get the files which are created on 2018-06-09.
for file in os.scandir(Path):
if file.is_file():
file_ctime = datetime.fromtimestamp(os.path.getctime(file)).strftime('%Y- %m- %d %H:%M:%S')
if file_ctime[0:4] == '2018-06-09':
# ...
You could try using os.listdir(path) to get all the files and dirs from the given path.
Once you have all the files and directories you could use filter and a lambda function to create a new list of only the files with the desired timestamp.
You could then iterate through that list to do what work you need to on the correct files.
Let's start with the most basic thing - why are you building a datetime only to re-format it as string and then do a string comparison?
Then there is the whole point of using os.scandir() over os.listdir() - os.scandir() returns a os.DirEntry which caches file stats through the os.DirEntry.stat() call.
In dependence of checks you need to perform, os.listdir() might even perform better if you expect to do a lot of filtering on the filename as then you won't need to build up a whole os.DirEntry just to discard it.
So, to optimize your loop, if you don't expect a lot of filtering on the name:
for entry in os.scandir(Path):
if entry.is_file() and 1528495200 <= entry.stat().st_ctime < 1528581600:
pass # do whatever you need with it
If you do, then better stick with os.listdir() as:
import stat
for entry in os.listdir(Path):
# do your filtering on the entry name first...
path = os.path.join(Path, entry) # build path to the listed entry...
stats = os.stat(path) # cache the file entry statistics
if stat.S_ISREG(stats.st_mode) and 1528495200 <= stats.st_ctime < 1528581600:
pass # do whatever you need with it
If you want to be flexible with the timestamps, use datetime.datetime.timestamp() beforehand to get the POSIX timestamps and then you can compare them against what stat_result.st_ctime returns directly without conversion.
However, even your original, non-optimized approach should be significantly faster than 2 hours for a mere 10k entries. I'd check the underlying filesystem, too, something seems wrong there.

Get file path of continuously updating file

I have found a few approaches to search for the newest file created by a user in a directory, but I need to determine if an easier approach exists. Most posts on this topics work in some instances or have major hurdles, so I am hoping to unmuddy the water.
I am having difficulty looking through a growing file system, as well as bringing more users in with more potential errors.
I get data from a Superlogics Winview CP 32 for a continuously streaming system. On each occasion of use of the system, I have the operator input a unique identifier for the file name containing a few of the initial conditions of the system we need to track. I would like to get that file name with no help from the operator/user.
Eventually, the end goal is to pare down a list of files I want to search, filtered based on keys, so my first instinct was to use only matching file types, trim all folders in a pathway into a list, and sort based on max timestamp. I used some pretty common functions from these pages:
def fileWalkIn(path='.',matches=[],filt='*.csv'): # Useful for walking through a given directory
"""Iterates through all files under the given path using a filter."""
for root, dirnames, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, filt):
matches.append(os.path.join(root, filename))
yield os.path.join(root, filename)
def getRecentFile(path='.',matches=[],filt='*.dat'):
rr = max(fileWalkIn(path=path,matches=matches,filt=filt), key=os.path.getmtime)
return rr
This got me far, but is rather bulky and slow, which means I cannot do this repeatedly if I want to explore the files that match, lest I have to carry around a bulky list of the matching files.
Ideally, I will be able to process the data on the fly, executing and printing live while it writes, so this approach is not usable in that instance.
I borrowed from these pages a new approach by alex-martelli, which does not use a filter, gives the option of giving files, opposed to directories, is much slimmer than fileWalkIn, and works quicker if using the timestamp.
def all_subdirs_of(b='.'): # Useful for walking through a given directory
# Create hashable list of files or directories in the parent directory
results = []
for d in os.listdir(b):
bd = os.path.join(b, d)
if os.path.isfile(bd):
results.append(bd)
elif os.path.isdir(bd):
results.append(bd)
# return both
return results
def newest(path='.'):
rr = max(all_subdirs_of(b=path), key=os.path.getmtime)
return rr
def getActiveFile(newFile ='.'):
while os.path.exists(newFile):
newFile = newest(newFile)
if os.path.isfile(newFile):
return newFile
else:
if newFile:
continue
else:
return newFile
This gets me the active file in a directory much more quickly, but only if no other files have written since launching my data collection. I can see all kinds of problems here and need some help determining if I have gone down a rabbit hole and there is a more simple solution, like testing file sizes, or whether a more cohesive solution with less potential snags exists.
I found other answers for different languages (java, how-to-get-the-path-of-a-running-jar-file), but would need something in Python. I have explored functions like watchdog and win32, but both require steep learning curves, and I feel like I am either very close, or need to change my paradigm entirely.
dircache might speed up the second approach a bit. It's a wrapper around listdir that checks the directory timestamp and only re-reads directory contents if there's been a change.
Beyond that you really need something that listens to file system events. A quick google turned up two pip packages, pyinotify for Linux only and watchdog.
Hope this helps.

Categories