Find duplicate filenames, and only keep newest file using python - python

I have +20 000 files, that look like this below, all in the same directory:
8003825.pdf
8003825.tif
8006826.tif
How does one find all duplicate filenames, while ignoring the file extension.
Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)
For example:
"8003825" appears twice
Then look at the metadata of each duplicate file and only keep the newest one.
Similar to this post:
Keep latest file and delete all other
I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?
I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...
Python 2.6
Windows 7

You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.
import os
from collections import namedtuple
directory = #file directory
os.chdir(directory)
newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file = newest_files.get(name)
this_file_date = os.path.getmtime(file_name)
if cashed_file is None:
newest_files[name] = Entry(this_file_date,file_name)
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.
In the end you have a dictionary with the most recent files.
Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).
For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file_name = newest_files.get(name).file_name
if file_name != cashed_file_name: #it's not the newest with this name
os.remove(file_name)
As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
os.remove(cashed_file.file_name) #this line added

First, get a list of file names and sort them. This will put any duplicates next to each other.
Then, strip off the file extension and compare to neighbors, os.path.splitext() and itertools.groupby() may be useful here.
Once you have grouped the duplicates, pick the one you want to keep using os.stat().
In the end your code might looks something like this:
import os, itertools
files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
dups = list(g)
if len(dups) > 1:
# figure out which file(s) to remove
You shouldn't have to worry about memory here, you're looking at something on the order of a couple of megabytes.

For the filename counter you could use a defaultdict that stores how many times each file appears:
import os
from collections import defaultdict
counter = defaultdict(int)
for file_name in file_names:
file_name = os.path.splitext(os.path.basename(file_name))[0]
counter[file_name] += 1

Related

Find and separate max elements in a directory with diferent file names on it

I'm trying to find the maximum elements in a directory but that directory has different file names on it, if I make a list and use the max() Function ,I only get one result (the max one obviously) , but how can I separate the files before use max so I can get the max elements separately, or there is a method or function that do that?
Thanks.
example_directory_list = ['CM_30_00.v01', 'CM_30_00.v02', 'CM_30_00.v03', 'CM_30_00_house.v01','CM_30_00_house.v02', 'CM_30_00_house.v03', 'CM_30_01.v01', 'CM_30_00', 'CM_30_01']
print max (example_directory_list )
result ['CM_30_00.v03','CM_30_00_house.v03', 'CM_30_01.v01']
Your question is a bit unclear. It seems the file extension is a version number, and you want to get the filename of the highest version for each different file.
In the code below, I ignore files with no extensions.
It is by no means robust. I don't even bother extracting the version number as an int. It compares the file extensions as str.
It works by looping through all the filenames, separating the filename and extension, compares the extension with the highest version seen so far (or "v00" if it hasn't yet come across the same filename), and updates the dict if necessary. Finally, it reconstructs the highest filenames and stores them in the results list.
example_directory_list = ['CM_30_00.v01',
'CM_30_00.v02',
'CM_30_00.v03',
'CM_30_00_house.v01',
'CM_30_00_house.v02',
'CM_30_00_house.v03',
'CM_30_01.v01',
'CM_30_00',
'CM_30_01', ]
files = {}
for filename in example_directory_list:
dot_index = filename.rfind(".")
if dot_index == -1:
continue
name_without_extension = filename[:dot_index]
extension = filename[dot_index+1:]
highest_version = files.get(name_without_extension, "v00")
if extension > highest_version:
files[name_without_extension] = extension
results = [f"{filename}.{extension}" for filename, extension in files.items()]
print(results)
Output:
['CM_30_00.v03', 'CM_30_00_house.v03', 'CM_30_01.v01']
I made this little change to work in python 2.
results =[]
for filename, extension in files.items():
result = ("{filename}.{extension}".format(**locals()))
results.append(result)

How to find files that have matching pattern with current file and merge?

I have a file directory that contains multiple files recorded in one day. But I need to combine the files that ends with the same ids so the logic that I am trying to use is to go over each file in the directory and then look for the file that has the matching id. For example I have files that are stored as below:
a_1234_d.csv
b_1234_d.csv
c_1234_d.csv
a_1256_d.csv
b_1256_d.csv
c_1256_d.csv
These files are not necessarily stored in the sequence like above so I need to file the files that matches the id in between and combine them. So far, I have tried the code below but I need help to correct the pattern matching part as this is not practical to use for hundreds of files to keep changing for every id.
f = os.listdir(dat_folder)
for file in f:
if fnmatch.fnmatch(file, '*1234.csv'):
print(file)
I slightly modify LordDot's code:
import re
f = ["a_1234_d.csv", "b_1234_d.csv", "c_1234_d.csv", "a_1256_d.csv", "b_1256_d.csv", "c_1256_d.csv"]
file_to_compose = {}
for file in f:
lead, id_of_file, tail = re.split(r'[_]', file)
if id_of_file in file_to_compose:
file_to_compose[id_of_file].append(file)
else:
file_to_compose[id_of_file] = [file]
for (k, v) in file_to_compose.items():
print (f'id {k} contains files: {", ".join(v)}')
Output:
id 1234 contains files: a_1234_d.csv, b_1234_d.csv, c_1234_d.csv
id 1256 contains files: a_1256_d.csv, b_1256_d.csv, c_1256_d.csv
You can easily combine all files belongs to the same id.
Correct Me if I'm wrong, but I understand you have a lot of different ids. If they are always seperated by '_' you can get the id with help of the split()-Funktion of the string Type. Then you just have to go through all files, check the number and check if you already processed that number.
Maybe something like this:
f = ["a_1234_d.csv","b_1234_d.csv","a_1235_d.csv"]
processedFiles = []
for file in f:
number = file.split("_")[1]
if number not in processedFiles:
#do your code. now you know the number
processedFiles = processedFiles + [number]
print(processedFiles)
For your code it's probably helpfull to take a look at Nullman's answer.
the glob module is useful here
from glob import glob
print(glob(dat_folder + '*1234.csv'))
glob returns a list of matches
consider using iglob if you want an iterator instead of a list (great when you have a lot of files)

Changing name of file until it is unique

I have a script that downloads files (pdfs, docs, etc) from a predetermined list of web pages. I want to edit my script to alter the names of files with a trailing _x if the file name already exists, since it's possible files from different pages will share the same filename but contain different contents, and urlretrieve() appears to automatically overwrite existing files.
So far, I have:
urlfile = 'https://www.foo.com/foo/foo/foo.pdf'
filename = urlfile.split('/')[-1]
filename = foo.pdf
if os.path.exists(filename):
filename = filename('.')[0] + '_' + 1
That works fine for one occurrence, but it looks like after one foo_1.pdf it will start saving as foo_1_1.pdf, and so on. I would like to save the files as foo_1.pdf, foo_2.pdf, and so on.
Can anybody point me in the right direction on how to I can ensure that file names are stored in the correct fashion as the script runs?
Thanks.
So what you want is something like this:
curName = "foo_0.pdf"
while os.path.exists(curName):
num = int(curName.split('.')[0].split('_')[1])
curName = "foo_{}.pdf".format(str(num+1))
Here's the general scheme:
Assume you start from the first file name (foo_0.pdf)
Check if that name is taken
If it is, iterate the name by 1
Continue looping until you find a name that isn't taken
One alternative: Generate a list of file numbers that are in use, and update it as needed. If it's sorted you can say name = "foo_{}.pdf".format(flist[-1]+1). This has the advantage that you don't have to run through all the files every time (as the above solution does). However, you need to keep the list of numbers in memory. Additionally, this will not fill any gaps in the numbers
Why not just use the tempfile module:
fileobj = tempfile.NamedTemporaryFile(suffix='.pdf', prefix='', delete = False)
Now your filename will be available in fileobj.name and you can manipulate to your heart's content. As an added benefit, this is cross-platform.
Since you're dealing with multiple pages, this seeems more like a "global archive" than a per-page archive. For a per-page archive, I would go with the answer from #wnnmaw
For a global archive, I would take a different approch...
Create a directory for each filename
Store the file in the directory as "1" + extension
write the current "number" to the directory as "_files.txt"
additional files are written as 2,3,4,etc and increment the value in _files.txt
The benefits of this:
The directory is the original filename. If you keep turning "Example-1.pdf" into "Example-2.pdf" you run into a possibility where you download a real "Example-2.pdf", and can't associate it to the original filename.
You can grab the number of like-named files either by reading _files.txt or counting the number of files in the directory.
Personally, I'd also suggest storing the files in a tiered bucketing system, so that you don't have too many files/directories in any one directory (hundreds of files makes it annoying as a user, thousands of files can affect OS performance ). A bucketing system might turn a filename into a hexdigest, then drop the file into `/%s/%s/%s" % ( hex[0:3], hex[3:6], filename ). The hexdigest is used to give you a more even distribution of characters.
import os
def uniquify(path, sep=''):
path = os.path.normpath(path)
num = 0
newpath = path
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
while os.path.exists(newpath):
newpath = os.path.join(dirname, '{f}{s}{n:d}{e}'
.format(f=filename, s=sep, n=num, e=ext))
num += 1
return newpath
filename = uniquify('foo.pdf', sep='_')
Possible problems with this include:
If you call to uniquify many many thousands of times with the same
path, each subsequent call may get a bit slower since the
while-loop starts checking from num=0 each time.
uniquify is vulnerable to race conditions whereby a file may not
exist at the time os.path.exists is called, but may exist at the
time you use the value returned by uniquify. Use
tempfile.NamedTemporaryFile to avoid this problem. You won't get
incremental numbering, but you will get files with unique names,
guaranteed not to already exist. You could use the prefix parameter to
specify the original name of the file. For example,
import tempfile
import os
def uniquify(path, sep='_', mode='w'):
path = os.path.normpath(path)
if os.path.exists(path):
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
return tempfile.NamedTemporaryFile(prefix=filename+sep, suffix=ext, delete=False,
dir=dirname, mode=mode)
else:
return open(path, mode)
Which could be used like this:
In [141]: f = uniquify('/tmp/foo.pdf')
In [142]: f.name
Out[142]: '/tmp/foo_34cvy1.pdf'
Note that to prevent a race-condition, the opened filehandle -- not merely the name of the file -- is returned.

Scanning duplicate file names

Imagine several folders such as
d:\myfolder\abc
d:\myfolder\ard
d:\myfolder\kjes
...
And in each folder, there are files such as
0023.txt, 0025.txt, 9932.txt in d:\myfolder\abc
2763.txt, 1872.txt, 0023.txt, 7623.txt in d:\myfolder\ard
2763.txt, 2873.txt, 0023.txt in d:\myfolder\kjes
So, there are three 0023.txt files, and two 2763.txt files.
I want to create a file (say, d:\myfolder\dup.txt) which contains the following information:
0023 3
0025 1
9932 1
2763 2
1872 1
7623 1
2873 1
How can I implement that in Python? Thanks.
Not extensively tested, but this works:
import os, os.path
dupnames={}
for root, dirs, files in os.walk('myfolder'):
for file in files:
fulpath=os.path.join(root,file)
if file in dupnames:
dupnames[file].append(fulpath)
else:
dupnames[file]=[fulpath]
for name in sorted(dupnames):
print name, len(dupnames[name])
This works in the following way:
Creates an empty dict;
Walks the file hierarchy;
Creates a entries in a dict of lists (or append an existing list) with the base name: [path to file].
After the os.walk you will have a dict like so:
{0023.txt: ['d:\myfolder\abc', 'd:\myfolder\kjes'], 0025.txt: ['d:\myfolder\abc']}
So to get your output, just iterate over the sorted dict and count the entries in the list. You can either redirect the output of this to a file or open your output file directly in Python.
You show your output with the extension stripped -- 0023 vs 0023.txt. What should happen if you have 0023.txt and 0023.py? Same file or different? To the OS they are different files so I kept the extension. It is easily stripped if that is your desired output.
Step 1: use glob.glob to find all the files
Step 2: create a dictionary with each filename's last portion (after the last divider)
Step 3: go through the list of filepaths and find all duplicates.
import os
import collections
path = "d:\myfolder"
filelist = []
for (path, dirs, files) in os.walk(path):
filelist.extend(files)
filecount = collections.Counter(filelist)
This isn't precisely what you've asked for, but it might work for you without writing a line of code, albeit at a bit of a performance penalty. As a bonus, it'll group together files that have the same content but different filenames:
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html
The latest version is almost always O(n), without sacrificing even a little bit of accuracy.

Python - Efficiently building a dictionary

I am trying to build a dict(dict(dict())) out of multiple files, which are stored in different numbered directories, i.e.
/data/server01/datafile01.dat
/data/server01/datafile02.dat
...
/data/server02/datafile01.dat
/data/server02/datafile02.dat
...
/data/server86/datafile01.dat
...
/data/server86/datafile99.dat
I have a couple problems at the moment:
Switching between directories
I know that I have 86 servers, but the number of files per server may vary. I am using:
for i in range(1,86):
basedir='/data/server%02d' % i
for file in glob.glob(basedir+'*.dat'):
Do reading and sorting here
but I cant seem to switch between the directories properly. It just sits in the first one and gets stuck it seems when there are no files in the directory
Checking if key already exists
I would like to have a function that somehow checks if a key is already present or not, and in case it isnt creates that key and certain subkeys, since one cant define dict[Key1][Subkey1][Subsubkey1]=value
BTW i am using Python 2.6.6
Björn helped with the defaultdict half of your question. His suggestion should get you very close to where you want to be in terms of the default value for keys that do not yet exist.
The best tool for walking a directory and looking at files is os.walk. You can combine the directory and filename names that you get from it with os.path.join to find the files you are interested in. Something like this:
import os
data_path = '/data'
# option 1 using nested list comprehensions**
data_files = (os.path.join(root,f) for (root, dirs, files) in os.walk(data_path)
for f in files) # can use [] instead of ()
# option 2 using nested for loops
data_files = []
for root, dirs, files in os.walk(data_path):
for f in files:
data_files.append(os.path.join(root, f))
for data_file in data_files:
# ... process data_file ...
**Docs for list comprehensions.
I can't help you with your first problem, but the second one can be solved by using a defaultdict. This is a dictionary that has a function that is called to generate a value when a requested key did not exist. Using lambda you can nest them:
>>> your_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
>>> your_dict[1][2][3]
0
I'm assuming these 'directories' are remotely mounted shares?
Couple of things:
I'd use os.path.join instead of 'basedir' + '*.dat'
For FS related stuff I've had very good results parallelizing the computation using
multiprocessing.Pool to get around those times where a remote fs might be extremely slow and hold up the whole process.
import os
import glob
import multiprocessing as mp
def processDir(path):
results = {}
for file in glob.iglob(os.path.join(path,'*.dat')):
results.update(add to the results here)
return results
dirpaths = ['/data/server%02d'%i for i in range(1,87)]
_results = mp.Pool(8).map(processDir,dirpaths)
results = combine _results here...
For your dict-related problems, use defaultdict, as mentioned in the other answers, or even your own dict subclass, or function?
def addresult(results,key,subkey,subsubkey,value):
if key not in results:
results[key] = {}
if subkey not in results[key]:
results[key][subkey] = {}
if subsubkey not in results[key][subkey]:
results[key][subkey][subsubkey] = value
There are almost certainly more efficient ways to accomplish this, but that's a start.

Categories