Scanning duplicate file names

Scanning duplicate file names - python

Imagine several folders such as
d:\myfolder\abc
d:\myfolder\ard
d:\myfolder\kjes
...
And in each folder, there are files such as
0023.txt, 0025.txt, 9932.txt in d:\myfolder\abc
2763.txt, 1872.txt, 0023.txt, 7623.txt in d:\myfolder\ard
2763.txt, 2873.txt, 0023.txt in d:\myfolder\kjes
So, there are three 0023.txt files, and two 2763.txt files.
I want to create a file (say, d:\myfolder\dup.txt) which contains the following information:
0023 3
0025 1
9932 1
2763 2
1872 1
7623 1
2873 1
How can I implement that in Python? Thanks.

Not extensively tested, but this works:
import os, os.path
dupnames={}
for root, dirs, files in os.walk('myfolder'):
for file in files:
fulpath=os.path.join(root,file)
if file in dupnames:
dupnames[file].append(fulpath)
else:
dupnames[file]=[fulpath]
for name in sorted(dupnames):
print name, len(dupnames[name])
This works in the following way:
Creates an empty dict;
Walks the file hierarchy;
Creates a entries in a dict of lists (or append an existing list) with the base name: [path to file].
After the os.walk you will have a dict like so:
{0023.txt: ['d:\myfolder\abc', 'd:\myfolder\kjes'], 0025.txt: ['d:\myfolder\abc']}
So to get your output, just iterate over the sorted dict and count the entries in the list. You can either redirect the output of this to a file or open your output file directly in Python.
You show your output with the extension stripped -- 0023 vs 0023.txt. What should happen if you have 0023.txt and 0023.py? Same file or different? To the OS they are different files so I kept the extension. It is easily stripped if that is your desired output.

Step 1: use glob.glob to find all the files
Step 2: create a dictionary with each filename's last portion (after the last divider)
Step 3: go through the list of filepaths and find all duplicates.

import os
import collections
path = "d:\myfolder"
filelist = []
for (path, dirs, files) in os.walk(path):
filelist.extend(files)
filecount = collections.Counter(filelist)

This isn't precisely what you've asked for, but it might work for you without writing a line of code, albeit at a bit of a performance penalty. As a bonus, it'll group together files that have the same content but different filenames:
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html
The latest version is almost always O(n), without sacrificing even a little bit of accuracy.

Related

Using regex to move multiple files

I'm really new to python and looking to organize hundreds of files and want to use regex to move them to the correct folders.
Example: I would like to move 4 files into different folders.
File A has "USA" in the name
File B has "Europe" in the name
File C has both "USA" and "Europe" in the name
Fild D has "World" in the name
Here is what I am thinking but I don't think this is correct
shutil.move('Z:\local 1\[.*USA.*]', 'Z:\local 1\USA')
shutil.move('Z:\local 1\[.*\(Europe\).*]', 'Z:\local 1\Europe')
shutil.move('Z:\local 1\[.*World.*]', 'Z:\local 1\World')

You can list all the files in a directory and move them in a new folder if their names matches a given regular expression as follows:
import os
import re
import shutil
for filename in os.listdir('path/to/some/directory'):
if re.match(r'Z:\\local 1\\[.*USA.*]+', filename):
shutil.move(os.path.join('path/to/some/directory', filename), 'Z:\local 1\USA')
elif re.match(r'Z:\\local 1\\[.*\(Europe\).*]+', filename):
shutil.move(os.path.join('path/to/some/directory', filename), 'Z:\local 1\Euro')
# and so forth
However, os.listdir shows only the direct subfolders and files, but it does not iterate deeper. If you want to analyze all the files recursively in a given folder use the os.walk method.

According to definition of shutil.move, it needs two things:
src, which is a path of a source file
dst, which is a path to the destination folder.
It says that src and dst should be paths, not regular expressions.
What you have is os.listdir() which list files in a directory.
So what you need to do is to list files, then try to match file names against regular expressions. If you get a match, then you know where the file should go.
That said, you still need to decide what to do with option C that matches both 'USA' and 'Europe'.
For added style points you can put pairs of (regex, destination_path) into an array, tuple or map; in this case you can add any number of rules without changing or duplicating the logic.

Errors with Glob while outputting file names

I am combining two questions here because they are related to each other.
Question 1: I am trying to use glob to open all the files in a folder but it is giving me "Syntax Error". I am using Python 3.xx. Has the syntax changed for Python 3.xx?
Error Message:
File "multiple_files.py", line 29
files = glob.glob(/src/xyz/rte/folder/)
SyntaxError: invalid syntax
Code:
import csv
import os
import glob
from pandas import DataFrame, read_csv
#extracting
files = glob.glob(/src/xyz/rte/folder/)
for fle in files:
with open (fle) as f:
print("output" + fle)
f_read.close()
Question 2: I want to read input files, append "output" to the names and print out the names of the files. How can I do that?
Example: Input file name would be - xyz.csv and the code should print output_xyz.csv .
Your help is appreciated.

Your first problem is that strings, including pathnames, need to be in quotes. This:
files = glob.glob(/src/xyz/rte/folder/)
… is trying to divide a bunch of variables together, but the leftmost and rightmost divisions are missing operands, so you've confused the parser. What you want is this:
files = glob.glob('/src/xyz/rte/folder/')
Your next problem is that this glob pattern doesn't have any globs in it, so the only thing it's going to match is the directory itself.
That's perfectly legal, but kind of useless.
And then you try to open each match as a text file. Which you can't do with a directory, hence the IsADirectoryError.
The answer here is less obvious, because it's not clear what you want.
Maybe you just wanted all of the files in that directory? In that case, you don't want glob.glob, you want listdir (or maybe scandir): os.listdir('/src/xyz/rte/folder/').
Maybe you wanted all of the files in that directory or any of its subdirectories? In that case, you could do it with rglob, but os.walk is probably clearer.
Maybe you did want all the files in that directory that match some pattern, so glob.glob is right—but in that case, you need to specify what that pattern is. For example, if you wanted all .csv files, that would be glob.glob('/src/xyz/rte/folder/*.csv').
Finally, you say "I want to read input files, append "output" to the names and print out the names of the files". Why do you want to read the files if you're not doing anything with the contents? You can do that, of course, but it seems pretty wasteful. If you just want to print out the filenames with output appended, that's easy:
for filename in os.listdir('/src/xyz/rte/folder/'):
print('output'+filename)

This works in http://pyfiddle.io:
Doku: https://docs.python.org/3/library/glob.html
import csv
import os
import glob
# create some files
for n in ["a","b","c","d"]:
with open('{}.txt'.format(n),"w") as f:
f.write(n)
print("\nFiles before")
# get all files
files = glob.glob("./*.*")
for fle in files:
print(fle) # print file
path,fileName = os.path.split(fle) # split name from path
# open file for read and second one for write with modified name
with open (fle) as f,open('{}{}output_{}'.format(path,os.sep, fileName),"w") as w:
content = f.read() # read all
w.write(content.upper()) # write all modified
# check files afterwards
print("\nFiles after")
files = glob.glob("./*.*") # pattern for all files
for fle in files:
print(fle)
Output:
Files before
./d.txt
./main.py
./c.txt
./b.txt
./a.txt
Files after
./d.txt
./output_c.txt
./output_d.txt
./main.py
./output_main.py
./c.txt
./b.txt
./output_b.txt
./a.txt
./output_a.txt
I am on windows and would use os.walk (Doku) instead.
for d,subdirs,files in os.walk("./"): # deconstruct returned aktDir, all subdirs, files
print("AktDir:", d)
print("Subdirs:", subdirs)
print("Files:", files)
Output:
AktDir: ./
Subdirs: []
Files: ['d.txt', 'output_c.txt', 'output_d.txt', 'main.py', 'output_main.py',
'c.txt', 'b.txt', 'output_b.txt', 'a.txt', 'output_a.txt']
It also recurses into subdirs.

Finding files with Python using os.walk() in a list comprehension?

I have been using os.walk() method in Python to make a list of the paths to all the folders and subfolders where a specific file can be found.
I was tired of using a bunch of loops and elifs, and packed it all into a (quite messy) list comprehension that does excacly what I want:
import os
directory = "C:\\Users\\User\\Documents"
file_name = "example_file.txt"
list_of_paths = [path for path in (os_tuple[0] for os_tuple in os.walk(directory) if file_name in (item.lower() for item in os_tuple[2]))]
I have two questions. The first, and most important, is: Is there a more efficiant way to do this? I often expect to find several hundred files in just as many folders, and if it's on a server it can take several minutes.
The second question is: How can I make it more readable? Having two generator comprehensions inside a list comprehension feels pretty messy.
Update: I was told to use Glob, so naturally I had to try it. It seems to work just as well as my list comprehension with os.walk(). My next step will therefore be to test the two versions on a couple of different files and folders.
import glob
directory = "C:\\Users\\User\\Documents"
file_name = "example_file.txt"
list_of_paths = [path.lower().replace(("\\" + file_name), "") for path in (glob.glob(directory + "/**/*" + file_name, recursive=True))]
Any additional comments are very welcome.
Update 2: After testing both methods, the results I'm getting suggests that the os.walk() method is about twice as fast as the glob.glob() method. The test was performed on about 400 folders with a total of 326 copies of the file I was looking for.

Python: Need to add chosen filenames into an array

The idea is simple: there is a directory with 2 or more files *.txt. My script should look in the directory and get filenames in order to copy them (if they exist) over the network.
As a Python newbie, I am facing problems which cannot resolve so far.
My code:
files = os.listdir('c:\\Python34\\');
for f in files:
if f.endswith(".txt"):
print(f)
This example returns 3 files:
LICENSE.txt
NEWS.txt
README.txt
Now I need to use every filename in order to do a SCP. The problem is that when I try to get the first filename with:
print(f[0])
I am receiving just the first letters from each file in the list:
L
N
R
How to add filenames to an array in order to use them later as a array elements?

You can also try using the EXTEND method. So you say:
x = []
for f in files:
if f endswith(".txt"):
x.extend([f])
so it would be "adding" to the end of the list the file in which f is on.

If you want a list of matching files names, then instead of using os.listdir and filtering, use glob.glob with a suitable pattern.
import glob
files = glob.glob('C:\\python34\\*.txt')
Then you can access files[0] etc...

The array of files is files. In the loop, f is a single file name (a string) so f[x] gets the xth character of a filename. Do files[0] instead of f[0].

Using Python, how do I get an array of file info objects, based on a search of a file system?

Currently I have a bash script which runs the find command, like so:
find /storage/disk-1/Media/Video/TV -name *.avi -mtime -7
This gets a list of TV shows that were added to my system in the last 7 days. I then go on to create some symbolic links so I can get to my newest TV shows.
I'm looking to re-code this in Python, but I have a few questions I can seem to find the answers for using Google (maybe I'm not searching for the right thing). I think the best way to sum this up is to ask the question:
How do I perform a search on my file system (should I call find?) which gives me an array of file info objects (containing the modify date, file name, etc) so that I may sort them based on date, and other such things?

import os, time
allfiles = []
now = time.time()
# walk will return triples (current dir, list of subdirs, list of regular files)
# file names are relative to dir at first
for dir, subdirs, files in os.walk("/storage/disk-1/Media/Video/TV"):
for f in files:
if not f.endswith(".avi"):
continue
# compute full path name
f = os.path.join(dir, f)
st = os.stat(f)
if st.st_mtime < now - 3600*24*7:
# too old
continue
allfiles.append((f, st))
This will return all files that find also returned, as a list of pairs (filename, stat result).

look into module os: os.walk is the function which walks the file system, os.path is the module which gives the file mtime and other file informations. also os.path defines a lot of functions for parsing and splitting filenames.
also of interest, module glob defines a functions for "globbing" strings (matching a string using unix wildcards rules)
from this, building a list of file matching some criterion should be easy.

You can use "find" through the "subprocess" module.
Afterwards, use the "split" string function to dissect each line
For each file, use the OS module (e.g. getmtime etc.) to get file information
or
Use the "walk" and "glob" modules to get the file paths in objects

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.