I am using the code from here:
http://pythoncentral.io/finding-duplicate-files-with-python/
to find duplicated files in a folder.
Those are my first steps in Python (I come form VBA for Excel) and my problem is probably very simple, but I tried several things without success. After running the code I get the message:
-f is not a valid path, please verify
An exception has occurred, use %tb to see the full traceback.
%tb generates:
SystemExit Traceback (most recent call last)
<ipython-input-118-31268a802b4a> in <module>()
11 else:
12 print('%s is not a valid path, please verify' % i)
---> 13 sys.exit()
14 printResults(dups)
15 else:
SystemExit:
The code I am using is:
# dupFinder.py
import os, sys
import hashlib
def findDup(parentFolder):
# Dups in format {hash:[names]}
dups = {}
for dirName, subdirs, fileList in os.walk(parentFolder):
print('Scanning %s...' % dirName)
for filename in fileList:
# Get the path to the file
path = os.path.join(dirName, filename)
# Calculate hash
file_hash = hashfile(path)
# Add or append the file path
if file_hash in dups:
dups[file_hash].append(path)
else:
dups[file_hash] = [path]
return dups
# Joins two dictionaries
def joinDicts(dict1, dict2):
for key in dict2.keys():
if key in dict1:
dict1[key] = dict1[key] + dict2[key]
else:
dict1[key] = dict2[key]
def hashfile(path, blocksize = 65536):
afile = open(path, 'rb')
hasher = hashlib.md5()
buf = afile.read(blocksize)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(blocksize)
afile.close()
return hasher.hexdigest()
def printResults(dict1):
results = list(filter(lambda x: len(x) > 1, dict1.values()))
if len(results) > 0:
print('Duplicates Found:')
print('The following files are identical. The name could differ, but the content is identical')
print('___________________')
for result in results:
for subresult in result:
print('\t\t%s' % subresult)
print('___________________')
else:
print('No duplicate files found.')
if __name__ == '__main__':
path='C:/DupTestFolder/' #this is the path to analyze for duplicated files
if len(sys.argv) > 1:
dups = {}
folders = sys.argv[1:]
for i in folders:
# Iterate the folders given
if os.path.exists(i):
# Find the duplicated files and append them to the dups
joinDicts(dups, findDup(i))
else:
print('%s is not a valid path, please verify' % i)
sys.exit()
printResults(dups)
else:
print('Usage: python dupFinder.py folder or python dupFinder.py folder1 folder2 folder3')
I tried ending the path with and without "\" at the end, but the result is the same.
I am running Jupyter with Python 3.
Thanks in advance for your help!
The path variable is not used in your code.
All you do is an iteration over sys.argv[1:], which are the parameters of your script. You consider each parameter as a directory path.
On a Windows console, You can try:
python dupFinder.py C:\DupTestFolder
It should work.
Sys.argv works in command line window and takes arguments. It doesn't work with jupyter notebook naturally, or you need to figure out some commands in jupyter notebook.
Thanks!, I am able to run the code now. I have to do two things:
Save the dupFinder.py to the same folder that runs my python installation, in my case C:\Users\Pepe
Open the cmd window from Anaconda (that creates the cmd window in the folder where python is running), I presume I could do the same open the command window and navigate (cd\ command) to the folder location
Finally run python dupFinder.py C:\DupTestFolder.
Now I need to find out how to have the results saved to a .txt file for future use, I will search for it before posting. Thanks for your help!
Related
I have this file here
# --------------------------------------------------------------
# Goal : Remove file base on input match
# Run : curl 45.55.88.57/code/fileModifier.py | python3
import os
import sys
rootdir = os.path.abspath(os.curdir)
print(rootdir)
#Give string to remove from folder names. Ensure that removing a string doens't make the folder name empty. It wont work
removeStringFromFolderName = input('Remove this from folder names :')
while removeStringFromFolderName == '':
print('Empty string not allowed')
removeStringFromFolderName = input('Remove this file if contain : ')
count = 0
subdir = [x for x in os.walk(rootdir)]
toRemove = []
for chunk in subdir:
folders = chunk[1]
if len(folders) > 0:
for aDir in folders:
if removeStringFromFolderName in aDir:
toRemove.append((chunk[0], aDir))
toRemove.reverse()
for folders in toRemove:
oldPath = (os.path.join(folders[0], folders[1]))
newPath = (os.path.join(folders[0], folders[1].replace(removeStringFromFolderName,'')))
os.rename(oldPath, newPath)
count +=1
subdir = [x for x in os.walk(rootdir)]
for chunk in subdir:
folders = chunk[1]
if len(folders) > 0:
for aDir in folders:
if removeStringFromFolderName in aDir:
print(os.path.join(chunk[0], aDir))
oldPath = (os.path.join(chunk[0], aDir))
newPath = (os.path.join(chunk[0], aDir.replace(removeStringFromFolderName,'')))
os.rename(oldPath, newPath)
count +=1
print('Renamed', count, 'files')
count = 0
#Give string to delete files which contain this string
removeThisFileNameIfContain = input('Enter string to delete files which contain this string: ')
while removeThisFileNameIfContain == '':
print('Empty string not allowed')
removeThisFileNameIfContain = input('Enter string to delete files which contain this string: ')
for subdir, dirs, files in os.walk(rootdir):
for aFile in files:
if '.py' in aFile:
continue
if removeThisFileNameIfContain in aFile:
os.remove(os.path.join(subdir, aFile))
count += 1
print('Deleted', count, 'files')
Work perfect when on local machine with python3, but when I uploaded into my VM, and executed remotely via cURL
I kept getting this
curl 45.55.88.57/code/fileModifier.py | python3
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2266 100 2266 0 0 43381 0 --:--:-- --:--:-- --:--:-- 43576
/Users/bheng/Desktop/projects/bheng/fileModifier
Remove this from folder names :Traceback (most recent call last):
File "<stdin>", line 16, in <module>
EOFError: EOF when reading a line
What did I missed ?
Your usage is taking up stdout, which the input command needs.
Try this if your shell has the ability:
python3 <(curl 45.55.88.57/code/fileModifier.py)
Note: As Amadan said, your syntax (and mine) run a remote script locally, not vice versa.
First of all, you are not executing the script remotely. You are fetching it then executing it locally by piping into Python REPL.
When you are piping the script into Python, stdin is where the program comes from. You cannot use stdin to also get data from input().
To actually execute it remotely, you would need to either build in a web server in your code, or tell an existing web server to run your Python code (e.g. by registering a CGI handler).
I'm trying to loop through some files in a directory. If the filename has two specific strings together, then I'm supposed to open and read those files for information. However, if none of the files have those two strings, I want to print an error message only once.
for filename in os.listdir(directory):
if filename.find("<string1>") != -1 and filename.find("<string2>") != -1:
#open file
else:
#print error message
I know doing this will print as many error messages as there are files in the directory (i.e. if there's 15 files with no matches, I'll get 15 error messages). But what I want is to only print an error message once after there aren't any matches in any of the N files in directory. I figured I could do something like this:
for filename in os.listdir(directory):
if filename.find("<string1>") != -1 and filename.find("<string2>") != -1:
#open file
else:
if filename[-1]: #if filename is last in directory
#print error message
But I've discovered this doesn't work. How would I get an error message to print only after the last filename has been read and doesn't match?
A simple solution would be to initialize some boolean flag before your for loop, e.g. found = false
If you find a file, set found = true. Then you can check the value of found after your for loop finishes and print the appropriate message based on its value.
Filter the list of files before the for-loop:
filenames = [fname for fname in os.listdir(directory)
if '<string1>' in fname and '<string2>' in fname]
if filenames:
for filename in filenames:
#open file
else:
#print error message
You can probably also use the glob module to get the filenames:
import glob
filenames = glob.glob(directory + '/*string1*string2*')
Another way is to use a variable to check if all the files have been processed. Checked and found it working in Python 2.7
import os
directory = "E:\\test\\"
files_count = len(os.listdir(directory))
files_processed = 0
for filename in os.listdir(directory):
if 'string1' in filename and 'string2' in filename:
#open file
print ("Opening file")
else:
files_processed = files_processed + 1
if (files_processed >= files_count):
print ("error message")
Not sure if this is extreme. But I'd make it a function and raise IOError.
Plus, i'd always use absolute path. Try the pathlib module too
import os
def get_files(directory):
for filename in os.listdir(directory):
if "string1" in filename and "string2" in filename:
yield filename
raise IOError("No such file")
for file in get_files('.'):
print(file)
# do stuff with file
I have made a simple test code in python that reads from a text file, and then preforms an action if the text file contains a line "on".
My code works fine if i run the script on my hardive with the text file in the same folder. Example, (C:\Python27\my_file.txt, and C:\Python27\my_scipt.py).
However, if I try this code while my text file is located on my flashdrive and my script is still on my hardrive it won't work even though I have the correct path specified. Example, (G:\flashdrive_folder\flashdrive_file.txt, and C:\Python27\my_scipt.py).
Here is the code I have written out.
def locatedrive():
file = open("G:\flashdrive_folder\flashdrive_file.txt", "r")
flashdrive_file = file.read()
file.close()
if flashdrive_file == "on":
print "working"
else:
print"fail"
while True:
print "trying"
try:
locatedrive()
break
except:
pass
break
The backslash character does double duty. Windows uses it as a path separator, and Python uses it to introduce escape sequences.
You need to escape the backslash (using a backslash!), or use one of the other techniques below:
file = open("G:\\flashdrive_folder\\flashdrive_file.txt", "r")
or
file = open(r"G:\flashdrive_folder\flashdrive_file.txt", "r")
or
file = open("G:/flashdrive_folder/flashdrive_file.txt", "r")
cd /media/usb0
import os
path = "/media/usb0"
#!/usr/bin/python
import os
path = "/usr/tmp"
# Check current working directory.
retval = os.getcwd()
print "Current working directory %s" % retval
# Now change the directory
os.chdir( path )
# Check current working directory.
retval = os.getcwd()
print "Directory changed successfully %s" % retval
Use:
import os
os.chdir(path_to_flashdrive)
I've created a simple script to rename my media files that have lots of weird periods and stuff in them that I have obtained and want to organize further. My script kinda works, and I will be editing it to edit the filenames further but my os.rename line throws this error:
[Windows Error: Error 2: The system cannot find the file specified.]
import os
for filename in os.listdir(directory):
fcount = filename.count('.') - 1 #to keep the period for the file extension
newname = filename.replace('.', ' ', fcount)
os.rename(filename, newname)
Does anyone know why this might be? I have a feeling that it doesn't like me trying to rename the file without including the file path?
try
os.rename(filename, directory + '/' + newname);
Triton Man has already answered your question. If his answer doesn't work I would try using absolute paths instead of relative paths.
I've done something similar before, but in order to keep any name clashes from happening I temporarily moved all the files to a subfolder. The entire process happened so fast that in Windows Explorer I never saw the subfolder get created.
Anyhow if you're interested in looking at my script It's shown below. You run the script on the command line and you should pass in as a command-line argument the directory of the jpg files you want renamed.
Here's a script I used to rename .jpg files to multiples of 10. It might be useful to look at.
'''renames pictures to multiples of ten'''
import sys, os
debug=False
try:
path = sys.argv[1]
except IndexError:
path = os.getcwd()
def toint(string):
'''changes a string to a numerical representation
string must only characters with an ordianal value between 0 and 899'''
string = str(string)
ret=''
for i in string:
ret += str(ord(i)+100) #we add 101 to make all the numbers 3 digits making it easy to seperate the numbers back out when we need to undo this operation
assert len(ret) == 3 * len(string), 'recieved an invalid character. Characters must have a ordinal value between 0-899'
return int(ret)
def compare_key(file):
file = file.lower().replace('.jpg', '').replace('dscf', '')
try:
return int(file)
except ValueError:
return toint(file)
#files are temporarily placed in a folder
#to prevent clashing filenames
i = 0
files = os.listdir(path)
files = (f for f in files if f.lower().endswith('.jpg'))
files = sorted(files, key=compare_key)
for file in files:
i += 10
if debug: print('renaming %s to %s.jpg' % (file, i))
os.renames(file, 'renaming/%s.jpg' % i)
for root, __, files in os.walk(path + '/renaming'):
for file in files:
if debug: print('moving %s to %s' % (root+'/'+file, path+'/'+file))
os.renames(root+'/'+file, path+'/'+file)
Edit: I got rid of all the jpg fluff. You could use this code to rename your files. Just change the rename_file function to get rid of the extra dots. I haven't tested this code so there is a possibility that it might not work.
import sys, os
path = sys.argv[1]
def rename_file(file):
return file
#files are temporarily placed in a folder
#to prevent clashing filenames
files = os.listdir(path)
for file in files:
os.renames(file, 'renaming/' + rename_file(file))
for root, __, files in os.walk(path + '/renaming'):
for file in files:
os.renames(root+'/'+file, path+'/'+file)
Looks like I just needed to set the default directory and it worked just fine.
folder = r"blah\blah\blah"
os.chdir(folder)
for filename in os.listdir(folder):
fcount = filename.count('.') - 1
newname = filename.replace('.', ' ', fcount)
os.rename(filename, newname)
I'm trying to delete old SVN files from directory tree. shutil.rmtree and os.unlink raise WindowsErrors, because the script doesn't have permissions to delete them. How can I get around that?
Here is the script:
# Delete all files of a certain type from a direcotry
import os
import shutil
dir = "c:\\"
verbosity = 0;
def printCleanMsg(dir_path):
if verbosity:
print "Cleaning %s\n" % dir_path
def cleandir(dir_path):
printCleanMsg(dir_path)
toDelete = []
dirwalk = os.walk(dir_path)
for root, dirs, files in dirwalk:
printCleanMsg(root)
toDelete.extend([root + os.sep + dir for dir in dirs if '.svn' == dir])
toDelete.extend([root + os.sep + file for file in files if 'svn' in file])
print "Items to be deleted:"
for candidate in toDelete:
print candidate
print "Delete all %d items? [y|n]" % len(toDelete)
choice = raw_input()
if choice == 'y':
deleted = 0
for filedir in toDelete:
if os.path.exists(filedir): # could have been deleted already by rmtree
try:
if os.path.isdir(filedir):
shutil.rmtree(filedir)
else:
os.unlink(filedir)
deleted += 1
except WindowsError:
print "WindowsError: Couldn't delete '%s'" % filedir
print "\nDeleted %d/%d files." % (deleted, len(toDelete))
exit()
if __name__ == "__main__":
cleandir(dir)
Not a single file is able to be deleted. What am I doing wrong?
To remove recursively all .svn I use this script. May be it will help someone.
import os, shutil, stat
def del_evenReadonly(action, name, exc):
os.chmod(name, stat.S_IWRITE)
os.remove(name)
for root, subFolders, files in os.walk(os.getcwd()):
if '.svn' in subFolders:
shutil.rmtree(root+'\.svn',onerror=del_evenReadonly)
Subversion usually makes all the .svn directories (and everything in them) write protected. Probably you have to remove the write protection before you can remove the files.
I'm not really sure how to do this best with Windows, but you should be able to use os.chmod() with the stat.S_IWRITE flag. Probably you have to iterate through all the files in the .svn directories and make them all writable individually.