I am working on a data cleanup in a network drive. The drive has 1000+ folders, and those folders have several subfolders. The script that I got from G4G (seen below) prompts me to select a folder. I can click on one of my 1000+ folders, and the data is cleaned up properly (duplicates are deleted). However, I'd like to loop the command through the whole drive to avoid clicking on folders for hours. I cannot select the drive as my folder because duplicate file names between the first folders in the drive should not be considered duplicates.
Example:
Z:/Folder1 and Z:/Folder2 both have several files named text.txt, immediately inside of the folders and within the subdirectories of the folders. Folder1 and Folder2, among all text.txt files immediately inside and within its subdirectories, should each be left with one text.txt. If the current script is applied to Folder1 and Folder2 individually, then the desired result of one text.txt file existing in Folder1 and one existing in Folder2 is accomplished. If the script is applied to the Z: drive, then between Folder1 and Folder2, there would only be one text.txt, and one of the folders would be without a file named text.txt.
How can I apply this script to each first folder in the drive without having to manually click on each folder?
from tkinter.filedialog import askdirectory
# Importing required libraries.
from tkinter import Tk
import os
import hashlib
from pathlib import Path
# We don't want the GUI window of
# tkinter to be appearing on our screen
Tk().withdraw()
# Dialog box for selecting a folder.
file_path = askdirectory(title="Select a folder")
# Listing out all the files
# inside our root folder.
list_of_files = os.walk(file_path)
# In order to detect the duplicate
# files we are going to define an empty dictionary.
unique_files = dict()
for root, folders, files in list_of_files:
# Running a for loop on all the files
for file in files:
# Finding complete file path
file_path = Path(os.path.join(root, file))
# Converting all the content of
# our file into md5 hash.
Hash_file = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
# If file hash has already #
# been added we'll simply delete that file
if Hash_file not in unique_files:
unique_files[Hash_file] = file_path
else:
if file.endswith((".txt",".bmp")):
os.remove(file_path)
print(f"{file_path} has been deleted")
You can change the script as below.
Basically below script gets absolute path of all the directories in the current directory, and feeds them one by one for cleanup.
from tkinter.filedialog import askdirectory
# Importing required libraries.
from tkinter import Tk
import os
import hashlib
from pathlib import Path
# We don't want the GUI window of
# tkinter to be appearing on our screen
Tk().withdraw()
# Dialog box for selecting a folder.
file_paths = [os.path.abspath(i) for i in os.listdir() if os.path.isdir(i)]
for file_path in file_paths:
# Listing out all the files
# inside our root folder.
list_of_files = os.walk(file_path)
# In order to detect the duplicate
# files we are going to define an empty dictionary.
unique_files = dict()
for root, folders, files in list_of_files:
# Running a for loop on all the files
for file in files:
# Finding complete file path
file_path = Path(os.path.join(root, file))
# Converting all the content of
# our file into md5 hash.
Hash_file = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
# If file hash has already #
# been added we'll simply delete that file
if Hash_file not in unique_files:
unique_files[Hash_file] = file_path
else:
if file.endswith((".txt",".bmp")):
os.remove(file_path)
print(f"{file_path} has been deleted")
Related
I am working on a data cleanup in a network drive. The drive has 1000+ folders, and those folders have several subfolders. The script that I got from G4G (seen below) prompts me to select a folder. I can click on one of my 1000+ folders, and the data is cleaned up properly (duplicates are deleted). However, I'd like to loop the command through the whole drive to avoid clicking on folders for hours. I cannot select the drive as my folder because duplicate file names between the first folders in the drive should not be considered duplicates.
EDIT:
I'll give an example to clarify.
Z:/Folder1 and Z:/Folder2 both have several files named "text.txt," immediately inside of the folders and within the subdirectories of the folders. Folder1 and Folder2, amongst all "text.txt" files immediately inside and within its subdirectories, should each be left with one "text.txt." If the current script is applied to Folder1 and Folder2 individually, then the desired result of one "text.txt" file existing in Folder1 and one existing in Folder2 is accomplished. If the script is applied to the Z drive, then between Folder1 and Folder2, there would only be one "text.txt," and one of the folders would be without a file named "text.txt."
How can I apply this script to each first folder in the drive without having to manually click on each folder?
from tkinter.filedialog import askdirectory
# Importing required libraries.
from tkinter import Tk
import os
import hashlib
from pathlib import Path
# We don't want the GUI window of
# tkinter to be appearing on our screen
Tk().withdraw()
# Dialog box for selecting a folder.
file_path = askdirectory(title="Select a folder")
# Listing out all the files
# inside our root folder.
list_of_files = os.walk(file_path)
# In order to detect the duplicate
# files we are going to define an empty dictionary.
unique_files = dict()
for root, folders, files in list_of_files:
# Running a for loop on all the files
for file in files:
# Finding complete file path
file_path = Path(os.path.join(root, file))
# Converting all the content of
# our file into md5 hash.
Hash_file = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
# If file hash has already #
# been added we'll simply delete that file
if Hash_file not in unique_files:
unique_files[Hash_file] = file_path
else:
if file.endswith((".txt",".bmp")):
os.remove(file_path)
print(f"{file_path} has been deleted")
Maybe you should run it for your drive and use if/else to skip first folder
list_of_files = os.walk("your drive")
for root, folders, files in list_of_files:
if root != "your drive":
for file in files:
# ... code ...
This way you can skip also other (sub)folders.
OR you can use next() to skip some element from os.walk() because os.walk() doesn't give directly list with all elements but generator.
list_of_files = os.walk("your drive")
next(list_of_files) # skip first item
for root, folders, files in list_of_files:
for file in files:
# ... code ...
I am trying to move files from one folder to another. I have a folder name from a to z. Inside each folder(a-z) i have several folders. I can move files from the subfolder of the folder(a-z) to my folder but I want to do it from a-z at once.
folder structure : a--ab
--ac
b--bc
--bd
.. till z
import glob
import os
import shutil
path = "E:\\download\\images\\a\\*"
move_path = "E:\\download\\images\\final\\"
files = glob.glob(path,recursive = True)
for file in files:
subfile= os.listdir(file)
for sub in subfile:
subpath = file + "\\" + sub
shutil.move(subpath,move_path +"\\" + sub)
Copy this tiny script in E:\download\images and run it from there. This way, the Path class will use that directory as the working root directory.
The images variable will contain a generator that will give you every file matching the glob (which means: every file in every subfolder that has a 3-letter extension and with the first subfolder's name having only one character).
When renaming, the file will be moved from the subfolder path to final/, thus being moved.
Keep in mind that the glob will pick every file or folder name having a 3-letter extension. You'll need to do additional checks if you have other files or folders that match this nomenclature.
from pathlib import Path
images = Path().glob('?/**/*.???')
for img in images:
img.rename('final/' + img.name)
I'm wanting to move .csv files after reading them.
The code I've come up with is to move any .csv files found in a folder, then direct to an archive folder.
src1 = "\\xxx\xxx\Source Folder"
dst1 = "\\xxx\xxx\Destination Folder"
for root, dirs, files in os.walk(src1):
for f in files:
if f.endswith('.csv'):
shutil.move(os.path.join(root,f), dst1)
Note: I imported shutil at the beginning of my code.
Note 2: The destination archive folder is within the source folder - will this have implications for the above code?
When I run this, nothing happens. I get no error messages and the file remains in the source folder.
Any insight is appreciated.
Edit (some context on my goal):
My overall code will be used to read .csv files that are moved manually into a source folder by users - I then want to archive these .csv files using Python once the data has been used. Every .csv file placed into the source folder by the users will have a different name - no .csv file name will be the same, which is why I want to search the source folder for .csv files and move them all.
You can use the pathlib module. I'm assuming you have got the same folder structure in the destination directory.
from pathlib import Path
src1 = "<Path to source folder>"
dst1 = "<Path to destination folder>"
for csv_file in Path(src1).glob('**/*.csv'):
relative_file_path = csv_file.relative_to(src1)
destination_path = dst1 / relative_file_path
csv_file.rename(destination_path)
Explanation-
for csv_file in Path(src1).glob('**/*.csv'):
The glob(returns generator object) will capture all the CSV files in the directory as well as in the subdirectory. Now, we can iterate over the files one by one.
relative_file_path = csv_file.relative_to(src1)
All the csv_files are now pathlib path objects. So, we can use the functions that the library provides. One such function is relative to. Here It'll copy the relative path of the file from the src folder. Let's say you have a CSV file like-
scr_folder/A/B/c.csv - It'll copy A/B/c.csv
destination_path = dst1 / relative_file_path
As the folder structure is the same the destination path now becomes -
dst_folder/A/B/c.csv
csv_file.rename(destination_path)
At Last, rename will just move the file from src to destination.
After a bunch of research I have found a solution:
import shutil
source = r"\\xx\Source"
destination = r"\\xx\Destination"
files = os.listdir(source)
for file in files:
new_path = shutil.move(f"{source}/{file}", destination)
print(new_path)
I was making it more complicated than it needed to be - because all files in the folder would be .csv anyway, I just needed to move all files. Thanks stackoverlfow.
I have a folder "c:\test" , the folder "test" contains many sub folders and files (.xml, .wav). I need to search all folders for files in the test folder and all sub-folders, starting with the number 4 and being 7 characters long in it and copy these files to another folder called 'c:\test.copy' using python. any other files need to be ignored.
So far i can copy the files starting with a 4 but not structure to the new folder using the following,
from glob import glob
import os, shutil
root_src_dir = r'C:/test' #Path of the source directory
root_dst_dir = 'c:/test.copy' #Path to the destination directory
for file in glob('c:/test/**/4*.*'):
shutil.copy(file, root_dst_dir)
any help would be most welcome
You can use os.walk:
import os
import shutil
root_src_dir = r'C:/test' #Path of the source directory
root_dst_dir = 'c:/test.copy' #Path to the destination directory
for root, _, files in os.walk(root_src_dir):
for file in files:
if file.startswith("4") and len(file) == 7:
shutil.copy(os.path.join(root, file), root_dst_dir)
If, by 7 characters, you mean 7 characters without the file extension, then replace len(file) == 7 with len(os.path.splitext(file)[0]) == 7.
This can be done using the os and shutil modules:
import os
import shutil
Firstly, we need to establish the source and destination paths. source should the be the directory you are copying and destination should be the directory you want to copy into.
source = r"/root/path/to/source"
destination = r"/root/path/to/destination"
Next, we have to check if the destination path exists because shutil.copytree() will raise a FileExistsError if the destination path already exists. If it does already exist, we can remove the tree and duplicate it again. You can think of this block of code as simply refreshing the duplicate directory:
if os.path.exists(destination):
shutil.rmtree(destination)
shutil.copytree(source, destination)
Then, we can use os.walk to recursively navigate the entire directory, including subdirectories:
for path, _, files in os.walk(destination):
for file in files:
if not file.startswith("4") and len(os.path.splitext(file)[0]) != 7:
os.remove(os.path.join(path, file))
if not os.listdir(path):
os.rmdir(path)
We then can loop through the files in each directory and check if the file does not meet your condition (starts with "4" and has a length of 7). If it does not meet the condition, we simply remove it from the directory using os.remove.
The final if-statement checks if the directory is now empty. If the directory is empty after removing the files, we simply delete that directory using os.rmdir.
I have the following directory, in the parent dir there are several folders lets say ABCD and within each folder many zips with names as displayed and the letter of the parent folder included in the name along with other info:
-parent--A-xxxAxxxx_timestamp.zip
-xxxAxxxx_timestamp.zip
-xxxAxxxx_timestamp.zip
--B-xxxBxxxx_timestamp.zip
-xxxBxxxx_timestamp.zip
-xxxBxxxx_timestamp.zip
--C-xxxCxxxx_timestamp.zip
-xxxCxxxx_timestamp.zip
-xxxCxxxx_timestamp.zip
--D-xxxDxxxx_timestamp.zip
-xxxDxxxx_timestamp.zip
-xxxDxxxx_timestamp.zip
I need to unzip only selected zips in this tree and place them in the same directory with the same name without the .zip extension.
Output:
-parent--A-xxxAxxxx_timestamp
-xxxAxxxx_timestamp
-xxxAxxxx_timestamp
--B-xxxBxxxx_timestamp
-xxxBxxxx_timestamp
-xxxBxxxx_timestamp
--C-xxxCxxxx_timestamp
-xxxCxxxx_timestamp
-xxxCxxxx_timestamp
--D-xxxDxxxx_timestamp
-xxxDxxxx_timestamp
-xxxDxxxx_timestamp
My effort:
for path in glob.glob('./*/xxx*xxxx*'): ##walk the dir tree and find the files of interest
zipfile=os.path.basename(path) #save the zipfile path
zip_ref=zipfile.ZipFile(path, 'r')
zip_ref=extractall(zipfile.replace(r'.zip', '')) #unzip to a folder without the .zip extension
The problem is that i dont know how to save the A,B,C,D etc to include them in the path where the files will be unzipped. Thus, the unzipped folders are created in the parent directory. Any ideas?
The code that you have seems to be working fine, you just to make sure that you are not overriding variable names and using the correct ones. The following code works perfectly for me
import os
import zipfile
import glob
for path in glob.glob('./*/xxx*xxxx*'): ##walk the dir tree and find the files of interest
zf = os.path.basename(path) #save the zipfile path
zip_ref = zipfile.ZipFile(path, 'r')
zip_ref.extractall(path.replace(r'.zip', '')) #unzip to a folder without the .zip extension
Instead of trying to do it in a single statement , it would be much easier and more readable to do it by first getting list of all folders and then get list of files inside each folder. Example -
import os.path
for folder in glob.glob("./*"):
#Using *.zip to only get zip files
for path in glob.glob(os.path.join(".",folder,"*.zip")):
filename = os.path.split(path)[1]
if folder in filename:
#Do your logic