Locating multiple files in large dataset in python

Locating multiple files in large dataset in python - python

I have a large repository of image files (~2 million, .jpg) with individual ids spread in multiple sub-dirs and I'm trying to locate and copy each image on a list containing a ~1,000 subset of these ids.
I'm still very new to Python so my first thought was to use os.walk to iterate through the 1k subset for each file, to see if any within the subset matched the id. This works, at least theoretically, but it seems incredibly slow at something like 3-5 images a second. The same seems to be the case for running through all of the files looking for one id at a time.
import shutil
import os
import csv
# Wander to Folder, Identify Files
for root, dirs, files in os.walk(ImgFolder):
for file in files:
fileName = ImgFolder + str(file)
# For each file, check dictionary for match
with open(DictFolder, 'r') as data1:
csv_dict_reader = csv.DictReader(data1)
for row in csv.DictReader(data1):
img_id_line = row['id_line']
isIdentified = (img_id_line in fileName) and ('.jpg' in fileName)
# If id_line == file ID, copy file
if isIdentified:
src = fileName + '.jpg'
dst = dstFolder + '.jpg'
shutil.copyfile(src,dst)
else:
continue
I've been looking at trying to automate query searches instead, but the data is contained on a NAS and I have no easy way of indexing the files to make querying faster. The machine I'm running the code through is a W10 and thus I can't use the Ubuntu Find method which I gather is considerably better at this task.
Any way to speed up the process would be greatly appreciated!

Here's a couple of scripts that should do what you're looking for.
index.py
This script uses pathlib to walk through directories searching for files with a given extension. It will write a TSV file with two columns, filename and filepath.
import argparse
from pathlib import Path
def main(args):
for arg, val in vars(args).items():
print(f"{arg} = {val}")
ext = "*." + args.ext
index = {}
with open(args.output, "w") as fh:
for file in Path(args.input).rglob(ext):
index[file.name] = file.resolve()
fh.write(f"{file.name}\t{file.resolve()}\n")
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument(
"input",
help="Top level folder which will be recursively "
" searched for files ending with the value "
"provided to `--ext`",
)
p.add_argument("output", help="Output file name for the index tsv file")
p.add_argument(
"--ext",
default="jpg",
help="Extension to search for. Don't include `*` or `.`",
)
main(p.parse_args())
search.py
This script will load the index (output from index.py) into a dicttionary, then it will load the CSV file into a dictionary, then for each id_line it will look for the filename in the index and attempt to copy it to the output folder.
import argparse
import csv
import shutil
from collections import defaultdict
from pathlib import Path
def main(args):
for arg, val in vars(args).items():
print(f"{arg} = {val}")
if not Path(args.dest).is_dir():
Path(args.dest).mkdir(parents=True)
with open(args.index) as fh:
index = dict(l.strip().split("\t", 1) for l in fh)
print(f"Loaded {len(index):,} records")
csv_dict = defaultdict(list)
with open(args.csv) as fh:
reader = csv.DictReader(fh)
for row in reader:
for (k, v) in row.items():
csv_dict[k].append(v)
print(f"Searching for {len(csv_dict['id_line']):,} files")
copied = 0
for file in csv_dict["id_line"]:
if file in index:
shutil.copy2(index[file], args.dest)
copied += 1
else:
print(f"!! File {file!r} not found in index")
print(f"Copied {copied} files to {args.dest}")
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument("index", help="Index file from `index.py`")
p.add_argument("csv", help="CSV file with target filenames")
p.add_argument("dest", help="Target folder to copy files to")
main(p.parse_args())
How to run this:
python index.py --ext "jpg" "C:\path\to\image\folder" "index.tsv"
python search.py "index.tsv" "targets.csv" "C:\path\to\output\folder"
I would try this on one/two folders first to check that it has the expected results.

Under the assumption that file names are unique and files location doesn't change, it is possible to create a dictionary that will allow searching for a file path in O(1) time complexity. The dictionary creation process will take some time, it is possible to pickle it on your computer, so you have to run it only once.
A simple script to create the dictionary:
from pathlib import Path
import pickle
root = Path('path/to/root/folder')
# files extensions to index
extensions = {'.jpg', '.png'}
# iterating over whole `root` directory tree and indexing by file name
image = {file.stem: file for file in root.rglob('*.*') if file.suffix in extensions}
# saving the index on your computer for further use
index_path = Path('path/to/index.pickle')
with index_path.open('wb') as file:
pickle.dump(image, file, pickle.HIGHEST_PROTOCOL)
An example of loading the dictionary:
from pathlib import Path
import pickle
index_path = Path('path/to/index.pickle')
with index_path.open('rb') as file:
image = pickle.load(file)

Related

Find duplicates files and folders in directory and move the duplicates to different folders in python

I am very new to python and i am looking for help.
I am trying to find duplicate folders and files in a directory and move them to a different folder(called Duplicates)in the same directory and retain a single copy of all the files in a folder calles Single_Copy.I am able to find the duplicates and add their info in the CSV file but unable to create and move the files to Duplicates and Single_Copy folder.This piece of code is not showing the duplicated files properly.Could you please guide.
Please find my piece of code attached,
# checkDuplicates.py
# Python 2.7.6
"""
Given a folder, walk through all files within the folder and subfolders
and get list of all files that are duplicates
The md5 checcksum for each file will determine the duplicates
"""
import os
import hashlib
from collections import defaultdict
import csv
src_folder = "C://Users//renu//Desktop//SNow work related"
def generate_md5(fname, chunk_size=1024):
"""
Function which takes a file name and returns md5 checksum of the file
"""
hash = hashlib.md5()
with open(fname, "rb") as f:
# Read the 1st block of the file
chunk = f.read(chunk_size)
# Keep reading the file until the end and update hash
while chunk:
hash.update(chunk)
chunk = f.read(chunk_size)
# Return the hex checksum
return hash.hexdigest()
if __name__ == "__main__":
"""
Starting block of script
"""
# The dict will have a list as values
md5_dict = defaultdict(list)
file_types_inscope = ["ppt", "pptx", "pdf", "txt", "html",
"mp4", "jpg", "png", "xls", "xlsx", "xml",
"vsd", "py", "json"]
# Walk through all files and folders within directory
for path, dirs, files in os.walk(src_folder):
print("Analyzing {}".format(path))
for each_file in files:
if each_file.split(".")[-1].lower() in file_types_inscope:
# The path variable gets updated for each subfolder
file_path = os.path.join(os.path.abspath(path), each_file)
# If there are more files with same checksum append to list
md5_dict[generate_md5(file_path)].append(file_path)
# Identify keys (checksum) having more than one values (file names)
duplicate_files = (
val for key, val in md5_dict.items() if len(val) > 1)
# Write the list of duplicate files to csv file
with open("duplicates.csv", "w") as log:
# Lineterminator added for windows as it inserts blank rows otherwise
csv_writer = csv.writer(log, quoting=csv.QUOTE_MINIMAL, delimiter=",",
lineterminator="\n")
header = ["File Names"]
csv_writer.writerow(header)
for file_name in duplicate_files:
csv_writer.writerow(file_name)
print("Done")

As #Grismar said, you can use the modules os or shutil.
import os
import shutil
os.rename("your/current/path/file.txt", "your/new/path/file.txt")
shutil.move("your/current/path/file.txt", "your/new/path/file.txt")
personal preference: shutil; because if you're on windows, os.rename will silently replace an existing file with the same name.

Extract zip file and nested zip files into target directory using Python

I have a file structure something like this:
/a.zip
    /not_a_zip/
        contents
    /b.zip
        contents
and I want to create a directory a and extract a.zip into it and all the nested zipped files where they are so I get something like this:
/a/
    /not_a_zip/
        contents
    /b/
        contents
I tried this solution, but I was getting errors because inside my main directory I have subdirectories, as well as zip files.
I want to be able to extract the main zip file into a directory of the same name, then be able to extract all nested files within, no matter how deeply nested they are.
EDIT: my current code is this
archive = zipfile.ZipFile(zipped, 'r')
for file in archive.namelist():
archive.extract(file, resultDirectory)
for f in [filename for filename in archive.NameToInfo if filename.endswith(".zip")]:
# get file name and path to extract
fileToExtract = resultDirectory + '/' + f
# get directory to extract new file to
directoryToExtractTo = fileToExtract.rsplit('/', 1)
directoryToExtractTo = directoryToExtractTo[0] + '/'
# extract nested file
nestedArchive = zipfile.ZipFile(fileToExtract, 'r')
for file in nestedArchive.namelist():
nestedArchive.extract(fileToExtract, directoryToExtractTo)
but I'm getting this error:
KeyError: "There is no item named 'nestedFileToExtract.zip' in the archive"
Even though it exists in the file system

Based on this other solutions: this and this.
import os
import io
import sys
import zipfile
def extract_with_structure(input_file, output):
with zipfile.ZipFile(input_file) as zip_file:
print(f"namelist: {zip_file.namelist()}")
for obj in zip_file.namelist():
filename = os.path.basename(obj)
if not filename:
# Skip folders
continue
if 'zip' == filename.split('.')[-1]:
# extract a zip
content = io.BytesIO(zip_file.read(filename))
f = zipfile.ZipFile(content)
dirname = os.path.splitext(os.path.join(output, filename))[0]
for i in f.namelist():
f.extract(i, dirname)
else:
# extract a file
zip_file.extract(obj, os.path.join(output))
if __name__ == "__main__":
if len(sys.argv) < 3:
print("No zipfile specified or output folder.")
exit(1)
extract_with_structure(sys.argv[1], sys.argv[2])

Move pairs of files (.txt & .xml) into their corresponding folder using Python

I have been working this challenge for about a day or so. I've looked at multiple questions and answers asked on SO and tried to 'MacGyver' the code used for my purpose, but still having issues.
I have a directory (lets call it "src\") with hundreds of files (.txt and .xml). Each .txt file has an associated .xml file (let's call it a pair). Example:
src\text-001.txt
src\text-001.xml
src\text-002.txt
src\text-002.xml
src\text-003.txt
src\text-003.xml
Here's an example of how I would like it to turn out so each pair of files are placed into a single unique folder:
src\text-001\text-001.txt
src\text-001\text-001.xml
src\text-002\text-002.txt
src\text-002\text-002.xml
src\text-003\text-003.txt
src\text-003\text-003.xml
What I'd like to do is create an associated folder for each pair and then move each pair of files into its respective folder using Python. I've already tried working from code I found (thanks to a post from Nov '12 by Sethdd, but am having trouble figuring out how to use the move function to grab pairs of files. Here's where I'm at:
import os
import shutil
srcpath = "PATH_TO_SOURCE"
srcfiles = os.listdir(srcpath)
destpath = "PATH_TO_DEST"
# grabs the name of the file before extension and uses as the dest folder name
destdirs = list(set([filename[0:9] for filename in srcfiles]))
def create(dirname, destpath):
full_path = os.path.join(destpath, dirname)
os.mkdir(full_path)
return full_path
def move(filename, dirpath):
shutil.move(os.path.join(srcpath, filename)
,dirpath)
# create destination directories and store their names along with full paths
targets = [
(folder, create(folder, destpath)) for folder in destdirs
]
for dirname, full_path in targets:
for filename in srcfile:
if dirname == filename[0:9]:
move(filename, full_path)
I feel like it should be easy, but Python isn't something I work with everyday and it's been a while since my scripting days... Any help would be greatly appreciated!
Thanks,
WK2EcoD

Use the glob module to interate all of the 'txt' files. From that you can parse and create the folders and copy the files.

The process should be as simple as it appears to you as a human.
for file_name in os.listdir(srcpath):
dir = file_name[:9]
# if dir doesn't exist, create it
# move file_name to dir
You're doing a lot of intermediate work that seems to be confusing you.
Also, insert some simple print statements to track data flow and execution flow. It appears that you have no tracing output so far.

You can do it with os module. For every file in directory check if associated folder exists, create if needed and then move the file. See the code below:
import os
SRC = 'path-to-src'
for fname in os.listdir(SRC):
filename, file_extension = os.path.splitext(fname)
if file_extension not in ['xml', 'txt']:
continue
folder_path = os.path.join(SRC, filename)
if not os.path.exists(folder_path):
os.mkdir(folderpath)
os.rename(
os.path.join(SRC, fname),
os.path.join(folder_path, fname)
)

My approach would be:
Find the pairs that I want to move (do nothing with files without a pair)
Create a directory for every pair
Move the pair to the directory
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import os, shutil
import re
def getPairs(files):
pairs = []
file_re = re.compile(r'^(.*)\.(.*)$')
for f in files:
match = file_re.match(f)
if match:
(name, ext) = match.groups()
if ext == 'txt' and name + '.xml' in files:
pairs.append(name)
return pairs
def movePairsToDir(pairs):
for name in pairs:
os.mkdir(name)
shutil.move(name+'.txt', name)
shutil.move(name+'.xml', name)
files = os.listdir()
pairs = getPairs(files)
movePairsToDir(pairs)
NOTE: This script works when called inside the directory with the pairs.

Locating files by name for copying elsewhere

New to Python...
I'm trying to have python take a text file of file names (new name on each row), and store them as strings ...
i.e
import os, shutil
files_to_find = []
with open('C:\\pathtofile\\lostfiles.txt') as fh:
for row in fh:
files_to_find.append(row.strip)
...in order to search for these files in directories and then copy any found files somewhere else...
for root, dirs, files in os.walk('D:\\'):
for _file in files:
if _file in files_to_find:
print ("Found file in: " + str(root))
shutil.copy(os.path.abspath(root + '/' + _file), 'C:\\destination')
print ("process completed")
Despite knowing these files exist, the script runs without any errors but without finding any files.
I added...
print (files_to_find)
...after the first block of code to see if it was finding anything and saw screeds of "built-in method strip of str object at 0x00000000037FC730>,
Does this tell me it's not successfully creating strings to compare file names against? I wonder where I'm going wrong?

Use array to create a list of files.
import os
import sys
import glob
import shutil
def file_names(self,filepattern,dir):
os.chdir(dir)
count = len(glob.glob(filepattern))
file_list = []
for line in sorted(glob.glob(filepattern)):
line = line.split("/")
line = line[-1]
file_list.append(line)
return file_list
The loop over the array list to compare.

Python - name output file to include part of input file name

I am using python 2.6
I am inputting n number of files and using loops to process the data in the files and outputting that information to a single output file.
The input files are named inputfile_date_time.h5 where each date/time is different for each input file.
I am looking to name the output file outputfile_firstdate_firsttime_lastdate_lasttime.pkt - where firstdate_firsttime is the date and time of the input file with the first time (aka part of the name of the input file that comes first in the sequence of n files) and where lastdate_lasttime is the date and time of the input file with the last time (aka part of the name of the input file that comes last in the sequence of n files)
My code is currently set up as follows:
import os
from glob import glob
from os.path import basename
import numpy
import hdf5
#set location/directory of input files
inputdir = "/Location of directory that contains files"
#create output file
outputfilename = 'outputfilename'
outputfile = "/Location to put output file/"+basename(outputfilename)[:-4]+".pkt"
ofile = open(outputfile, 'wb')
for path, dirs, files in os.walk(inputdir):
files_list = glob(os.path.join(inputdir, '*.h5'))
for file in files_list:
f = h5py.File(os.path.join(files_list,file), 'r')
f.close()
#for loop performing the necessary task to the information in the files
#print that the output file was written
print "Wrote " + outputfile
#close output file
ofile.close()
This code creates an output file called outputfile.pkt
How can I adjust this code to make the changes I previously stated?

time.strptime can parse any time format you want, time.strftime can generate any time format you want. You should read (and possibly parse) all of them, and use min(...) and max(...) to get the smallest and the largest.
For example, if the filenames look like foo2014-06-16bar.txt and hello2014-06-17world, then here is how to parse them:
import re
files = ['foo2014-06-16bar.txt', 'hello2014-06-17world'
dates = [re.search(r'(?:19|20)\d{2}-\d{2}-\d{2}', f).group() for f in files]
print min(dates) #: 2014-06-16
print max(dates) #: 2014-06-17
Here is how to build files using os.walk:
import os
inputdir = "/Location of directory that contains files"
files = []
for dirpath, dirnames, filenames in os.walk(inputdir):
for filename in filenames:
if filename.endswith('.h5'):
pathname = os.path.join(dirpath, filename)
files.append(pathname)
print files

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Locating multiple files in large dataset in python - python

Related

Find duplicates files and folders in directory and move the duplicates to different folders in python

Extract zip file and nested zip files into target directory using Python

Move pairs of files (.txt & .xml) into their corresponding folder using Python

Locating files by name for copying elsewhere

Python - name output file to include part of input file name

Categories

Resources