How to search the entire HDD for all pdf files? - python

As the title suggests, I would like to get python 3.5 to search my root ('C:\')
for pdf files and then move those files to a specific folder.
This task can easily split into 2:
1. Search my root for files with the pdf extension.
2. Move those to a specific folder.
Now. I know how to search for a specific file name, but not plural files that has a specific extension.
import os
print('Welcome to the Walker Module.')
print('find(name, path) or find_all(name, path)')
def find(name, path):
for root, dirs, files in os.walk(path):
print('Searching for files...')
if name in files:
return os.path.join(root, name)
def find_all(name, path):
result = []
for root, dirs, files in os.walk(path):
print('Searching for files...')
if name in files:
result.append(os.path.join(root, name))
return result
This little program will find me either the 1st or all locations of a specific file.
I, however, can not modify this to be able to search for pdf files due to the lack of knowledge with python and programming in general.
Would love to have some kind of insight on where to go from here.
To sum it up,
Search the root for all pdf files.
Move those files into a specific location. Lets say 'G:\Books'
Thanks in advance.

Your find_all function is very close to the final result.
When you loop through the files, you can check their extension with os.path.splitext, and if they are .pdf files you can move them with shutil.move
Here's an example that walks the tree of a source directory, checks the extension of every file and, in case of match, moves the files to a destination directory:
import os
import shutil
def move_all_ext(extension, source_root, dest_dir):
# Recursively walk source_root
for (dirpath, dirnames, filenames) in os.walk(source_root):
# Loop through the files in current dirpath
for filename in filenames:
# Check file extension
if os.path.splitext(filename)[-1] == extension:
# Move file
shutil.move(os.path.join(dirpath, filename), os.path.join(dest_dir, filename))
# Move all pdf files from C:\ to G:\Books
move_all_ext(".pdf", "C:\\", "G:\\Books")

You can use glob from python 3.5 onwards. It supports a recursive search.
If recursive is true, the pattern “**” will match any files and zero or more directories and subdirectories. If the pattern is followed by an os.sep, only directories and subdirectories match.
Therefore you can use it like
import glob
from os import path
import shutil
def searchandmove(wild, srcpath, destpath):
search = path.join(srcpath,'**', wild)
for fpath in glob.iglob(search, recursive=True):
print(fpath)
dest = path.join(destpath, path.basename(fpath))
shutil.move(fpath, dest)
searchandmove('*.pdf', 'C:\\', 'G:\\Books')
With a minimum of string wrangling. For large searches however such as from the root of a filesystem it can take a while, but I'm sure any approach would have this issue.
Tested only on linux, but should work fine on windows. Whatever you pass as destpath must already exist.

Related

Python script to move specific filetypes from the all directories to one folder

I'm trying to write a python script to move all music files from my whole pc to one spcific folder.
They are scattered everywhere and I want to get them all in one place, so I don't want to copy but completely move them.
I was already able to make a list of all the files with this script:
import os
targetfiles = []
extensions = (".mp3", ".wav", ".flac")
for root, dirs, files in os.walk('/'):
for file in files:
if file.endswith(extensions):
targetfiles.append(os.path.join(root, file))
print(targetfiles)
This prints out a nice list of all the files but I'm stuck to now move them.
I did many diffent tries with different code and this was one of them:
import os
import shutil
targetfiles = []
extensions = (".mp3", ".wav", ".flac")
for root, dirs, files in os.walk('/'):
for file in files:
if file.endswith(extensions):
targetfiles.append(os.path.join(root, file))
new_path = 'C:/Users/Nicolaas/Music/All' + file
shutil.move(targetfiles, new_path)
But everything I try gives me an error:
TypeError: rename: src should be string, bytes or os.PathLike, not list
I think I've met my limit gathering this all as I'm only starting at Python but I would be very grateful if anyone could point me in the right direction!
You are trying to move a list of files to a new location, but the shutil.move function expects a single file as the first argument. To move all the files in the targetfiles list to the new location, you have to use a loop to move each file individually.
for file in targetfiles:
shutil.move(file, new_path)
Also if needed add a trailing slash to the new path 'C:/Users/Nicolaas/Music/All/'
On a sidenote are you sure that moving all files with those extentions is a good idea? I would suggest copying them or having a backup.
Edit:
You can use an if statement to exclude certain folders from being searched.
for root, dirs, files in os.walk('/'):
if any(folder in root for folder in excluded_folders):
continue
for file in files:
if file.endswith(extensions):
targetfiles.append(os.path.join(root, file))
Where excluded_folder is a list of the unwanted folders like: excluded_folders = ['Program Files', 'Windows']
I would suggest using glob for matching:
import glob
def match(extension, root_dir):
return glob.glob(f'**\\*.{extension}', root_dir=root_dir, recursive=True)
root_dirs = ['C:\\Path\\to\\Albums', 'C:\\Path\\to\\dir\\with\\music\\files']
excluded_folders = ['Bieber', 'Eminem']
extensions = ("mp3", "wav", "flac")
targetfiles = [f'{root_dir}\\{file_name}' for root_dir in root_dirs for extension in extensions for file_name in match(extension, root_dir) if not any(excluded_folder in file_name for excluded_folder in excluded_folders)]
Then you can move these files to new_path

Iterate Through all Folders in a Drive - A Legacy Storage Option Migration to Cloud

I have a folder structure similar to the following:
This structure is used to store images.
New images are appended to the deepest available directory.
A directory can hold a maximum of 100 images.
Examples:
The first 100 images added will have the path:
X:\Images\DB\0\0\0\0\0\0\image_name.jpg
A random image may have the path:
X:\Images\DB\0\2\1\4\2\7\image_name.jpg
The last 100 images added will have the path:
X:\Images\DB\0\9\9\9\9\9\image_name.jpg
N.B. An image is only ever stored at the deepest possible directory.
X:\Images\DB\0\x\x\x\x\x\IMAGES_HERE
E.G. There are no images stored in: X:\Images\DB\0\1\2\3
N.B. The deepest folder path to an image only exists if an image is stored there. Example:
X:\Images\DB\0\9\9\9\9\9
... may not exist (and it doesn't in my case).
What I want to achieve is, beginning at the root directory, navigate through every possible path to the images and run a command.
I'm aware the time complexity for this is in terms of hours, if not days. It's a legacy storage option with the command migrating images to the cloud.
I have already managed to code some functions to allow me to travel to the current deepest directory and execute a command, but visiting all possible paths adds a complexity I'm struggling with - also I'm new to Python.
Here is the code:
# file generator
def files(path):
for file in os.listdir(path):
if os.path.isfile(os.path.join(path, file)):
yield file
# latest deepest directory
def get_deepest_dir(dir):
current_dir = dir
next_dir = os.listdir(current_dir)[-1]
if len(list(files(current_dir))) == 0:
next_dir = os.path.join(current_dir, next_dir)
return get_deepest_dir(next_dir)
else:
return current_dir
# perform command
def sync():
dir = get_deepest_dir(root_dir)
command = "<command_here>"
subprocess.Popen(command, shell=True)
I used the following to search for csv / pdf files. I've left an example of what I wrote to search through all folders.
os.listdir -
os.listdir() method in python is used to get the list of all files and directories in the specified directory.
os.walk -
os.walk() method, in python is used to generate the file names in a directory tree by walking the tree either top-down or bottom-up.
#Import Python Modules
import os,time
import pandas as pd
## Search Folder
##src_path ="/Users/folder1/test/"
src_path ="/Users/folder1/"
path = src_path
files = os.listdir(path)
for f in files:
if f.endswith('.csv'):
print(f)
for root, directories, files in os.walk(path, topdown=False):
for name in files:
if name.endswith('.csv'):
print(os.path.join(root, name))
## for name in directories:
## print(os.path.join(root, name))
for root, directories, files in os.walk(path):
for name in files:
if name.endswith('.pdf'):
print(os.path.join(root, name))
## for name in directories:
## print(os.path.join(root, name))
Thanks to #NeoTheNerd above for the solution.
The adapted code which worked for me is here.
def all_dirs(path):
for root, directories, files in os.walk(path, topdown=False):
if sum(c.isdigit() for c in root) == 6:
print("Migrating Images From {}".format(root))
all_dirs("X:\\Images\\DB\\0")

Can't get absolute path in Python

I've tried to use os.path.abspath(file) as well as Path.absolute(file) to get the paths of .png files I'm working on that are on a separate drive from the project folder that the code is in. The result from the following script is "Project Folder for the code/filename.png", whereas obviously what I need is the path to the folder that the .png is in;
for root, dirs, files in os.walk(newpath):
for file in files:
if not file.startswith("."):
if file.endswith(".png"):
number, scansize, letter = file.split("-")
filepath = os.path.abspath(file)
# replace weird backslash effects
correctedpath = filepath.replace(os.sep, "/")
newentry = [number, file, correctedpath]
textures.append(newentry)
I've read other answers on here that seem to suggest that the project file for the code can't be in the same directory as the folder that is being worked on. But that isn't the case here. Can someone kindly point out what I'm not getting? I need the absolute path because the purpose of the program will be to write the paths for the files into text files.
You could use pathlib.Path.rglob here to recursively get all the pngs:
As a list comprehension:
from pathlib import Path
search_dir = "/path/to/search/dir"
# This creates a list of tuples with `number` and the resolved path
paths = [(p.name.split("-")[0], p.resolve()) for p in Path(search_dir).rglob("*.png")]
Alternatively, you can process them in a loop:
paths = []
for p in Path(search_dir).rglob("*.png"):
number, scansize, letter = p.name.split("-")
# more processing ...
paths.append([number, p.resolve()])
I just recently wrote something like what you're looking for.
This code relies on the assumption that your files are the end of the path.
it's not suitable to find a directory or something like this.
there's no need for a nested loop.
DIR = "your/full/path/to/direcetory/containing/desired/files"
def get_file_path(name, template):
"""
#:param template: file's template (txt,html...)
#return: The path to the given file.
#rtype: str
"""
substring = f'{name}.{template}'
for path in os.listdir(DIR):
full_path = os.path.join(DIR, path)
if full_path.endswith(substring):
return full_path
The result from
for root, dirs, files in os.walk(newpath):
is that files just contains the filenames without a directory path. Using just filenames means that python by default uses your project folder as directory for those filenames. In your case the files are in newpath. You can use os.path.join to add a directory path to the found filenames.
filepath = os.path.join(newpath, file)
In case you want to find the png files in subdirectories the easiest way is to use glob:
import glob
newpath = r'D:\Images'
file_paths = glob.glob(newpath + "/**/*.png", recursive=True)
for file_path in file_paths:
print(file_path)

os.walk folder exclusion based on .txt file

I would like to have a Folders_To_Skip.txt file with a list of directories separated by new lines
ex:
A:\\stuff\a\b\
A:\\junk\a\b\
I have files which are breaking my .csv record compiling that this is used for and I want to exclude directories which I have no use for reading anyway.
In the locate function I have what I tried to implement from Excluding directories in os.walk but I can't seem to get it to work with directories in a list let alone while reading from a text file list as when I print files accessed it still includes files in the directories I attempted to exclude.
Could you also explain whether the solution would be specific excluded directories (not the end of the world) or if it can be operated to exclude subdirectories (would be more convenient).
Right now the code preceding locate allows for easy lookup of controlling text files and then loading those items in as lists for the rest of the script to run, with the assumption that all control files are in the same location but that location can change based on who is running the script and from where.
Also for testing purposes Drive_Locations.txt is setup as:
A
B
Here is the current script:
import os
from tkinter import filedialog
import fnmatch
input('Press Enter to select any file in writing directory or associated control files...')
fname = filedialog.askopenfilename()
fpath = os.path.split(fname)
# Set location for Drive Locations to scan
Disk_Locations = os.path.join(fpath[0], r'Drive_Locations.txt')
# Set location for Folders to ignore such as program files
Ignore = os.path.join(fpath[0], r'Folders_To_Skip.txt')
# Opens list of Drive Locations to be sampled
with open(Disk_Locations, 'r') as Drives:
Drive = Drives.readlines()
Drive = [x.replace('\n', '') for x in Drive]
# Iterable list for directories to be excluded
with open(Ignore, 'r') as SkipF1:
Skip_Fld = SkipF1.readlines()
Skip_Fld = [x.replace('\n', '') for x in Skip_Fld]
# Locates file in entire file tree from previously established parent directory.
def locate(pattern, root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root), topdown=True):
dirs[:] = [d for d in dirs if d not in Skip_Fld]
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)
for disk in Drive:
# Formats Drive Location for acceptance
disk = str.upper(disk)
if str.find(disk, ':') < 0:
disk = disk + ':'
# Changes the current disk drive
if os.path.exists(disk):
os.chdir(disk)
# If disk incorrect skip to next disk
else:
continue
for exist_csv in locate('*.csv'):
# Skip compiled record output files in search
print(exist_csv)
The central bug here is that os.walk() returns a list of relative directory names. So for example when you are in the directory A:\stuff\a, the directory you want to skip is simply listed as b, not as A:\stuff\a\b; and so of course your skip logic doesn't find anything to remove from the list of subdirectories in the current directory.
Here's a refactoring which examines the current directory directly instead.
for path, dirs, files in os.walk(os.path.abspath(root), topdown=True):
if path not in Skip_Fld:
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)
The abspath call is important to keep; good on you for including that in your attempt.
Your list of directories to skip should have single backslashes, or perhaps forward slashes, and probably no final directory separator (I fortunately have no way to check how these are reported by os.walk() on Windows).

How do you get the absolute path of a file in Python?

I have read quite a few links on the site saying to use "os.path.abspath(#filename)". This method isn't exactly working for me. I am writing a program that will be able to search a given directory for files with certain extensions, save the name and absolute path as keys and values (respectively) into a dictionary, and then use the absolute path to open the files and make the edits that are required. The problem I am having is that when I use os.path.abspath() it isn't returning the full path.
Let's say my program is on the desktop. I have a file stored at "C:\Users\Travis\Desktop\Test1\Test1A\test.c". My program can easily locate this file, but when I use os.path.abspath() it returns "C:\Users\Travis\Desktop\test.c" which is the absolute path of where my source code is stored, but not the file I was searching for.
My exact code is:
import os
Files={}#Dictionary that will hold file names and absolute paths
root=os.getcwd()#Finds starting point
for root, dirs, files in os.walk(root):
for file in files:
if file.endswith('.c'):#Look for files that end in .c
Files[file]=os.path.abspath(file)
Any tips or advice as to why it may be doing this and how I can fix it? Thanks in advance!
os.path.abspath() makes a relative path absolute relative to the current working directory, not to the file's original location. A path is just a string, Python has no way of knowing where the filename came from.
You need to supply the directory yourself. When you use os.walk, each iteration lists the directory being listed (root in your code), the list of subdirectories (just their names) and a list of filenames (again, just their names). Use root together with the filename to make an absolute path:
Files={}
cwd = os.path.abspath(os.getcwd())
for root, dirs, files in os.walk(cwd):
for file in files:
if file.endswith('.c'):
Files[file] = os.path.join(root, os.path.abspath(file))
Note that your code only records the one path for each unique filename; if you have foo/bar/baz.c and foo/spam/baz.c, it depends on the order the OS listed the bar and spam subdirectories which one of the two paths wins.
You may want to collect paths into a list instead:
Files={}
cwd = os.path.abspath(os.getcwd())
for root, dirs, files in os.walk(cwd):
for file in files:
if file.endswith('.c'):
full_path = os.path.join(root, os.path.abspath(file))
Files.setdefault(file, []).append(full_path)
Per the docs for os.path.join,
If any component is an absolute path, all previous components (on
Windows, including the previous drive letter, if there was one) are
thrown away
So, for example, if the second argument is an absolute path, the first path, '/a/b/c' is discarded.
In [14]: os.path.join('/a/b/c', '/d/e/f')
Out[14]: '/d/e/f'
Therefore,
os.path.join(root, os.path.abspath(file))
will discard root no matter what it is, and return os.path.abspath(file) which will tack file on to the current working directory, which will not necessarily be the same as root.
Instead, to form the absolute path to the file:
fullpath = os.path.abspath(os.path.join(root, file))
Actually, I believe the os.path.abspath is unnecessary, since I believe root will always be absolute, but my reasoning for that depends on the source code for os.walk not just the documented (guaranteed) behavior of os.walk. So to be absolutely sure (pun intended), use os.path.abspath.
import os
samefiles = {}
root = os.getcwd()
for root, dirs, files in os.walk(root):
for file in files:
if file.endswith('.c'):
fullpath = os.path.join(root, file)
samefiles.setdefault(file, []).append(fullpath)
print(samefiles)
Glob is useful in these cases, you can do:
files = {f:os.path.join(os.getcwd(), f) for f in glob.glob("*.c")}
to get the same result

Categories