os.walk folder exclusion based on .txt file - python

I would like to have a Folders_To_Skip.txt file with a list of directories separated by new lines
ex:
A:\\stuff\a\b\
A:\\junk\a\b\
I have files which are breaking my .csv record compiling that this is used for and I want to exclude directories which I have no use for reading anyway.
In the locate function I have what I tried to implement from Excluding directories in os.walk but I can't seem to get it to work with directories in a list let alone while reading from a text file list as when I print files accessed it still includes files in the directories I attempted to exclude.
Could you also explain whether the solution would be specific excluded directories (not the end of the world) or if it can be operated to exclude subdirectories (would be more convenient).
Right now the code preceding locate allows for easy lookup of controlling text files and then loading those items in as lists for the rest of the script to run, with the assumption that all control files are in the same location but that location can change based on who is running the script and from where.
Also for testing purposes Drive_Locations.txt is setup as:
A
B
Here is the current script:
import os
from tkinter import filedialog
import fnmatch
input('Press Enter to select any file in writing directory or associated control files...')
fname = filedialog.askopenfilename()
fpath = os.path.split(fname)
# Set location for Drive Locations to scan
Disk_Locations = os.path.join(fpath[0], r'Drive_Locations.txt')
# Set location for Folders to ignore such as program files
Ignore = os.path.join(fpath[0], r'Folders_To_Skip.txt')
# Opens list of Drive Locations to be sampled
with open(Disk_Locations, 'r') as Drives:
Drive = Drives.readlines()
Drive = [x.replace('\n', '') for x in Drive]
# Iterable list for directories to be excluded
with open(Ignore, 'r') as SkipF1:
Skip_Fld = SkipF1.readlines()
Skip_Fld = [x.replace('\n', '') for x in Skip_Fld]
# Locates file in entire file tree from previously established parent directory.
def locate(pattern, root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root), topdown=True):
dirs[:] = [d for d in dirs if d not in Skip_Fld]
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)
for disk in Drive:
# Formats Drive Location for acceptance
disk = str.upper(disk)
if str.find(disk, ':') < 0:
disk = disk + ':'
# Changes the current disk drive
if os.path.exists(disk):
os.chdir(disk)
# If disk incorrect skip to next disk
else:
continue
for exist_csv in locate('*.csv'):
# Skip compiled record output files in search
print(exist_csv)

The central bug here is that os.walk() returns a list of relative directory names. So for example when you are in the directory A:\stuff\a, the directory you want to skip is simply listed as b, not as A:\stuff\a\b; and so of course your skip logic doesn't find anything to remove from the list of subdirectories in the current directory.
Here's a refactoring which examines the current directory directly instead.
for path, dirs, files in os.walk(os.path.abspath(root), topdown=True):
if path not in Skip_Fld:
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)
The abspath call is important to keep; good on you for including that in your attempt.
Your list of directories to skip should have single backslashes, or perhaps forward slashes, and probably no final directory separator (I fortunately have no way to check how these are reported by os.walk() on Windows).

Related

Can't get absolute path in Python

I've tried to use os.path.abspath(file) as well as Path.absolute(file) to get the paths of .png files I'm working on that are on a separate drive from the project folder that the code is in. The result from the following script is "Project Folder for the code/filename.png", whereas obviously what I need is the path to the folder that the .png is in;
for root, dirs, files in os.walk(newpath):
for file in files:
if not file.startswith("."):
if file.endswith(".png"):
number, scansize, letter = file.split("-")
filepath = os.path.abspath(file)
# replace weird backslash effects
correctedpath = filepath.replace(os.sep, "/")
newentry = [number, file, correctedpath]
textures.append(newentry)
I've read other answers on here that seem to suggest that the project file for the code can't be in the same directory as the folder that is being worked on. But that isn't the case here. Can someone kindly point out what I'm not getting? I need the absolute path because the purpose of the program will be to write the paths for the files into text files.
You could use pathlib.Path.rglob here to recursively get all the pngs:
As a list comprehension:
from pathlib import Path
search_dir = "/path/to/search/dir"
# This creates a list of tuples with `number` and the resolved path
paths = [(p.name.split("-")[0], p.resolve()) for p in Path(search_dir).rglob("*.png")]
Alternatively, you can process them in a loop:
paths = []
for p in Path(search_dir).rglob("*.png"):
number, scansize, letter = p.name.split("-")
# more processing ...
paths.append([number, p.resolve()])
I just recently wrote something like what you're looking for.
This code relies on the assumption that your files are the end of the path.
it's not suitable to find a directory or something like this.
there's no need for a nested loop.
DIR = "your/full/path/to/direcetory/containing/desired/files"
def get_file_path(name, template):
"""
#:param template: file's template (txt,html...)
#return: The path to the given file.
#rtype: str
"""
substring = f'{name}.{template}'
for path in os.listdir(DIR):
full_path = os.path.join(DIR, path)
if full_path.endswith(substring):
return full_path
The result from
for root, dirs, files in os.walk(newpath):
is that files just contains the filenames without a directory path. Using just filenames means that python by default uses your project folder as directory for those filenames. In your case the files are in newpath. You can use os.path.join to add a directory path to the found filenames.
filepath = os.path.join(newpath, file)
In case you want to find the png files in subdirectories the easiest way is to use glob:
import glob
newpath = r'D:\Images'
file_paths = glob.glob(newpath + "/**/*.png", recursive=True)
for file_path in file_paths:
print(file_path)

How to search the entire HDD for all pdf files?

As the title suggests, I would like to get python 3.5 to search my root ('C:\')
for pdf files and then move those files to a specific folder.
This task can easily split into 2:
1. Search my root for files with the pdf extension.
2. Move those to a specific folder.
Now. I know how to search for a specific file name, but not plural files that has a specific extension.
import os
print('Welcome to the Walker Module.')
print('find(name, path) or find_all(name, path)')
def find(name, path):
for root, dirs, files in os.walk(path):
print('Searching for files...')
if name in files:
return os.path.join(root, name)
def find_all(name, path):
result = []
for root, dirs, files in os.walk(path):
print('Searching for files...')
if name in files:
result.append(os.path.join(root, name))
return result
This little program will find me either the 1st or all locations of a specific file.
I, however, can not modify this to be able to search for pdf files due to the lack of knowledge with python and programming in general.
Would love to have some kind of insight on where to go from here.
To sum it up,
Search the root for all pdf files.
Move those files into a specific location. Lets say 'G:\Books'
Thanks in advance.
Your find_all function is very close to the final result.
When you loop through the files, you can check their extension with os.path.splitext, and if they are .pdf files you can move them with shutil.move
Here's an example that walks the tree of a source directory, checks the extension of every file and, in case of match, moves the files to a destination directory:
import os
import shutil
def move_all_ext(extension, source_root, dest_dir):
# Recursively walk source_root
for (dirpath, dirnames, filenames) in os.walk(source_root):
# Loop through the files in current dirpath
for filename in filenames:
# Check file extension
if os.path.splitext(filename)[-1] == extension:
# Move file
shutil.move(os.path.join(dirpath, filename), os.path.join(dest_dir, filename))
# Move all pdf files from C:\ to G:\Books
move_all_ext(".pdf", "C:\\", "G:\\Books")
You can use glob from python 3.5 onwards. It supports a recursive search.
If recursive is true, the pattern “**” will match any files and zero or more directories and subdirectories. If the pattern is followed by an os.sep, only directories and subdirectories match.
Therefore you can use it like
import glob
from os import path
import shutil
def searchandmove(wild, srcpath, destpath):
search = path.join(srcpath,'**', wild)
for fpath in glob.iglob(search, recursive=True):
print(fpath)
dest = path.join(destpath, path.basename(fpath))
shutil.move(fpath, dest)
searchandmove('*.pdf', 'C:\\', 'G:\\Books')
With a minimum of string wrangling. For large searches however such as from the root of a filesystem it can take a while, but I'm sure any approach would have this issue.
Tested only on linux, but should work fine on windows. Whatever you pass as destpath must already exist.

How do you get the absolute path of a file in Python?

I have read quite a few links on the site saying to use "os.path.abspath(#filename)". This method isn't exactly working for me. I am writing a program that will be able to search a given directory for files with certain extensions, save the name and absolute path as keys and values (respectively) into a dictionary, and then use the absolute path to open the files and make the edits that are required. The problem I am having is that when I use os.path.abspath() it isn't returning the full path.
Let's say my program is on the desktop. I have a file stored at "C:\Users\Travis\Desktop\Test1\Test1A\test.c". My program can easily locate this file, but when I use os.path.abspath() it returns "C:\Users\Travis\Desktop\test.c" which is the absolute path of where my source code is stored, but not the file I was searching for.
My exact code is:
import os
Files={}#Dictionary that will hold file names and absolute paths
root=os.getcwd()#Finds starting point
for root, dirs, files in os.walk(root):
for file in files:
if file.endswith('.c'):#Look for files that end in .c
Files[file]=os.path.abspath(file)
Any tips or advice as to why it may be doing this and how I can fix it? Thanks in advance!
os.path.abspath() makes a relative path absolute relative to the current working directory, not to the file's original location. A path is just a string, Python has no way of knowing where the filename came from.
You need to supply the directory yourself. When you use os.walk, each iteration lists the directory being listed (root in your code), the list of subdirectories (just their names) and a list of filenames (again, just their names). Use root together with the filename to make an absolute path:
Files={}
cwd = os.path.abspath(os.getcwd())
for root, dirs, files in os.walk(cwd):
for file in files:
if file.endswith('.c'):
Files[file] = os.path.join(root, os.path.abspath(file))
Note that your code only records the one path for each unique filename; if you have foo/bar/baz.c and foo/spam/baz.c, it depends on the order the OS listed the bar and spam subdirectories which one of the two paths wins.
You may want to collect paths into a list instead:
Files={}
cwd = os.path.abspath(os.getcwd())
for root, dirs, files in os.walk(cwd):
for file in files:
if file.endswith('.c'):
full_path = os.path.join(root, os.path.abspath(file))
Files.setdefault(file, []).append(full_path)
Per the docs for os.path.join,
If any component is an absolute path, all previous components (on
Windows, including the previous drive letter, if there was one) are
thrown away
So, for example, if the second argument is an absolute path, the first path, '/a/b/c' is discarded.
In [14]: os.path.join('/a/b/c', '/d/e/f')
Out[14]: '/d/e/f'
Therefore,
os.path.join(root, os.path.abspath(file))
will discard root no matter what it is, and return os.path.abspath(file) which will tack file on to the current working directory, which will not necessarily be the same as root.
Instead, to form the absolute path to the file:
fullpath = os.path.abspath(os.path.join(root, file))
Actually, I believe the os.path.abspath is unnecessary, since I believe root will always be absolute, but my reasoning for that depends on the source code for os.walk not just the documented (guaranteed) behavior of os.walk. So to be absolutely sure (pun intended), use os.path.abspath.
import os
samefiles = {}
root = os.getcwd()
for root, dirs, files in os.walk(root):
for file in files:
if file.endswith('.c'):
fullpath = os.path.join(root, file)
samefiles.setdefault(file, []).append(fullpath)
print(samefiles)
Glob is useful in these cases, you can do:
files = {f:os.path.join(os.getcwd(), f) for f in glob.glob("*.c")}
to get the same result

Python script errors out

I have this script, which I have no doubt is flawed:
import fnmatch, os, sys
def findit (rootdir, find, pattern):
for folder, dirs, files in os.walk(rootdir):
print (folder)
for filename in fnmatch.filter(files,pattern):
with open(filename) as f:
s = f.read()
f.close()
if find in s :
print(filename)
findit(sys.argv[1], sys.argv[2], sys.argv[3])
when I run it I get Errno2, no such file or directory. BUT the file exists. For instance if I execute it by going: findit.py c:\python "folder" *.py it will work just fine, listing all the *.py files which contain the word "folder". BUT if I go findit.py c:\php\projects1 "include" *.php
as an example I get [Errno2] no such file or directory: 'About.php' (for example). But About.php exists. I don't understand what it's doing, or what I'm doing wrong.
If you look at any of the examples for os.walk, you'll see that they all do os.path.join(root, name). You need to do that too.
Why? Quoting from the docs:
filenames is a list of the names of the non-directory files in dirpath. Note that the names in the lists contain no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name).
If you just use the filename as a path, it's going to look for a file of the same name in the current working directory. If there's no such file, you'll get a FileNotFoundError. If there is such a file, you'll open and read the wrong file. Only if you happen to be looking inside the current working directory will it work.
There's also another major problem in your code: os.walk walks a directory tree recursively, finding all files in the given top directory, or any subdirectory of top, or any subdirectory of… and so on, yielding once for each directory. But you're not doing anything useful with that (except printing out the folders). Instead, you wait until it finishes, and then use the files from whichever directory it happened to reach last.
If you just want to get a flat listing of the files directly in a directory, use os.listdir, not os.walk. (Or maybe use glob.glob instead of explicitly listing everything then filtering with fnmatch.)
On the other hand, if you want to walk the tree, you have to move your second for loop inside the first one.
You've also got a minor problem: You call f.close() inside a with open(…) as f:, which leads to f being closed twice. This is guaranteed to be completely harmless (at least in 2.5+, including 3.x), but it's still a bad idea.
Putting it together, here's a working version of your code:
def findit (rootdir, find, pattern):
for folder, dirs, files in os.walk(rootdir):
print (folder)
for filename in fnmatch.filter(files,pattern):
pathname = os.path.join(folder, filename)
with open(pathname) as f:
s = f.read()
if find in s:
print(pathname)
You are using a relative filename. But your current directory does not contain the file. And you don't want to search there anyway. Use os.path.join(folder, filename) to make an absolute path.

Directory is not being recognized in Python

I'm uploading a zipped folder that contains a folder of text files, but it's not detecting that the folder that is zipped up is a directory. I think it might have something to do with requiring an absolute path in the os.path.isdir call, but can't seem to figure out how to implement that.
zipped = zipfile.ZipFile(request.FILES['content'])
for libitem in zipped.namelist():
if libitem.startswith('__MACOSX/'):
continue
# If it's a directory, open it
if os.path.isdir(libitem):
print "You have hit a directory in the zip folder -- we must open it before continuing"
for item in os.listdir(libitem):
The file you've uploaded is a single zip file which is simply a container for other files and directories. All of the Python os.path functions operate on files on your local file system which means you must first extract the contents of your zip before you can use os.path or os.listdir.
Unfortunately it's not possible to determine from the ZipFile object whether an entry is for a file or directory.
A rewrite or your code which does an extract first may look something like this:
import tempfile
# Create a temporary directory into which we can extract zip contents.
tmpdir = tempfile.mkdtemp()
try:
zipped = zipfile.ZipFile(request.FILES['content'])
zipped.extractall(tmpdir)
# Walk through the extracted directory structure doing what you
# want with each file.
for (dirpath, dirnames, filenames) in os.walk(tmpdir):
# Look into subdirectories?
for dirname in dirnames:
full_dir_path = os.path.join(dirpath, dirname)
# Do stuff in this directory
for filename in filenames:
full_file_path = os.path.join(dirpath, filename)
# Do stuff with this file.
finally:
# ... Clean up temporary diretory recursively here.
Usually to make things handle relative paths etc when running scripts you'd want to use os.path.
It seems to me that you're reading from a Zipfile the items you've not actually unzipped it so why would you expect the file/dirs to exist?
Usually I'd print os.getcwd() to find out where I am and also use os.path.join to join with the root of the data directory, whether that is the same as the directory containing the script I can't tell. Using something like scriptdir = os.path.dirname(os.path.abspath(__file__)).
I'd expect you would have to do something like
libitempath = os.path.join(scriptdir, libitem)
if os.path.isdir(libitempath):
....
But I'm guessing at what you're doing as it's a little unclear for me.

Categories