Copying files recursively with skipping some directories in Python? - python

I want to copy a directory to another directory recursively. I also want to ignore some files (eg. all hidden files; everything starting with ".") and then run a function on all the other files (after copying them). This is simple to do in the shell, but I need a Python script.
I tried using shutil.copytree, which has ignore support, but I don't know how to have it do a function on each file copied. I might also need to check some other condition when copying so I can't just run the function on all the files once they are copied over. I also tried looking at os.walk but I couldn't figure it out.

You can use os.walk to iterate over each file, apply your custom filtering function and copying over only the ones you care.
os.walk(top[, topdown=True[, onerror=None[, followlinks=False]]])
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
You can add all copied files to a list and use it to run you custom script or run the script after each copy operation.
This should get you started:
import os
def filterls(src, filter_func):
for root, dirs, files in os.walk(src):
for f in files:
if filter_func(f):
path = os.path.join(root, f)
yield path[len(src)+1:]
It takes a path to a directory and a function which takes a single parameter. If the function returns False for the passed filename, the file is ignored.
You can try it in a python interpreter like this:
# Get a list of all visible files rooted in the 'path/to/the/dir' directory
print list(filterls('path/to/the/dir', lambda p: not p.startswith('.')))

If you're on Python 2.6 or higher, you can simply use shutil.copytree and its ignore argument. Since it gets passed all the files and directories, you can call your function from there, unless you want it to be called right after the file is copied.
If that is the case, the easiest thing is to copy and modify the copytree code.

Related

os.walk but with directories on top?

I have some simple code to print out the structure of a directory.
My example directory ABC contains subdirectory A containing A.txt, a subdirectory Z containing Z.txt, and a file info.txt. In real use, this will be big collection of many files and nested directories.
import os
topdir = 'ABC/'
for dirpath, dirnames, files in os.walk(topdir):
print(os.path.join(dirpath))
for name in files:
print(os.path.join(dirpath, name))
The output is:
ABC/
ABC/info.txt
ABC/A
ABC/A/A.txt
ABC/Z
ABC/Z/Z.txt
How can I make it so directories are processed/printed on the top?
I want the output to replicate what I see in Windows Explorer, which displays directories first, and files after.
The output I want:
ABC/
ABC/A
ABC/A/A.txt
ABC/Z
ABC/Z/Z.txt
ABC/info.txt
Without storing all the files in a list and sorting that list in one way or the other, you could make a recursive function and first recurse to the next level of the directory structure before printing the files on the current level:
def print_dirs(directories):
try:
dirpath, dirnames, files = next(directories)
print(dirpath) # print current path; no need for join here
for _ in dirnames: # once for each sub-directory...
print_dirs(directories) # ... recursively call generator
for name in files: # now, print files in current directory
print(os.path.join(dirpath, name))
except StopIteration:
pass
print_dirs(os.walk(topdir))
The same could also be done with a stack, but I think this way it's a little bit clearer. And yes, this will also store some directories in a list/on a stack, but not all the files but just as many as there are levels of nested directories.
Edit: This had a problem of printing any next directory on the generator, even if that's not a sub-directory but a sibling (or "uncle" or whatever). The for _ in dirnames loop should fix that, making the recursive call once for each of the subdirectories, if any. The directory itself does not have to be passed as a parameter as it will be gotten from the generator.

How to rename sub directory and file names recursively in script python3?

I have a recursive directory. Both subdirectory and files names have illegal characters. I have a function to clean up the names, such as it replaces a space with an underscore in the name. There must be an easier way but I couldn't find a way to both rename folders and files. So, I want to rename the folders first.
for path, subdirs, files in os.walk(root):
for name in subdirs:
new_name=clean_names(name)
name=os.path.join(path,name)
new_name=os.path.join(path,new_name)
os.chdir(path)
os.rename(name,new_name)
When I check my real folder and it contents I see that only the first subfolder name is corrected. I can see the reason because os.chdir(path) changes the cwd then it doesn't change back before for loop starts to second path. I thought after the os.rename I could rechange the cwd but I am sure there is a more elegant way to do this. If I remove the os.chdir line it gives filenotfound error.
I see that renaming subdirectories has been asked about before, but they are in command line.
You should use os.walk(root, topdown=False) instead; otherwise once the top folder gets renamed, os.walk won't have access to the subfolders because it can no longer find their parent folders.
Excerpt from the documentation:
If optional argument topdown is True or not specified, the triple for
a directory is generated before the triples for any of its
subdirectories (directories are generated top-down). If topdown is
False, the triple for a directory is generated after the triples for
all of its subdirectories (directories are generated bottom-up). No
matter the value of topdown, the list of subdirectories is retrieved
before the tuples for the directory and its subdirectories are
generated.
Note that you do not need to call os.chdir at all because all the paths passed to os.rename are absolute.

Recreate input folder tree for output of some analyses

I am new to Python and, although having been reading and enjoying it so far, have ∂ experience, where ∂ → 0.
I have a folder tree and each folder at the bottom of the tree's branches contains many files. For me, this whole tree in the input.
I would to perform several steps of analysis (I believe these are irrelavant to this question), the results of which I would like to have returned in an identical tree to that of the input, called output.
I have two ideas:
Read through each folder recursively using os.walk() and for each file to perform the analysis, and
Use a function such as shutil.copytree() and perform the analysis somewhere along the way. So actually, I do not want to COPY the tree at all, rather replicate it's structure but with new files. I thought this might be a kind of 'hack' as I do actually want to use each input file to create the output file, so instead of a copycommand, I need an analyse command. The rest should remain unchanged as far as my imagination allows me to understand.
I have little experience with option 1 and zero experience with option 2.
For smaller trees up until now I have been hard-coding the paths, which has become too time-consuming at this point.
I have also seen more mundane ways, such as using glob to first find all the files I would like and work on them, but I don't know how this might help find a shortcut in recreating the input tree for my output.
My attempt at option 1 looks like this:
import os
for root, dirs, files in os.walk('/Volumes/Mac OS Drive/Data/input/'):
# I have no actual need to print these, it just helps me see what is happening
print root, "\n"
print dirs, "\n"
# This is my actual work going on
[analysis_function(name) for name in files]
however I fear this is going to be very slow, I would also like to do some kind of filtering on files too - for example the .DS_Store files created in mac trees are included in the results of the above. I would attempt to use the fnmatch module to filter only the files I want.
I have seen in the copytree function that it is possible to ignore files according to a pattern, which would be helpful, however I do not understand from the documentation where I could put my analysis function in on each file.
You can use both options: you could provide your custom copy_function that performs analysis instead of the default shutil.copy2 to shutil.copytree() (it is a more of a hack) or you could use os.walk() to have a finer control over the process.
You don't need to create parent directories manually either way. copytree() creates the parent directories for you and os.makedirs(root) can create parent directories if you use os.walk():
#!/usr/bin/env python2
import fnmatch
import itertools
import os
ignore_dir = lambda d: d in ('.git', '.svn', '.hg')
src_dir = '/Volumes/Mac OS Drive/Data/input/' # source directory
dst_dir = '/path/to/destination/' # destination directory
for root, dirs, files in os.walk(src_dir):
for input_file in fnmatch.filter(files, "*.input"): # for each input file
output_file = os.path.splitext(input_file)[0] + '.output'
output_dir = os.path.join(dst_dir, root[len(src_dir):])
if not os.path.isdir(output_dir):
os.makedirs(output_dir) # create destination directories
analyze(os.path.join(root, input_file), # perform analysis
os.path.join(output_dir, output_file))
# don't visit ignored subtrees
dirs[:] = itertools.ifilterfalse(ignore_dir, dirs)

Unable to use getsize method with os.walk() returned files

I am trying to make a small program that looks through a directory (as I want to find recursively all the files in the sub directories I use os.walk()).
Here is my code:
import os
import os.path
filesList=[]
path = "C:\\Users\Robin\Documents"
for(root,dirs,files) in os.walk(path):
for file in files:
filesList+=file
Then I try to use the os.path.getsize() method to elements of filesList, but it doesn't work.
Indeed, I realize that the this code fills the list filesList with characters. I don't know what to do, I have tried several other things, such as :
for(root,dirs,files) in os.walk(path):
filesList+=[file for file in os.listdir(root) if os.path.isfile(file)]
This does give me files, but only one, which isn't even visible when looking in the directory.
Can someone explain me how to obtain files with which we can work (that is to say, get their size, hash them, or modify them...) on with os.walk ?
I am new to Python, and I don't really understand how to use os.walk().
The issue I suspect you're running into is that file contains only the filename itself, not any directories you have to navigate through from your starting folder. You should use os.path.join to combine the file name with the folder it is in, which is the root value yielded by os.walk:
for(root,dirs,files) in os.walk(path):
for file in files:
filesList.append(os.path.join(root, file))
Now all the filenames in filesList will be acceptable to os.path.getsize and other functions (like open).
I also fixed a secondary issue, which is that your use of += to extend a list wouldn't work the way you intended. You'd need to wrap the new file path in a list for that to work. Using append is more appropriate for adding a single value to the end of a list.
If you want to get a list of files including path use:
for(root, dirs, files) in os.walk(path):
fullpaths = [os.path.join(root, fil) for fil in files]
filesList+=fullpaths

Delete all files with certain properties in Python

How would you delete in Python all files in directory /tmp/dir and all its subdirectories that have extension .txt or .mp3?
You just need to use os.walk to traverse recursively a directory and os.remove when you find a file whose name matches your requirements.
Note that os.walk returns on one hand file names and, on the other hand, a root directory. Hence, for the os.remove to work you'll need to create the full filename with os.path.join.

Categories