os.walk isn't showing all the files in the given path

os.walk isn't showing all the files in the given path - python

I'm trying to make my own backup program but to do so I need to be able to give a directory and be able to get every file that is somewhere deep down in subdirectories to be able to copy them. I tried making a script but it doesn't give me all the files that are in that directory. I used documents as a test and my list with items is 3600 but the amount of files should be 17000. why isn't os.walk showing everything?
import os
data = []
for mdir, dirs, files in os.walk('C:/Users/Name/Documents'):
data.append(files)
print(data)
print(len(data))

Use data.extend(files) instead of data.append(files).
files is a list of files in a directory. It looks like ["a.txt", "b.html"] and so on. If you use append, you end up with data looking like
[..., ["a.txt", "b.html"]]
whereas I suspect you're after
[..., "a.txt", "b.html"]
Using extend will provide the second behaviour.

Related

How to save os.listdir output as list

I am trying to save all entries of os.listdir("./oldcsv") separately in a list but I don't know how to manipulate the output before it is processed.
What I am trying to do is generate a list containing the absolute pathnames of all *.csv files in a folder, which can later be used to easily manipulate those files' contents. I don't want to put lots of hardcoded pathnames in the script, as it is annoying and hard to read.
import os
for file in os.listdir("./oldcsv"):
if file.endswith(".csv"):
print(os.path.join("/oldcsv", file))
Normally I would use a loop with .append but in this case I cannot do so, since os.listdir just seems to create a "blob" of content. Probably there is an easy solution out there, but my brain won't think of it.

There's a glob module in the standard library that can solve your problem with a single function call:
import glob
csv_files = glob.glob("./*.csv") # get all .csv files from the working dir
assert isinstance(csv_files, list)

Dealing with OS Error Python: [Errno 20] Not a directory: '/Volumes/ARLO/ADNI/.DS_Store'

I am trying to write a piece of code that will recursively iterate through the subdirectories of a specific directory and stop only when reaching files with a '.nii' extension, appending these files to a list called images - a form of a breadth first search. Whenever I run this code, however, I keep receiving [Errno 20] Not a directory: '/Volumes/ARLO/ADNI/.DS_Store'
*/Volumes/ARLO/ADNI is the folder I wish to traverse through
*I am doing this in Mac using the Spyder IDE from Anaconda because it is the only way I can use the numpy and nibabel libraries, which will become important later
*I have already checked that this folder directly contains only other folders and not files
#preprocessing all the MCIc files
import os
#import nibabel as nib
#import numpy as np
def pwd():
cmd = 'pwd'
os.system(cmd)
print(os.getcwd())
#Part 1
os.chdir('/Volumes/ARLO')
images = [] #creating an empty list to store MRI images
os.chdir('/Volumes/ARLO/ADNI')
list_sample = [] #just an empty list for an earlier version of
#the program
#Part 2
#function to recursively iterate through folder of raw MRI
#images and extract them into a list
#breadth first search
def extract(dir):
#dir = dir.replace('.DS_Store', '')
lyst = os.listdir(dir) #DS issue
for item in lyst:
if 'nii' not in item: #if item is not a .nii file, if
#item is another folder
newpath = dir + '/' + item
#os.chdir(newpath) #DS issue
extract(newpath)
else: #if item is the desired file type, append it to
#the list images
images.append(item)
#Part 3
adni = os.getcwd() #big folder I want to traverse
#print(adni) #adni is a string containing the path to the ADNI
#folder w/ all the images
#print(os.listdir(adni)) this also works, prints the actual list
"""adni = adni + '/' + '005_S_0222'
os.chdir(adni)
print(os.listdir(adni))""" #one iteration of the recursion,
#works
extract(adni)
print(images)
With every iteration, I wish to traverse further into the nested folders by appending the folder name to the growing path, and part 3 of the code works, i.e. I know that a single iteration works. Why does os keep adding the '.DS_Store' part to my directories in the extract() function? How can I correct my code so that the breadth first traversal can work? This folder contains hundreds of MRI images, I cannot do it without automation.
Thank you.

The .DS_Store files are not being created by the os module, but by the Finder (or, I think, sometimes Spotlight). They're where macOS stores things like the view options and icon layout for each directory on your system.
And they've probably always been there. The reason you didn't see them when you looked is that files that start with a . are "hidden by convention" on Unix, including macOS. Finder won't show them unless you ask it to show hidden files; ls won't show them unless you pass the -a flag; etc.
So, that's your core problem:
I have already checked that this folder directly contains only other folders and not files
… is wrong. The folder does contain at least one regular file; .DS_Store.
So, what can you do about that?
You could add special handling for .DS_Store.
But a better solution is probably to just check each file to see if it's a file or directory, by calling os.path.isdir on it.
Or, even better, use os.scandir instead of listdir, which gives you entries with more information than just the name, so you don't need to make extra calls like isdir.
Or, best of all, just throw out this code and use os.walk to recursively visit every file in every directory underneath your top-level directory.

Recreate input folder tree for output of some analyses

I am new to Python and, although having been reading and enjoying it so far, have ∂ experience, where ∂ → 0.
I have a folder tree and each folder at the bottom of the tree's branches contains many files. For me, this whole tree in the input.
I would to perform several steps of analysis (I believe these are irrelavant to this question), the results of which I would like to have returned in an identical tree to that of the input, called output.
I have two ideas:
Read through each folder recursively using os.walk() and for each file to perform the analysis, and
Use a function such as shutil.copytree() and perform the analysis somewhere along the way. So actually, I do not want to COPY the tree at all, rather replicate it's structure but with new files. I thought this might be a kind of 'hack' as I do actually want to use each input file to create the output file, so instead of a copycommand, I need an analyse command. The rest should remain unchanged as far as my imagination allows me to understand.
I have little experience with option 1 and zero experience with option 2.
For smaller trees up until now I have been hard-coding the paths, which has become too time-consuming at this point.
I have also seen more mundane ways, such as using glob to first find all the files I would like and work on them, but I don't know how this might help find a shortcut in recreating the input tree for my output.
My attempt at option 1 looks like this:
import os
for root, dirs, files in os.walk('/Volumes/Mac OS Drive/Data/input/'):
# I have no actual need to print these, it just helps me see what is happening
print root, "\n"
print dirs, "\n"
# This is my actual work going on
[analysis_function(name) for name in files]
however I fear this is going to be very slow, I would also like to do some kind of filtering on files too - for example the .DS_Store files created in mac trees are included in the results of the above. I would attempt to use the fnmatch module to filter only the files I want.
I have seen in the copytree function that it is possible to ignore files according to a pattern, which would be helpful, however I do not understand from the documentation where I could put my analysis function in on each file.

You can use both options: you could provide your custom copy_function that performs analysis instead of the default shutil.copy2 to shutil.copytree() (it is a more of a hack) or you could use os.walk() to have a finer control over the process.
You don't need to create parent directories manually either way. copytree() creates the parent directories for you and os.makedirs(root) can create parent directories if you use os.walk():
#!/usr/bin/env python2
import fnmatch
import itertools
import os
ignore_dir = lambda d: d in ('.git', '.svn', '.hg')
src_dir = '/Volumes/Mac OS Drive/Data/input/' # source directory
dst_dir = '/path/to/destination/' # destination directory
for root, dirs, files in os.walk(src_dir):
for input_file in fnmatch.filter(files, "*.input"): # for each input file
output_file = os.path.splitext(input_file)[0] + '.output'
output_dir = os.path.join(dst_dir, root[len(src_dir):])
if not os.path.isdir(output_dir):
os.makedirs(output_dir) # create destination directories
analyze(os.path.join(root, input_file), # perform analysis
os.path.join(output_dir, output_file))
# don't visit ignored subtrees
dirs[:] = itertools.ifilterfalse(ignore_dir, dirs)

Python - Opening successive Files without physically opening every one

If I am to read a number of files in Python 3.2, say 30-40, and i want to keep the file references in a list
(all the files are in a common folder)
Is there anyway how i can open all the files to their respective file handles in the list, without having to individually open every file via the file.open() function

This is simple, just use a list comprehension based on your list of file paths. Or if you only need to access them one at a time, use a generator expression to avoid keeping all forty files open at once.
list_of_filenames = ['/foo/bar', '/baz', '/tmp/foo']
open_files = [open(f) for f in list_of_filenames]
If you want handles on all the files in a certain directory, use the os.listdir function:
import os
open_files = [open(f) for f in os.listdir(some_path)]
I've assumed a simple, flat directory here, but note that os.listdir returns a list of paths to all file objects in the given directory, whether they are "real" files or directories. So if you have directories within the directory you're opening, you'll want to filter the results using os.path.isfile:
import os
open_files = [open(f) for f in os.listdir(some_path) if os.path.isfile(f)]
Also, os.listdir only returns the bare filename, rather than the whole path, so if the current working directory is not some_path, you'll want to make absolute paths using os.path.join.
import os
open_files = [open(os.path.join(some_path, f)) for f in os.listdir(some_path)
if os.path.isfile(f)]
With a generator expression:
import os
all_files = (open(f) for f in os.listdir(some_path)) # note () instead of []
for f in all_files:
pass # do something with the open file here.
In all cases, make sure you close the files when you're done with them. If you can upgrade to Python 3.3 or higher, I recommend you use an ExitStack for one more level of convenience .

The os library (and listdir in particular) should provide you with the basic tools you need:
import os
print("\n".join(os.listdir())) # returns all of the files (& directories) in the current directory
Obviously you'll want to call open with them, but this gives you the files in an iterable form (which I think is the crux of the issue you're facing). At this point you can just do a for loop and open them all (or some of them).
quick caveat: Jon Clements pointed out in the comments of Henry Keiter's answer that you should watch out for directories, which will show up in os.listdir along with files.
Additionally, this is a good time to write in some filtering statements to make sure you only try to open the right kinds of files. You might be thinking you'll only ever have .txt files in a directory now, but someday your operating system (or users) will have a clever idea to put something else in there, and that could throw a wrench in your code.
Fortunately, a quick filter can do that, and you can do it a couple of ways (I'm just going to show a regex filter):
import os,re
scripts=re.compile(".*\.py$")
files=[open(x,'r') for x in os.listdir() if os.path.isfile(x) and scripts.match(x)]
files=map(lambda x:x.read(),files)
print("\n".join(files))
Note that I'm not checking things like whether I have permission to access the file, so if I have the ability to see the file in the directory but not permission to read it then I'll hit an exception.

Get big TAR(gz)-file contents by dir levels

I use python tarfile module.
I have a system backup in tar.gz file.
I need to get first level dirs and files list without getting ALL the list of files in the archive because it's TOO LONG.
For example: I need to get ['bin/', 'etc/', ... 'var/'] and that's all.
How can I do it? May be not even with a tar-file? Then how?

You can't scan the contents of a tar without scanning the entire file; it has no central index. You need something like a ZIP.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

os.walk isn't showing all the files in the given path - python

Related

How to save os.listdir output as list

Dealing with OS Error Python: [Errno 20] Not a directory: '/Volumes/ARLO/ADNI/.DS_Store'

Recreate input folder tree for output of some analyses

Python - Opening successive Files without physically opening every one

Get big TAR(gz)-file contents by dir levels

Categories

Resources