Python-2.x: list directory without os.listdir() - python

With os.listdir(some_dir), we can get all the files from some_dir, but sometimes, there would be 20M files(no sub-dirs) under some_dir, these would be a long time to return 20M strings from os.listdir().
(We don't think it's a wise option to put 20M files under a single directory, but it's really there and out of my control...)
Is it any other generator-like method to do the list operation like this: once find a file, yield it, we fetch it and then the next file.
I have tried os.walk(), it's really a generator-style tool, but it also call os.listdir() to do the list operation, and it can not handle unicode file names well (UTF-8 names along with GBK names).

If you have python 3.5+ you can use os.scandir() see documentation for scandir

Related

How to save os.listdir output as list

I am trying to save all entries of os.listdir("./oldcsv") separately in a list but I don't know how to manipulate the output before it is processed.
What I am trying to do is generate a list containing the absolute pathnames of all *.csv files in a folder, which can later be used to easily manipulate those files' contents. I don't want to put lots of hardcoded pathnames in the script, as it is annoying and hard to read.
import os
for file in os.listdir("./oldcsv"):
if file.endswith(".csv"):
print(os.path.join("/oldcsv", file))
Normally I would use a loop with .append but in this case I cannot do so, since os.listdir just seems to create a "blob" of content. Probably there is an easy solution out there, but my brain won't think of it.
There's a glob module in the standard library that can solve your problem with a single function call:
import glob
csv_files = glob.glob("./*.csv") # get all .csv files from the working dir
assert isinstance(csv_files, list)

Iterate over infinite files in a directory in Python

I'm using Python 3.3.
If I'm manipulating potentially infinite files in a directory (bear with me; just pretend I have a filesystem that supports that), how do I do that without encountering a MemoryError? I only want the string name of one file to be in memory at a time. I don't want them all in an iterable as that would cause a memory error when there are too many.
Will os.walk() work just fine, since it returns a generator? Or, do generators not work like that?
Is this possible?
If you have a system for naming the files that can be figured out computationally, you can do such as this (this iterates over any number of numbered txt files, with only one in memory at a time; you could convert to another calculable system to get shorter filenames for large numbers):
import os
def infinite_files(path):
num=0;
while 1:
if not os.path.exists(os.path.join(path, str(num)+".txt")):
break
else:
num+=1 #perform operations on the file: str(num)+".txt"
[My old inapplicable answer is below]
glob.iglob seems to do exactly what the question asks for. [EDIT: It doesn't. It actually seems less efficient than listdir(), but see my alternative solution above.] From the official documentation:
glob.glob(pathname, *, recursive=False)
Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).
glob.iglob(pathname, *, recursive=False)
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
iglob returns an "iterator which yields" or-- more concisely-- a generator.
Since glob.iglob has the same behavior as glob.glob, you can search with wildcard characters:
import glob
for x glob.iglob("/home/me/Desktop/*.txt"):
print(x) #prints all txt files in that directory
I don't see a way for it to differentiate between files and directories without doing it manually. That is certainly possible, however.

Python - Opening successive Files without physically opening every one

If I am to read a number of files in Python 3.2, say 30-40, and i want to keep the file references in a list
(all the files are in a common folder)
Is there anyway how i can open all the files to their respective file handles in the list, without having to individually open every file via the file.open() function
This is simple, just use a list comprehension based on your list of file paths. Or if you only need to access them one at a time, use a generator expression to avoid keeping all forty files open at once.
list_of_filenames = ['/foo/bar', '/baz', '/tmp/foo']
open_files = [open(f) for f in list_of_filenames]
If you want handles on all the files in a certain directory, use the os.listdir function:
import os
open_files = [open(f) for f in os.listdir(some_path)]
I've assumed a simple, flat directory here, but note that os.listdir returns a list of paths to all file objects in the given directory, whether they are "real" files or directories. So if you have directories within the directory you're opening, you'll want to filter the results using os.path.isfile:
import os
open_files = [open(f) for f in os.listdir(some_path) if os.path.isfile(f)]
Also, os.listdir only returns the bare filename, rather than the whole path, so if the current working directory is not some_path, you'll want to make absolute paths using os.path.join.
import os
open_files = [open(os.path.join(some_path, f)) for f in os.listdir(some_path)
if os.path.isfile(f)]
With a generator expression:
import os
all_files = (open(f) for f in os.listdir(some_path)) # note () instead of []
for f in all_files:
pass # do something with the open file here.
In all cases, make sure you close the files when you're done with them. If you can upgrade to Python 3.3 or higher, I recommend you use an ExitStack for one more level of convenience .
The os library (and listdir in particular) should provide you with the basic tools you need:
import os
print("\n".join(os.listdir())) # returns all of the files (& directories) in the current directory
Obviously you'll want to call open with them, but this gives you the files in an iterable form (which I think is the crux of the issue you're facing). At this point you can just do a for loop and open them all (or some of them).
quick caveat: Jon Clements pointed out in the comments of Henry Keiter's answer that you should watch out for directories, which will show up in os.listdir along with files.
Additionally, this is a good time to write in some filtering statements to make sure you only try to open the right kinds of files. You might be thinking you'll only ever have .txt files in a directory now, but someday your operating system (or users) will have a clever idea to put something else in there, and that could throw a wrench in your code.
Fortunately, a quick filter can do that, and you can do it a couple of ways (I'm just going to show a regex filter):
import os,re
scripts=re.compile(".*\.py$")
files=[open(x,'r') for x in os.listdir() if os.path.isfile(x) and scripts.match(x)]
files=map(lambda x:x.read(),files)
print("\n".join(files))
Note that I'm not checking things like whether I have permission to access the file, so if I have the ability to see the file in the directory but not permission to read it then I'll hit an exception.

Over-riding os.walk to return a generator object as the third item

While checking the efficiency of os.walk, I created 6,00,000 files with the string Hello <number> (where number is just a number indicating the number of the file in the directory), e.g. the contents of the files in the directory would look like:-
File Name | Contents
1.txt | Hello 1
2.txt | Hello 2
.
.
600000.txt|Hello 600000
Now, I ran the following code:-
a= os.walk(os.path.join(os.getcwd(),'too_many_same_type_files')) ## Here, I am just passing the actual path where those 6,00,000 txt files are present
print a.next()
The problem what I felt was that the a.next() takes too much time and memory, because the 3rd item that a.next() would return is the list of files in the directory (which has 600000 items). So, I am trying to figure out a way to reduce the space complexity (at least) by somehow making a.next() to return a generator object as the 3rd item of the tuple, instead of list of file names.
Would that be a good idea to reduce the space complexity?
As folks have mentioned already, 600,000 files in a directory is a bad idea. Initially I thought that there's really no way to do this because of how you get access to the file list, but it turns out that I'm wrong. You could use the following steps to achieve what you want:
Use subprocess or os.system to call ls or dir (whatever OS you happen to be on). Direct the output of that command to a temporary file (say /tmp/myfiles or something. In Python there's a module that can return you a new tmp file).
Open that file for reading in Python.
File objects are iterable and will return each line, so as long as you have just the filenames, you'll be fine.
It's such a good idea, that's the way the underlying C API works!
If you can get access to readdir, you can do it: unfortunately this isn't directly exposed by Python.
This question shows two approaches (both with drawbacks).
A cleaner approach would be to write a module in C to expose the functionality you want.
os.walk calls listdir() under the hood to retrieve the contents of the root directory then proceeds to split the returned list of items to dirs and non-dirs.
To achieve what you want you'll need to dig much lower down and implement not only your own version of walk() but also an alternative listdir() that returns a generator. Note that even then you will not be able to provide independent generators for both dirs and files unless you make two separate calls to the modifiedlistdir() and filter the results on the fly.
As suggested by Sven in the comments above, it might be better to address the actual problem (too many files in a dir) rather than over-engineer a solution.

Get big TAR(gz)-file contents by dir levels

I use python tarfile module.
I have a system backup in tar.gz file.
I need to get first level dirs and files list without getting ALL the list of files in the archive because it's TOO LONG.
For example: I need to get ['bin/', 'etc/', ... 'var/'] and that's all.
How can I do it? May be not even with a tar-file? Then how?
You can't scan the contents of a tar without scanning the entire file; it has no central index. You need something like a ZIP.

Categories