Related
I'm writing some code using python/numpy that I will be using for data analysis of some experimental data sets. Certain steps of these analysis routines can take a while. It is not practical to rerun every step of the analysis every time (I.E when debugging) so it makes sense to save the output from these steps to a file and just reuse them if they're already available.
The data I ultimately want to obtain can be derived from various steps along this analysis process. I.E, A can be used to calculate B and C. D can be calculated from B. E can then be calculated using C and D. etc etc.
The catch here is that it's not uncommon to make it through a few (or many) datasets only to find that there's some tiny little gotcha in the code somewhere that requires some part of the tree to be recalculated. I.E - I discover a bug in B, so now anything that depends on B also needs to be recalculated because it was derived from incorrect data.
The end goal here is to basically protect myself from having sets of data that I forget to reprocess when bugs are found. In other words, I want to be confident that all of my data is calculated using the newest code.
Is there a way to implement this in Python? I have no specific form this solution needs to take so long as it's extensible as I add new steps. I also am okay with the "recalculation step" only being performed when the dependent quantity is recalculated (Rather than at the time one of the parents are changed).
My first thought of how this might be done is to embed information in a header of each saved file (A, B, C, etc) indicating what version of each module it was created with. Then, when loading the saved data the code can check if the version in the file matches the current version of the parent module. (Some sort of parent.getData() which checks if the data has been calculated for that dataset and if it's up to date)
The thing is, at least at first glance, I could see that this could have problems when the change happens several steps up in the dependency chain because the derived file may still be up to date with its module even though its parents are out of date. I suppose I could add some sort of parent.checkIfUpToDate() that checks its own files and then asks each of its parents if they're up to date (which then ask their parents, etc), and updates it if not. The version number can just be a static string stored in each module.
My concern with that approach is that it might mean reading potentially large files from disk just to get a version number. If I went with the "file header" approach, does Python actually load the whole file in to memory when I do an open(myFile), or can I do that, just read the header lines, and close the file without loading the whole thing in to memory?
Last - is there a good way to embed this type of information beyond just having the first line of the file be some variation of # MyFile made with MyModule V x.y.z and writing some piece of code to parse that line?
I'm kind of curious if this approach makes sense, or if I'm reinventing the wheel and there's already something out there to do this.
edit: And something else that occurred to me after I submitted this - does Python have any mechanism to define templates that modules must follow? Just as a means to make sure I keep the format for the data reading steps consistent from module to module.
I cannot answer all of your questions but you can read in only a small part of data from a large file as you can see here:
How to read specific part of large file in Python
I do not see why you would need a parent.checkIfUpToDate() function. You could as well just store the version number of the parent functions in the file itself as well.
To me your approach sounds reasonable, however I have never done anything similar. Alternatively you could create an additional file that holds the specified information but I think storing the information in the actual file should prevent Version errors between your "data file" and the "function version file".
Let's say we have a test.zip file and we update a file:
zfh = zipfile.ZipFile("test.zip", mode = "a")
zfh.write("/home/msala/test.txt")
zfh.close()
Repeating a few times this "update", using the builtin method printdir()
I see in the archive there are stored not only the last one "test.txt" but also all the previous copies of the file.
Ok, I understand the zipfile library hasn't a delete method.
Questions:
if I call the builtin method extract("/home/msala/test.txt"),
which copy of the file is extracted and written to the file system ?
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
At the moment I list all the stored files and sort them by filename, last modification time...
The tl;dr is no, you can't do this without building a bit of extra info—but that can be done without sorting, and, even if you did have to sort, the performance cost would be irrelevant.
First, let me explain how zipfiles work. (Even if you understand this, later readers with the same problem may not.)
Unfortunately, the specification is a copyrighted and paywalled ISO document, so I can't link to it or quote it. The original PKZip APPNOTE.TXT which is the de facto pro-standardization standard is available, however. And numerous sites like Wikipedia have nice summaries.
A zipfile is 0 or more fragments, followed by a central directory.
Fragments are just treated as if they were all concatenated into one big file.
The body of the file can contain zip entries, in arbitrary order, along with anything you want. (This is how DOS/Windows self-extracting archives work—the unzip executable comes at the start of the first fragment.) Anything that looks like a zip entry, but isn't referenced by the central directory, is not treated as a zip entry (except when repairing a corrupted zipfile.)
Each zip entries starts with a header that gives you the filename, compression format, etc. of the following data.
The directory is a list of directory entries that contain most of the same information, plus a pointer to where to find the zip entry.
It's the order of directory entries that determines the order of the files in the archive.
if I call the builtin method extract("/home/msala/test.txt"), which copy of the file is extracted and written to the file system ?
The behavior isn't really specified anywhere.
Extracting the whole archive should extract both files, in the order present in the zip directory (the same order given by infolist), with the second one overwriting the first.
But extracting by name doesn't have to give you both—it could give you the last one, or the first, or pick one at random.
Python gives you the last. The way this works is that, when reading the directory, it builds a dict mapping filenames to ZipInfos, just adding them as encountered, so the last one will overwrite the previous ones. (Here's the 3.7 code.) Whenever you try to access something by filename, it just looks up the filename in that dict to get the ZipInfo.
But is that something you want to rely on? I'm not sure. On the one hand, this behavior has been the same from Python 1.6 to 3.7, which is usually a good sign that it's not going to change, even if it's never been documented. On the other hand, there are open issues—including #6818, which is intended to add deletion support to the library one way or another—that could change it.
And it's really not that hard to do the same thing yourself. With the added benefit that you can use a different rule—always keep the first, always keep the one with the latest mod time, etc.
You seem to be worried about the performance cost of sorting the infolist, which is probably not worth worrying about. The time it takes to read and parse the zip directory is going to make the cost of your sort virtually invisible.
But you don't really need to sort here. After all, you don't want to be able to get all of the entries with a given name in some order, you just want to get one particular entry for each name. So, you can just do what ZipFile does internally, which takes only linear time to build, and constant time each time you search it. And you can use any rule you want here.
entries = {}
for entry in zfh.infolist():
if entry.filename not in entries:
entries[entry.filename] = entries
This keeps the first entry for any name. If you want to keep the last, just remove the if. If you want to keep the latest by modtime, just change it if entry.date_time > entries[entry.filename].date_time:. And so on.
Now, instead of relying on what happens when you call extract("home/msala/test.txt"), you can call extract(entries["home/msala/test.txt"]) and know that you're getting the first/last/latest/whatever file of that name.
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
No, not really.
The way to delete a file is to remove it from the central directory. Which you do just by rewriting the central directory. Since it comes at the end of the zipfile, and is almost always more than small enough to fit on even the smallest floppy, this was generally considered fine even back in the DOS days.
(But notice that if you unplug the computer in the middle of it, you've got a zipfile without a central directory, which has to be rebuilt by scanning all of the file entries. So, many newer tools will instead, at least for smaller files, rewrite the whole file to a tempfile then rename it over the original, to guarantee a safe, atomic write.)
At least some early tools would sometimes, especially for gigantic archives, rewrite the entry's pathname's first byte with a NUL. But this doesn't really mark the entry as deleted, it just renames it to "\0ome/msala/test.txt". And many modern tools will in fact treat it as meaning exactly that and give you weird errors telling you they can't find a directory named 'ome' or '' or something else fun. Plus, this means the filename in the directory entry no longer matches the filename in the file entry header, which will cause many modern tools to flag the zipfile as corrupted.
At any rate, Python's zipfile module doesn't do either of these, so you'd need to subclass ZipFile to add the support yourself.
I solved this way, similar to database records management.
Adding a file to the archive, I look for previous stored copies (same filename).
For each of them, I set their field "comment" to a specific marker, for example "deleted".
We add the new file, with comment = empty.
As we like, we can "vacuum": shrink the zip archive using the usually tools (under the hood a new archive is created, discarding the files having the comment set to "deleted").
This way, we have also a simple "versioning".
We have all the previous files copies, until the vacuum.
I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?
There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:
filea.txt
fileb.txt
filec.txt
Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.
If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.
This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.
Edit:
The OS of concern is Redhat. My use case is this:
Process A is continuously writing files to a storage location.
Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Edit:
Definition of valid:
Adjective
1. Well grounded or justifiable, pertinent.
(Sorry S.Lott, I couldn't resist).
I've edited the paragraph in question above.
tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>
As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.
If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.
The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory
http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory
I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.
Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)
#!/usr/bin/env python2
from ctypes import *
libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))
class Dirent(Structure):
_fields_ = [("d_ino", c_voidp),
("off_t", c_int64),
("d_reclen", c_ushort),
("d_type", c_ubyte),
("d_name", c_char * 2048)
]
while True:
p = libc.readdir64(dir_)
if not p:
break
entry = Dirent.from_address( p)
print entry.d_name
update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).
[footnote-1] The dirent64 C struct is determined at C compile time for each system.
The glob module Python from 2.5 onwards has an iglob method which returns an iterator.
An iterator is exactly for the purposes of not storing huge values in memory.
glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.
For example:
import glob
for eachfile in glob.iglob('*'):
# act upon eachfile
Since you are using Linux, you might want to look at pyinotify.
It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.
Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.
It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.
#jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.
The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.
Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.
What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?
Let's say you have three files: a.a, b.b, c.c.
Your magical "iterator" starts with a.a. You process it.
The magical "iterator" moves to b.b. You're processing it.
Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?
The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?
Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Don't use the naked file system for coordination.
Use a queue.
Process A writes files and enqueues the add/change/delete memento onto a queue.
Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.
I think what you are asking is impossible due to the nature of file IO. Once python has retrieved the listing of a directory it cannot maintain a view of the actual directory on disk, nor is there any way for python to insist that the OS inform it of any modifications to the directory.
All python can do is ask for periodic listings and diff the results to see if there have been any changes.
The best you can do is create a semaphore file in the directory which lets other processes know that your python process desires that no other process modify the directory. Of course they will only observe the semaphore if you have explicitly programmed them to.
Good evening. I am looking at developing some code that will collect EXIF data from JPEG images and store it in a MySQL database using Python v2.x The stumbling block lies in the fact that the JPEGs are scattered in a number of subdirectories and further subdirectories in a root folder so for example 200 JPEGs may be stored in root > subroot > subsubroot1 as well as a further 100 in root > subroot > subroot2. Once all images are identified, they will be scanned and their respective EXIF data abstracted before being added to a MySQL table.
At the moment I am just at the planning stage but I am just wondering, what would be the most efficient and pythonic way to carry out the recursive searching? I am looking to scan the root directory and append any new identified subdirectories to a list and then scan all subdirectory paths in the list for further subdirectories until I have a total list of all directories. This just seems to be a clumsy way though IMHO and a bit repetitive so I assume there may be a more OOP manner of carrying out this function.
Similarly, I am only looking to add new info to my MySQL table and so what would be the most efficient way to establish if an entry already exists? The filename both in the table and the JPEG file name will be its MD5 hash values. I was considering scanning through the table at the beginning of the code and placing all filenames in a set and so, before scanning a new JPEG, if an entry already exists in the set, there would be no need to extract the EXIF and move onto the next picture. Is this an efficient method however or would it be better to scan through the MySQL table when a new image is encountered? I anticipate the set method may be the most efficient however the table may potentially contain tens of millions of entries eventually and so to add the filenames for these entries into a set (volatile memory) may not be the best idea.
Thanks folks.
I would just write a function that scanned a directory for all files; if it's a jpeg, add the full path name of the jpeg to the list of results. If it's a directory, then immediately call the function with the newly discovered directory as an argument. If it's another type of file, do nothing. This is a classic recursive divide-and-conquer strategy. It will break if there are loops in your directory path, for instance with symlinks -- if this is a danger for you, then you'll have to make sure you don't traverse the same directory twice by finding the "real" non-symlinked path of each directory and recording it.
How to avoid duplicate entries is a trickier problem and you have to consider whether you are tolerant of two differently-named files with the exact same contents (and also consider the edge cases of symlinked or multiply-hard-linked files), how new files appear in the directories you are scanning and whether you have any control over that process. One idea to speed it up would be to use os.path.getmtime(). Record the moment you start the directory traversal process. Next time around, have your recursive traversal process ignore any jpeg files with an mtime older than your recorded time. This can't be your only method of keeping track because files modified between the start and end times of your process may or may not be recorded, so you will still have to check the database for those records (for instance using the full path, a hash on the file info or a hash on the data itself, depending on what kind of duplication you're intolerant of) but used as a heuristic it should speed up the process greatly.
You could theoretically load all filenames (probably paths and not filenames) from the database into memory to speed up comparison, but if there's any danger of the table becoming very large it would be better to leave that info in the database. For instance, you could create a hash from the filename, and then simply add that to the database with a UNIQUE constraint -- the database will reject any duplicate entries, you can catch the exception and go on your way. This won't be slow if you use the aforementioned heuristic checking file mtime.
Be sure you account for the possibility of files that may be only modified and not newly created, if that's important for your application.
I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?
There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:
filea.txt
fileb.txt
filec.txt
Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.
If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.
This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.
Edit:
The OS of concern is Redhat. My use case is this:
Process A is continuously writing files to a storage location.
Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Edit:
Definition of valid:
Adjective
1. Well grounded or justifiable, pertinent.
(Sorry S.Lott, I couldn't resist).
I've edited the paragraph in question above.
tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>
As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.
If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.
The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory
http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory
I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.
Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)
#!/usr/bin/env python2
from ctypes import *
libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))
class Dirent(Structure):
_fields_ = [("d_ino", c_voidp),
("off_t", c_int64),
("d_reclen", c_ushort),
("d_type", c_ubyte),
("d_name", c_char * 2048)
]
while True:
p = libc.readdir64(dir_)
if not p:
break
entry = Dirent.from_address( p)
print entry.d_name
update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).
[footnote-1] The dirent64 C struct is determined at C compile time for each system.
The glob module Python from 2.5 onwards has an iglob method which returns an iterator.
An iterator is exactly for the purposes of not storing huge values in memory.
glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.
For example:
import glob
for eachfile in glob.iglob('*'):
# act upon eachfile
Since you are using Linux, you might want to look at pyinotify.
It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.
Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.
It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.
#jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.
The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.
Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.
What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?
Let's say you have three files: a.a, b.b, c.c.
Your magical "iterator" starts with a.a. You process it.
The magical "iterator" moves to b.b. You're processing it.
Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?
The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?
Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Don't use the naked file system for coordination.
Use a queue.
Process A writes files and enqueues the add/change/delete memento onto a queue.
Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.
I think what you are asking is impossible due to the nature of file IO. Once python has retrieved the listing of a directory it cannot maintain a view of the actual directory on disk, nor is there any way for python to insist that the OS inform it of any modifications to the directory.
All python can do is ask for periodic listings and diff the results to see if there have been any changes.
The best you can do is create a semaphore file in the directory which lets other processes know that your python process desires that no other process modify the directory. Of course they will only observe the semaphore if you have explicitly programmed them to.