check if a file is 'complete' (with python)

check if a file is 'complete' (with python) - python

is it possible to check if a file is done copying of if its complete using python?
or even on the command line.
i manipulate files programmatically in a specific folder on mac osx but i need to check if the file is complete before running the code which makes the manipulation.

There's no notion of "file completeness" in the Unix/Mac OS X filesystem. You could either try locking the file with flock or, simpler, copy the files to a subdir of the destination directory temporarily, then move them out once they're fully copied (assuming you have control over the program that does the copying). Moving is an atomic operation; you'll know the file is completely copied once it appears at the expected path.

take the md5 of the file before you copy and then again whenever you think you are done copying. when they match you are good to go. use md5 from the hashlib module for this.

If you know where the files are being copied from, you can check to see whether the size of the copy has reached the size of the original.
Alternatively, if a file's size doesn't change for a couple of seconds, it is probably done being copied, which may be good enough. (May not work well for slow network connections, however.)

It seems like you have control of the (python?) program doing the copying. What commands are you using to copy? I would think writing your code such that it blocks until the copy operation is complete would be sufficient.
Is this program multi-threaded or processed? If so you could add file paths to a queue when they are complete and then have the other thread only act on items in the queue.

You can use lsof and parse the opened handle list. If some process still has an opened handle on the file (aka writing) you can find it there.

You can do this:
import os
# Get the file size two times, now and after 3 seconds.
size_1 = os.path.getsize(file_path)
time.sleep(3)
size_2 = os.path.getsize(file_path)
# compare the sizes.
if size_1 == size_2:
# Do something.
else:
# Do something else.
You can change the time to whatever suits your need.

Related

Why does os.scandir() slow down/ how to reorganize huge directory?

I have a directory with 3 million+ files in it (which I should have avoided creating in the first place). Using os.scandir() to simply print out the names,
for f in os.scandir():
print(f)
takes .004 seconds per item for the first ~200,000 files, but drastically slows down to .3 seconds per item. Upon trying it again, it did the same thing- fast for the first ~200,000, then slowed way down.
After waiting an hour and running it again, this time it was fast for the first ~400,000 files but then slowed down in the same way.
The files all start with a year between 1908 and 1963, so I've tried reorganizing the files using bash commands like
for i in {1908..1963}; do
> mkdir ../test-folders/$i;
> mv $i* ../test-folders/$i/;
> done
But it ends up getting hung up and never making it anywhere...
Any advice on how to reorganize this huge folder or more efficiently list the files in the directory?

It sounds like using an iterator, a function that only returns one item at a time instead of putting everything in memory, would be best.
The glob Library has the function iglob
for infile in glob.iglob( os.path.join(rootdir, '*.*') ):
…
Documentation: https://docs.python.org/3/library/glob.html#glob.iglob
Related question and answer: https://stackoverflow.com/a/17020892/7838574

oof. That's a lot of files. I'm not sure why python starts slowing down, that is interesting. But there are many reasons why you're having problems. One, directories can be thought of as a special type of file that just holds filenames/data-pointers of all the files in it (grossly simplified). It can be faster at time, along with accessing any file, when the OS is caching some of that information in memory in order to speed up disk access across the system as a whole.
It seems strange that python gets slower, and maybe you're hitting an internal memory or some other mechanism in python.
But let's fix the root of the problem. Your bash script is problematic, because every time you are using a * character you're forcing the bash script to read the entire directory (and likely sort it alphabetically) too. It might be wiser to get the list once and then operate on sections of the list. Maybe something like:
/bin/ls -1 > /tmp/allfiles
for i in {1908..1963}; do
echo "moving files starting with $i"
mkdir ../test-folders/$i
mv $(egrep "^$i" /tmp/allfiles) ../test-folders/$i/
done
this will read the directory only once (sort of) and will inform you about how fast its going.

How the OS handles python and subprocesses of a python script...?

My question is somewhat unique. I am currently working on a project for my computer forensics class. This project is aimed at hiding disk data from investigators. The method by which this is supposed to be achieved is by writing the bytes of a "clean" file over the "bad" file. Once overwritten, the "bad" file is deleted.
This concept sounds simple enough, but what my partner and I have observed is interesting. If we open a file in a python script, we can easily overwrite the memory associated with that file on disk (verified using dd). We can also easily delete a file using from inside the script. However, a write then delete results in no write actually taking place, only the file's removal.
This makes sense from an OS optimization standpoint. From that point, we thought it might work if we split the writing and deleting into two separate scripts, and controlled both by a third. However, it seems that even if we run the scripts as a subprocess of another script, the same thing happens. We've tried to use bash scripts for the deletion process instead of pure python, and still, nothing sticks.
This project was supposed to be a whole mess of little anti-forensics tools like this, but this particular one has captured our whole attention because of this issue. Does anyone have an idea as to why this is happening and what we can do to move forward?
We know this can be achieved in C, etc, but we want to solve this using python because of the interesting constraints it's presented.
---EDIT---
This is a snippet from our controller, it calls "ghost.py" with the associated params.
ghost.py prints the edited file names/paths to stdout.
Relevant code follows:
proc = subprocess.Popen(['python', 'ghost.py', '-c', 'good.txt', '-d','/mnt/evil.txt'], stdout=subprocess.PIPE,)
files = proc.communicate()
for i in files:
if i != None and i != "\n":
os.system("./del.sh " + i)

Using a subprocess doesn't change any interesting aspect of your design, so don't use them. You probably need os.fsync(). Try this pattern:
myfile.write('all of my good data')
myfile.flush()
os.fsync(myfile.fileno())
myfile.close()
os.remove(myfile)
Reference: https://docs.python.org/2/library/os.html#os.fsync

Performant check whether a directory contains at least n files [duplicate]

I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?
There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:
filea.txt
fileb.txt
filec.txt
Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.
If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.
This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.
Edit:
The OS of concern is Redhat. My use case is this:
Process A is continuously writing files to a storage location.
Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Edit:
Definition of valid:
Adjective
1. Well grounded or justifiable, pertinent.
(Sorry S.Lott, I couldn't resist).
I've edited the paragraph in question above.

tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>
As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.
If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.
The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory
http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory
I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.
Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)
#!/usr/bin/env python2
from ctypes import *
libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))
class Dirent(Structure):
_fields_ = [("d_ino", c_voidp),
("off_t", c_int64),
("d_reclen", c_ushort),
("d_type", c_ubyte),
("d_name", c_char * 2048)
]
while True:
p = libc.readdir64(dir_)
if not p:
break
entry = Dirent.from_address( p)
print entry.d_name
update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).
[footnote-1] The dirent64 C struct is determined at C compile time for each system.

The glob module Python from 2.5 onwards has an iglob method which returns an iterator.
An iterator is exactly for the purposes of not storing huge values in memory.
glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.
For example:
import glob
for eachfile in glob.iglob('*'):
# act upon eachfile

Since you are using Linux, you might want to look at pyinotify.
It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.
Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.
It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.

#jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.
The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.
Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.

What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?
Let's say you have three files: a.a, b.b, c.c.
Your magical "iterator" starts with a.a. You process it.
The magical "iterator" moves to b.b. You're processing it.
Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?
The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?
Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Don't use the naked file system for coordination.
Use a queue.
Process A writes files and enqueues the add/change/delete memento onto a queue.
Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.

I think what you are asking is impossible due to the nature of file IO. Once python has retrieved the listing of a directory it cannot maintain a view of the actual directory on disk, nor is there any way for python to insist that the OS inform it of any modifications to the directory.
All python can do is ask for periodic listings and diff the results to see if there have been any changes.
The best you can do is create a semaphore file in the directory which lets other processes know that your python process desires that no other process modify the directory. Of course they will only observe the semaphore if you have explicitly programmed them to.

Is there a way to efficiently yield every file in a directory containing millions of files?

tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>
As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.
If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.
The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory
http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory
I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.
Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)
#!/usr/bin/env python2
from ctypes import *
libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))
class Dirent(Structure):
_fields_ = [("d_ino", c_voidp),
("off_t", c_int64),
("d_reclen", c_ushort),
("d_type", c_ubyte),
("d_name", c_char * 2048)
]
while True:
p = libc.readdir64(dir_)
if not p:
break
entry = Dirent.from_address( p)
print entry.d_name
update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).
[footnote-1] The dirent64 C struct is determined at C compile time for each system.

The glob module Python from 2.5 onwards has an iglob method which returns an iterator.
An iterator is exactly for the purposes of not storing huge values in memory.
glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.
For example:
import glob
for eachfile in glob.iglob('*'):
# act upon eachfile

Since you are using Linux, you might want to look at pyinotify.
It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.
Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.
It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.

#jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.
The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.
Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.

What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?
Let's say you have three files: a.a, b.b, c.c.
Your magical "iterator" starts with a.a. You process it.
The magical "iterator" moves to b.b. You're processing it.
Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?
The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?
Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Don't use the naked file system for coordination.
Use a queue.
Process A writes files and enqueues the add/change/delete memento onto a queue.
Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.

I think what you are asking is impossible due to the nature of file IO. Once python has retrieved the listing of a directory it cannot maintain a view of the actual directory on disk, nor is there any way for python to insist that the OS inform it of any modifications to the directory.
All python can do is ask for periodic listings and diff the results to see if there have been any changes.
The best you can do is create a semaphore file in the directory which lets other processes know that your python process desires that no other process modify the directory. Of course they will only observe the semaphore if you have explicitly programmed them to.

A way to "listen" for changes to a file system from Python on Linux?

I want to be able to detect whenever new files are created or existing files are modified or deleted within a given directory tree (or set of trees). The brute force way to do this would be to just rescan the tree looking for changes, but I'm looking for a more "interrupt driven" solution where the file system tells my code what changed when it changes, rather than my code having to "poll" by continuously scanning through thousands of files looking for changes.
A way to do this in Python is preferred, but if I have to write a native module in C that's ok as a last resort.

pyinotify is IMHO the only way to get system changes without scanning the directory.

twisted.internet.inotify! It's much more useful to have an event loop attached than just free-floating inotify. Using twisted also gives you filepath for free, which is a nice library for more easily manipulating file paths in python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.