File scan daemon (OS X) in Python

File scan daemon (OS X) in Python - python

I'm currently using Python's function,
os.walk('/')
to loop through my whole filesystem on OS X. My aim is to make a personal daemon that keeps track of:
Newly made files/dirs
Adjusted/touched files/dirs
Deleted files (maybe)
The idea
This is more of a precautious function I wanted to add to my Macs to be able to see if weird stuff gets placed in my directories unwanted, so if my Macs ever get infected by some (yet unknown) trojan I can maybe detect it myself already.
Also I'm looking into adding features later to maybe shut down my internet connections etc. when something off is detected. This is maybe an irrational function, but as it's just a personal script I think it's not that bad :P.
What I want to achieve
So my main question is. After the first run I will save an array of the whole filesystem and their metadata (creation data, modification date). After that the I want the daemon to run in the background in a "watching" mode, practically mirroring the last known stored array of the filesystem with a newly looped one.
The problem now is that when I run the script to test it litteraly starts to burn my CPU, making my MacBook starting to get hiccups after a while. I want to add sleeps between each directory step "os.walk()" makes in my for loop.
My question is: What is a right sleep time? My MacBook's Disk Utility says I have 183.867 folders and 1.013.320 files making a total of 1.197.187 entries (as Folders are practically files too). So setting my code to:
time.sleep(0.001)
..would approximately take those 1.2Million entries to be dealt in about 2 minutes. I have no clue if this is good and I prefer to make it more dynamically based on the total files/folders count.
An extra feature
Based on my question I noticed that the OS X Disk Utility already knows my total Files and Folders. Can Python get this data too without doing an extreme loop-through? Maybe calling a Terminal function built-in to OS X.
This way I could also keep an indicator if using a little GUI for my Daemon's status.
Thanks in advance!

This is not a direct answer but directed to the requirement to track:
Newly made files/dirs
Adjusted/touched files/dirs
Deleted files (maybe)
You can use : http://pyinotify.sourceforge.net/ which ties with inotify and will send events on file changes, deletion and creation. This will avoid the walk through of large directories.
This is wrapper over Inotify. This is available for Linux. I also see that there are libraries and modules for OSX in fink and macports. So this should be more elegant solution.

Not a complete answer, but a few pointers:
For OSX, fseventsd implements a mechanism similar to inotify. fslogger is an example of how it can be used. pymacadmin seems to allow you to interface it from Python.
What you want to implement is similar to Tripwire.

Related

Multiprocessing new terminal window with python

I am working with multiprocessing and I want to ask if there is some option to create new process with new terminal window in Ubuntu
I have 3 processes starting simultaneously, but I want results from them in separated terminals for each
Thanks

No. Sadly there is no* way to do what you want within python, as python has no control over the terminal it is running in.
What I think you want though is to separate the messages from your different processes, so you can see what's going on. What I sometimes do for this (in testing only!) is to have each process log to a different file, and then watch those three files in three terminal windows. You can do this with watch or even a simple while loop in bash:
watch -n 3 "cat /my/output/file" # or:
while true; do cat /my/output/file; sleep 3; done
of course you can replace cat with something more useful, perhaps tail. Or you can just open the output files in a text editor which has an auto-revert facility (e.g. Emacs with M-x auto-revert-mode). This does exactly the same thing internally---poll the file for changes and update if need be.
I also really suggest you use logging inside your code, and give each paralleled function it's own logger (with the name derived from the parameters of the function). (This can be easier with a small class rather than a function). That way later on you can send all your output to a file, and if something goes wrong you can easily find out which run failed and extract information from only that run (with grep!). I use this approach in paralleled fuzzy matching code (actually for matching music libraries) and it's invaluable when you need to dig into how some strange result occurred.
*Okay, I'm sure there's some horrible way to control some particular terminal and output to it, but that's not what you meant.

What is the best or proper way to allow debugging of generated code?

For various reasons, in one project I generate executable code by means of generating AST from various source files the compiling that to bytecode (though the question could also work for cases where the bytecode is generated directly I guess).
From some experimentation, it looks like the debugger more or less just uses the lineno information embedded in the AST alongside the filename passed to compile in order to provide a representation for the debugger's purposes, however this assumes the code being executed comes from a single on-disk file.
That is not necessarily the case for my project, the executable code can be pieced together from multiple sources, and some or all of these sources may have been fetched over the network, or been retrieved from non-disk storage (e.g. database).
And so my Y questions, which may be the wrong ones (hence the background):
is it possible to provide a memory buffer of some sort, or is it necessary to generate a singular on-disk representation of the "virtual source"?
how well would the debugger deal with jumping around between the different bits and pieces if the virtual source can't or should not be linearised[0]
and just in case, is the assumption of Python only supporting a single contiguous source file correct or can it actually be fed multiple sources somehow?
[0] for instance a web-style literate program would be debugged in its original form, jumping between the code sections, not in the so-called "tangled" form

Some of this can be handled by the trepan3k debugger. For other things various hooks are in place.
First of all it can debug based on bytecode alone. But of course stepping instructions won't be possible if the line number table doesn't exist. And for that reason if for no other, I would add a "line number" for each logical stopping point, such as at the beginning of statements. The numbers don't have to be line numbers, they could just count from 1 or be indexes into some other table. This is more or less how go's Pos type position works.
The debugger will let you set a breakpoint on a function, but that function has to exist and when you start any python program most of the functions you define don't exist. So the typically way to do this is to modify the source to call the debugger at some point. In trepan3k the lingo for this is:
from trepan.api import debug; debug()
Do that in a place where the other functions you want to break on and that have been defined.
And the functions can be specified as methods on existing variables, e.g. self.my_function()
One of the advanced features of this debugger is that will decompile the bytecode to produce source code. There is a command called deparse which will show you the context around where you are currently stopped.
Deparsing bytecode though is a bit difficult so depending on which kind of bytecode you get the results may vary.
As for the virtual source problem, well that situation is somewhat tolerated in the debugger, since that kind of thing has to go on when there is no source. And to facilitate this and remote debugging (where the file locations locally and remotely can be different), we allow for filename remapping.
Another library pyficache is used to for this remapping; it has the ability I believe remap contiguous lines of one file into lines in another file. And I think you could use this over and over again. However so far there hasn't been need for this. And that code is pretty old. So someone would have to beef up trepan3k here.
Lastly, related to trepan3k is a trepan-xpy which is a CPython bytecode debugger which can step bytecode instructions even when the line number table is empty.

Faster way to find large files with Python?

I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200GB or something like that on multiple linux servers, and it has to be Python.
I have tried many things like calling du -h with the script but du is just way too slow to go through a directory as large as 1TB.
I've also tried the find command like find ./ +200G but that is also going to take foreeeever.
I have also tried os.walk() and doing .getsize() but it's the same problem- too slow.
All of these methods take hours and hours and I need help finding another solution if anyone is able to help me. Because not only do I have to do this search for large files on one server, but I will have to ssh through almost 300 servers and output a giant list of all the files > 200GB, and the three methods that i have tried will not be able to get that done.
Any help is appreciated, thank you!

That's not true that you cannot do better than os.walk()
scandir is said to be 2 to 20 times faster.
From https://pypi.python.org/pypi/scandir
Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling listdir() on each directory – it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.
In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.
From python 3.5, thanks to PEP 471, scandir is now built-in, provided in the os package. Small (untested) example:
for dentry in os.scandir("/path/to/dir"):
if dentry.stat().st_size > max_value:
print("{} is biiiig".format(dentry.name))
(of course you need stat at some point, but with os.walk you called stat implicitly when using the function. Also if the files have some specific extensions, you could perform stat only when the extension matches, saving even more)
And there's more to it:
So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.
So migrating to Python 3.5+ magically speeds up os.walk without having to rewrite your code.
From my experience, multiplying the stat calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.
The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh for instance). It's less convenient, but it's worth it.

It is hard to imagine that you will find a significantly faster way to traverse a directory than os.walk() and du. Parallelizing the search might help a bit in some setups (e.g. SSD), but it won't make a dramatic difference.
A simple approach to make things faster is by automatically running the script in the background every hour or so, and having your actual script just pick up the results. This won't help if the results need to be current, but might work for many monitoring setups.

Modifying a running script

I'm using python scripts to execute simple but long measurements. I as wondering if (and how) it's possible to edit a running script.
An example:
Let's assume I made an error in the last lines of a running script.These lines have not yet been executed. Now I'd like to fix it without restarting the script. What should I do?
Edit:
One Idea I had was loading each line of the script in a list. Then pop the first one. Feed it to an interpreter instance. Wait for it to complete and pop the next one. This way I could modify the list.
I guess I can't be the first one thinking about it. Someone must have implemented something like this before and I don't wan't to reinvent the weel. I one of you knows about a project please let me know.

I am afraid there's no easy way to arbitrarily modify a running Python script.
One approach is to test the script on a small amount of data first. This way you'll reduce the likelihood of discovering bugs when running on the actual, large, dataset.
Another possibility is to make the script periodically save its state to disk, so that it can be restarted from where it left off, rather than from the beginning.

How can I create a ramdisk in Python?

I want to create a ramdisk in Python. I want to be able to do this in a cross-platform way, so it'll work on Windows XP-to-7, Mac, and Linux. I want to be able to read/write to the ramdisk like it's a normal drive, preferably with a drive letter/path.
The reason I want this is to write tests for a script that creates a directory with a certain structure. I want to create the directory completely in the ramdisk so I'll be sure it would be completely deleted after the tests are over. I considered using Python's tempfile, but if the test will be stopped in the middle the directory might not be deleted. I want to be completely sure it's deleted even if someone pulls the plug on the computer in the middle of a test.

How about PyFilesystem?
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html
https://docs.pyfilesystem.org/en/latest/reference/tempfs.html
The downside is that you have to access the filesystem with PyFilesystem API, but you can also access the real fs with PyFilesystem.

Because file and directory-handling is so low-level and OS dependent, I doubt anything like what you want exists (or is even possible). Your best bet might be to implement a "virtual" file-system-like set of functions, classes, and methods that keep track of the files and directory-hierarchy created and their content.
Callables in such an emulation would need to have the same signature and return the same value(s) as their counterparts in the various Python standard built-ins and modules your application uses.
I suspect this might not be as much work as it sounds -- emulating the standard Python file-system interface -- depending on how much of it you're actually using since you wouldn't necessarily have to imitate all of it. Also, if written in Pure Python™, it would also be portable and easy to maintain and enhance.

One option might be to inject (monkey patch) modified versions of the methods used in the os module as well as the builtins open and file that write to StringIO files instead of to disk. Obviously this substitution should only occur for the module being tested;

Please read this:
http://docs.python.org/library/tempfile.html#tempfile.TemporaryFile
"Return a file-like object that can be
used as a temporary storage area. The
file is created using mkstemp(). It
will be destroyed as soon as it is
closed (including an implicit close
when the object is garbage
collected)."
It's all handled for you. Do nothing and it already works.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.