Faster way to find large files with Python?

Faster way to find large files with Python? - python

I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200GB or something like that on multiple linux servers, and it has to be Python.
I have tried many things like calling du -h with the script but du is just way too slow to go through a directory as large as 1TB.
I've also tried the find command like find ./ +200G but that is also going to take foreeeever.
I have also tried os.walk() and doing .getsize() but it's the same problem- too slow.
All of these methods take hours and hours and I need help finding another solution if anyone is able to help me. Because not only do I have to do this search for large files on one server, but I will have to ssh through almost 300 servers and output a giant list of all the files > 200GB, and the three methods that i have tried will not be able to get that done.
Any help is appreciated, thank you!

That's not true that you cannot do better than os.walk()
scandir is said to be 2 to 20 times faster.
From https://pypi.python.org/pypi/scandir
Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling listdir() on each directory – it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.
In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.
From python 3.5, thanks to PEP 471, scandir is now built-in, provided in the os package. Small (untested) example:
for dentry in os.scandir("/path/to/dir"):
if dentry.stat().st_size > max_value:
print("{} is biiiig".format(dentry.name))
(of course you need stat at some point, but with os.walk you called stat implicitly when using the function. Also if the files have some specific extensions, you could perform stat only when the extension matches, saving even more)
And there's more to it:
So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.
So migrating to Python 3.5+ magically speeds up os.walk without having to rewrite your code.
From my experience, multiplying the stat calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.
The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh for instance). It's less convenient, but it's worth it.

It is hard to imagine that you will find a significantly faster way to traverse a directory than os.walk() and du. Parallelizing the search might help a bit in some setups (e.g. SSD), but it won't make a dramatic difference.
A simple approach to make things faster is by automatically running the script in the background every hour or so, and having your actual script just pick up the results. This won't help if the results need to be current, but might work for many monitoring setups.

Related

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.

Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.

This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

File scan daemon (OS X) in Python

I'm currently using Python's function,
os.walk('/')
to loop through my whole filesystem on OS X. My aim is to make a personal daemon that keeps track of:
Newly made files/dirs
Adjusted/touched files/dirs
Deleted files (maybe)
The idea
This is more of a precautious function I wanted to add to my Macs to be able to see if weird stuff gets placed in my directories unwanted, so if my Macs ever get infected by some (yet unknown) trojan I can maybe detect it myself already.
Also I'm looking into adding features later to maybe shut down my internet connections etc. when something off is detected. This is maybe an irrational function, but as it's just a personal script I think it's not that bad :P.
What I want to achieve
So my main question is. After the first run I will save an array of the whole filesystem and their metadata (creation data, modification date). After that the I want the daemon to run in the background in a "watching" mode, practically mirroring the last known stored array of the filesystem with a newly looped one.
The problem now is that when I run the script to test it litteraly starts to burn my CPU, making my MacBook starting to get hiccups after a while. I want to add sleeps between each directory step "os.walk()" makes in my for loop.
My question is: What is a right sleep time? My MacBook's Disk Utility says I have 183.867 folders and 1.013.320 files making a total of 1.197.187 entries (as Folders are practically files too). So setting my code to:
time.sleep(0.001)
..would approximately take those 1.2Million entries to be dealt in about 2 minutes. I have no clue if this is good and I prefer to make it more dynamically based on the total files/folders count.
An extra feature
Based on my question I noticed that the OS X Disk Utility already knows my total Files and Folders. Can Python get this data too without doing an extreme loop-through? Maybe calling a Terminal function built-in to OS X.
This way I could also keep an indicator if using a little GUI for my Daemon's status.
Thanks in advance!

This is not a direct answer but directed to the requirement to track:
Newly made files/dirs
Adjusted/touched files/dirs
Deleted files (maybe)
You can use : http://pyinotify.sourceforge.net/ which ties with inotify and will send events on file changes, deletion and creation. This will avoid the walk through of large directories.
This is wrapper over Inotify. This is available for Linux. I also see that there are libraries and modules for OSX in fink and macports. So this should be more elegant solution.

Not a complete answer, but a few pointers:
For OSX, fseventsd implements a mechanism similar to inotify. fslogger is an example of how it can be used. pymacadmin seems to allow you to interface it from Python.
What you want to implement is similar to Tripwire.

How can I create a ramdisk in Python?

I want to create a ramdisk in Python. I want to be able to do this in a cross-platform way, so it'll work on Windows XP-to-7, Mac, and Linux. I want to be able to read/write to the ramdisk like it's a normal drive, preferably with a drive letter/path.
The reason I want this is to write tests for a script that creates a directory with a certain structure. I want to create the directory completely in the ramdisk so I'll be sure it would be completely deleted after the tests are over. I considered using Python's tempfile, but if the test will be stopped in the middle the directory might not be deleted. I want to be completely sure it's deleted even if someone pulls the plug on the computer in the middle of a test.

How about PyFilesystem?
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html
https://docs.pyfilesystem.org/en/latest/reference/tempfs.html
The downside is that you have to access the filesystem with PyFilesystem API, but you can also access the real fs with PyFilesystem.

Because file and directory-handling is so low-level and OS dependent, I doubt anything like what you want exists (or is even possible). Your best bet might be to implement a "virtual" file-system-like set of functions, classes, and methods that keep track of the files and directory-hierarchy created and their content.
Callables in such an emulation would need to have the same signature and return the same value(s) as their counterparts in the various Python standard built-ins and modules your application uses.
I suspect this might not be as much work as it sounds -- emulating the standard Python file-system interface -- depending on how much of it you're actually using since you wouldn't necessarily have to imitate all of it. Also, if written in Pure Python™, it would also be portable and easy to maintain and enhance.

One option might be to inject (monkey patch) modified versions of the methods used in the os module as well as the builtins open and file that write to StringIO files instead of to disk. Obviously this substitution should only occur for the module being tested;

Please read this:
http://docs.python.org/library/tempfile.html#tempfile.TemporaryFile
"Return a file-like object that can be
used as a temporary storage area. The
file is created using mkstemp(). It
will be destroyed as soon as it is
closed (including an implicit close
when the object is garbage
collected)."
It's all handled for you. Do nothing and it already works.

Will Python be faster if I put commonly called code into separate methods or files?

I thought I once read on SO that Python will compile and run slightly more quickly if commonly called code is placed into methods or separate files. Does putting Python code in methods have an advantage over separate files or vice versa? Could someone explain why this is? I'd assume it has to do with memory allocation and garbage collection or something.

It doesn't matter. Don't structure your program around code speed; structure it around coder speed. If you write something in Python and it's too slow, find the bottleneck with cProfile and speed it up. How do you speed it up? You try things and profile them. In general, function call overhead in critical loops is high. Byte compiling your code takes a very small amount of time and only needs to be done once.

No. Regardless of where you put your code, it has to be parsed once and compiled if necessary. Distinction between putting code in methods or different files might have an insignificant performance difference, but you shouldn't worry about it.
About the only language right now that you have to worry about structuring "right" is Javascript. Because it has to be downloaded from net to client's computer. That's why there are so many compressors and obfuscators for it. Stuff like this isn't done with Python because it's not needed.

Two things:
Code in separate modules is compiled into bytecode at first runtime and saved as a precompiled .pyc file, so it doesn't have to be recompiled at the next run as long as the source hasn't been modified since. This might result in a small performance advantage, but only at program startup.
Also, Python stores variables etc. a bit more efficiently if they are placed inside functions instead of at the top level of a file. But I don't think that's what you're referring to here, is it?

What is the use of the "-O" flag for running Python?

Python can run scripts in optimized mode (python -O) which turns off debugs, removes assert statements, and IIRC it also removes docstrings.
However, I have not seen it used. Is python -O actually used? If so, what for?

python -O does the following currently:
completely ignores asserts
sets the special builtin name __debug__ to False (which by default is True)
and when called as python -OO
removes docstrings from the code
I don't know why everyone forgets to mention the __debug__ issue; perhaps it is because I'm the only one using it :) An if __debug__ construct creates no bytecode at all when running under -O, and I find that very useful.

It saves a small amount of memory, and a small amount of disk space if you distribute any archive form containing only the .pyo files. (If you use assert a lot, and perhaps with complicated conditions, the savings can be not trivial and can extend to running time too).
So, it's definitely not useless -- and of course it's being used (if you deploy a Python-coded server program to a huge number N of server machines, why ever would you want to waste N * X bytes to keep docstrings which nobody, ever, would anyway be able to access?!). Of course it would be better if it saved even more, but, hey -- waste not, want not!-)
So it's pretty much a no-brainer to keep this functionality (which is in any case trivially simple to provide, you know;-) in Python 3 -- why add even "epsilon" to the latter's adoption difficulties?-)

Prepacked software in different Linux distributions often comes byte-compiled with -O. For example, this if from Fedora packaging guidelines for python applications:
In the past it was common practice to %ghost .pyo files in order to save a small amount of space on the users filesystem. However, this has two issues: 1. With SELinux, if a user is running python -O [APP] it will try to write the .pyos when they don't exist. This leads to AVC denial records in the logs. 2. If the system administrator runs python -OO [APP] the .pyos will get created with no docstrings. Some programs require docstrings in order to function. On subsequent runs with python -O [APP] python will use the cached .pyos even though a different optimization level has been requested. The only way to fix this is to find out where the .pyos are and delete them.
The current method of dealing with pyo files is to include them as is, no %ghosting.

Removing assertions means a small performance benefit, so you could use this for "release" code. Anyway nobody uses it because many Python libraries are open sourced and thus the help() function should work.
So, as long as there isn't any real optimization in this mode, you can ignore it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.