What happens when disk space runs out? - python

Suppose I do:
with open("temp.txt", "w" as f):
while True:
f.write(1)
What shall happen when I come close to completely using up my disk space? It seems like a problem that might have been asked before, but unfortunately I didn't find anything. Thanks...
In case it matters, I'm on ubuntu.

When the disk will be full (or when you will exhaust your quota if the filesystem supports them), the write will raise an IOError. If that exception is not filtered in a try block, it will terminate the script.
But bad things could happen. Most tools currently expect to have enough disk and memory, and most system implement multi-tasking. That means that if you exhaust the system disk, various system components could start to malfunction, especially if you are running under an admin account. Long story made short: avoid that unless you are experimenting on a dedicated file system...

Related

No limit to file generation?

Is there no limit as to how many files can be created in a code in python, like there is a recursion limit? I have this bit of code here,
each = 0
while True:
with open('filenamewhtever'+str(each)+'.txt', 'a') as file:
file.write(str(each))
each += 1
which seems to work just fine, although it quickly filled a lot of memory in the folder. Could this,if unchecked, potentially have crashed my pc? Also, shouldnt the compiler have a failsafe switch to prevent this?
There is typically an operating system defined limit to how many files you can have open at the same time. But because the with statement closes each file after you've written it, you don't run into this limit.
There might also be limits imposed by the file system on how many files can be in a single directory. You might see certain operations (like listing the files) become slow, well before you even come close to that limit.
And finally, you are obviously limited by disk space.

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Does python "file write()" method guarantee datas have been correctly written?

I'm new with python and I'm writting script to patch a file with something like:
def getPatchDatas(file):
f = open(file,"rb")
datas = f.read()
f.close()
return datas
f = open("myfile.bin","r+b")
f.seek(0xC020)
f.write(getPatchDatas("mypatch.bin"))
f.close()
I would like to be sure the patch as been applied correctly.
So, if no error / exception is raised, does it mean I'm 100% sure the patch has been correctly written?
Or is it better to double check with something like:
f = open("myfile.bin","rb")
f.seek(0xC020)
if not f.read(0x20) == getPatchDatas("mypatch.bin"):
print "Patch not applied correctly!"
f.close()
??
Thanks.
No it doesn't, but roughly it does. It depends how much it matters.
Anything could go wrong - it could be a consumer hard disk which lies to the operating system about when it has finished writing data to disk. It could be corrupted in memory and that corrupt version gets written to disk, or it could be corrupted inside the disk during writing by electrical or physical problems.
It could be intercepted by kernel modules on Linux, filter drivers on Windows or a FUSE filesystem provider which doesn't actually support writing but pretends it does, meaning nothing was written.
It could be screwed up by a corrupted Python install where exceptions don't work or were deliberately hacked out of it, or file objects monkeypatched, or accidentally run in an uncommon implementation of Python which fakes supporting files but is otherwise identical.
These kinds of reasons are why servers have server class hardware with higher tolerances to temperature and electrical variation, error checking and correcting memory (ECC), RAID controller battery backups, ZFS checksumming filesystem, Uninterruptable Power Supplies, and so on.
But, as far as normal people and low risk things go - if it's written without error, it's as good as written. Double-checking makes sense - especially as it's that easy. It's nice to know if something has failed.
In single process, it is.
In multi processes(e.g. One process is writing and another is reading. Even you ensure it'll only read after call "write", the "write" need some time to finish), you may need a filelock.

File copy completion?

In Linux, how can we know if a file has completed copying before reading it? In Windows, an OSError is raised.
You can use the inotify mechanisms (via pyinotify) to catch events like CREATE, WRITE, CLOSE and based on them you can assume wether the copy has finished or not.
However, since you provided no details on what are you trying to do, I can't tell if inotify would be suitable for you (btw, inotify is Linux specific so you can't use it on Windows or other platforms)
In Linux, you can open a file while another process is writing to it without Python throwing an OSError, so in general, you cannot know for sure whether the other side has finished writing into that file. You can try some hacks, though:
You can check the file size regularly to see whether it increased since the last check. If it hasn't increased in, say, five seconds, you might be safe to assume that the copy has finished. I'm saying might since this is not true in all circumstances. If the other process that is writing the file is blocked for whatever reason, it might temporarily stop writing to the file and resume it later. So this is not 100% fool-proof, but might work for local file copies if the system is never under a heavy load that would stall the writing process.
You can check the output of fuser (this is a shell command), which will list the process IDs for all the files that are holding a file handle to a given file name. If this list includes any process other than yours, you can assume that the copying process hasn't finished yet. However, you will have to make sure that fuser is installed on the target system in order to make it work.

Abort a slow flush to disk after write?

Is there a way to abort a python write operation in such a way that the OS doesn't feel it's necessary to flush the unwritten data to the disc?
I'm writing data to a USB device, typically many megabytes. I'm using 4096 bytes as my block size on the write, but it appears that Linux caches up a bunch of data early on, and write it out to the USB device slowly. If at some point during the write, my user decides to cancel, I want the app to just stop writing immediately. I can see that there's a delay between when the data stops flowing from the application, and the USB activity light stops blinking. Several seconds, up to about 10 seconds typically. I find that the app is holding in the close() method, I'm assuming, waiting for the OS to finish writing the buffered data. I call flush() after every write, but that doesn't appear to have any impact on the delay. I've scoured the python docs for an answer but have found nothing.
It's somewhat filesystem dependent, but in some filesystems, if you delete a file before (all of) it is allocated, the IO to write the blocks will never happen. This might also be true if you truncate it so that the part which is still being written is chopped off.
Not sure that you can really abort a write if you want to still access the data. Also the kinds of filesystems that support this (e.g. xfs, ext4) are not normally used on USB sticks.
If you want to flush data to the disc, use fdatasync(). Merely flushing your IO library's buffer into the OS one will not achieve any physical flushing.
Assuming I am understanding this correct, you want to be able to 'abort' and NOT flush the data. This IS possible using a ctype and a little pokery. This is very OS dependent so I'll give you the OSX version and then what you can do to change it to Linux:
f = open('flibble1.txt', 'w')
f.write("hello world")
import ctypes
x = ctypes.cdll.LoadLibrary('/usr/lib/libc.dylib')
x.close(f.fileno())
try:
del f
catch IOError:
pass
If you change /usr/lib/libc.dylib to the libc.so.6 in /usr/lib for Linux then you should be good to go. Basically by calling close() instead of fclose(), no call to fsync() is done and nothing is flushed.
Hope that's useful.
When you abort the write operation, trying doing file.truncate(0); before closing it.

Categories