No limit to file generation? - python

Is there no limit as to how many files can be created in a code in python, like there is a recursion limit? I have this bit of code here,
each = 0
while True:
with open('filenamewhtever'+str(each)+'.txt', 'a') as file:
file.write(str(each))
each += 1
which seems to work just fine, although it quickly filled a lot of memory in the folder. Could this,if unchecked, potentially have crashed my pc? Also, shouldnt the compiler have a failsafe switch to prevent this?

There is typically an operating system defined limit to how many files you can have open at the same time. But because the with statement closes each file after you've written it, you don't run into this limit.
There might also be limits imposed by the file system on how many files can be in a single directory. You might see certain operations (like listing the files) become slow, well before you even come close to that limit.
And finally, you are obviously limited by disk space.

Related

What happens when disk space runs out?

Suppose I do:
with open("temp.txt", "w" as f):
while True:
f.write(1)
What shall happen when I come close to completely using up my disk space? It seems like a problem that might have been asked before, but unfortunately I didn't find anything. Thanks...
In case it matters, I'm on ubuntu.
When the disk will be full (or when you will exhaust your quota if the filesystem supports them), the write will raise an IOError. If that exception is not filtered in a try block, it will terminate the script.
But bad things could happen. Most tools currently expect to have enough disk and memory, and most system implement multi-tasking. That means that if you exhaust the system disk, various system components could start to malfunction, especially if you are running under an admin account. Long story made short: avoid that unless you are experimenting on a dedicated file system...

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Does python "file write()" method guarantee datas have been correctly written?

I'm new with python and I'm writting script to patch a file with something like:
def getPatchDatas(file):
f = open(file,"rb")
datas = f.read()
f.close()
return datas
f = open("myfile.bin","r+b")
f.seek(0xC020)
f.write(getPatchDatas("mypatch.bin"))
f.close()
I would like to be sure the patch as been applied correctly.
So, if no error / exception is raised, does it mean I'm 100% sure the patch has been correctly written?
Or is it better to double check with something like:
f = open("myfile.bin","rb")
f.seek(0xC020)
if not f.read(0x20) == getPatchDatas("mypatch.bin"):
print "Patch not applied correctly!"
f.close()
??
Thanks.
No it doesn't, but roughly it does. It depends how much it matters.
Anything could go wrong - it could be a consumer hard disk which lies to the operating system about when it has finished writing data to disk. It could be corrupted in memory and that corrupt version gets written to disk, or it could be corrupted inside the disk during writing by electrical or physical problems.
It could be intercepted by kernel modules on Linux, filter drivers on Windows or a FUSE filesystem provider which doesn't actually support writing but pretends it does, meaning nothing was written.
It could be screwed up by a corrupted Python install where exceptions don't work or were deliberately hacked out of it, or file objects monkeypatched, or accidentally run in an uncommon implementation of Python which fakes supporting files but is otherwise identical.
These kinds of reasons are why servers have server class hardware with higher tolerances to temperature and electrical variation, error checking and correcting memory (ECC), RAID controller battery backups, ZFS checksumming filesystem, Uninterruptable Power Supplies, and so on.
But, as far as normal people and low risk things go - if it's written without error, it's as good as written. Double-checking makes sense - especially as it's that easy. It's nice to know if something has failed.
In single process, it is.
In multi processes(e.g. One process is writing and another is reading. Even you ensure it'll only read after call "write", the "write" need some time to finish), you may need a filelock.

Does python make a copy of opened files in memory?

So I would like to search for filenames with os.walk() and write the resulting list of names to a file. I would like to know what is more efficient : opening the file and then writing each result as I find them or storing everything in a list and then writing the whole list. That list could be big so I wonder if the second solution would work.
See this example:
import os
fil = open('/tmp/stuff', 'w')
fil.write('aaa')
os.system('cat /tmp/stuff')
You may expect to see aaa, but instead you get nothing. This is because Python has an internal buffer. Writing to disk is expensive, as it has to:
Tell the OS to write it.
Actually transfer the data to the disk (on a hard disk it may involve spinning it up, waiting for IO time, etc.).
Wait for the OS to report success on the writing.
If you want to write any small things, it can add up to quite some time. Instead, what Python does is to keep a buffer and only actually write from time to time. You don't have to worry about the memory growth, as it will be kept at a low value. From the docs:
"0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."
When you are done, make sure you do a fil.close(), or fil.flush() at any point during the execution, or use the keyword buffering=0 to disable buffering.
Another thing to consider is what happens if, for some reason, the program exits in the middle of the process. If you store everything in memory, it will be lost. What you have on disk, will remain there (but unless you flush, there is no guarantee of how much was actually saved).

Write zeros to file blocks

I'm attempting to identify the blocks that are tied to a specific file and write zeros to them. I've found several methods that do this to the free space on a disk, but so far I haven't found any slid suggestions for doing the following:
identify the blocks for a file
Write zeros to those blocks.
The purpose of this is for a virtualized system. This system has the ability to dedupe blocks that are identified as being the same. This is used to reduce space used by the guest OSes on the drive.
Currently this is being done using dd to write zeros to the free space on the drive. However this has the side effect on VMWare systems to cause the guest OS drive to use the entire disk space it has been allocated as from that point on the system things all the bytes have been written to.
Writing code that can safely modify even an unmounted filesystem will require significant effort. It is to be avoided unless there is no other option.
You basically have two choices to make modifying the filesystem easy:
Run python in the virtual environment.
Mount the virtualized filesystem on the host. Most UNIX-like systems can do that, e.g. with the help of FUSE (which support a lot of filesystem types) and loop devices.
This way you can use the (guest or host) OS's filesystem code instead of having to roll your own. :-) If you can use one of those options, the code fragment listed below will fill a file with zeroes:
import os
def overwrite(f):
"""Overwrite a file with zeroes.
Arguments:
f -- name of the file
"""
stat = os.stat(f)
with open(f, 'r+') as of:
of.write('\0' * stat.st_size)
of.flush()

Categories