Python and memory usage when opening files

Python and memory usage when opening files - python

When opening and appending to a file in python, does that file get loaded into memory? I'm asking this because I'm writing a program where I write to several files in a round-robin fashion where I have the guarantee that any one file can fit into memory but not all files can fit into memory at the same time. Opening and closing files every time I append is not an option since that would be too slow. As such, I would need all the files opened simultaneously.

The answer is NO. Regarding the documentations of open() wraps a system call and returns a file object (Not the content of file): https://docs.python.org/2/library/functions.html#open
Open a file, returning an object of the file type described in section
File Objects.
The file contents are not loaded into RAM unless you read the file with eg.: readlines(), read()

Related

Reading file in "rb" mode while the file is changing

Imagine that you are reading the byte contents of a file in python, with the goal of writing them to a temporary file or bytesio.
What I have not been able to answer is what will happen if say the file is large and while it's open there's a change in the file?
Is there a way to ensure that the file is read correctly, without errors?
I would have dealt with that by simply copying it in the memory first but this doesn't seem to be wise in the scenario of large files.

How to write and update .txt files with python?

I ve written a script that fetches bitcoin data and saves it in .txt files or in the case where the .txt files exist, it updates them. The .txt files are nodes and relationships connecting the nodes for neo4j.
At the beginning of the script:
It checks whether the files exist, so it opens them and appends new lines OR
In case the files do not exist, the script creates them and starts appending lines.
The .txt files are constantly open, the script writes the new data. The .txt files close when all the data are written or I terminate the execution.
My question is:
Should I open, write, close each .txt file for each iteration and for each .txt file?
or
Should I keep it the way it is now; open the .txt files, do all the writing, when the writing is done close the .txt file
I am saving data from 6013 blocks. Which way would minimize risk of corrupting the data written in the .txt files?

Keeping files open will be faster. In the comments you mentioned that "Loss of data previously written is not an option". The probability of corrupting files is higher for open files so open and close file on each iteration is more reliable.
There is also an option to keep data in some buffer and to write/append buffer to file when all data is received or on user/system interrupt or network timeout.

I think keeping the file open will be more efficient, because python won't need to search for the file and open it every time you want to read/write the file.
I guess it should look like this
with open(filename, "a") as file:
while True:
data = # get data
file.write(data)

Run a benchmark and see for yourself would the typical answer for this kind of question.
Nevertheless opening and closing a file does have a cost. Python needs to allocate memory for the buffer and data structures associated with the file and call some operating system functions, e.g. the open syscall which in turn would search the file in cache or on disk.
On the other hand there is a limit on the number of files a program, the user, the whole system, etc can open at the same time. For example on Linux, the value in /proc/sys/fs/file-max denotes the maximum number of file-handles that the kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit (source).
If your program runs in such a restrictive environment then it would be good to keep the file open only when needed.

File updates twice when dumping pickle

I have a python script which will run various other scripts when it sees various files have been updated. It rapidly polls the files to check for updates by looking at the file modified dates.
For the most part this has worked as expected. When one of my scripts updates a file, another script is triggered and the appropriate action(s) are taken. For reference I am using pickles as the file type.
However, adding a new file and corresponding script into the mix just now, I've noticed an issue where the file has its modified date updated twice. Once when I perform the pickle.dump() and again when I exit the "with" statement (when the file closes). This means that the corresponding actions trigger twice rather than once. I guess this makes sense but what's confusing is this behaviour doesn't happen with any of my other files.
I know a simple workaround would be to poll the files slightly less frequently since the gap between the file updates is extremely small. But I want to understand why this issue is occuring some times but not other times.

I think what you observe is 2 actions: file created and file updated.
To resolve this, create and populate file outside of monitored folders, and once "with" block is over (file is closed), move it from temporary location to a proper place.
to do this, look at tempfile module in standard library

If the pickle is big enough (typically somewhere around 4+ KB, though it will vary by OS/file system), this would be expected behavior. The majority of the pickle would be written during the dump call as buffers filled and got written, but whatever fraction doesn't consume the full file buffer would be left in the buffer until the file is closed (which implicitly flushes any outstanding buffered data before closing the handle).
I agree with the other answer that the usual solution is to write the file in a different folder (but on the same file system), then immediately after closing it, us os.replace to perform an atomic rename that moves it from the temporary location to the final location, so there is no gap between file open, file population, and file close; the file is either there in its entirety, or not at all.

Is InMemoryUploadedFile really "in memory"?

I understand that opening a file just creates a file handler that takes a fixed memory irrespective of the size of the file.
Django has a type called InMemoryUploadedFile that represents files uploaded via forms.
I get the handle to my file object inside the django view like this:
file_object = request.FILES["uploadedfile"]
This file_object has type InMemoryUploadedFile.
Now we can see for ourselves that, file_object has the method .read() which is used to read files into memory.
bytes = file_object.read()
Wasn't file_object of type InMemoryUploadedFile already "in memory"?

The read() method on a file object is way to access content from within a file object irrespective of whether that file is in memory or stored on the disk. It is similar to other utility file access methods like readlines or seek.
The behavior is similar to what is built into Python which in turn is built over the operating system's fread() method.
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in
non-blocking mode, less data than was requested may be returned, even
if no size parameter was given.
On the question of where exactly the InMemoryUploadedFile is stored, it is a bit more complicated.
Before you save uploaded files, the data needs to be stored somewhere.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast.
However, if an uploaded file is too large, Django will write the
uploaded file to a temporary file stored in your system’s temporary
directory. On a Unix-like platform this means you can expect Django to
generate a file called something like /tmp/tmpzfp6I6.upload. If an
upload is large enough, you can watch this file grow in size as Django
streams the data onto disk.
These specifics – 2.5 megabytes; /tmp; etc. – are simply “reasonable
defaults”. Read on for details on how you can customize or completely
replace upload behavior.

One thing to consider is that in python file like objects have an API that is pretty strictly adhered to. This allows code to be very flexible, they are abstractions over I/O streams. These allow your code to not have to worry about where the data is coming from, ie. memory, filesystem, network, etc.
File like objects usually define a couple methods, one of which is read
I am not sure of the actually implementation of InMemoryUploadedFile, or how they are generated or where they are stored (I am assuming they are totally in memory though), but you can rest assured that they are file like objects and contain a read method, because they adhere to the file api.
For the implementation you could start checking out the source:
https://github.com/django/django/blob/master/django/core/files/uploadedfile.py#L90
https://github.com/django/django/blob/master/django/core/files/base.py
https://github.com/django/django/blob/master/django/core/files/uploadhandler.py

python "r+" requires file to exist?

I know that it doesn't make sense to open file for reading if it doesn't exist, unlike for writing. But I need to create a file object, write data to it and then read it later, that's why I want to use the "r+" mode. Of course I can just open the file for writing once and then open the saved file for reading, but the problem is I don't want the file to be saved to disc. Any ideas?

Maybe you should be using a StringIO then. It imitates file-like operations (such as writing to and reading from it).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.