Does reading an entire file leave the file handle open?

Does reading an entire file leave the file handle open? - python

If you read an entire file with content = open('Path/to/file', 'r').read() is the file handle left open until the script exits? Is there a more concise method to read a whole file?

The answer to that question depends somewhat on the particular Python implementation.
To understand what this is all about, pay particular attention to the actual file object. In your code, that object is mentioned only once, in an expression, and becomes inaccessible immediately after the read() call returns.
This means that the file object is garbage. The only remaining question is "When will the garbage collector collect the file object?".
in CPython, which uses a reference counter, this kind of garbage is noticed immediately, and so it will be collected immediately. This is not generally true of other python implementations.
A better solution, to make sure that the file is closed, is this pattern:
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
which will always close the file immediately after the block ends; even if an exception occurs.
Edit: To put a finer point on it:
Other than file.__exit__(), which is "automatically" called in a with context manager setting, the only other way that file.close() is automatically called (that is, other than explicitly calling it yourself,) is via file.__del__(). This leads us to the question of when does __del__() get called?
A correctly-written program cannot assume that finalizers will ever run at any point prior to program termination.
-- https://devblogs.microsoft.com/oldnewthing/20100809-00/?p=13203
In particular:
Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.
[...]
CPython currently uses a reference-counting scheme with (optional) delayed detection of cyclically linked garbage, which collects most objects as soon as they become unreachable, but is not guaranteed to collect garbage containing circular references.
-- https://docs.python.org/3.5/reference/datamodel.html#objects-values-and-types
(Emphasis mine)
but as it suggests, other implementations may have other behavior. As an example, PyPy has 6 different garbage collection implementations!

You can use pathlib.
For Python 3.5 and above:
from pathlib import Path
contents = Path(file_path).read_text()
For older versions of Python use pathlib2:
$ pip install pathlib2
Then:
from pathlib2 import Path
contents = Path(file_path).read_text()
This is the actual read_text implementation:
def read_text(self, encoding=None, errors=None):
"""
Open the file in text mode, read it, and close the file.
"""
with self.open(mode='r', encoding=encoding, errors=errors) as f:
return f.read()

Well, if you have to read file line by line to work with each line, you can use
with open('Path/to/file', 'r') as f:
s = f.readline()
while s:
# do whatever you want to
s = f.readline()
Or even better way:
with open('Path/to/file') as f:
for line in f:
# do whatever you want to

Instead of retrieving the file content as a single string,
it can be handy to store the content as a list of all lines the file comprises:
with open('Path/to/file', 'r') as content_file:
content_list = content_file.read().strip().split("\n")
As can be seen, one needs to add the concatenated methods .strip().split("\n") to the main answer in this thread.
Here, .strip() just removes whitespace and newline characters at the endings of the entire file string,
and .split("\n") produces the actual list via splitting the entire file string at every newline character \n.
Moreover,
this way the entire file content can be stored in a variable, which might be desired in some cases, instead of looping over the file line by line as pointed out in this previous answer.

Related

Replacing text in a file [duplicate]

Is it possible to parse a file line by line, and edit a line in-place while going through the lines?

Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
It can be simulated using a backup file as stdlib's fileinput module does.
Here's an example script that removes lines that do not satisfy some_condition from files given on the command line or stdin:
#!/usr/bin/env python
# grep_some_condition.py
import fileinput
for line in fileinput.input(inplace=True, backup='.bak'):
if some_condition(line):
print line, # this goes to the current file
Example:
$ python grep_some_condition.py first_file.txt second_file.txt
On completion first_file.txt and second_file.txt files will contain only lines that satisfy some_condition() predicate.

fileinput module has very ugly API, I find beautiful module for this task - in_place, example for Python 3:
import in_place
with in_place.InPlace('data.txt') as file:
for line in file:
line = line.replace('test', 'testZ')
file.write(line)
main difference from fileinput:
Instead of hijacking sys.stdout, a new filehandle is returned for writing.
The filehandle supports all of the standard I/O methods, not just readline().
Important Notes:
This solution deletes every line in the file if you don't re-write it with the file.write() line.
Also, if the process is interrupted, you lose any line in the file that has not already been re-written.

No. You cannot safely write to a file you are also reading, as any changes you make to the file could overwrite content you have not read yet. To do it safely you'd have to read the file into a buffer, updating any lines as required, and then re-write the file.
If you're replacing byte-for-byte the content in the file (i.e. if the text you are replacing is the same length as the new string you are replacing it with), then you can get away with it, but it's a hornets nest, so I'd save yourself the hassle and just read the full file, replace content in memory (or via a temporary file), and write it out again.

If you only intend to perform localized changes that do not change the length of the part of the file that is modified (e.g. changing all characters to lower case), then you can actually overwrite the old contents of the file dynamically.
To do that, you can use random file access with the seek() method of a file object.
Alternatively, you may be able to use an mmap object to treat the whole file as a mutable string. Keep in mind that mmap objects may impose a maximum file-size limit in the 2-4 GB range on a 32-bit CPU, depending on your operating system and its configuration.

You have to back up by the size of the line in characters. Assuming you used readline, then you can get the length of the line and back up using:
file.seek(offset[, whence])
Set whence to SEEK_CUR, set offset to -length.
See Python Docs or look at the manpage for seek.

Is there a more concise way to read csv files in Python?

with open(file, 'rb') as readerfile:
reader = csv.reader(readerfile)
In the above syntax, can I perform the first and second line together? It seems unnecessary to use 2 variables ('readerfile' and 'reader' above) if I only need to use the latter.
Is the former variable ('readerfile') ever used?
Can I use the same variable name for both is that bad form?

You can do:
reader = csv.reader(open(file, 'rb'))
but that would mean you are not closing your file explicitly.

with open(file, 'rb') as readerfile:
The first line opens the file and stores the file object in readerfile. The with statement ensures that the file is closed when you exit the block by any means, including exceptions.
reader = csv.reader(readerfile)
The second line creates a CSV reader object using the file object. It needs the file object (otherwise where would it read the data from?). Of course you could conceivably store it in the same variable
readerfile = csv.reader(readerfile)
if you wanted to (and don't plan on using the file object again), but this will likely lead to confusion for readers of your code.
Note that you haven't read anything yet! You still need to iterate over the reader object in order to get the data that you're interested in, and if you close the file before that happens then the reader object won't work. The file object is used behind the scenes by the reader object, even if you "hide" it by overwriting the readerfile variable.
Lastly, if you really want to do everything on one line, you could conceivably define a function that abstracts the with statement:
def with1(context, func):
with context as x:
return func(x)
Now you can write this as one line:
data = with1(open(file, 'rb'), lambda readerfile: list(csv.reader(readerfile)))
It's by no means clearer, however.

This is not recommended at all
Why is it important to use one line?
Most python programmers know well the benefits of using the with statement. Keep in mind that readers might be lazy (that is -read line by line-) on some cases. You want to be able to handle the file with the correct statement, ensuring the correct closing, even if errors arise.
Nevertheless, you can use a one liner for this, as stated in other answers:
reader = csv.reader(open(file, 'rb'))

So basically you want a one-liner?
reader = csv.reader(open(file, 'rb'))
As said before, the problem with that is with open() allows you to do the following steps in one time:
Open the file
Do what you want with the file (inside your open block)
Close the file (that is implicit and you don't have to specify it)
If you don't use with open but directly open, you file stays opened until the object is garbage collected, and that could lead to unpredicted behaviour in some cases.
Plus, your original code (two lines) is much more readable than a one-liner.

If you put them together, then the file won't be closed automatically -- but that often doesn't really matter, since it will be closed automatically when the script terminates.
It's not common to need to reference the raw file once acsv.readerinstance has been created from (except possibly to explicitly close it if you're not using awithstatement).
If you use the same variable name for both, it will probably work because thecsv.readerinstance will still hold a reference to the file object, so it won't be garbage collected until the program ends. It's not a commonly idiom, however.
Since csv files are often processed sequentially, the following can be a fairly concise way to do it since thecsv.readerinstance frequently doesn't really need to be given a variable name and it will close the file properly even if an exception occurs:
with open(file, 'rb') as readerfile:
for row in csv.reader(readerfile):
process the data...

Is there a way to shallow copy an existing file-object?

The use case for this would be creating multiple generators based on some file-object without any of them trampling each other's read state.
Originally I (thought I) had a working implementation using seek() and tell() where each generator was decorated by a meta-generator which maintained the file-handle position. This worked fine on things like StringIO, but failed on real files due the to read-ahead buffer mutilating the offset.
Using readline() or otherwise mocking the real file-object isn't viable as the reason for doing this was the excessively large files prompting a generator expression in the first place. So losing the read-ahead buffer isn't really a good option (as an aside, why was Python implemented this way in the first place? Shouldn't the buffer be like a cache and not actually exposed to the user? Proper encapsulation should have prevented this tell() issue in the first place...)
I then tried to use copy.copy, but that results in something like this: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f722ffda810>. Which appears unusable.
Does there exist an alternative way to copy? Is there a way to initialize a file-object? Or should I give up on this use case entirely because it is not possible in Python?

You are looking for itertools.tee.
from itertools import tee
with open("somefile.txt", "r") as fh:
fh1, fh2, fh3 = tee(fh, 3)
Once you call tee, do not use the parent iterator again. The iterators returned from tee may be used freely and independently, however.
For file objects specifically (to keep file-specific methods like read), you can just open a file multiple times; each file object will maintain its own file pointer as it reads the file.
fh1, fh2, fh3 = [open("somefile.txt") for i in range(3)]
or, if you already have a file object fh:
fh1, fh2, fh3 = [open(fh.name) for i in range(3)]
This doesn't preserve an already advanced file pointer, but it's easy enough to jump ahead:
for x in fh1, fh2, fh3:
x.seek(fh.tell())

opening & closing file without file object in python

Opening & closing file using file object:
fp=open("ram.txt","w")
fp.close()
If we want to Open & close file without using file object ,i.e;
open("ram.txt","w")
Do we need to write close("poem.txt") or writing close() is fine?
None of them are giving any error...
By only writing close() ,How it would understand to what file we are referencing?

For every object in memory, Python keeps a reference count. As long as there are no more references to an object around, it will be garbage collected.
The open() function returns a file object.
f = open("myfile.txt", "w")
And in the line above, you keep a reference to the object around in the variable f, and therefore the file object keeps existing. If you do
del f
Then the file object has no references anymore, and will be cleaned up. It'll be closed in the process, but that can take a little while which is why it's better to use the with construct.
However, if you just do:
open("myfile.txt")
Then the file object is created and immediately discarded again, because there are no references to it. It's gone, and closed. You can't close it anymore, because you can't say what exactly you want to close.
open("myfile.txt", "r").readlines()
To evaluate this whole expression, first open is called, which returns a file object, and then the method readlines is called on that. Then the result of that is returned. As there are now no references to the file object, it is immediately discarded again.

I would use with open(...), if I understand the question correctly.
This answer might help you What is the python keyword "with" used for?.

In answer to your actual question... a file object (what you get back when you call open) has the reference to the file in it. So when you do something like:
fp = open(myfile, 'w')
fp.write(...)
fp.close()
Everything in the above, including both write and close, know they reference myfile because that's the file that fp is associated with. I'm not sure what fp.close(myfile) actually does, but it certainly doesn't need the filename after it's open.
Better constructions like
with open(myfile,'w') as fp:
fp.write(...)
don't change this; in this case, fp is also a context manager, but still contains the pointer to myfile; there's no need to remind it.

Which method should I use for accessing files and why?

I am in python and there is a lot of ways to access files.
Method 1:
fp = open("hello.txt", "w")
fp.write("No no no");
fp.close()
fp = open("hello.txt", "r")
print fp.read()
fp.close()
Method 2:
open("hello.txt", "w").write("hello world!")
print open("hello.txt", "r").read()
Method 3:
with open("hello.txt","w") as f:
f.write("Yes yes yes")
with open("hello.txt") as f:
print f.read()
Is there a specific advantage in using each of these?
Stuff I know:
Method 2 and Method 3 closes the file automatically, but
Method 1 doesn't.
Method 2 doesn't give you a handle to do multiple operations.

You should use the third method.
There is a common pattern in programming where to use some object you have set it up, run your code, and tear it down again. File handles are one example of this: you have to open the file, run your code, and then close the file. This last is not optional -- it's important for the operating system to know that you are done with it, and for Python to flush all the data out of its IO buffers.
Now, CPython is a reference counted language. That means that it counts how many pieces of code 'know about' a given object, so that when that count becomes zero it can clean up said object and reuse its space in memory. In method 2, the reference count of the file object becomes zero, which allows Python to clean it up. And file objects' cleanup method also closes them. However, you should in general not rely on this -- reference counting is an implementation detail of the standard version of Python, and there's no guarantee that whatever you're using to run the program will do the same. That's why you shouldn't use method 2.
Method 1 is better, because you explicitly close the file -- as long as you reach the .close() function call! If an exception was thrown in the middle of that code block, the close would not be reached, and the file would not be explicitly closed. So you should really wrap the middle code in a try... finally block.
Method 3 is therefore best: you use the with statement -- an idiomatic way of enclosing the .close in a finally block -- to close the file, so you don't have to worry about the extra syntactic fluff of try... except.

I'd use this, kind of extended version of method 3:
with open("hello.txt","w+") as f:
f.write("Yes yes yes")
f.seek(0) #places the cursor back to the start of the file
print f.read() #now read the file
Advantages:
It opens the file only once
w+ mode allows both read and write on the same file object
with takes care of the closing of file

I would think it is best to go method one as it is explicit and you can surround it with a try and except block or the method 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.