Is there a way to shallow copy an existing file-object? - python

The use case for this would be creating multiple generators based on some file-object without any of them trampling each other's read state.
Originally I (thought I) had a working implementation using seek() and tell() where each generator was decorated by a meta-generator which maintained the file-handle position. This worked fine on things like StringIO, but failed on real files due the to read-ahead buffer mutilating the offset.
Using readline() or otherwise mocking the real file-object isn't viable as the reason for doing this was the excessively large files prompting a generator expression in the first place. So losing the read-ahead buffer isn't really a good option (as an aside, why was Python implemented this way in the first place? Shouldn't the buffer be like a cache and not actually exposed to the user? Proper encapsulation should have prevented this tell() issue in the first place...)
I then tried to use copy.copy, but that results in something like this: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f722ffda810>. Which appears unusable.
Does there exist an alternative way to copy? Is there a way to initialize a file-object? Or should I give up on this use case entirely because it is not possible in Python?

You are looking for itertools.tee.
from itertools import tee
with open("somefile.txt", "r") as fh:
fh1, fh2, fh3 = tee(fh, 3)
Once you call tee, do not use the parent iterator again. The iterators returned from tee may be used freely and independently, however.
For file objects specifically (to keep file-specific methods like read), you can just open a file multiple times; each file object will maintain its own file pointer as it reads the file.
fh1, fh2, fh3 = [open("somefile.txt") for i in range(3)]
or, if you already have a file object fh:
fh1, fh2, fh3 = [open(fh.name) for i in range(3)]
This doesn't preserve an already advanced file pointer, but it's easy enough to jump ahead:
for x in fh1, fh2, fh3:
x.seek(fh.tell())

Related

Is there a more concise way to read csv files in Python?

with open(file, 'rb') as readerfile:
reader = csv.reader(readerfile)
In the above syntax, can I perform the first and second line together? It seems unnecessary to use 2 variables ('readerfile' and 'reader' above) if I only need to use the latter.
Is the former variable ('readerfile') ever used?
Can I use the same variable name for both is that bad form?
You can do:
reader = csv.reader(open(file, 'rb'))
but that would mean you are not closing your file explicitly.
with open(file, 'rb') as readerfile:
The first line opens the file and stores the file object in readerfile. The with statement ensures that the file is closed when you exit the block by any means, including exceptions.
reader = csv.reader(readerfile)
The second line creates a CSV reader object using the file object. It needs the file object (otherwise where would it read the data from?). Of course you could conceivably store it in the same variable
readerfile = csv.reader(readerfile)
if you wanted to (and don't plan on using the file object again), but this will likely lead to confusion for readers of your code.
Note that you haven't read anything yet! You still need to iterate over the reader object in order to get the data that you're interested in, and if you close the file before that happens then the reader object won't work. The file object is used behind the scenes by the reader object, even if you "hide" it by overwriting the readerfile variable.
Lastly, if you really want to do everything on one line, you could conceivably define a function that abstracts the with statement:
def with1(context, func):
with context as x:
return func(x)
Now you can write this as one line:
data = with1(open(file, 'rb'), lambda readerfile: list(csv.reader(readerfile)))
It's by no means clearer, however.
This is not recommended at all
Why is it important to use one line?
Most python programmers know well the benefits of using the with statement. Keep in mind that readers might be lazy (that is -read line by line-) on some cases. You want to be able to handle the file with the correct statement, ensuring the correct closing, even if errors arise.
Nevertheless, you can use a one liner for this, as stated in other answers:
reader = csv.reader(open(file, 'rb'))
So basically you want a one-liner?
reader = csv.reader(open(file, 'rb'))
As said before, the problem with that is with open() allows you to do the following steps in one time:
Open the file
Do what you want with the file (inside your open block)
Close the file (that is implicit and you don't have to specify it)
If you don't use with open but directly open, you file stays opened until the object is garbage collected, and that could lead to unpredicted behaviour in some cases.
Plus, your original code (two lines) is much more readable than a one-liner.
If you put them together, then the file won't be closed automatically -- but that often doesn't really matter, since it will be closed automatically when the script terminates.
It's not common to need to reference the raw file once acsv.readerinstance has been created from (except possibly to explicitly close it if you're not using awithstatement).
If you use the same variable name for both, it will probably work because thecsv.readerinstance will still hold a reference to the file object, so it won't be garbage collected until the program ends. It's not a commonly idiom, however.
Since csv files are often processed sequentially, the following can be a fairly concise way to do it since thecsv.readerinstance frequently doesn't really need to be given a variable name and it will close the file properly even if an exception occurs:
with open(file, 'rb') as readerfile:
for row in csv.reader(readerfile):
process the data...

Python Generator memory benefits for large readins?

I'm wondering about the memory benefits of python generators in this use case (if any). I wish to read in a large text file that must be shared between all objects. Because it only needs to be used once and the program finishes once the list is exhausted I was planning on using generators.
The "saved state" of a generator I believe lets it keep track of what is the next value to be passed to whatever object is calling it. I've read that generators also save memory usage by not returning all the values at once, but rather calculating them on the fly. I'm a little confused if I'd get any benefit in this use case though.
Example Code:
def bufferedFetch():
while True:
buffer = open("bigfile.txt","r").read().split('\n')
for i in buffer:
yield i
Considering that the buffer is going to be reading in the entire "bigfile.txt" anyway, wouldn't this be stored within the generator, for no memory benefit? Is there a better way to return the next value of a list that can be shared between all objects?
Thanks.
In this case no. You are reading the entire file into memory by doing .read().
What you ideally want to do instead is:
def bufferedFetch():
with open("bigfile.txt","r") as f:
for line in f:
yield line
The python file object takes care of line endings for you (system dependent) and it's built-in iterator will yield lines by simply iterating over it one line at a time (not reading the entire file into memory).

Delete line after it has been read from file in Python

I have a function that read lines from a file and process them. However, I want to delete every line that I have read, but without using readlines() that reads all of the lines at once and stores them into a list.
If the problem is that you run out of memory, then I suggest you use the for line in file syntax, as this will only load the lines one at a time:
bigFile = open('path/to/file.dat','r')
for line in bigFile:
processLine(line)
If you can construct your system so that it can process the file line-by-line, then it won't run out of memory trying to read the whole file. The program will discard the copy it has made of the file contents when it moves onto the next line.
Why does this work when readlines doesn't?
In Python there are iterators, which provide an interface to supply one item of a collection at a time, iterating over the whole collection if .next() is called repeatedly. Because you rarely need the whole collection at once, this can allow the program to work with a single item in memory instead, and thus allow large files to be processed.
By contrast, the readlines function has to return a whole list, rather than an iterator object, so it cannot delay the processing of later lines like an iterator could. Since Python 2.3, the old xreadlines read iterator was deprecated in favour of using for line in file, because the file object returned by open had been changed to return an iterator rather than a list.
This follows the functional paradigm called 'lazy evaluation', where you avoid doing any actual processing unless and until the result is needed.
More iterators
Iterators can be chained together (process the lines of this file, then that one), or otherwise combined using the excellent itertools module (included in Python). These are very powerful, and can allow you to separate out the way you combine files or inputs from the code that processes them.
First of all, deleting the first line of a file is a costly process. Actually, you are unlikely to be able to do it without rewriting most of the file.
You have multiple approaches that could solve your issue:
1.In python, file objects have an iterator over the lines, may be you can use this to solve your memory issues
document_count = 0
with open(filename) as handler:
for index, line in enumerate(handler):
if line == '.':
document_count += 1
2.Use an index. Reserve a certain part of your file to the index(fixed size, make sure to reserve enough space, let's say the first 100Ko of your file should be reserved for the index, that's about 100K entries) or even another index file, every time you add a document put it's starting position on the index. Once you know the document position, just use the seek function to get there and start reading
3.Read the file once and store every document position, this is very similar to the previous idea, except it's in memory which is better performance-wise but you will have to repeat the process every time you run the application (no persistence)

Which method should I use for accessing files and why?

I am in python and there is a lot of ways to access files.
Method 1:
fp = open("hello.txt", "w")
fp.write("No no no");
fp.close()
fp = open("hello.txt", "r")
print fp.read()
fp.close()
Method 2:
open("hello.txt", "w").write("hello world!")
print open("hello.txt", "r").read()
Method 3:
with open("hello.txt","w") as f:
f.write("Yes yes yes")
with open("hello.txt") as f:
print f.read()
Is there a specific advantage in using each of these?
Stuff I know:
Method 2 and Method 3 closes the file automatically, but
Method 1 doesn't.
Method 2 doesn't give you a handle to do multiple operations.
You should use the third method.
There is a common pattern in programming where to use some object you have set it up, run your code, and tear it down again. File handles are one example of this: you have to open the file, run your code, and then close the file. This last is not optional -- it's important for the operating system to know that you are done with it, and for Python to flush all the data out of its IO buffers.
Now, CPython is a reference counted language. That means that it counts how many pieces of code 'know about' a given object, so that when that count becomes zero it can clean up said object and reuse its space in memory. In method 2, the reference count of the file object becomes zero, which allows Python to clean it up. And file objects' cleanup method also closes them. However, you should in general not rely on this -- reference counting is an implementation detail of the standard version of Python, and there's no guarantee that whatever you're using to run the program will do the same. That's why you shouldn't use method 2.
Method 1 is better, because you explicitly close the file -- as long as you reach the .close() function call! If an exception was thrown in the middle of that code block, the close would not be reached, and the file would not be explicitly closed. So you should really wrap the middle code in a try... finally block.
Method 3 is therefore best: you use the with statement -- an idiomatic way of enclosing the .close in a finally block -- to close the file, so you don't have to worry about the extra syntactic fluff of try... except.
I'd use this, kind of extended version of method 3:
with open("hello.txt","w+") as f:
f.write("Yes yes yes")
f.seek(0) #places the cursor back to the start of the file
print f.read() #now read the file
Advantages:
It opens the file only once
w+ mode allows both read and write on the same file object
with takes care of the closing of file
I would think it is best to go method one as it is explicit and you can surround it with a try and except block or the method 3

Does reading an entire file leave the file handle open?

If you read an entire file with content = open('Path/to/file', 'r').read() is the file handle left open until the script exits? Is there a more concise method to read a whole file?
The answer to that question depends somewhat on the particular Python implementation.
To understand what this is all about, pay particular attention to the actual file object. In your code, that object is mentioned only once, in an expression, and becomes inaccessible immediately after the read() call returns.
This means that the file object is garbage. The only remaining question is "When will the garbage collector collect the file object?".
in CPython, which uses a reference counter, this kind of garbage is noticed immediately, and so it will be collected immediately. This is not generally true of other python implementations.
A better solution, to make sure that the file is closed, is this pattern:
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
which will always close the file immediately after the block ends; even if an exception occurs.
Edit: To put a finer point on it:
Other than file.__exit__(), which is "automatically" called in a with context manager setting, the only other way that file.close() is automatically called (that is, other than explicitly calling it yourself,) is via file.__del__(). This leads us to the question of when does __del__() get called?
A correctly-written program cannot assume that finalizers will ever run at any point prior to program termination.
-- https://devblogs.microsoft.com/oldnewthing/20100809-00/?p=13203
In particular:
Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.
[...]
CPython currently uses a reference-counting scheme with (optional) delayed detection of cyclically linked garbage, which collects most objects as soon as they become unreachable, but is not guaranteed to collect garbage containing circular references.
-- https://docs.python.org/3.5/reference/datamodel.html#objects-values-and-types
(Emphasis mine)
but as it suggests, other implementations may have other behavior. As an example, PyPy has 6 different garbage collection implementations!
You can use pathlib.
For Python 3.5 and above:
from pathlib import Path
contents = Path(file_path).read_text()
For older versions of Python use pathlib2:
$ pip install pathlib2
Then:
from pathlib2 import Path
contents = Path(file_path).read_text()
This is the actual read_text implementation:
def read_text(self, encoding=None, errors=None):
"""
Open the file in text mode, read it, and close the file.
"""
with self.open(mode='r', encoding=encoding, errors=errors) as f:
return f.read()
Well, if you have to read file line by line to work with each line, you can use
with open('Path/to/file', 'r') as f:
s = f.readline()
while s:
# do whatever you want to
s = f.readline()
Or even better way:
with open('Path/to/file') as f:
for line in f:
# do whatever you want to
Instead of retrieving the file content as a single string,
it can be handy to store the content as a list of all lines the file comprises:
with open('Path/to/file', 'r') as content_file:
content_list = content_file.read().strip().split("\n")
As can be seen, one needs to add the concatenated methods .strip().split("\n") to the main answer in this thread.
Here, .strip() just removes whitespace and newline characters at the endings of the entire file string,
and .split("\n") produces the actual list via splitting the entire file string at every newline character \n.
Moreover,
this way the entire file content can be stored in a variable, which might be desired in some cases, instead of looping over the file line by line as pointed out in this previous answer.

Categories