Reading lines in text files using python

Reading lines in text files using python - python

I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?

If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()

These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.

Related

Modify only a specific part of a file

I have a python script that is supposed to read a file. The issue is that that file is very large so for efficiency I decided that my script should only read from line 650000 and onward, since previous line does not contain relevant information.
Is there any way to only modify lines 650000 till eof, so for example, if i read() this file only those specific lines would appear?

Files are not line-oriented, they are blocks of bytes.
There's no way, short of reading the data in, to figure out how many bytes make up those first 650,000 lines, so you'd have to do that just in order to skip them.
Starting modifying a file at a certain offset is possible, but that offset will be in bytes which is the addressing unit used by files.
Skipping lines can be done easily enough:
with open("myfile.txt", "w+t") as f:
for i in xrange(650000):
f.readline() # Read a line and throw it away
f.write("hello")
This will truncate the file so that there will be no data after the hello (but 650,000 lines before it, of course).

How can I read four specific lines of a file without reading the whole file in python?

I need to read 4 specific lines of a file in python. I don't want to read all the file and then get four out of it ( for the sake of menory). Does anyone know how to do that?
Thanks!
P. S. I used the following code but apparently it reads all the file and then take 4 out of it.
a=open("file", "r")
b=a.readlines() [c:d]

you have to read at least to the lines you are interested in ... you can use islice to grab a slice
interesting_lines = list(itertools.islice(a,c,d))
but it still reads up to those lines

Files, at least on Macs and Windows and Linux and other UNIXy systems, are just streams of bytes; there's no concept of "line" in the file structure, just bytes that happen to represent newline characters. So the only way to find the Nth line in the file is to start at the beginning and read until you've found (N-1) newlines. You don't have to store all the content you scan through, but you do have to read it.
Then you have to read and store from that point until you find 4 more newlines.
You can do this in Python, but it's not clear to me that it's a win compared to using the straightforward approach that reads more than it needs to; feels like premature optimization to me.

Python: Change a few characters at a certain location in a certain line of a file

I know there are many questions regarding editing lines of a file, but my problem is quite specific and in two days I couldn't find a question/answer here which hits it.
The Problem
How do I replace a few (contiguous) characters s1 of one specific line in a file with another few characters s2 meeting the following conditions?
The line number is always the same. (number 5)
The part of the line in front of s1 is always the same. (and therefore has constant length = 18)
The part of the line in front of s1 won't occur anywhere else in the file.
s1 and s2 both are not constant and can even have different lengths.
s1 and s2 both may occur anywhere else in the file.
The file can be very long, so I don't want to load the whole file into memory.
For the same reason as 6. I want to avoid copying the file contents into a new file. Im just changing a few characters so rewriting the whole file would be much of a overhead, wouldn't it?
I'm using Python 3.X.
Most similar approaches I found so far didn't meet either 6. or 7. I found this (opening the file with r+ and performing a write(s2) right before s1), but it doesn't work for me because of 4. Is it even possible in Python to achieve what I want or do I have to copy my file somehow and modify the line along the way after all?
The Background
I have a text file consisting of a few lines of metadata followed by a potentially large number of datasets. The metadata contains a line saying No. of patterns : n while n is the number of datasets in the file. Among other things my script should be able to append additional datasets to an existing file by appending the sets themselves and updating n.
The design of this file I want to be generated/extended by my script is not invented by me, so I mustn't change it. The file will serve as input for another application not invented by me - the JavaNNS.

The answer you linked states
you can only extend and truncate a file at the end, not at the head
With this limitation, python just mirrors the restrictions imposed by the data storage abstraction we call 'file system'. All programs, no matter the programming language, are bound by this when using the file system. Some just hide this fact from the user by re-writing complete files in the background.
If due to the size of the file this causes performance problems when updating the file, then that's really a problem of that crude file format, even though you aren't the one to be blamed for that: The file format doesn't seem to be suited for in-place updates of the file that change the number of patterns.
How to avoid (re)writing large amounts of data
Pipes
If the program which will consume the updated file (JavaNNS) accepts the file contents on standard input, consider to keep the meta-data and the patterns in separate files. Like this, you can append the patterns file and only have the re-write the (hopefully small) meta-data file. Then, just pipe both files into JavaNNS in a single call:
cat metadata.txt patterns.txt | JavaNNS
If JavaNNS does not take accept the required file content on standard input but insists on opening the file itself, you can probably still use a named pipe and pass that as the file to open. (This might not work if JavaNNS does random access on the file instead of just reading and seeking.)
Padding
If you'll be appending to the file several times and the file format is flexible enough to allow some padding, then just pad to make some space for n with potentially increased number of digits in future writes. Like this, you only have to rewrite the file completely when the padding wasn't sufficiently large.

You can't edit in place and just change s1 for s2 as they can be different lengths. You will need to write out the rest of the file, and this will be safer with a replacement file.
If s1 and s2 are guaranteed to be the same length then you could do it in place, e.g. the value is padded to the maximum size of s1/s2:
with open('<file>', 'r+') as f:
for line_no, line in enumerate(f):
if line_no == 5: # read 5 lines
f.seek(18, 1) # jump forward 18 characters
f.write("{: 8d}".format(s2)) # overwrite with padded s2 (int)
break
With different lengths you will need a different file:
with open('<file>', 'r') as r:
with open('<file-new>', 'w+') as w:
for line_no, line in enumerate(r):
if line_no == 5:
w.write(line[:18] + str(s2) + line[18+len(s1):])
else:
w.write(line)

python jump to a line in a txt file (a gzipped one)

I'm reading through a large file, and processing it.
I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file?
If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n separating lines.

The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.

You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).
Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as #wwii suggested recipe is also worth checking out

Not really.
If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.
The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, i.e. find out and remember the positions of each line. If you have that in e.g. a dictionary that you save and load with pickle, you can skip to it in quasi-constant time with seek.

It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.
It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.
You should consider a different storage format for your needs. What are your needs?

Python: read a line and write back to that same line

I am using python to make a template updater for html. I read a line and compare it with the template file to see if there are any changes that needs to be updated. Then I want to write any changes (if there are any) back to the same line I just read from.
Reading the file, my file pointer is positioned now on the next line after a readline(). Is there anyway I can write back to the same line without having to open two file handles for reading and writing?
Here is a code snippet of what I want to do:
cLine = fp.readline()
if cLine != templateLine:
# Here is where I would like to write back to the line I read from
# in cLine

Updating lines in place in text file - very difficult
Many questions in SO are trying to read the file and update it at once.
While this is technically possible, it is very difficult.
(text) files are not organized on disk by lines, but by bytes.
The problem is, that read number of bytes on old lines is very often different from new one, and this mess up the resulting file.
Update by creating a new file
While it sounds inefficient, it is the most effective way from programming point of view.
Just read from file on one side, write to another file on the other side, close the files and copy the content from newly created over the old one.
Or create the file in memory and finally do the writing over the old one after you close the old one.

At the OS level the things are a bit different from how it looks from Python - from Python a file looks almost like a list of strings, with each string having arbitrary length, so it seems to be easy to swap a line for something else without affecting the rest of the lines:
l = ["Hello", "world"]
l[0] = "Good bye"
In reality, though, any file is just a stream of bytes, with strings following each other without any "padding". So you can only overwrite the data in-place if the resulting string has exactly the same length as the source string - otherwise it'll simply overwrite the following lines.
If that is the case (your processing guarantees not to change the length of strings), you can "rewind" the file to the start of the line and overwrite the line with new data. The below script converts all lines in file to uppercase in-place:
def eof(f):
cur_loc = f.tell()
f.seek(0,2)
eof_loc = f.tell()
f.seek(cur_loc, 0)
if cur_loc >= eof_loc:
return True
return False
with open('testfile.txt', 'r+t') as fp:
while True:
last_pos = fp.tell()
line = fp.readline()
new_line = line.upper()
fp.seek(last_pos)
fp.write(new_line)
print "Read %s, Wrote %s" % (line, new_line)
if eof(fp):
break
Somewhat related: Undo a Python file readline() operation so file pointer is back in original state
This approach is only justified when your output lines are guaranteed to have the same length, and when, say, the file you're working with is really huge so you have to modify it in place.
In all other cases it would be much easier and more performant to just build the output in memory and write it back at once. Another option is to write to a temporary file, then delete the original and rename the temporary file so it replaces the original file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.