Related
I'm having some trouble dealing with large text files (about 1GB), when I want to read them and use them in while loops.
More specifically: First I start by doing some parsing on the lines of the file, in order to find e.g. all lines that start with "x". In doing so, I add the indices of the found lines to a list (say l). This is the pre-processing part.
Now in a while loop, I'm choosing random indices from l, and want to read its corresponding line (or say 5 lines around it). Thus I need to keep the file in memory once and for all throughout the while loop, as a priori I do not know what lines I end up reading (the line is randomly picked from l).
The problem is, when I call the file before my main loop, during the first run of the loop, the reading gets done successfully, but already from the second run, the file has vanished from memory. What I have tried:
The preprocess part:
for i, line in enumerate(filename):
prep = ''.join(c for c in line if c.isalnum() or c.isspace())
if 'x' in prep: l.append(i)
Now I have my l list. loading the file in memory before main loop:
with open(filename,'r') as f:
while (some condition):
random_index = random.sample(range(0,len(l)),1)
output_file = open("out","w") #I will write here the read line(s)
for i, line in enumerate(f):
#(the lines to be read, starting from the given random index)
if (i >= l[random_index]) and (i < l[random_index+1]):
out.write(line)
out.close()
Only during the first run of the loop, things work properly.
Alternatively I also tried:
f = open(filename)
while (some condition):
random_index = ... #rest is same as above.
Same issue, only first run work. One thing that worked was putting the f=open(filename) in the loop, so every run the file is called. But since it is a large one, this is really no practical solution.
What am I doing wrong here?
How should such readings be done properly?
What am I doing wrong here?
This answer addresses the same problem: you can't read file twice.
You open file f outside of the while loop and read it completely by calling for i, line in enumerate(f): during first iteration of the while loop. During the second iteration you can't read it again, since it has been read already.
How should such readings be done properly?
As noted in the linked answer:
To answer your question directly, once a file has been read, with read() you can use seek(0) to return the read cursor to the start of the file (docs are here).
That means, that to solve your problem you can add f.seek(0) at the end of the while loop to move pointer to the start of the file after each iteration. Doing this you can reread file from the start again.
The Task
I am writing a program in python that running a SAP2000 program by importing a new .s2k file each time into the Sap2000 program, and then a new file is generated from the results of the previous run by the means of exporting the data.
The file is about 1,500 lines containing arbitrary words and numbers. (For a better understanding, see this: http://pastebin.com/8ptYacJz, which is the file I am dealing with.)
I'm required to replace one number in the file.
That number is somewhere in the middle of line 800.
The Question
Does anyone know an efficient way to move down to the middle of line 800 in a file, in order to replace one number?
What I've Tried
Regular expressions did not work, because there can be more then one instance of the same number.
So I came up with the solution of templating the file and writing a new file each time with the number to be changed as a template parameter.
This solution does work but the person insists that I can move the file pointer down to line 800, then over to the middle of the line to replace the number.
Here is the only code I have for the problem that takes the file buffer to a line then back up to the beginning when I try to seek over.
import sys
import os
#open file
f = open("output.$2k")
#this will go to line 883 in text file
count = 0;
while count < 883:
line = f.readline()
count = count+1
#this would seek over to middle of file DOESN'T WORK
f.seek(0,0)
line = f.readline()
print(line)
f.close()
Yes and no. Consider:
f=open('output.$2k','r+')
f.seek(300)
f.write('\n')
f.close()
This script just changes the 300th character in your ascii file to a newline. Now the tricky part is that there is no way to know the length of a line in an ascii file short of reading until you get to a newline. So, locating the particular character in the file at the middle of the 800th line is non-trivial. However, if you can make guarantees (due to the way the file was written) about the line length, you can calculate the position without any problem. Also note that replacing 1 with 100 won't work here. You need to replace 1 character with 1 character.
And just for all the other *NIX users in the world ... please don't put $ in your filename. That's just a nightmare...
OK, i'm not a professional programmer, but my (stupid) approach would be: If it's always line 800, read the file line by line while tracking the line numbers. Write then directly to a new file. Read line 800, change it, write it. Then write the rest. Dumb and not elegant but it should work-unless i miss something which i probably do. And there goes my meager reputation :D
No. Read in the line, manipulate it, then write it out to the new file you've previously opened for writing (and have been writing the other lines to, unmodified).
A first thing:
#this would seek over to middle of file DOESN'T WORK
f.seek(0,0)
this is not true. This seeks to the beginning of the file.
To your actual question:
Does anyone know an efficient way to move down to the middle of line 800 in a file, in order to replace one number?
In general, no. You'd need to rewrite the file. For example like this:
# open the file in read-and-update mode
with open("file", 'r+') as f:
# read all lines
lines = f.readlines()
# update 800'th line
my_line = lines[799].split()
my_line[5] = "%s" % my_number # TODO: put in index of number and updated number
lines[799] = " ".join(my_line)
# truncate and rewrite file
f.truncate(0)
f.writelines(lines)
You can do it, if the starting position of the number in the file is predictable (e.g. number_starting_pos = 1234 from the beginning of the file) and the size of the string representation is also predictable (e.g. 20).
Then you could rewrite the number and make sure you fill up the padding with whitespace again to overwrite any content of the previous entry.
Similar to this:
with open("file", 'r+') as f:
# seek to the number starting position
f.seek(number_starting_pos, 0)
# update number field, assuming width (20), arbitrary space-padding allowed
my_number_string = "%19s " % my_number
# make sure the string is indeed exactly of the specific size (it may be longer)
assert len(my_number_string) == 20, "file writing would fail! aborting!"
f.write(my_number_string)
For this to work, you'd need to have a look at the docs of your SAP-thingy, and see if whitespace indeed not matters.
However, both approaches are based on a lot of assumptions. Depending on your use case it may easily break your code, e.g. if a line is inserted or even a characters is inserted before the number field.
Im trying to take a text file and use only the first 30 lines of it in python.
this is what I wrote:
text = open("myText.txt")
lines = myText.readlines(30)
print lines
for some reason I get more then 150 lines when I print?
What am I doing wrong?
Use itertools.islice
import itertools
for line in itertools.islice(open("myText.txt"), 0, 30)):
print line
If you are going to process your lines individually, an alternative could be to use a loop:
file = open('myText.txt')
for i in range(30):
line = file.readline()
# do stuff with line here
EDIT: some of the comments below express concern about this method assuming there are at least 30 lines in the file. If that is an issue for your application, you can check the value of line before processing. readline() will return an empty string '' once EOF has been reached:
for i in range(30):
line = file.readline()
if line == '': # note that an empty line will return '\n', not ''!
break
index = new_index
# do stuff with line here
The sizehint argument for readlines isn't what you think it is (bytes, not lines).
If you really want to use readlines, try text.readlines()[:30] instead.
Do note that this is inefficient for large files as it first creates a list containing the whole file before returning a slice of it.
A straight-forward solution would be to use readline within a loop (as shown in mac's answer).
To handle files of various sizes (more or less than 30), Andrew's answer provides a robust solution using itertools.islice(). To achieve similar results without itertools, consider:
output = [line for _, line in zip(range(30), open("yourfile.txt", "r"))]
or as a generator expression (Python >2.4):
output = (line for _, line in zip(range(30), open("yourfile.txt", "r")))
for line in output:
# do something with line.
The argument for readlines is the size (in bytes) that you want to read in. Apparently 150+ lines is 30 bytes worth of data.
Doing it with a for loop instead will give you proper results. Unfortunately, there doesn't seem to be a better built-in function for that.
Are there any alternatives to the code below:
startFromLine = 141978 # or whatever line I need to jump to
urlsfile = open(filename, "rb", 0)
linesCounter = 1
for line in urlsfile:
if linesCounter > startFromLine:
DoSomethingWithThisLine(line)
linesCounter += 1
If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.
You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
linecache:
The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...
You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.
You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.
0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8 kB, i.e. 8192, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.
I am suprised no one mentioned islice
line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line
or if you want the whole rest of the file
rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
print line
or if you want every other line from the file
rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
print odd_line
I'm probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines() is what I usually do with files of this size. Accessing a line after that is trivial.
Since there is no way to determine the length of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is make it look nice. If the file is really huge then you might want to use a generator-based approach:
from itertools import dropwhile
def iterate_from_line(f, start_from_line):
return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))
for line in iterate_from_line(open(filename, "r", 0), 141978):
DoSomethingWithThisLine(line)
Note: the index is zero-based in this approach.
I have had the same problem (need to retrieve from huge file specific line).
Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.
I found out next decision:
Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).
t = open(file,’r’)
dict_pos = {}
kolvo = 0
length = 0
for each in t:
dict_pos[kolvo] = length
length = length+len(each)
kolvo = kolvo+1
ultimately, aim function:
def give_line(line_number):
t.seek(dict_pos.get(line_number))
line = t.readline()
return line
t.seek(line_number) – command that execute pruning of file up to line inception.
So, if you next commit readline – you obtain your target line.
Using such approach I have saved significant part of time.
If you don't want to read the entire file in memory .. you may need to come up with some format other than plain text.
of course it all depends on what you're trying to do, and how often you will jump across the file.
For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005
and so on
You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file
example:
with open('input_file', "r+b") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
i = 1
for line in iter(mapped.readline, ""):
if i == Line_I_want_to_jump:
offsets = mapped.tell()
i+=1
then use f.seek(offsets) to move to the line you need
None of the answers are particularly satisfactory, so here's a small snippet to help.
class LineSeekableFile:
def __init__(self, seekable):
self.fin = seekable
self.line_map = list() # Map from line index -> file position.
self.line_map.append(0)
while seekable.readline():
self.line_map.append(seekable.tell())
def __getitem__(self, index):
# NOTE: This assumes that you're not reading the file sequentially.
# For that, just use 'for line in file'.
self.fin.seek(self.line_map[index])
return self.fin.readline()
Example usage:
In: !cat /tmp/test.txt
Out:
Line zero.
Line one!
Line three.
End of file, line four.
In:
with open("/tmp/test.txt", 'rt') as fin:
seeker = LineSeekableFile(fin)
print(seeker[1])
Out:
Line one!
This involves doing a lot of file seeks, but is useful for the cases where you can't fit the whole file in memory. It does one initial read to get the line locations (so it does read the whole file, but doesn't keep it all in memory), and then each access does a file seek after the fact.
I offer the snippet above under the MIT or Apache license at the discretion of the user.
If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.
Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.
What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.
Which line do you want?.
Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
Use seek or whatever to directly jump to get the line from index file.
Parse to get byte offset for corresponding line of actual file.
Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.
Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().
If you're dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!
import commands
def read_line(path, line=1):
return commands.getoutput('head -%s %s | tail -1' % (line, path))
line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)
Here's an example using readlines(sizehint) to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.
def getlineno(filename, lineno):
if lineno < 1:
raise TypeError("First line is line 1")
f = open(filename)
lines_read = 0
while 1:
lines = f.readlines(100000)
if not lines:
return None
if lines_read + len(lines) >= lineno:
return lines[lineno-lines_read-1]
lines_read += len(lines)
print getlineno("nci_09425001_09450000.smi", 12000)
#george brilliantly suggested mmap, which presumably uses the syscall mmap. Here's another rendition.
import mmap
LINE = 2 # your desired line
with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
for i,line in enumerate(iter(data.readline, '')):
if i!=LINE: continue
pos = data.tell() - len(line)
break
# optionally copy data to `chunk`
i_file.seek(pos)
chunk = i_file.read(len(line))
print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')
Can use this function to return line n:
def skipton(infile, n):
with open(infile,'r') as fi:
for i in range(n-1):
fi.next()
return fi.next()
I have an application that reads lines from a file and runs its magic on each line as it is read. Once the line is read and properly processed, I would like to delete the line from the file. A backup of the removed line is already being kept. I would like to do something like
file = open('myfile.txt', 'rw+')
for line in file:
processLine(line)
file.truncate(line)
This seems like a simple problem, but I would like to do it right rather than a whole lot of complicated seek() and tell() calls.
Maybe all I really want to do is remove a particular line from a file.
After spending far to long on this problem I decided that everyone was probably right and this it just not a good way to do things. It just seemed so elegant solution. What I was looking for was something akin to a FIFO that would just let me pop lines out of a file.
Remove all lines after you've done with them:
with open('myfile.txt', 'r+') as file:
for line in file:
processLine(line)
file.truncate(0)
Remove each line independently:
lines = open('myfile.txt').readlines()
for line in lines[::-1]: # process lines in reverse order
processLine(line)
del lines[-1] # remove the [last] line
open('myfile.txt', 'w').writelines(lines)
You can leave only those lines that cause exceptions:
import fileinput, sys
for line in fileinput.input(['myfile.txt'], inplace=1):
try: processLine(line)
except Exception:
sys.stdout.write(line) # it prints to 'myfile.txt'
In general, as other people already said it is a bad idea what you are trying to do.
You can't. It is just not possible with actual text file implementations on current filesystems.
Text files are sequential, because the lines in a text file can be of any length.
Deleting a particular line would mean rewriting the entire file from that point on.
Suppose you have a file with the following 3 lines;
'line1\nline2reallybig\nline3\nlast line'
To delete the second line you'd have to move the third and fourth lines' positions in the disk. The only way would be to store the third and fourth lines somewhere, truncate the file on the second line, and rewrite the missing lines.
If you know the size of every line in the text file, you can truncate the file in any position using .truncate(line_size * line_number) but even then you'd have to rewrite everything after the line.
You're better off keeping a index into the file so that you can start where you stopped last, without destroying part of the file. Something like this would work :
try :
for index, line in enumerate(file) :
processLine(line)
except :
# Failed, start from this line number next time.
print(index)
raise
Truncating the file as you read it seems a bit extreme. What if your script has a bug that doesn't cause an error? In that case you'll want to restart at the beginning of your file.
How about having your script print the line number it breaks on and having it take a line number as a parameter so you can tell it which line to start processing from?
First of all, calling the operation truncate is probably not the best pick. If I understand the problem correctly, you want to delete everything up to the current position in file. (I would expect truncate to cut everything from the current position up to the end of the file. This is how the standard Python truncate method works, at least if I Googled correctly.)
Second, I am not sure it is wise to modify the file while iterating on in using the for loop. Wouldn’t it be better to save the number of lines processed and delete them after the main loop has finished, exception or not? The file iterator supports in-place filtering, which means it should be fairly simple to drop the processed lines afterwards.
P.S. I don’t know Python, take this with a grain of salt.
A related post has what seems a good strategy to do that, see
How can I run the first process from a list of processes stored in a file and immediately delete the first line as if the file was a queue and I called "pop"?
I have used it as follows:
import os;
tasklist_file = open(tasklist_filename, 'rw');
first_line = tasklist_file.readline();
temp = os.system("sed -i -e '1d' " + tasklist_filename); # remove first line from task file;
I'm not sure it works on Windows.
Tried it on a mac and it did do the trick.
This is what I use for file based queues. It returns the first line and rewrites the file with the rest. When it's done it returns None:
def pop_a_text_line(filename):
with open(filename,'r') as f:
S = f.readlines()
if len(S) > 0:
pop = S[0]
with open(filename,'w') as f:
f.writelines(S[1:])
else:
pop = None
return pop