Alternatives to `tell()` while iterating over lines of a file in Python3? - python

How can I find out the location of the file cursor when iterating over a file in Python3?
In Python 2.7 it's trivial, use tell(). In Python3 that same call throws an OSError:
Traceback (most recent call last):
File "foo.py", line 113, in check_file
pos = infile.tell()
OSError: telling position disabled by next() call
My use case is making a progress bar for reading large CSV files. Computing a total line count is too expensive and requires an extra pass. An approximate value is plenty useful, I don't care about buffers or other sources of noise, I want to know if it'll take 10 seconds or 10 minutes.
Simple code to reproduce the issue. It works as expected on Python 2.7, but throws on Python 3:
file_size = os.stat(path).st_size
with open(path, "r") as infile:
reader = csv.reader(infile)
for row in reader:
pos = infile.tell() # OSError: telling position disabled by next() call
print("At byte {} of {}".format(pos, file_size))
This answer https://stackoverflow.com/a/29641787/321772 suggests that the problem is that the next() method disables tell() during iteration. Alternatives are to manually read line by line instead, but that code is inside the CSV module so I can't get at it. I also can't fathom what Python 3 gains by disabling tell().
So what is the preferred way to find out your byte offset while iterating over the lines of a file in Python 3?

The csv module just expects the first parameter of the reader call to be an iterator that returns one line on each next call. So you can just use a iterator wrapper than counts the characters. If you want the count to be accurate, you will have to open the file in binary mode. But in fact, this is fine because you will have no end of line conversion which is expected by the csv module.
So a possible wrapper is:
class SizedReader:
def __init__(self, fd, encoding='utf-8'):
self.fd = fd
self.size = 0
self.encoding = encoding # specify encoding in constructor, with utf8 as default
def __next__(self):
line = next(self.fd)
self.size += len(line)
return line.decode(self.encoding) # returns a decoded line (a true Python 3 string)
def __iter__(self):
return self
You code would then become:
file_size = os.stat(path).st_size
with open(path, "rb") as infile:
szrdr = SizedReader(infile)
reader = csv.reader(szrdr)
for row in reader:
pos = szrdr.size # gives position at end of current line
print("At byte {} of {}".format(pos, file_size))
The good news here is that you keep all the power of the csv module, including newlines in quoted fields...

If you are comfortable without the csv module in particular. You can do something like:
import os, csv
file_size = os.path.getsize('SampleCSV.csv')
pos = 0
with open('SampleCSV.csv', "r") as infile:
for line in infile:
pos += len(line) + 1 # 1 for newline character
row = line.rstrip().split(',')
print("At byte {} of {}".format(pos, file_size))
But this might not work in cases where fields themselves contain \".
Ex: 1,"Hey, you..",22:04 Though these can also be taken care of using regular expressions.

As your csvfile is too large, there is also another solution according to the page you mentioned:
Using offset += len(line) instead of file.tell(). For example,
offset = 0
with open(path, mode) as file:
for line in file:
offset += len(line)

Related

How to quickly get the last line of a huge csv file (48M lines)? [duplicate]

This question already has answers here:
How to read the last line of a file in Python?
(10 answers)
Closed 1 year ago.
I have a csv file that grows until it reaches approximately 48M of lines.
Before adding new lines to it, I need to read the last line.
I tried the code below, but it got too slow and I need a faster alternative:
def return_last_line(filepath):
with open(filepath,'r') as file:
for x in file:
pass
return x
return_last_line('lala.csv')
Here is my take, in python:
I created a function that lets you choose how many last lines, because the last lines may be empty.
def get_last_line(file, how_many_last_lines = 1):
# open your file using with: safety first, kids!
with open(file, 'r') as file:
# find the position of the end of the file: end of the file stream
end_of_file = file.seek(0,2)
# set your stream at the end: seek the final position of the file
file.seek(end_of_file)
# trace back each character of your file in a loop
n = 0
for num in range(end_of_file+1):
file.seek(end_of_file - num)
# save the last characters of your file as a string: last_line
last_line = file.read()
# count how many '\n' you have in your string:
# if you have 1, you are in the last line; if you have 2, you have the two last lines
if last_line.count('\n') == how_many_last_lines:
return last_line
get_last_line('lala.csv', 2)
This lala.csv has 48 million lines, such as in your example. It took me 0 seconds to get the last line.
Here is code for finding the last line of a file mmap, and it should work on Unixen and derivatives and Windows alike (I've tested this on Linux only, please tell me if it works on Windows too ;), i.e. pretty much everywhere where it matters. Since it uses memory mapped I/O it could be expected to be quite performant.
It expects that you can map the entire file into the address space of a processor - should be OK for 50M file everywhere but for 5G file you'd need a 64-bit processor or some extra slicing.
import mmap
def iterate_lines_backwards(filename):
with open(filename, "rb") as f:
# memory-map the file, size 0 means whole file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
start = len(mm)
while start > 0:
start, prev = mm.rfind(b"\n", 0, start), start
slice = mm[start + 1:prev + 1]
# if the last character in the file was a '\n',
# technically the empty string after that is not a line.
if slice:
yield slice.decode()
def get_last_nonempty_line(filename):
for line in iterate_lines_backwards(filename):
if stripped := line.rstrip("\r\n"):
return stripped
print(get_last_nonempty_line("datafile.csv"))
As a bonus there is a generator iterate_lines_backwards that would efficiently iterate over the lines of a file in reverse for any number of lines:
print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
print(l, end="")
This is generally a rather tricky thing to do. A very efficient way of getting a chunk that includes the last lines is the following:
import os
def get_last_lines(path, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with path.open("rb") as f:
f.seek(-offset, os.SEEK_END)
while f.read(1) != b"\n":
f.seek(-2, os.SEEK_CUR)
return f.readlines()
You need to know the maximum line length though and ensure that the file is at least one offset long!
To use it, do the following:
from pathlib import Path
n_last_lines = 10
last_bit_of_file = get_last_lines(Path("/path/to/my/file"))
real_last_n_lines = last_bit_of_file[-10:]
Now finally you need to decode the binary to strings:
real_last_n_lines_non_binary = [x.decode() for x in real_last_n_lines]
Probably all of this could be wrapped in one more convenient function.
If you are running your code in a Unix based environment, you can execute tail shell command from Python to read the last line:
import subprocess
subprocess.run(['tail', '-n', '1', '/path/to/lala.csv'])
You could additionally store the last line in a separate file, which you update whenever you add new lines to the main file.
This works well for me:
https://pypi.org/project/file-read-backwards/
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:
# getting lines by lines starting from the last line up
for l in frb:
if l:
print(l)
break
An easy way to do this is with deque:
from collections import deque
def return_last_line(filepath):
with open(filepath,'r') as f:
q = deque(f, 1)
return q[0]
since seek() returns the position that it moved to, you can use it to move backward and position the cursor to the beginning of the last line.
with open("test.txt") as f:
p = f.seek(0,2)-1 # ignore trailing end of line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
print(lastLine)
To get the last non-empty line, you can add a while loop around the search:
with open("test.txt") as f:
p,lastLine = f.seek(0,2),"" # start from end of file
while p and not lastLine: # want last non-empty line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
Based on #kuropan
Faster and shorter:
# 60.lastlinefromlargefile.py
# juanfc 2021-03-17
import os
def get_last_lines(fileName, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with open(fileName, "rb") as f:
f.seek(-offset, os.SEEK_END)
return f.read().decode('utf-8').rstrip().split('\n')[-1]
print(get_last_lines('60.lastlinefromlargefile.py'))

Problem reading valid last line of a file [duplicate]

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

Is there a way to read file in reverse using with open using Python

I'm trying to read a file.out server file but I need to read only latest data in datetime range.
Is it possible to reverse read file using with open() with modes(methods)?
The a+ mode gives access to the end of the file:
``a+'' Open for reading and writing. The file is created if it does not
exist. The stream is positioned at the end of the file. Subsequent writes
to the file will always end up at the then current end of the file,
irrespective of any intervening fseek(3) or similar.
Is there a way to use maybe a+ or other modes(methods) to access the end of the file and read a specific range?
Since regular r mode reads file from beginning
with open('file.out','r') as file:
have tried using reversed()
for line in reversed(list(open('file.out').readlines())):
but it returns no rows for me.
Or there are other ways to reverse read file... help
EDIT
What I got so far:
import os
import time
from datetime import datetime as dt
start_0 = dt.strptime('2019-01-27','%Y-%m-%d')
stop_0 = dt.strptime('2019-01-27','%Y-%m-%d')
start_1 = dt.strptime('09:34:11.057','%H:%M:%S.%f')
stop_1 = dt.strptime('09:59:43.534','%H:%M:%S.%f')
os.system("touch temp_file.txt")
process_start = time.clock()
count = 0
print("reading data...")
for line in reversed(list(open('file.out'))):
try:
th = dt.strptime(line.split()[0],'%Y-%m-%d')
tm = dt.strptime(line.split()[1],'%H:%M:%S.%f')
if (th == start_0) and (th <= stop_0):
if (tm > start_1) and (tm < stop_1):
count += 1
print("%d occurancies" % (count))
os.system("echo '"+line.rstrip()+"' >> temp_file.txt")
if (th == start_0) and (tm < start_1):
break
except KeyboardInterrupt:
print("\nLast line before interrupt:%s" % (str(line)))
break
except IndexError as err:
continue
except ValueError as err:
continue
process_finish = time.clock()
print("Done:" + str(process_finish - process_start) + " seconds.")
I'm adding these limitations so when I find the rows it could atleast print that the occurancies appeared and then just stop reading the file.
The problem is that it's reading, but it's way too slow..
EDIT 2
(2019-04-29 9.34am)
All the answers I received works well for reverse reading logs, but in my (and maybe for other people's) case, when you have n GB size log Rocky's answer below suited me the best.
The code that works for me:
(I only added for loop to Rocky's code):
import collections
log_lines = collections.deque()
for line in open("file.out", "r"):
log_lines.appendleft(line)
if len(log_lines) > number_of_rows:
log_lines.pop()
log_lines = list(log_lines)
for line in log_lines:
print(str(line).split("\n"))
Thanks people, all the answers works.
-lpkej
There's no way to do it with open params but if you want to read the last part of a large file without loading that file into memory, (which is what reversed(list(fp)) will do) you can use a 2 pass solution.
LINES_FROM_END = 1000
with open(FILEPATH, "r") as fin:
s = 0
while fin.readline(): # fixed typo, readlines() will read everything...
s += 1
fin.seek(0)
mylines = []
for i, e in enumerate(fin):
if i >= s - LINES_FROM_END:
mylines.append(e)
This won't keep your file in the memory, you can also reduce this to one pass by using collections.deque
# one pass (a lot faster):
mylines = collections.deque()
for line in open(FILEPATH, "r"):
mylines.appendleft(line)
if len(mylines) > LINES_FROM_END:
mylines.pop()
mylines = list(mylines)
# mylines will contain #LINES_FROM_END count of lines from the end.
Sure there is:
filename = 'data.txt'
for line in reversed(list(open(filename))):
print(line.rstrip())
EDIT:
As mentioned in comments this will read the whole file into memory. This solution should not be used with large files.
Another option is to mmap.mmap the file and then use rfind from the end to search for the newlines and then slice out the lines.
Hey m8 I have made this code it works for me I can read in my file in reversed order. hope it helps :)
I start by creating a new text file, so I don't know how much that is important for you.
def main():
f = open("Textfile.txt", "w+")
for i in range(10):
f.write("line number %d\r\n" % (i+1))
f.close
def readReversed():
for line in reversed(list(open("Textfile.txt"))):
print(line.rstrip())
main()
readReversed()

Loop within a loop not re-looping with reading a file Python3

Trying to write a code that will find all of a certain type of character in a text file
For vowels it'll find all of the number of a's but won't reloop through text to read e's. help?
def finder_character(file_name,character):
in_file = open(file_name, "r")
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=''
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
found=0
for line in in_file:
for i in range (len(brain_rat)):
found += finder(file_name,brain_rat[i+1,i+2])
in_file.close()
return found
def finder(file_name,character):
in_file = open(file_name, "r")
line_number = 1
found=0
for line in in_file:
line=line.lower()
found +=line.count(character)
return found
If you want to use your original code, you have to pass the filename to the finder() function, and open the file there, for each char you are testing for.
The reason for this is that the file object (in_file) is a generator, not a list. The way a generator works, is that it returns the next item each time you call their next() method. When you say
for line in in_file:
The for ... in statement calls in_file.next() as long as the next() method "returns" (it actually use the keyword yield, but don't think about that for now) a value. When the generator doesn't return any values any longer, we say that the generator is exhausted. You can't re-use an exhausted generator. If you want to start over again, you have to make a new generator.
I allowed myself to rewrite your code. This should give you the desired result. If anything is unclear, please ask!
def finder_character(file_name,character):
with open(file_name, "r") as ifile:
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=' '
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
return sum(1 if c.lower() in brain_rat else 0 for c in ifile.read())
test.txt:
eeehhh
iii!#
kk ="k
oo o
Output:
>>>print(finder_character('test.txt', 'vowel'))
9
>>>print(finder_character('test.txt', 'consonant'))
6
>>>print(finder_character('test.txt', 'space'))
2
>>>print(finder_character('test.txt', ''))
4
If you are having problems understanding the return line, it should be read backwards, like this:
Sum this generator:
Make a generator with values as v in:
for row in ifile.read():
if c.lower() in brain_rat:
v = 1
else:
v = 0
If you want to know more about generators, I recommend the Python Wiki page concerning it.
This seems to be what you are trying to do in finder_character. I'm not sure why you need finder at all.
In python you can loop over iterables (like strings), so you don't need to do range(len(string)).
for line in in_file:
for i in brain_rat:
if i in line: found += 1
There appear to be a few other oddities in your code too:
You open (and iterate through) the file twice, but only closed once.
line_number is never used
You get the total of a character in a file for each line in the file, so the total will be vastly inflated.
This is probably a much safer version, with open... is generally better than open()... file.close() as you don't need to worry as much about error handling and closing. I've added some comments to help explain what you are trying to do.
def finder_character(file_name,character):
found=0 # Initialise the counter
with open(file_name, "r") as in_file:
# Open the file
in_file = file_name.split('\n')
opts = { 'vowel':'aeiou',
'consonant':'bcdfghjklmnpqrstvwxyz',
'space':'' }
default= '!##$%^&*()_+=-123456789{}|":?><,./;[]\''
for line in in_file:
# Iterate through each line in the file
for c in opts.get(character,default):
With each line, also iterate through the set of chars to check.
if c in line.lower():
# If the current character is in the line
found += 1 # iterate the counter.
return found # return the counter

Fastest Way to Delete a Line from Large File in Python

I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...
Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.
What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywhere in the file.
If Python, assume the following interface:
def removeLine(filename, lineno):
Thanks,
-aj
You can have two file objects for the same file at the same time (one for reading, one for writing):
def removeLine(filename, lineno):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to discard
fro.readline()
# now move the rest of the lines in the file
# one line back
chars = fro.readline()
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop
As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:
f = open('in.txt')
fo = open('out.txt','w')
ind = 1
for line in f:
if ind != linenumtoremove:
fo.write(line)
ind += 1
f.close()
fo.close()
You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.
If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.
You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.
If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek to move the file pointer... I doubt you're that lucky though.
Update: solution using sed as requested by poster in comment.
To delete for example the second line of file:
sed '2d' input.txt
Use the -i switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.
def removeLine(filename, lineno):
in = open(filename)
out = open(filename + ".new", "w")
for i, l in enumerate(in, 1):
if i != lineno:
out.write(l)
in.close()
out.close()
os.rename(filename + ".new", filename)
I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.
The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.
#OP, if you can use awk, eg assuming line number is 10
$ awk 'NR!=10' file > newfile
I will provide two alternatives based on the look-up factor (line number or a search string):
Line number
def removeLine2(filename, lineNumber):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
currentLineNumber = 0
while currentLineNumber < lineNumber:
inputFile.readline()
currentLineNumber += 1
seekPosition = inputFile.tell()
outputFile.seek(seekPosition, 0)
inputFile.readline()
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()
String
def removeLine(filename, key):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
seekPosition = 0
currentLine = inputFile.readline()
while not currentLine.strip().startswith('"%s"' % key):
seekPosition = inputFile.tell()
currentLine = inputFile.readline()
outputFile.seek(seekPosition, 0)
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()

Categories