Delete first line from a large file using Python - python

I recently came across an answer that uses the code below to remove last line from big file using Python, it is very fast and efficient but I cannot make it work to delete first line from a file.
Can anyone please help?
Here is that answer
https://stackoverflow.com/a/10289740/9311781
Below is the code:
with open(sys.argv[1], "r+", encoding = "utf-8") as file:
# Move the pointer (similar to a cursor in a text editor) to the end of the file
file.seek(0, os.SEEK_END)
# This code means the following code skips the very last character in the file -
# i.e. in the case the last line is null we delete the last line
# and the penultimate one
pos = file.tell() - 1
# Read each character in the file one at a time from the penultimate
# character going backwards, searching for a newline character
# If we find a new line, exit the search
while pos > 0 and file.read(1) != "\n":
pos -= 1
file.seek(pos, os.SEEK_SET)
# So long as we're not at the start of the file, delete all the characters ahead
# of this position
if pos > 0:
file.seek(pos, os.SEEK_SET)
file.truncate()

As comments already mentioned - Removing last line like that is easy because you can "truncate" the file from a given position - and afaik, that works across multiple operating systems and filesystems.
However, similary truncating from the start of the file is not standard operation in many filesystems. Linux does support this on new enough kernels (>3.15 afaik) and Mac's might have something similar too.
You could try to use Fallocate package from pypi[0] - or implement something similar by using the underlying syscall[1] if your os/filesystem is compatible.
[0] - https://pypi.org/project/fallocate/
[1] - https://man7.org/linux/man-pages/man2/fallocate.2.html

Related

What is a Pythonic way to detect that the next read will produce an EOF in Python 3 (and Python 2)

Currently, I am using
def eofapproached(f):
pos = f.tell()
near = f.read(1) == ''
f.seek(pos)
return near
to detect if a file open in 'r' mode (the default) is "at EOF" in the sense that the next read would produce the EOF condition.
I might use it like so:
f = open('filename.ext') # default 'r' mode
print(eofapproached(f))
FYI, I am working with some existing code that stops when EOF occurs, and I want my code to do some action just before that happens.
I am also interested in any suggestions for a better (e.g., more concise) function name. I thought of eofnear, but that does not necessarily convey as specific a meaning.
Currently, I use Python 3, but I may be forced to use Python 2 (part of a legacy system) in the future.
You can use f.tell() to find out your current position in the file.
The problem is, that you need to find out how big the file is.
The niave (and efficient) solution is os.path.getsize(filepath) and compare that to the result of tell() but that will return the size in bytes, which is only relavent if reading in binary mode ('rb') as your file may have multi-byte characters.
Your best solution is to seek to the end and back to find out the size.
def char_count(f):
current = f.tell()
f.seek(0, 2)
end = f.tell()
f.seek(current)
return end
def chars_left(f, length=None):
if not length:
length = char_count(f)
return length - f.tell()
Preferably, run char_count once at the beginning, and then pass that into chars_left. Seeking isn't efficient, but you need to know how long your file is in characters and the only way is by reading it.
If you are reading line by line, and want to know before reading the last line, you also have to know how long your last line is to see if you are at the beginning of the last line.
If you are reading line by line, and only want to know if the next line read will result in an EOF, then when chars_left(f, total) == 0 you know you are there (no more lines left to read)
I've formulated this code to avoid the use of tell (perhaps using tell is simpler):
import os
class NearEOFException(Exception): pass
def tellMe_before_EOF(filePath, chunk_size):
fileSize = os.path.getsize(filePath)
chunks_num = (fileSize // chunk_size) # how many chunks can we read from file?
reads = 0 # how many chunks we read so far
f = open(filePath)
if chunks_num == 0:
raise NearEOFException("File is near EOF")
for i in range(chunks_num-1):
yield f.read(chunk_size)
else:
raise NearEOFException("File is near EOF")
if __name__ == "__main__":
g = tellMe_before_EOF("xyz", 3) # read in chunks of 3 chars
while True:
print(next(g), end='') # near EOF raise NearEOFException
The naming of the function is disputed. It's boring to name things, I'm just not good at that.
The function works like this: take the size of the file and see approximately how many times can we read N sized chunks and store it in chunks_num. This simple division gets us near EOF, the question is where do you think near EOF is? Near the last char for example or near the last nth characters? Maybe that's something to keep in mind if it matters.
Trace through this code to see how it works.

How to erase part of a read only file when printing it

Practically, I'm reading a file line by line and then printing onto the screen in pygame.
textbeingread = f.readline()
The code takes 'textbeingread' and uses that to show text on the screen but because each piece of writing is on a separate line it has the little icon to show that there a line underneath it (not exactly sure how to show it). I was just wondering if there was a way (because each line is a different length) to omit the last character in the line but use everything else. Thanks in advance :)
textbeingread = f.readline()[:-1]
or
textbeingread = f.readline()[:-2]
Depends on whether you want to get rid of newline character or also the character before it.
textbeingread = f.readline().rstrip("\r\n")

How to read line by line from stdin in python

Everyone knows how to count the characters from STDIN in C. However, when I tried to do that in python3, I find it is a puzzle. (counter.py)
import sys
chrCounter = 0
for line in sys.stdin.readline():
chrCounter += len(line)
print(chrCounter)
Then I try to test the program by
python3 counter.py < counter.py
The answer is only the length of the first line "import sys". In fact, the program ONLY read the first line from the standard input, and discard the rest of them.
It will be work if I take the place of sys.stdin.readline by sys.stdin.read()
import sys
print(len(sys.stdin.read()))
However, it is obviously, that the program is NOT suitable for a large input. Please give me a elegant solution. Thank you!
It's simpler:
for line in sys.stdin:
chrCounter += len(line)
The file-like object sys.stdin is automatically iterated over line by line; if you call .readline() on it, you only read the first line (and iterate over that character-by-character); if you call read(), then you'll read the entire input into a single string and iterate over that character-by.character.
The answer from Tim Pietzcker is IMHO the correct one. There are 2 similar ways of doing this. Using:
for line in sys.stdin:
and
for line in sys.stdin.readlines():
The second option is closer to your original code. The difference between these two options is made clear by using e.g. the following modification of the for-loop body and using keyboard for input:
for line in sys.stdin.readlines():
line_len = len(line)
print('Last line was', line_len, 'chars long.')
chrCounter += len(line)
If you use the first option (for line in sys.stdin:), then the lines are processed right after you hit enter.
If you use the second option (for line in sys.stdin.readlines():), then the whole file is first read, split into lines and only then they are processed.
If I just wanted a character count, I'd read in blocks at a time instead of lines at a time:
# 4096 chosen arbitrarily. Pick any other number you want to use.
print(sum(iter(lambda:len(sys.stdin.read(4096)), 0)))

Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.
inputFileHandle = open(inputFileName, 'r')
row = 0
for line in inputFileHandle:
row = row + 1
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.
Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.
It's a bug in Python.
Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
And the work-around:
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).
The code you've posted looks fine by itself, so I would suspect a bug in your Python build.
FWIW, the snippet would be a little cleaner if it used enumerate:
inputFileHandle = open(inputFileName, 'r')
for row, line in enumerate(inputFileHandle):
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)

Moving to an arbitrary position in a file in Python

Let's say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.
Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid --- that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?
I do not want to traverse the file line by line.
I tried using mmap, but while poking around here on SO, I learned that that's probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn't really figure out how to simply move N lines away from my current position. If I can at least just "jump" to the Nth line then I can use .split() and grab the Kth integer that way.
The nuance here is that I don't just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.
Any ideas? I hope this is enough information.
Thanks!
Python's seek goes to a byte offset in a file, not to a line offset, simply because that's the way modern operating systems and their filesystems work -- the OS/FS just don't record or remember "line offsets" in any way whatsoever, and there's no way for Python (or any other language) to just magically guess them. Any operation purporting to "go to a line" will inevitably need to "walk through the file" (under the covers) to make the association between line numbers and byte offsets.
If you're OK with that and just want it hidden from your sight, then the solution is the standard library module linecache -- but performance won't be any better than that of code you could write yourself.
If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number - to - byte offset correspondence (technically an "index" auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case...?
Edit: since apparently this may apply -- here's the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py, as follows:
import array
import sys
BLOCKSIZE = 1024 * 1024
def reader(f):
blockstart = 0
while True:
block = f.read(BLOCKSIZE)
if not block: break
inblock = 0
while True:
nextnl = block.find(b'\n', inblock)
if nextnl < 0:
blockstart += len(block)
break
yield nextnl + blockstart
inblock = nextnl + 1
def doindex(fn):
with open(fn, 'rb') as f:
# result format: x[0] is tot # of lines,
# x[N] is byte offset of END of line N (1+)
result = array.array('L', [0])
result.extend(reader(f))
result[0] = len(result) - 1
return result
def main():
for fn in sys.argv[1:]:
index = doindex(fn)
with open(fn + '.indx', 'wb') as p:
print('File', fn, 'has', index[0], 'lines')
index.tofile(p)
main()
and then to use it, for example, the following useindex.py:
import array
import sys
def readline(n, f, findex):
f.seek(findex[n] + 1)
bytes = f.read(findex[n+1] - findex[n])
return bytes.decode('utf8')
def main():
fn = sys.argv[1]
with open(fn + '.indx', 'rb') as f:
findex = array.array('l')
findex.fromfile(f, 1)
findex.fromfile(f, findex[0])
findex[0] = -1
with open(fn, 'rb') as f:
for n in sys.argv[2:]:
print(n, repr(readline(int(n), f, findex)))
main()
Here's an example (on my slow laptop):
$ time py3 makeindex.py kjv10.txt
File kjv10.txt has 100117 lines
real 0m0.235s
user 0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'
real 0m0.049s
user 0m0.028s
sys 0m0.020s
$
The sample file is a plain text file of King James' Bible:
$ wc kjv10.txt
100117 823156 4445260 kjv10.txt
100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine -- even my tiny slow laptop has 2GB after all;-).
You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding -- works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don't bother collapsing \r\n into a single character if that's what the file has as line terminator (it's pretty trivial to do that after reading each line if you want).
The problem is that since your lines are not of fixed length, you have to pay attention to line end markers to do your seeking, and that effectively becomes "traversing the file line by line". Thus, any viable approach is still going to be traversing the file, it's merely a matter of what can traverse it fastest.
Another solution, if the file is potentially going to change a lot, is to go full-way to a proper database. The database engine will create and maintain the indexes for you so you can do very fast searches/queries.
This may be an overkill though.

Categories