Read blocks from a file object until x bytes from the end - python

I need to read chunks of 64KB in loop, and process them, but stop at the end of file minus 16 bytes: the last 16 bytes are a tag metadata.
The file might be super large, so I can't read it all in RAM.
All the solutions I find are a bit clumsy and/or unpythonic.
with open('myfile', 'rb') as f:
while True:
block = f.read(65536)
if not block:
break
process_block(block)
If 16 <= len(block) < 65536, it's easy: it's the last block ever. So useful_data = block[:-16] and tag = block[-16:]
If len(block) == 65536, it could mean three things: that the full block is useful data. Or that this 64KB block is in fact the last block, so useful_data = block[:-16] and tag = block[-16:]. Or that this 64KB block is followed by another block of only a few bytes (let's say 3 bytes), so in this case: useful_data = block[:-13] and tag = block[-13:] + last_block[:3].
How to deal with this problem in a nicer way than distinguishing all these cases?
Note:
the solution should work for a file opened with open(...), but also for a io.BytesIO() object, or for a distant SFTP opened file (with pysftp).
I was thinking about getting the file object size, with
f.seek(0,2)
length = f.tell()
f.seek(0)
Then after each
block = f.read(65536)
we can know if we are far from the end with length - f.tell(), but again the full solution does not look very elegant.

you can just read in every iteration min(65536, L-f.tell()-16)
Something like this:
from pathlib import Path
L = Path('myfile').stat().st_size
with open('myfile', 'rb') as f:
while True:
to_read_length = min(65536, L-f.tell()-16)
block = f.read(to_read_length)
process_block(block)
if f.tell() == L-16
break
Did not ran this, but hope you get the gist of it.

The following method relies only on the fact that the f.read() method returns an empty bytes object upon end of stream (EOS). It thus could be adopted for sockets simply by replacing f.read() with s.recv().
def read_all_but_last16(f):
rand = random.Random() # just for testing
buf = b''
while True:
bytes_read = f.read(rand.randint(1, 40)) # just for testing
# bytes_read = f.read(65536)
buf += bytes_read
if not bytes_read:
break
process_block(buf[:-16])
buf = buf[-16:]
verify(buf[-16:])
It works by always leaving 16 bytes at the end of buf until EOS, then finally processing the last 16. Note that if there aren't at least 17 bytes in buf then buf[:-16] returns the empty bytes object.

Related

Problem reading valid last line of a file [duplicate]

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

Batching very large text file in python

I am trying to batch a very large text file (approximately 150 gigabytes) into several smaller text files (approximately 10 gigabytes).
My general process will be:
# iterate over file one line at a time
# accumulate batch as string
--> # given a certain count that correlates to the size of my current accumulated batch and when that size is met: (this is where I am unsure)
# write to file
# accumulate size count
I have a rough metric to calculate when to batch (when the desired batch size) but am not so clear how I should calculate how often to write to disk for a given batch. For example, if my batch size is 10 gigabytes, I assume I will need to iteratively write rather than hold the entire 10 gigbyte batch in memory. I obviously do not want to write more than I have to as this could be quite expensive.
Do ya'll have any rough calculations or tricks that you like to use to figure out when to write to disk for task such as this, e.g. size vs memory or something?
Assuming your large file is simple unstructured text, i.e. this is no good for structured text like JSON, here's an alternative to reading every single line: read large binary bites of the input file until at your chunksize then read a couple of lines, close the current output file and move on to the next.
I compared this with line-by-line using #tdelaney code adapted with the same chunksize as my code - that code took 250s to split a 12GiB input file into 6x2GiB chunks, whereas this took ~50s so maybe five times faster and looks like it's I/O bound on my SSD running >200MiB/s read and write, where the line-by-line was running 40-50MiB/s read and write.
I turned buffering off because there's not a lot of point. The size of bite and the buffering setting may be tuneable to improve performance, haven't tried any other settings as for me it seems to be I/O bound anyway.
import time
outfile_template = "outfile-{}.txt"
infile_name = "large.text"
chunksize = 2_000_000_000
MEB = 2**20 # mebibyte
bitesize = 4_000_000 # the size of the reads (and writes) working up to chunksize
count = 0
starttime = time.perf_counter()
infile = open(infile_name, "rb", buffering=0)
outfile = open(outfile_template.format(count), "wb", buffering=0)
while True:
byteswritten = 0
while byteswritten < chunksize:
bite = infile.read(bitesize)
# check for EOF
if not bite:
break
outfile.write(bite)
byteswritten += len(bite)
# check for EOF
if not bite:
break
for i in range(2):
l = infile.readline()
# check for EOF
if not l:
break
outfile.write(l)
# check for EOF
if not l:
break
outfile.close()
count += 1
print( count )
outfile = open(outfile_template.format(count), "wb", buffering=0)
outfile.close()
infile.close()
endtime = time.perf_counter()
elapsed = endtime-starttime
print( f"Elapsed= {elapsed}" )
NOTE I haven't exhaustively tested this doesn't lose data, although no evidence it does lose anything you should validate that yourself.
Might be useful to add some robustness by checking when at the end of a chunk to see how much data is left to read, so you don't end up with the last output file being 0-length (or shorter than bitesize)
HTH
barny
I used slightly modificated version of this for parsing 250GB json, I choose how many smaller files I need number_of_slices and then I find positions where to slice a file (I always look for line end). FInally i slice file with file.seek and file.read(chunk)
import os
import mmap
FULL_PATH_TO_FILE = 'full_path_to_a_big_file'
OUTPUT_PATH = 'full_path_to_a_output_dir' # where sliced files will be generated
def next_newline_finder(mmapf):
def nl_find(mmapf):
while 1:
current = hex(mmapf.read_byte())
if hex(ord('\n')) == current: # or whatever line-end symbol
return(mmapf.tell())
return nl_find(mmapf)
# find positions where to slice a file
file_info = os.stat(FULL_PATH_TO_FILE)
file_size = file_info.st_size
positions_for_file_slice = [0]
number_of_slices = 15 # say u want slice the big file to 15 smaller files
size_per_slice = file_size//number_of_slices
with open(FULL_PATH_TO_FILE, "r+b") as f:
mmapf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
slice_counter = 1
while slice_counter < number_of_slices:
pos = size_per_slice*slice_counter
mmapf.seek(pos)
newline_pos = next_newline_finder(mmapf)
positions_for_file_slice.append(newline_pos)
slice_counter += 1
# create ranges for found positions (from, to)
positions_for_file_slice = [(pos, positions_for_file_slice[i+1]) if i < (len(positions_for_file_slice)-1) else (
positions_for_file_slice[i], file_size) for i, pos in enumerate(positions_for_file_slice)]
# do actual slice of a file
with open(FULL_PATH_TO_FILE, "rb") as f:
for i, position_pair in enumerate(positions_for_file_slice):
read_from, read_to = position_pair
f.seek(read_from)
chunk = f.read(read_to-read_from)
with open(os.path.join(OUTPUT_PATH, f'dummyfile{i}.json'), 'wb') as chunk_file:
chunk_file.write(chunk)
Here is an example of line-by-line writes. Its opened in binary mode to avoid the line decode step which takes a modest amount of time but can skew character counts. For instance, utf-8 encoding may use multiple bytes on disk for a single python character.
4 Meg is a guess at buffering. The idea is to get the operating system to read more of the file at once, reducing seek times. Whether this works or the best number to use is debatable - and will be different for different operating systems. I found 4 meg makes a difference... but that was years ago and things change.
outfile_template = "outfile-{}.txt"
infile_name = "infile.txt"
chunksize = 10_000_000_000
MEB = 2**20 # mebibyte
count = 0
byteswritten = 0
infile = open(infile_name, "rb", buffering=4*MEB)
outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)
try:
for line in infile:
if byteswritten > chunksize:
outfile.close()
byteswritten = 0
count += 1
outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)
outfile.write(line)
byteswritten += len(line)
finally:
infile.close()
outfile.close()

Read a file in byte chunks using python

I am trying to convert a file containing more than 1 billion bytes into integers. Obviously, my machine cannot do this at once so I need to chunk my code. I was able to decode the first 50,000,000 bytes but I am wondering how to read the integers in the file that are between 50,000,001 and 100,000,000, 150,000,000 and 200,000,000 etc. The following is what I have now;the range function is not working with this.
import struct
with open(x, "rb") as f:
this_chunk = range(50000001, 100000000)
data = f.read(this_chunk)
ints1 = struct.unpack("I" * (this_chunk //4) , data)
print(ints1)
You can use f.seek(offset) to set the file pointer to start reading from a certain offset.
In your case, you'd want to skip 5000000 bytes, so you'd call
f.seek(50000000)
At this point, you'd want to read another 50000000 bytes, so you'd call f.read(50000000).
This would be your complete code listing, implementing f.seek and reading the whole file:
with open(x, "rb") as f:
f.seek(50000000) # omit if you don't want to skip this chunk
data = f.read(50000000)
while data:
... # do something
data = f.read(50000000)
Use f.read(50000000) in a loop at it will read the file in chunks of 50000000, e.g.:
In []:
from io import StringIO
s = '''hello'''
with StringIO(s) as f:
while True:
c = f.read(2)
if not c:
break
print(c)
Out[]:
he
ll
o

writing data into file with binary packed format in python

I am reading some value for file and wants to write modified value into file. My file is .ktx format [binary packed format].
I am using struct.pack() but seems that something is going wrong with that:
bytes = file.read(4)
bytesAsInt = struct.unpack("l",bytes)
number=1+(bytesAsInt[0])
number=hex(number)
no=struct.pack("1",number)
outfile.write(no)
I want to write in both ways little-endian and big-endian.
no_little =struct.pack(">1",bytesAsInt)
no_big =struct.pack("<1",bytesAsInt) # i think this is default ...
again you can check the docs and see the format characters you need
https://docs.python.org/3/library/struct.html
>>> struct.unpack("l","\x05\x04\x03\03")
(50529285,)
>>> struct.pack("l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack("<l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack(">l",50529285)
'\x03\x03\x04\x05'
also note that it is a lowercase L , not a one (as also covered in the docs)
I haven't tested this but the following function should solve your problem. At the moment it reads the file contents completely, creates a buffer and then writes out the updated contents. You could also modify the file buffer directly using unpack_from and pack_into but it might be slower (again, not tested). I'm using the struct.Struct class since you seem to want to unpack the same number many times.
import os
import struct
from StringIO import StringIO
def modify_values(in_file, out_file, increment=1, num_code="i", endian="<"):
with open(in_file, "rb") as file_h:
content = file_h.read()
num = struct.Struct(endian + num_code)
buf = StringIO()
try:
while len(content) >= num.size:
value = num.unpack(content[:num.size])[0]
value += increment
buf.write(num.pack(value))
content = content[num.size:]
except Exception as err:
# handle
else:
buf.seek(0)
with open(out_file, "wb") as file_h:
file_h.write(buf.read())
An alternative is to use the array which makes it quite easy. I don't know how to implement endianess with an array.
def modify_values(filename, increment=1, num_code="i"):
with open(filename, "rb") as file_h:
arr = array("i", file_h.read())
for i in range(len(arr)):
arr[i] += increment
with open(filename, "wb") as file_h:
arr.tofile(file_h)

How can I perform a buffered search and replace?

I have XML files that contain invalid characters sequences which cause parsing to fail. They look like . To solve the problem, I am escaping them by replacing the whole thing with an escape sequence:  --> !#~10^. Then after I am done parsing I can restore them to what they were.
buffersize = 2**16 # 64 KB buffer
def escape(filename):
out = file(filename + '_esc', 'w')
with open(filename, 'r') as f:
buffer = 'x' # is there a prettier way to handle the first one?
while buffer != '':
buffer = f.read(buffersize)
out.write(re.sub(r'&#x([a-fA-F0-9]+);', r'!#~\1^', buffer))
out.close()
The files are very large, so I have to use buffering (mmap gave me a MemoryError) . Because the buffer has a fixed size, I am running into problems when the buffer happens to be small enough to split a sequence. Imagine the buffer size is 8, and the file is like:
123456789
hello!&x10;
The buffer will only read hello!&x, allowing &x10; to slip through the cracks. How do I solve this? I thought of getting more characters if the last few look like they could belong to a character sequence, but the logic I thought of is very ugly.
First, don't bother to read and write the file, you can create a file-like object that wraps your open file, and processes the data before it's handled by the parser. Second, your buffering can just take care of the ends of read bytes. Here's some working code:
class Wrapped(object):
def __init__(self, f):
self.f = f
self.buffer = ""
def read(self, size=0):
buf = self.buffer + self.f.read(size)
buf = buf.replace("!", "!!")
buf = re.sub(r"&(#x[0-9a-fA-F]+;)", r"!\1", buf)
# If there's an ampersand near the end, hold onto that piece until we
# have more, to be sure we don't miss one.
last_amp = buf.rfind("&", -10, -1)
if last_amp > 0:
self.buffer = buf[last_amp:]
buf = buf[:last_amp]
else:
self.buffer = ""
return buf
Then in your code, replace this:
it = ET.iterparse(file(xml, "rb"))
with this:
it = ET.iterparse(Wrapped(file(xml, "rb")))
Third, I used a substitution replacing "&" with "!", and "!" with "!!", so you can fix them after parsing, and you aren't counting on obscure sequences. This is Stack Overflow data after all, so lots of strange random punctuation could occur naturally.
If you sequence is 6 characters long, you can use buffers with 5 overlapping characters. That way, you are sure no sequence will even slip between the buffers.
Here is an example to help you visualize :
--&#x10
--
--
#x10;--
As for the implementation, just prepend the 5 last characters of the last buffer to the new buffer :
buffer = buffer[-5:] + f.read(buffersize)
The only problem is that the concatenation may require a copy of the whole buffer. Another solution, if you have random access to the file, is to rewind a little bit with :
f.seek(-5, os.SEEK_CUR)
In both case, you'll have to modify the script slightly to handle the first iteration.

Categories