how many unread bytes left in a file? - python

I read periodically 16-bit frames from a file,
last frame I need to know if there are enough data and file is valid for my format.
f.read(16)
returns an empty string if there is no more data more or data if there is at least 1 byte.
How can I check how many unread bytes are left in a file?

For that, you'd have to know the size of the file. Using the file object, you could do the following:
f.seek(0, 2)
file_size = f.tell()
The variable file_size will then contain the size of your file in bytes. While reading, simply do f.tell() - file_size to get the number of bytes remaining. So:

Use seek(0, 2) and tell()
BUFF = 16
f = open("someFile", "r")
x = 0
# move to end of file
f.seek(0, 2)
# get current position
eof = f.tell()
# go back to start of file
f.seek(0, 0)
# some arbitrary loop
while x < 128:
data = f.read(BUFF)
x += len(data)
# print how many unread bytes left
unread = eof - x
print unread
File Objects - Python Library Reference:
seek(offset[, whence]) Set the file's current position, like stdio's fseek(). The whence argument is optional and defaults to 0
(absolute file positioning); other values are 1 (seek relative to the
current position) and 2 (seek relative to the file's end). There is no
return value. Note that if the file is opened for appending (mode 'a'
or 'a+'), any seek() operations will be undone at the next write. If
the file is only opened for writing in append mode (mode 'a'), this
method is essentially a no-op, but it remains useful for files opened
in append mode with reading enabled (mode 'a+'). If the file is opened
in text mode (without 'b'), only offsets returned by tell() are legal.
Use of other offsets causes undefined behavior. Note that not all file
objects are seekable.
tell() Return the file's current position, like stdio's ftell().

perhaps a little easiser to use..
def LengthOfFile(f):
""" Get the length of the file for a regular file (not a device file)"""
currentPos=f.tell()
f.seek(0, 2) # move to end of file
length = f.tell() # get current position
f.seek(currentPos, 0) # go back to where we started
return length
def BytesRemaining(f,f_len):
""" Get number of bytes left to read, where f_len is the length of the file (probably from f_len=LengthOfFile(f) )"""
currentPos=f.tell()
return f_len-currentPos
def BytesRemainingAndSize(f):
""" Get number of bytes left to read for a regular file (not a device file), returns a tuple of the bytes remaining and the total length of the file
If your code is going to be doing this alot then use LengthOfFile and BytesRemaining instead of this function
"""
currentPos=f.tell()
l=LengthOfFile(f)
return l-currentPos,l
if __name__ == "__main__":
f=open("aFile.data",'r')
f_len=LengthOfFile(f)
print "f_len=",f_len
print "BytesRemaining=",BytesRemaining(f,f_len),"=",BytesRemainingAndSize(f)
f.read(1000)
print "BytesRemaining=",BytesRemaining(f,f_len),"=",BytesRemainingAndSize(f)

Related

Problem reading valid last line of a file [duplicate]

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

How to insert text at line and column position in a file?

I would like to insert a string at a specific column of a specific line in a file.
Suppose I have a file file.txt
How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?
I would like to change the last line to say How was the History test? by adding the string History at line 4 column 13.
Currently I read in every line of the file and add the string to the specified position.
with open("file.txt", "r+") as f:
# Read entire file
lines = f.readlines()
# Update line
lino = 4 - 1
colno = 13 -1
lines[lino] = lines[lino][:colno] + "History " + lines[lino][colno:]
# Rewrite file
f.seek(0)
for line in lines:
f.write(line)
f.truncate()
f.close()
But I feel like I should be able to simply add the line to the file without having to read and rewrite the entire file.
This is possibly a duplicate of below SO thread
Fastest Way to Delete a Line from Large File in Python
In above it's a talk about delete, which is just a manipulation, and yours is more of a modification. So the code would get updated like below
def update(filename, lineno, column, text):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno - 1:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to update
line = fro.readline()
chars = line[0: column-1] + text + line[column-1:]
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
if __name__ == "__main__":
update("file.txt", 4, 13, "History ")
In a large file it make sense to not make modification till the lineno where the update needs to happen, Imagine you have file with 10K lines and update needs to happen at 9K, your code will load all 9K lines of data in memory unnecessarily. The code you have would work still but is not the optimal way of doing it
The function readlines() reads the entire file. But it doesn't have to. It actually reads from the current file cursor position to the end, which happens to be 0 right after opening. (To confirm this, try f.tell() right after with statement.) What if we started closer to the end of the file?
The way your code is written implies some prior knowledge of your file contents and layouts. Can you place any constraints on each line? For example, given your sample data, we might say that lines are guaranteed to be 27 bytes or less. Let's round that to 32 for "power of 2-ness" and try seeking backwards from the end of the file.
# note the "rb+"; need to open in binary mode, else seeking is strictly
# a "forward from 0" operation. We need to be able to seek backwards
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
last = f.readlines()[-1].decode()
At which point the code has only read the last 32 bytes of the file.1 readlines() (at the byte level) will look for the line end byte (in Unix, \n or 0x0a or byte value 10), and return the before and after. Spelled out:
>>> last = f.readlines()
>>> print( last )
[b'hemistry test?\n', b'How was the test?']
>>> last = last[-1]
>>> print( last )
b'How was the test?'
Crucially, this works robustly under UTF-8 encoding by exploiting the UTF-8 property that ASCII byte values under 128 do not occur when encoding non-ASCII bytes. In other words, the exact byte \n (or 0x0a) only ever occurs as a newline and never as part of a character. If you are using a non-UTF-8 encoding, you will need to check if the code assumptions still hold.
Another note: 32 bytes is arbitrary given the example data. A more realistic and typical value might be 512, 1024, or 4096. Finally, to put it back to a working example for you:
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
# does *not* read while file, unless file is exactly 32 bytes.
last = f.readlines()[-1]
last_decoded = last.decode()
# Update line
colno = 13 -1
last_decoded = last_decoded[:colno] + "History " + last_decoded[colno:]
last_line_bytes = len( last )
f.seek(-last_line_bytes, 2)
f.write( last_decoded.encode() )
f.truncate()
Note that there is no need for f.close(). The with statement handles that automatically.
1 The pedantic will correctly note that the computer and OS will likely have read at least 512 bytes, if not 4096 bytes, relating to the on-disk or in-memory page size.
You can use this piece of code :
with open("test.txt",'r+') as f:
# Read the file
lines=f.readlines()
# Gets the column
column=int(input("Column:"))-1
# Gets the line
line=int(input("Line:"))-1
# Gets the word
word=input("Word:")
lines[line]=lines[line][0:column]+word+lines[line][column:]
# Delete the file
f.seek(0)
for i in lines:
# Append the lines
f.write(i)
This answer will only loop through the file once and only write everything after the insert. In cases where the insert is at the end there is almost no overhead and where the insert at the beginning it is no worse than a full read and write.
def insert(file, line, column, text):
ln, cn = line - 1, column - 1 # offset from human index to Python index
count = 0 # initial count of characters
with open(file, 'r+') as f: # open file for reading an writing
for idx, line in enumerate(f): # for all line in the file
if idx < ln: # before the given line
count += len(line) # read and count characters
elif idx == ln: # once at the line
f.seek(count + cn) # place cursor at the correct character location
remainder = f.read() # store all character afterwards
f.seek(count + cn) # move cursor back to the correct character location
f.write(text + remainder) # insert text and rewrite the remainder
return # You're finished!
I'm not sure whether you were having problems changing your file to contain the word "History", or whether you wanted to know how to only rewrite certain parts of a file, without having to rewrite the whole thing.
If you were having problems in general, here is some simple code which should work, so long as you know the line within the file that you want to change. Just change the first and last lines of the program to read and write statements accordingly.
fileData="""How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?""" # So that I don't have to create the file, I'm writing the text directly into a variable.
fileData=fileData.split("\n")
fileData[3]=fileData[3][:11]+" History"+fileData[3][11:] # The 3 referes to the line to add "History" to. (The first line is line 0)
storeData=""
for i in fileData:storeData+=i+"\n"
storeData=storeData[:-1]
print(storeData) # You can change this to a write command.
If you wanted to know how to change specific "parts" to a file, without rewriting the whole thing, then (to my knowledge) that is not possible.
Say you had a file which said Ths is a TEST file., and you wanted to correct it to say This is a TEST file.; you would technically be changing 17 characters and adding one on the end. You are changing the "s" to an "i", the first space to an "s", the "i" (from "is") to a space, etc... as you shift the text forward.
A computer can't actually insert bytes between other bytes. It can only move the data, to make room.

Reading bed file integers from bytes

I've created a tab delimited bed file using the following code
def raw_data_file(sample_name, chrom):
data = []
with open (('{}_{}_raw_data2.bed'.format(sample_name, chrom)),'w') as text_file:
for (i, zone) in enumerate(zones):
select = final_data[i]
for x in select:
row = [chrom, int(zone[1][0]), int(zone[1][1]), zone[0], x]
text_file.write("\t".join(map(str, row))+"\n")
I then open it using
with open ('HG00148_1_raw_data2.bed', 'rb') as f:
rawdata = [x.decode('utf-8').split('\t') for x in f.read().splitlines()]
The data show lines with chromosome number, start point, end point, zone name and a list of associated data (read position, p-value, reads)
When trying to the read position of line zero using:
rawdata[0][4][1]
my code returns 7 instead of 755255 (treats each character as a byte). What should I change either in my encoding or decoding of the bed file for the read position to be returned correctly?
Thanks

Process the lines in reverse order

How do I process a log file (in my case nginx access.log) in reverse order?
Background
I am developing a log file analyser script and I am just not able to get my head around on how to process huge log files from the end so I can sort out the time frames starting with the newest dates I need.
One way could be to access the end of the file using seek and then scanning the file in reverse from there. Example:
def Tail(filepath, nol=10, read_size=1024):
"""
This function returns the last line of a file.
Args:
filepath: path to file
nol: number of lines to print
read_size: data is read in chunks of this size (optional, default=1024)
Raises:
IOError if file cannot be processed.
"""
f = open(filepath, 'rU') # U is to open it with Universal newline support
offset = read_size
f.seek(0, 2)
file_size = f.tell()
while 1:
if file_size < offset:
offset = file_size
f.seek(-1*offset, 2)
read_str = f.read(offset)
# Remove newline at the end
if read_str[offset - 1] == '\n':
read_str = read_str[:-1]
lines = read_str.split('\n')
if len(lines) >= nol: # Got nol lines
return "\n".join(lines[-nol:])
if offset == file_size: # Reached the beginning
return read_str
offset += read_size
f.close()
Then use as
Tail('/etc/httpd/logs/access.log', 100)
This would give you the last 100 lines of your access.log file.
Code referenced from: http://www.manugarg.com/2007/04/real-tailing-in-python.html

Python: write and read blocks of binary data to a file

I am working on a script where it will breakdown another python script into blocks and using pycrypto to encrypt the blocks (all of this i have successfully done so far), now i am storing the encrypted blocks to a file so that the decrypter can read it and execute each block. The final result of the encryption is a list of binary outputs (something like blocks=[b'\xa1\r\xa594\x92z\xf8\x16\xaa',b'xfbI\xfdqx|\xcd\xdb\x1b\xb3',etc...]).
When writing the output to a file, they all end up into one giant line, so that when reading the file, all the bytes come back in one giant line, instead of each item from the original list. I also tried converting the bytes into a string, and adding a '\n' at the end of each one, but the problem there is that I still need the bytes, and I can't figure out how to undo the string to get the original byte.
To summarize this, i am looking to either: write each binary item to a separate line in a file so i can easily read the data and use it in the decryption, or i could translate the data to a string and in the decrpytion undo the string to get back the original binary data.
Here is the code for writing to the file:
new_file = open('C:/Python34/testfile.txt','wb')
for byte_item in byte_list:
# This or for the string i just replaced wb with w and
# byte_item with ascii(byte_item) + '\n'
new_file.write(byte_item)
new_file.close()
and for reading the file:
# Or 'r' instead of 'rb' if using string method
byte_list = open('C:/Python34/testfile.txt','rb').readlines()
A file is a stream of bytes without any implied structure. If you want to load a list of binary blobs then you should store some additional metadata to restore the structure e.g., you could use the netstring format:
#!/usr/bin/env python
blocks = [b'\xa1\r\xa594\x92z\xf8\x16\xaa', b'xfbI\xfdqx|\xcd\xdb\x1b\xb3']
# save blocks
with open('blocks.netstring', 'wb') as output_file:
for blob in blocks:
# [len]":"[string]","
output_file.write(str(len(blob)).encode())
output_file.write(b":")
output_file.write(blob)
output_file.write(b",")
Read them back:
#!/usr/bin/env python3
import re
from mmap import ACCESS_READ, mmap
blocks = []
match_size = re.compile(br'(\d+):').match
with open('blocks.netstring', 'rb') as file, \
mmap(file.fileno(), 0, access=ACCESS_READ) as mm:
position = 0
for m in iter(lambda: match_size(mm, position), None):
i, size = m.end(), int(m.group(1))
blocks.append(mm[i:i + size])
position = i + size + 1 # shift to the next netstring
print(blocks)
As an alternative, you could consider BSON format for your data or ascii armor format.
I think what you're looking for is byte_list=open('C:/Python34/testfile.txt','rb').read()
If you know how many bytes each item is, you can use read(number_of_bytes) to process one item at a time.
read() will read the entire file, but then it is up to you to decode that entire list of bytes into their respective items.
In general, since you're using Python 3, you will be working with bytes objects (which are immutable) and/or bytearray objects (which are mutable).
Example:
b1 = bytearray('hello', 'utf-8')
print b1
b1 += bytearray(' goodbye', 'utf-8')
print b1
open('temp.bin', 'wb').write(b1)
#------
b2 = open('temp.bin', 'rb').read()
print b2
Output:
bytearray(b'hello')
bytearray(b'hello goodbye')
b'hello goodbye'

Categories