Fetching address and values of opened file - python

I need to read a particular byte from a big binary file using Python. Using f.seek() takes a long time. Is there any method to fetch the address of the first byte of file and then add the address to reach to a particular byte in Python?
For example, given a text file containing
asddfrgd
get address of a, add 5, and then fetch the resulting value (which is 'r', assuming 1 byte for each letter).

Your description is not very clear. I assume that you want to fetch all values that are 5 bytes after an "a" in your example, such that "aardvark" gets "a" and "r" and the last "a" is skipped, because adding 5 goes beyond the end of the string.
Here's a solution that returns a list of such values by scanning the file linearly without jumping, byte by byte:
def find_past(fn, which, step):
""" Read file 'fn' and return all elements 'step' bytes after
each occurrence of 'which'.
"""
f = open(fn, "rb")
n = 0 # current byte address
res = [] # list of result bytes
next = [] # list of next byte addresses to consider
while True:
c = f.read(1)
if c == "":
break
if next and next[0] == n:
res.append(c)
next.pop(0)
if c == which:
next.append(n + step)
n += 1
f.close()
return res
Keeping track of the lists and byte offsets should be cheaper than f.seek(), but I haven't tried that on large data.

Related

Problem reading valid last line of a file [duplicate]

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

How to loop over every 2 characters in a file in python

I'm trying to loop over every 2 character in a file, do some tasks on them and write the result characters into another file.
So I tried to open the file and read the first two characters.Then I set the pointer on the 3rd character in the file but it gives me the following error:
'bytes' object has no attribute 'seek'
This is my code:
the_file = open('E:\\test.txt',"rb").read()
result = open('E:\\result.txt',"w+")
n = 0
s = 2
m = len(the_file)
while n < m :
chars = the_file.seek(n)
chars.read(s)
#do something with chars
result.write(chars)
n =+ 1
m =+ 2
I have to mention that inside test.txt is only integers (numbers).
The content of test.txt is a series of binary data (0's and 1's) like this:
01001010101000001000100010001100010110100110001001011100011010000001010001001
Although it's not the point here, but just want to replace every 2 character with something else and write it into result.txt .
Use the file with the seek and not its contents
Use an if statement to break out of the loop as you do not have the length
use n+= not n=+
finally we seek +2 and read 2
Hopefully this will get you close to what you want.
Note: I changed the file names for the example
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
n = 0
s = 2
while True:
the_file.seek(n)
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
n +=2
the_file.close()
Note that because, in this case, you are reading the file sequentially, in chunks i.e. read(2) rather than read() the seek is superfluous.
The seek() would only be required if you wished to alter the position pointer within the file, say for example you wanted to start reading at the 100th byte (seek(99))
The above could be written as:
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
while True:
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
the_file.close()
You were trying to use .seek() method on a string, because you thought it was a File object, but the .read() method of files turns it into a string.
Here's a general approach I might take to what you were going for:
# open the file and load its contents as a string file_contents
with open('E:\\test.txt', "r") as f:
file_contents = f.read()
# do the stuff you were doing
n = 0
s = 2
m = len(file_contents)
# initialize a result string
result = ""
# iterate over the file_contents, incrementing by 2, adding to results
for i in xrange(0, m, 2):
result += file_contents[i]
# write to results.txt
with open ('E:\\result.txt', 'wb') as f:
f.write(result)
Edit: It seems like there was a change to the question. If you want to change every second character, you'll need to make some adjustments.

Python doesn't deal with char string as expected

I am dealing with a long char string (elements with values 0..255), read directly from a file. I need to divide the string in chunks of 8 bytes. I'd expect this to work:
rawindex = file.read()
for chunk in rawindex[::8]:
print sys.stderr, len(chunk)
...but the len() always returns 1. What am I doing wrong?
More info:
* this is not homework
* I could play with range(,,8), but I would really like to know why the above example doesn't work
The 'step' parameter in an array's index just iterates each 8th element, not 8 element at once. Your code should look like:
step = 8
rawindex = file.read()
for index in range(0, len(rawindex), step):
print sys.stderr, len(rawindex[index:index+step])
You could just read chunks of the correct size yourself.
read_chunk = lambda: my_file.read(8)
for chunk in iter(read_chunk, ''):
print len(chunk)
rawindex = file.read() # this line will return the file read output as a string. For example rawindex = abc efgh ijk
rawindex[::8] # will returns all the eight charactors from the above string. so the result will be 'ah'
so effectively the for loop will be for chunk in 'ah':
In the first loop chunk will be a. len('a') will be 1. So always the len(chunk) will return 1.
I think you should use range in stead of rowindex[::8].

how many unread bytes left in a file?

I read periodically 16-bit frames from a file,
last frame I need to know if there are enough data and file is valid for my format.
f.read(16)
returns an empty string if there is no more data more or data if there is at least 1 byte.
How can I check how many unread bytes are left in a file?
For that, you'd have to know the size of the file. Using the file object, you could do the following:
f.seek(0, 2)
file_size = f.tell()
The variable file_size will then contain the size of your file in bytes. While reading, simply do f.tell() - file_size to get the number of bytes remaining. So:
Use seek(0, 2) and tell()
BUFF = 16
f = open("someFile", "r")
x = 0
# move to end of file
f.seek(0, 2)
# get current position
eof = f.tell()
# go back to start of file
f.seek(0, 0)
# some arbitrary loop
while x < 128:
data = f.read(BUFF)
x += len(data)
# print how many unread bytes left
unread = eof - x
print unread
File Objects - Python Library Reference:
seek(offset[, whence]) Set the file's current position, like stdio's fseek(). The whence argument is optional and defaults to 0
(absolute file positioning); other values are 1 (seek relative to the
current position) and 2 (seek relative to the file's end). There is no
return value. Note that if the file is opened for appending (mode 'a'
or 'a+'), any seek() operations will be undone at the next write. If
the file is only opened for writing in append mode (mode 'a'), this
method is essentially a no-op, but it remains useful for files opened
in append mode with reading enabled (mode 'a+'). If the file is opened
in text mode (without 'b'), only offsets returned by tell() are legal.
Use of other offsets causes undefined behavior. Note that not all file
objects are seekable.
tell() Return the file's current position, like stdio's ftell().
perhaps a little easiser to use..
def LengthOfFile(f):
""" Get the length of the file for a regular file (not a device file)"""
currentPos=f.tell()
f.seek(0, 2) # move to end of file
length = f.tell() # get current position
f.seek(currentPos, 0) # go back to where we started
return length
def BytesRemaining(f,f_len):
""" Get number of bytes left to read, where f_len is the length of the file (probably from f_len=LengthOfFile(f) )"""
currentPos=f.tell()
return f_len-currentPos
def BytesRemainingAndSize(f):
""" Get number of bytes left to read for a regular file (not a device file), returns a tuple of the bytes remaining and the total length of the file
If your code is going to be doing this alot then use LengthOfFile and BytesRemaining instead of this function
"""
currentPos=f.tell()
l=LengthOfFile(f)
return l-currentPos,l
if __name__ == "__main__":
f=open("aFile.data",'r')
f_len=LengthOfFile(f)
print "f_len=",f_len
print "BytesRemaining=",BytesRemaining(f,f_len),"=",BytesRemainingAndSize(f)
f.read(1000)
print "BytesRemaining=",BytesRemaining(f,f_len),"=",BytesRemainingAndSize(f)

Matching every word in a txt file

I'm working on a Project Euler problem (for fun).
It comes with a 46kb txt file containing 1 line with a list of over 5000 names in the format like this:
"MARIA","SUSAN","ANGELA","JACK"...
My plan is to write a method to extract every name and append them into a Python list. Is regular expression the best weapon to tackle this problem?
I looked up the Python re doc, but am having hard time figuring out the right regex.
If the format of the file is as you say it is, i.e.
It's a single line
The format is like this: "MARIA","SUSAN","ANGELA","JACK"
Then this should work:
>>> import csv
>>> lines = csv.reader(open('words.txt', 'r'), delimiter=',')
>>> words = lines.next()
>>> words
['MARIA', 'SUSAN', 'ANGELA', 'JACK']
That looks like a format that the csv module would be helpful with. Then you wouldn't have to write any regex.
If you can do it simpler, then do it simpler. No need to use the csv module. I don't think 5000 names or 46KB is enough to worry.
names = []
f = open("names.txt", "r")
# In case there is more than one line...
for line in f.readlines():
names = [x.strip().replace('"', '') for x in line.split(",")]
print names
#should print ['name1', ... , ...]
A regexp will get the job done, but would be inefficient. Using csv would work, but it might not handle 5000 cells in a single line very well. At the very least it has to load the whole file in and maintain the entire list of names in memory (which might not be a problem for you because that's a very small amount of data). If you want an iterator for relatively large files (much larger than 5000 names), a state machine will do the trick:
def parse_chunks(iter, quote='"', delim=',', escape='\\'):
in_quote = False
in_escaped = False
buffer = ''
for chunk in iter:
for byte in chunk:
if in_escaped:
# Done with the escape char, add it to the buffer
buffer += byte
in_escaped = False
elif byte == escape:
# The next charachter will be added literally and not parsed
in_escaped = True
elif in_quote:
if byte == quote:
in_quote = False
else:
buffer += byte
elif byte == quote:
in_quote = True
elif byte in (' ', '\n', '\t', '\r'):
# Ignore whitespace outside of quotes
pass
elif byte == delim:
# Done with this block of text
yield buffer
buffer = ''
else:
buffer += byte
if in_quote:
raise ValueError('Found unbalanced quote char %r' % quote)
elif in_escaped:
raise ValueError('Found unbalanced escape char %r' % escape)
# Yield the last bit in the buffer
yield buffer
data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))
# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']
# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
for name in parse_chunks(file):
print name

Categories