Find a string in a binary file - python

I am trying to extract data from a binary file where the data chunks are "tagged" with ASCII text. I need to find the word "tracers" in the binary file so I can read the next 4 bytes (int).
I am trying to simply loop over the lines, decoding them and checking for the text, which works. But I am having trouble seeking to the correct place in the file directly after the text (the seek_to_key function):
from io import BytesIO
import struct
binary = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\xd6\x00\x8c<TE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00tracers\x00\xf2N\x03\x00P\xd9U=6\x1e\x92=\xbe\xa8\x0b<\xb1\x9f\x9f=\xaf%\x82=3\x81|=\xbeM\xb4=\x94\xa7\xa6<\xb9\xbd\xcb=\xba\x18\xc7=\x18?\xca<j\xe37=\xbc\x1cm=\x8a\xa6\xb5=q\xc1\x8f;\xe7\xee\xa0=\xe7\xec\xf7<\xc3\xb8\x8c=\xedw\xae=C$\x84<\x94\x18\x9c=&Tj=\xb3#\xb3=\r\xdd3=\x0eL==4\x00~<\xc6q\x1e=pHw=\xc1\x9a\x92="\x08\x9a=\xe6a\xeb<\xa4#.=\xc4\x0f-=\xa9O\xcb=i\'\x15=\x94\x03\x80=\x8f\xcd\xaf=\xd6\x00\x8c<TE\x9f<m\x9ad<[;Q=\x157X=\x17\xf1u=\xb8(\xa4=\x13\xd3\xfa<\x811_=\xd1iX=Q\x17^;\xd1n\xbe=\xfcb\xcc=\xe8\x9b\x99=W\xa9\x16=\xc5\x83\xa4=\xc0%\x98<\xbb|\x99<>#\x8b:\x1cY\x82;\xb8T\xa4<Cv\x87="n\x1c<J\x152=\x1f\xb2\x9d=&\x18\xb6=\x8a\xf9{=\x0fT\xba=HrX=\xa0\\S=#\xee\xbd=\x1e,\xc5=y\rU<gK\x84=\xe3*\r=\x04\xc4M=\x98a\xb3<\x95 T=\xf2Z\x94=lL\x15=\x07\x1b^=\xf3W\x83<\xf6\xff\xa1<\xb8\xfb\xcb<p\xb4\xd8<\xc9#\xfd<s\xa6\x1f;\xbf7W<\x8a\x9c\x82<\x1c\xb7l=\xa7\xd0\xb7=\xe4\x8d\x97=\xe2\x7f\x82=\x82\xa1\xcc<\xdfs\xca=C\x10p=\xb4\xfa\xb0=\xf35\x87=\x9d\x8bR<d\xb9\x0c<\xb26\xcd=\r\xd5\x1d<\xf4p\xb1=f)\xaf=\xe2M\\=F|\xf9<\x9baW=\x85|\xa3=\x0f\xdd\xa1=\xb6f\xa9=\xcbW\xcf<\xfa\x1a\xbe=\xeb\xda\xb2=\x88\xfb\x8e=\x9f+$=\xbbS\xac;\xa2o\xb5=\x08\xca\xe5<\xc9IC=\xa8\x05\xa6=\xbc \xbd=\x8e\x8d}=U\xcd\xba=\xcbG\x89=}\xadg=Z\xad\x9f=_=\xb6:y\x1c==\xa5\x0b3<<\xe5\x1e=*\xa0\xb6=\n\xcd\xb8\xd9<u\xb5W=rZ\x88=\xe0w}=\xa5\xf0\xa0=\xf4\x91\x82=\xe4r\xc5<\x0e\x91A=Z\x9d-<[N:=\xf1\t\x1e=\xc5_\xc2=\xf8\xea\x98=t\xd7\xbf<~N\xce==#\x93=\x98A\xa7=c\x81x=\xe3\xc6\x94=\xe2&\xcc=\x05\xa9^=\xf7\x05\xa8=[m\x81=\x1b\x0b\x84=\xf5\x98\xb9=+\x90\xd8<\xa2\xcc\xa5=5^\x92=\x0e\x9d\x1d=\x96\xc7\x8b;\xc5E\x9e;r\x1e\xc7=\xea6\xbf=\x19mN;\xd9$D=\x85\xa9\x8b=!\xe9\x90=\xe4/~<\xc1\x9c\xaf=\xde\xe4\x18=e\xb0H=hLO;\x9f\xf8\x8b=p.\xcf=L\x1f\x01<\xea\x19\xaf=Z\xd5\xc2<\xb4\xd8\xcf=s\x84\x0c=\x987\xa5;\x19Z\x93=\x0c\x8fO=y/\x97=\xeaOG=\xb0Fl=\x03\x7f\xbe=\x96\n'
binary_data = BytesIO()
binary_data.write(binary)
binary_data.seek(0)
def seek_to_key(f, line_str, key):
key_start = line_str.find(key)
offset = len(line_str[key_start+len(key)].encode('utf-8'))
f.seek(-offset, 1)
for line in binary_data:
line_str = line.decode('utf-8', errors='replace')
print(line_str)
if 'tracers' in line_str:
seek_to_key(binary_data, line_str, 'tracers')
nfloats = struct.unpack('<i', binary_data.read(4))
print(nfloats)
break
Any recommendations on a better way to do this would be awesome!

It's not completely clear to me what you are trying to achieve. Please explain that in more detail if you want a better answer. What I understand from your current question and code is that you are trying to read the 32-bit number directly after the ASCII text 'tracers'. I'm guessing this is only the first step of your code, since the name `nfloats' suggests that you will be reading a number of floats in the next step ;-) But I'll try to answer this question only.
There are a number of problems with your code:
First of all, a simple typo: Instead of line_str[key_start+len(key)] you probably meant line_str[key_start+len(key):]. (You missed the colon.)
You are mixing binary and text data. Why do you decode the binary data as UTF-8? It clearly isn't. You can't just "decode" binary data as UTF-8, slicing a piece of it, and then re-encode that using UTF-8. In this case, the part after your marker is 518 bytes, but when encoded as UTF-8 it becomes 920 bytes. This messes up your offset calculation. Tip: you can search binary data in binary data in Python :-) For example: b'Hello, world!'.find(b'world') returns 7. So you don't have to encode/decode the data at all.
You are reading line by line. Why is that? Lines are a concept of text files and don't have a real meaning in binary files. It could work, but that depends on the file format (which I don't know). In any case, your current code can only find one tracer per line. Is that intentionally, or could there be more markers in each line? Anyway, if the file is small enough to fit in memory, it would be much easier to process the data in one chunk.
A minor note: you could write binary_data = BytesIO(binary) and avoid the additional write(). Also the seek(0) is not necessary.
Example code
I think the following code gives the correct result. I hope it will be a useful start to finish your application. Note that this code conforms to the Style Guide for Python Code and that all pylint issues were resolved (except for a too long line and missing docstrings).
import io
import struct
DATA = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\xd6\x00\x8c<TE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00tracers\x00\xf2N\x03\x00P\xd9U=6\x1e\x92=\xbe\xa8\x0b<\xb1\x9f\x9f=\xaf%\x82=3\x81|=\xbeM\xb4=\x94\xa7\xa6<\xb9\xbd\xcb=\xba\x18\xc7=\x18?\xca<j\xe37=\xbc\x1cm=\x8a\xa6\xb5=q\xc1\x8f;\xe7\xee\xa0=\xe7\xec\xf7<\xc3\xb8\x8c=\xedw\xae=C$\x84<\x94\x18\x9c=&Tj=\xb3#\xb3=\r\xdd3=\x0eL==4\x00~<\xc6q\x1e=pHw=\xc1\x9a\x92="\x08\x9a=\xe6a\xeb<\xa4#.=\xc4\x0f-=\xa9O\xcb=i\'\x15=\x94\x03\x80=\x8f\xcd\xaf=\xd6\x00\x8c<TE\x9f<m\x9ad<[;Q=\x157X=\x17\xf1u=\xb8(\xa4=\x13\xd3\xfa<\x811_=\xd1iX=Q\x17^;\xd1n\xbe=\xfcb\xcc=\xe8\x9b\x99=W\xa9\x16=\xc5\x83\xa4=\xc0%\x98<\xbb|\x99<>#\x8b:\x1cY\x82;\xb8T\xa4<Cv\x87="n\x1c<J\x152=\x1f\xb2\x9d=&\x18\xb6=\x8a\xf9{=\x0fT\xba=HrX=\xa0\\S=#\xee\xbd=\x1e,\xc5=y\rU<gK\x84=\xe3*\r=\x04\xc4M=\x98a\xb3<\x95 T=\xf2Z\x94=lL\x15=\x07\x1b^=\xf3W\x83<\xf6\xff\xa1<\xb8\xfb\xcb<p\xb4\xd8<\xc9#\xfd<s\xa6\x1f;\xbf7W<\x8a\x9c\x82<\x1c\xb7l=\xa7\xd0\xb7=\xe4\x8d\x97=\xe2\x7f\x82=\x82\xa1\xcc<\xdfs\xca=C\x10p=\xb4\xfa\xb0=\xf35\x87=\x9d\x8bR<d\xb9\x0c<\xb26\xcd=\r\xd5\x1d<\xf4p\xb1=f)\xaf=\xe2M\\=F|\xf9<\x9baW=\x85|\xa3=\x0f\xdd\xa1=\xb6f\xa9=\xcbW\xcf<\xfa\x1a\xbe=\xeb\xda\xb2=\x88\xfb\x8e=\x9f+$=\xbbS\xac;\xa2o\xb5=\x08\xca\xe5<\xc9IC=\xa8\x05\xa6=\xbc \xbd=\x8e\x8d}=U\xcd\xba=\xcbG\x89=}\xadg=Z\xad\x9f=_=\xb6:y\x1c==\xa5\x0b3<<\xe5\x1e=*\xa0\xb6=\n\xcd\xb8\xd9<u\xb5W=rZ\x88=\xe0w}=\xa5\xf0\xa0=\xf4\x91\x82=\xe4r\xc5<\x0e\x91A=Z\x9d-<[N:=\xf1\t\x1e=\xc5_\xc2=\xf8\xea\x98=t\xd7\xbf<~N\xce==#\x93=\x98A\xa7=c\x81x=\xe3\xc6\x94=\xe2&\xcc=\x05\xa9^=\xf7\x05\xa8=[m\x81=\x1b\x0b\x84=\xf5\x98\xb9=+\x90\xd8<\xa2\xcc\xa5=5^\x92=\x0e\x9d\x1d=\x96\xc7\x8b;\xc5E\x9e;r\x1e\xc7=\xea6\xbf=\x19mN;\xd9$D=\x85\xa9\x8b=!\xe9\x90=\xe4/~<\xc1\x9c\xaf=\xde\xe4\x18=e\xb0H=hLO;\x9f\xf8\x8b=p.\xcf=L\x1f\x01<\xea\x19\xaf=Z\xd5\xc2<\xb4\xd8\xcf=s\x84\x0c=\x987\xa5;\x19Z\x93=\x0c\x8fO=y/\x97=\xeaOG=\xb0Fl=\x03\x7f\xbe=\x96\n' # noqa
def find_tracers(data):
start = 0
while True:
pos = data.find(b'tracers', start)
if pos == -1:
break
num_floats = struct.unpack('<i', data[pos+7: pos+11])
print(num_floats)
start = pos + 11
def main():
with io.BytesIO(DATA) as file:
data = file.read()
find_tracers(data)
if __name__ == '__main__':
main()

Related

Deserializing messages without loading entire file into memory?

I am using Google Protocol Buffers and Python to decode some large data files--200MB each. I have some code below that shows how to decode a delimited stream and it works just fine. However it uses the read() command which loads the whole file into memory and then iterates over it.
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read() ## PROBLEM-LOADS ENTIRE FILE TO MEMORY.
n = 0
while n < len(buf):
msg_len, new_pos = _DecodeVarint32(buf, n)
n = new_pos
msg_buf = buf[n:n+msg_len]
n += msg_len
read_row = sfeed.standard_feed()
read_row.ParseFromString(msg_buf)
# do something with read_metric
print(read_row)
Note that this code comes from another SO post, but I don't remember the exact url. I was wondering if there was a readlines() equivalent with protocol buffers that allows me to read in one delimited message at a time and decode it? I basically want a pipeline that is not limited by the RAM I have to load the file.
Seems like there was a pystream-protobuf package that supported some of this functionality, but it has not been updated in a year or two. There is also a post from 7 years ago that asked a similar question. But I was wondering if there was any new information since then.
python example for reading multiple protobuf messages from a stream
If it is ok to load one full message at a time, this is quite simple to implement by modifying the code you posted:
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read(10) # Maximum length of length prefix
while buf:
msg_len, new_pos = _DecodeVarint32(buf, 0)
buf = buf[new_pos:]
# read rest of the message
buf += f.read(msg_len - len(buf))
read_row = sfeed.standard_feed()
read_row.ParseFromString(buf)
buf = buf[msg_len:]
# do something with read_metric
print(read_row)
# read length prefix for next message
buf += f.read(10 - len(buf))
This reads 10 bytes, which is enough to parse the length prefix, and then reads the rest of the message once its length is known.
String mutations are not very efficient in Python (they make a lot of copies of the data), so using bytearray can improve performance if your individual messages are also large.
https://github.com/cartoonist/pystream-protobuf/ was updated 6 months ago. I haven't tested it much so far, but it seems to work fine without any need for an update. It provides optional gzip and async.

What is a Pythonic way to detect that the next read will produce an EOF in Python 3 (and Python 2)

Currently, I am using
def eofapproached(f):
pos = f.tell()
near = f.read(1) == ''
f.seek(pos)
return near
to detect if a file open in 'r' mode (the default) is "at EOF" in the sense that the next read would produce the EOF condition.
I might use it like so:
f = open('filename.ext') # default 'r' mode
print(eofapproached(f))
FYI, I am working with some existing code that stops when EOF occurs, and I want my code to do some action just before that happens.
I am also interested in any suggestions for a better (e.g., more concise) function name. I thought of eofnear, but that does not necessarily convey as specific a meaning.
Currently, I use Python 3, but I may be forced to use Python 2 (part of a legacy system) in the future.
You can use f.tell() to find out your current position in the file.
The problem is, that you need to find out how big the file is.
The niave (and efficient) solution is os.path.getsize(filepath) and compare that to the result of tell() but that will return the size in bytes, which is only relavent if reading in binary mode ('rb') as your file may have multi-byte characters.
Your best solution is to seek to the end and back to find out the size.
def char_count(f):
current = f.tell()
f.seek(0, 2)
end = f.tell()
f.seek(current)
return end
def chars_left(f, length=None):
if not length:
length = char_count(f)
return length - f.tell()
Preferably, run char_count once at the beginning, and then pass that into chars_left. Seeking isn't efficient, but you need to know how long your file is in characters and the only way is by reading it.
If you are reading line by line, and want to know before reading the last line, you also have to know how long your last line is to see if you are at the beginning of the last line.
If you are reading line by line, and only want to know if the next line read will result in an EOF, then when chars_left(f, total) == 0 you know you are there (no more lines left to read)
I've formulated this code to avoid the use of tell (perhaps using tell is simpler):
import os
class NearEOFException(Exception): pass
def tellMe_before_EOF(filePath, chunk_size):
fileSize = os.path.getsize(filePath)
chunks_num = (fileSize // chunk_size) # how many chunks can we read from file?
reads = 0 # how many chunks we read so far
f = open(filePath)
if chunks_num == 0:
raise NearEOFException("File is near EOF")
for i in range(chunks_num-1):
yield f.read(chunk_size)
else:
raise NearEOFException("File is near EOF")
if __name__ == "__main__":
g = tellMe_before_EOF("xyz", 3) # read in chunks of 3 chars
while True:
print(next(g), end='') # near EOF raise NearEOFException
The naming of the function is disputed. It's boring to name things, I'm just not good at that.
The function works like this: take the size of the file and see approximately how many times can we read N sized chunks and store it in chunks_num. This simple division gets us near EOF, the question is where do you think near EOF is? Near the last char for example or near the last nth characters? Maybe that's something to keep in mind if it matters.
Trace through this code to see how it works.

python limit must be integer

I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question

Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.
inputFileHandle = open(inputFileName, 'r')
row = 0
for line in inputFileHandle:
row = row + 1
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.
Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.
It's a bug in Python.
Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
And the work-around:
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).
The code you've posted looks fine by itself, so I would suspect a bug in your Python build.
FWIW, the snippet would be a little cleaner if it used enumerate:
inputFileHandle = open(inputFileName, 'r')
for row, line in enumerate(inputFileHandle):
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)

Reading non-text files into Python

I want to read in a non text file. It has an extension ".map" but can be opened by notepad. How should I open this file through python?
file = open("path-to-file","r") doesn't work for me. It returns No such file or directory: error.
Here's what my file looks like:
111 + gi|89106884|ref|AC_000091.1| 725803 TCGAGATCGACCATGTTGCCCGCCT IIIIIIIIIIIIIIIIIIIIIIIII 0 14:A>G
457 + gi|89106884|ref|AC_000091.1| 32629 CCGTGTCCACCGACTACGACACCTC IIIIIIIIIIIIIIIIIIIIIIIII 0 4:C>G,22:T>C
779 + gi|89106884|ref|AC_000091.1| 483582 GATCACCCACGCAAAGATGGGGCGA IIIIIIIIIIIIIIIIIIIIIIIII 0 15:A>G,18:C>G
784 + gi|89106884|ref|AC_000091.1| 226200 ACCGATAGTGAACCAGTACCGTGAG IIIIIIIIIIIIIIIIIIIIIIIII 1
If I do the follwing:
file = open("D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
It still gives me No such file or directory: 'D:\x08owtie-0.12.7-win32\x08owtie-0.12.7\\output_635\results_NC_000117.fna.1.ebwt.map' error. Is this because the file isn't binary or I don't have some permissions?
Would apppreciate help with this!
Binary files should use a binary mode.
f = open("path-to-file","rb")
But that won't help if you don't have the appropriate permissions or don't know the format of the file itself.
EDIT:
Obviously you didn't bother reading the error message, or you would have noticed that the filename it is using is not the one you expected.
f = open("D:\\bowtie-0.12.7-win32\\bowtie-0.12.7\\output_635\\results_NC_000117.fna.1.ebwt.map","rb")
f = open(r"D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
You have hit upon a minor difference between Unix and Windows here.
Since you mentioned Notepad, you must be running this on Windows. In DOS/Windows land, opening a binary file requires specifying attribute 'b' for binary, as others have already indicated. Unix/Linux are a bit more relaxed about this. Omitting attribute 'b' will still open a binary file.
The same behavior is exhibited in the C library's fopen() call.
If its a non-text file you could try opening it using binary format. Try this -
with open("path-to-file", "rb") as f:
byte = f.read(1)
while byte != "":
byte = f.read(1) # Do stuff with byte.
The with statement handles opening and closing the file, including if an exception is raised in the inner block.
Of course since the format is binary you need to know what you are going to do after you read. Also, here I read 1 byte at a time, you can define bigger chunk sizes too.
UPDATE: Maybe this is not a binary file. You might be having problems with file encoding, the characters might not be ascii or they might belong to unicode charset. Try this -
import codecs
f = codecs.open(u'path-to-file','r','utf-8')
print f.read()
f.close()
If you print this out in the terminal, you might still get gibberish since the terminal might not support this charset. I would advise, go ahead & process the text assuming its properly opened.
Source

Categories