Deserializing messages without loading entire file into memory?

Deserializing messages without loading entire file into memory? - python

I am using Google Protocol Buffers and Python to decode some large data files--200MB each. I have some code below that shows how to decode a delimited stream and it works just fine. However it uses the read() command which loads the whole file into memory and then iterates over it.
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read() ## PROBLEM-LOADS ENTIRE FILE TO MEMORY.
n = 0
while n < len(buf):
msg_len, new_pos = _DecodeVarint32(buf, n)
n = new_pos
msg_buf = buf[n:n+msg_len]
n += msg_len
read_row = sfeed.standard_feed()
read_row.ParseFromString(msg_buf)
# do something with read_metric
print(read_row)
Note that this code comes from another SO post, but I don't remember the exact url. I was wondering if there was a readlines() equivalent with protocol buffers that allows me to read in one delimited message at a time and decode it? I basically want a pipeline that is not limited by the RAM I have to load the file.
Seems like there was a pystream-protobuf package that supported some of this functionality, but it has not been updated in a year or two. There is also a post from 7 years ago that asked a similar question. But I was wondering if there was any new information since then.
python example for reading multiple protobuf messages from a stream

If it is ok to load one full message at a time, this is quite simple to implement by modifying the code you posted:
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read(10) # Maximum length of length prefix
while buf:
msg_len, new_pos = _DecodeVarint32(buf, 0)
buf = buf[new_pos:]
# read rest of the message
buf += f.read(msg_len - len(buf))
read_row = sfeed.standard_feed()
read_row.ParseFromString(buf)
buf = buf[msg_len:]
# do something with read_metric
print(read_row)
# read length prefix for next message
buf += f.read(10 - len(buf))
This reads 10 bytes, which is enough to parse the length prefix, and then reads the rest of the message once its length is known.
String mutations are not very efficient in Python (they make a lot of copies of the data), so using bytearray can improve performance if your individual messages are also large.

https://github.com/cartoonist/pystream-protobuf/ was updated 6 months ago. I haven't tested it much so far, but it seems to work fine without any need for an update. It provides optional gzip and async.

Related

aws sagemaker training pipe mode reading random number of bytes

I am using my own algorithm and loading data in json format from s3. Because of the huge size of data, I need to setup pipe mode. I have followed the instructions as given in: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/pipe_bring_your_own/train.py.
As a result, I am able to setup pipe and read data successfully. Only problem is that fifo pipe is not reading the specified amount of bytes. For example, given path to s3-fifo-channel,
number_of_bytes_to_read = 555444333
with open(fifo_path, "rb", buffering=0) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)
The length of data should be 555444333 bytes, but it is always less 12,123,123 bytes or so. Data in S3 looks as following:
s3://s3-bucket/1122/part1.json
s3://s3-bucket/1122/part2.json
s3://s3-bucket/1133/part1.json
s3://s3-bucket/1133/part2.json
and so.
Is there any way to enforce the number of bytes to be read? Any suggestion will be helpful. Thanks.

We just needed to add some positive value in the buffering and the problem was solved. Code will buffer 555444333 Bytes and then process 111222333 bytes each time. Since our files are in Json, we can easily convert incoming bytes to string and then clean strings by removing incomplete json parts. Final code looks like:
number_of_bytes_to_read = 111222333
number_of_bytes_to_buffer = 555444333
with open(fifo_path, "rb", buffering=number_of_bytes_to_buffer) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)

Find a string in a binary file

I am trying to extract data from a binary file where the data chunks are "tagged" with ASCII text. I need to find the word "tracers" in the binary file so I can read the next 4 bytes (int).
I am trying to simply loop over the lines, decoding them and checking for the text, which works. But I am having trouble seeking to the correct place in the file directly after the text (the seek_to_key function):
from io import BytesIO
import struct
binary = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\xd6\x00\x8c<TE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00tracers\x00\xf2N\x03\x00P\xd9U=6\x1e\x92=\xbe\xa8\x0b<\xb1\x9f\x9f=\xaf%\x82=3\x81|=\xbeM\xb4=\x94\xa7\xa6<\xb9\xbd\xcb=\xba\x18\xc7=\x18?\xca<j\xe37=\xbc\x1cm=\x8a\xa6\xb5=q\xc1\x8f;\xe7\xee\xa0=\xe7\xec\xf7<\xc3\xb8\x8c=\xedw\xae=C$\x84<\x94\x18\x9c=&Tj=\xb3#\xb3=\r\xdd3=\x0eL==4\x00~<\xc6q\x1e=pHw=\xc1\x9a\x92="\x08\x9a=\xe6a\xeb<\xa4#.=\xc4\x0f-=\xa9O\xcb=i\'\x15=\x94\x03\x80=\x8f\xcd\xaf=\xd6\x00\x8c<TE\x9f<m\x9ad<[;Q=\x157X=\x17\xf1u=\xb8(\xa4=\x13\xd3\xfa<\x811_=\xd1iX=Q\x17^;\xd1n\xbe=\xfcb\xcc=\xe8\x9b\x99=W\xa9\x16=\xc5\x83\xa4=\xc0%\x98<\xbb|\x99<>#\x8b:\x1cY\x82;\xb8T\xa4<Cv\x87="n\x1c<J\x152=\x1f\xb2\x9d=&\x18\xb6=\x8a\xf9{=\x0fT\xba=HrX=\xa0\\S=#\xee\xbd=\x1e,\xc5=y\rU<gK\x84=\xe3*\r=\x04\xc4M=\x98a\xb3<\x95 T=\xf2Z\x94=lL\x15=\x07\x1b^=\xf3W\x83<\xf6\xff\xa1<\xb8\xfb\xcb<p\xb4\xd8<\xc9#\xfd<s\xa6\x1f;\xbf7W<\x8a\x9c\x82<\x1c\xb7l=\xa7\xd0\xb7=\xe4\x8d\x97=\xe2\x7f\x82=\x82\xa1\xcc<\xdfs\xca=C\x10p=\xb4\xfa\xb0=\xf35\x87=\x9d\x8bR<d\xb9\x0c<\xb26\xcd=\r\xd5\x1d<\xf4p\xb1=f)\xaf=\xe2M\\=F|\xf9<\x9baW=\x85|\xa3=\x0f\xdd\xa1=\xb6f\xa9=\xcbW\xcf<\xfa\x1a\xbe=\xeb\xda\xb2=\x88\xfb\x8e=\x9f+$=\xbbS\xac;\xa2o\xb5=\x08\xca\xe5<\xc9IC=\xa8\x05\xa6=\xbc \xbd=\x8e\x8d}=U\xcd\xba=\xcbG\x89=}\xadg=Z\xad\x9f=_=\xb6:y\x1c==\xa5\x0b3<<\xe5\x1e=*\xa0\xb6=\n\xcd\xb8\xd9<u\xb5W=rZ\x88=\xe0w}=\xa5\xf0\xa0=\xf4\x91\x82=\xe4r\xc5<\x0e\x91A=Z\x9d-<[N:=\xf1\t\x1e=\xc5_\xc2=\xf8\xea\x98=t\xd7\xbf<~N\xce==#\x93=\x98A\xa7=c\x81x=\xe3\xc6\x94=\xe2&\xcc=\x05\xa9^=\xf7\x05\xa8=[m\x81=\x1b\x0b\x84=\xf5\x98\xb9=+\x90\xd8<\xa2\xcc\xa5=5^\x92=\x0e\x9d\x1d=\x96\xc7\x8b;\xc5E\x9e;r\x1e\xc7=\xea6\xbf=\x19mN;\xd9$D=\x85\xa9\x8b=!\xe9\x90=\xe4/~<\xc1\x9c\xaf=\xde\xe4\x18=e\xb0H=hLO;\x9f\xf8\x8b=p.\xcf=L\x1f\x01<\xea\x19\xaf=Z\xd5\xc2<\xb4\xd8\xcf=s\x84\x0c=\x987\xa5;\x19Z\x93=\x0c\x8fO=y/\x97=\xeaOG=\xb0Fl=\x03\x7f\xbe=\x96\n'
binary_data = BytesIO()
binary_data.write(binary)
binary_data.seek(0)
def seek_to_key(f, line_str, key):
key_start = line_str.find(key)
offset = len(line_str[key_start+len(key)].encode('utf-8'))
f.seek(-offset, 1)
for line in binary_data:
line_str = line.decode('utf-8', errors='replace')
print(line_str)
if 'tracers' in line_str:
seek_to_key(binary_data, line_str, 'tracers')
nfloats = struct.unpack('<i', binary_data.read(4))
print(nfloats)
break
Any recommendations on a better way to do this would be awesome!

It's not completely clear to me what you are trying to achieve. Please explain that in more detail if you want a better answer. What I understand from your current question and code is that you are trying to read the 32-bit number directly after the ASCII text 'tracers'. I'm guessing this is only the first step of your code, since the name `nfloats' suggests that you will be reading a number of floats in the next step ;-) But I'll try to answer this question only.
There are a number of problems with your code:
First of all, a simple typo: Instead of line_str[key_start+len(key)] you probably meant line_str[key_start+len(key):]. (You missed the colon.)
You are mixing binary and text data. Why do you decode the binary data as UTF-8? It clearly isn't. You can't just "decode" binary data as UTF-8, slicing a piece of it, and then re-encode that using UTF-8. In this case, the part after your marker is 518 bytes, but when encoded as UTF-8 it becomes 920 bytes. This messes up your offset calculation. Tip: you can search binary data in binary data in Python :-) For example: b'Hello, world!'.find(b'world') returns 7. So you don't have to encode/decode the data at all.
You are reading line by line. Why is that? Lines are a concept of text files and don't have a real meaning in binary files. It could work, but that depends on the file format (which I don't know). In any case, your current code can only find one tracer per line. Is that intentionally, or could there be more markers in each line? Anyway, if the file is small enough to fit in memory, it would be much easier to process the data in one chunk.
A minor note: you could write binary_data = BytesIO(binary) and avoid the additional write(). Also the seek(0) is not necessary.
Example code
I think the following code gives the correct result. I hope it will be a useful start to finish your application. Note that this code conforms to the Style Guide for Python Code and that all pylint issues were resolved (except for a too long line and missing docstrings).
import io
import struct
DATA = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\xd6\x00\x8c<TE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00tracers\x00\xf2N\x03\x00P\xd9U=6\x1e\x92=\xbe\xa8\x0b<\xb1\x9f\x9f=\xaf%\x82=3\x81|=\xbeM\xb4=\x94\xa7\xa6<\xb9\xbd\xcb=\xba\x18\xc7=\x18?\xca<j\xe37=\xbc\x1cm=\x8a\xa6\xb5=q\xc1\x8f;\xe7\xee\xa0=\xe7\xec\xf7<\xc3\xb8\x8c=\xedw\xae=C$\x84<\x94\x18\x9c=&Tj=\xb3#\xb3=\r\xdd3=\x0eL==4\x00~<\xc6q\x1e=pHw=\xc1\x9a\x92="\x08\x9a=\xe6a\xeb<\xa4#.=\xc4\x0f-=\xa9O\xcb=i\'\x15=\x94\x03\x80=\x8f\xcd\xaf=\xd6\x00\x8c<TE\x9f<m\x9ad<[;Q=\x157X=\x17\xf1u=\xb8(\xa4=\x13\xd3\xfa<\x811_=\xd1iX=Q\x17^;\xd1n\xbe=\xfcb\xcc=\xe8\x9b\x99=W\xa9\x16=\xc5\x83\xa4=\xc0%\x98<\xbb|\x99<>#\x8b:\x1cY\x82;\xb8T\xa4<Cv\x87="n\x1c<J\x152=\x1f\xb2\x9d=&\x18\xb6=\x8a\xf9{=\x0fT\xba=HrX=\xa0\\S=#\xee\xbd=\x1e,\xc5=y\rU<gK\x84=\xe3*\r=\x04\xc4M=\x98a\xb3<\x95 T=\xf2Z\x94=lL\x15=\x07\x1b^=\xf3W\x83<\xf6\xff\xa1<\xb8\xfb\xcb<p\xb4\xd8<\xc9#\xfd<s\xa6\x1f;\xbf7W<\x8a\x9c\x82<\x1c\xb7l=\xa7\xd0\xb7=\xe4\x8d\x97=\xe2\x7f\x82=\x82\xa1\xcc<\xdfs\xca=C\x10p=\xb4\xfa\xb0=\xf35\x87=\x9d\x8bR<d\xb9\x0c<\xb26\xcd=\r\xd5\x1d<\xf4p\xb1=f)\xaf=\xe2M\\=F|\xf9<\x9baW=\x85|\xa3=\x0f\xdd\xa1=\xb6f\xa9=\xcbW\xcf<\xfa\x1a\xbe=\xeb\xda\xb2=\x88\xfb\x8e=\x9f+$=\xbbS\xac;\xa2o\xb5=\x08\xca\xe5<\xc9IC=\xa8\x05\xa6=\xbc \xbd=\x8e\x8d}=U\xcd\xba=\xcbG\x89=}\xadg=Z\xad\x9f=_=\xb6:y\x1c==\xa5\x0b3<<\xe5\x1e=*\xa0\xb6=\n\xcd\xb8\xd9<u\xb5W=rZ\x88=\xe0w}=\xa5\xf0\xa0=\xf4\x91\x82=\xe4r\xc5<\x0e\x91A=Z\x9d-<[N:=\xf1\t\x1e=\xc5_\xc2=\xf8\xea\x98=t\xd7\xbf<~N\xce==#\x93=\x98A\xa7=c\x81x=\xe3\xc6\x94=\xe2&\xcc=\x05\xa9^=\xf7\x05\xa8=[m\x81=\x1b\x0b\x84=\xf5\x98\xb9=+\x90\xd8<\xa2\xcc\xa5=5^\x92=\x0e\x9d\x1d=\x96\xc7\x8b;\xc5E\x9e;r\x1e\xc7=\xea6\xbf=\x19mN;\xd9$D=\x85\xa9\x8b=!\xe9\x90=\xe4/~<\xc1\x9c\xaf=\xde\xe4\x18=e\xb0H=hLO;\x9f\xf8\x8b=p.\xcf=L\x1f\x01<\xea\x19\xaf=Z\xd5\xc2<\xb4\xd8\xcf=s\x84\x0c=\x987\xa5;\x19Z\x93=\x0c\x8fO=y/\x97=\xeaOG=\xb0Fl=\x03\x7f\xbe=\x96\n' # noqa
def find_tracers(data):
start = 0
while True:
pos = data.find(b'tracers', start)
if pos == -1:
break
num_floats = struct.unpack('<i', data[pos+7: pos+11])
print(num_floats)
start = pos + 11
def main():
with io.BytesIO(DATA) as file:
data = file.read()
find_tracers(data)
if __name__ == '__main__':
main()

Python bz2 sequential compressor produces invalid data stream on low compression levels

I have a series of strings in a list named 'lines' and I compress them as follows:
import bz2
compressor = bz2.BZ2Compressor(compressionLevel)
for l in lines:
compressor.compress(l)
compressedData = compressor.flush()
decompressedData = bz2.decompress(compressedData)
When compressionLevel is set to 8 or 9, this works fine. When it's any number between 1 and 7 (inclusive), the final line fails with an IOError: invalid data stream. The same occurs if I use the sequential decompressor. However, if I join the strings into one long string and use the one-shot compressor function, it works fine:
import bz2
compressedData = bz2.compress("\n".join(lines))
decompressedData = bz2.decompress(compressedData)
# Works perfectly
Do you know why this would be and how to make it work at lower compression levels?

You are throwing away the compressed data returned by compressor.compress(l) ... docs say "Returns a chunk of compressed data if possible, or an empty byte string otherwise." You need to do something like this:
# setup code goes here
for l in lines:
chunk = compressor.compress(l)
if chunk: do_something_with(chunk)
chunk = compressor.flush()
if chunk: do_something_with(chunk)
# teardown code goes here
Also note that your oneshot code uses "\n".join() ... to check this against the chunked result, use "".join()
Also beware of bytes/str issues e.g. the above should be b"whatever".join().
What version of Python are you using?

http post, byte size limitation on post - python

I have been reading the contents of a file which is continuously updated. I'm Trying something like this.
offset = 0
now = datetime.now()
FileName = now.date()
logfile = open("FileName","a")
logfile.seek(offset)
data = logfile.read()
try:
http post
except:
Exceptions...
Now I want to read only the specific number of bytes from the file. Just because if I lose the Ethernet connection and get the connection again, it takes a long time to read the whole file. So that Can someone help me reg this?

You can use .read() with a numeric argument to read only a specific number of bytes, e.g. read(10) will read 10 bytes from the current position in the file.
http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

Reading an entire binary file into Python

I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?

Use numpy.fromfile.

I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.

You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?

I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.

I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deserializing messages without loading entire file into memory? - python

https://github.com/cartoonist/pystream-protobuf/ was updated 6 months ago. I haven't tested it much so far, but it seems to work fine without any need for an update. It provides optional gzip and async.

Related

aws sagemaker training pipe mode reading random number of bytes

Find a string in a binary file

Python bz2 sequential compressor produces invalid data stream on low compression levels

http post, byte size limitation on post - python

Reading an entire binary file into Python

Categories

Resources