aws sagemaker training pipe mode reading random number of bytes - python

I am using my own algorithm and loading data in json format from s3. Because of the huge size of data, I need to setup pipe mode. I have followed the instructions as given in: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/pipe_bring_your_own/train.py.
As a result, I am able to setup pipe and read data successfully. Only problem is that fifo pipe is not reading the specified amount of bytes. For example, given path to s3-fifo-channel,
number_of_bytes_to_read = 555444333
with open(fifo_path, "rb", buffering=0) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)
The length of data should be 555444333 bytes, but it is always less 12,123,123 bytes or so. Data in S3 looks as following:
s3://s3-bucket/1122/part1.json
s3://s3-bucket/1122/part2.json
s3://s3-bucket/1133/part1.json
s3://s3-bucket/1133/part2.json
and so.
Is there any way to enforce the number of bytes to be read? Any suggestion will be helpful. Thanks.

We just needed to add some positive value in the buffering and the problem was solved. Code will buffer 555444333 Bytes and then process 111222333 bytes each time. Since our files are in Json, we can easily convert incoming bytes to string and then clean strings by removing incomplete json parts. Final code looks like:
number_of_bytes_to_read = 111222333
number_of_bytes_to_buffer = 555444333
with open(fifo_path, "rb", buffering=number_of_bytes_to_buffer) as fifo:
while True:
data = fifo.read(number_of_bytes_to_read)

Related

Deserializing messages without loading entire file into memory?

I am using Google Protocol Buffers and Python to decode some large data files--200MB each. I have some code below that shows how to decode a delimited stream and it works just fine. However it uses the read() command which loads the whole file into memory and then iterates over it.
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read() ## PROBLEM-LOADS ENTIRE FILE TO MEMORY.
n = 0
while n < len(buf):
msg_len, new_pos = _DecodeVarint32(buf, n)
n = new_pos
msg_buf = buf[n:n+msg_len]
n += msg_len
read_row = sfeed.standard_feed()
read_row.ParseFromString(msg_buf)
# do something with read_metric
print(read_row)
Note that this code comes from another SO post, but I don't remember the exact url. I was wondering if there was a readlines() equivalent with protocol buffers that allows me to read in one delimited message at a time and decode it? I basically want a pipeline that is not limited by the RAM I have to load the file.
Seems like there was a pystream-protobuf package that supported some of this functionality, but it has not been updated in a year or two. There is also a post from 7 years ago that asked a similar question. But I was wondering if there was any new information since then.
python example for reading multiple protobuf messages from a stream
If it is ok to load one full message at a time, this is quite simple to implement by modifying the code you posted:
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read(10) # Maximum length of length prefix
while buf:
msg_len, new_pos = _DecodeVarint32(buf, 0)
buf = buf[new_pos:]
# read rest of the message
buf += f.read(msg_len - len(buf))
read_row = sfeed.standard_feed()
read_row.ParseFromString(buf)
buf = buf[msg_len:]
# do something with read_metric
print(read_row)
# read length prefix for next message
buf += f.read(10 - len(buf))
This reads 10 bytes, which is enough to parse the length prefix, and then reads the rest of the message once its length is known.
String mutations are not very efficient in Python (they make a lot of copies of the data), so using bytearray can improve performance if your individual messages are also large.
https://github.com/cartoonist/pystream-protobuf/ was updated 6 months ago. I haven't tested it much so far, but it seems to work fine without any need for an update. It provides optional gzip and async.

Open() command buffer manual buffer manipulation does not work

I am currently creating a data logging function with the Raspberry Pi, and I am unsure as to whether I have found a slight bug. The code I am using is as follows:
import sys, time, os
_File = 'TemperatureData1.csv'
_newDir = '/home/pi/Documents/Temperature Data Logs'
_directoryList = os.listdir(_newDir)
os.chdir(_newDir)
# Here I am specifying the file, that I want to write to it, and that
# I want to use a buffer of 5kb
output = open(_File, 'w', 5000)
try:
while (1):
output.write('hi\n')
time.sleep(0.01)
except KeyboardInterrupt:
print('Keyboard has been pressed')
output.close()
sys.exit(1)
What I have found is that when I periodically view the created file properties, the file size increases in accordance with the default buffer setting 8192 bytes, and not the 5kb that I have specified. However, when I run the exact same program in Python 2.7.13, the buffer size changes to 5kb as requested.
I was wondering if anyone else had experienced this and had any ideas on a solution to getting the program working on Python 3.6.3? Thanks in advance. I can work with the problem on python 2.7.13, it is my pure curiosity which has led to me posting this question.
Python's definition of open in version 2 is what you are using:
open(name[, mode[, buffering]])
In Python 3, the open command is a little different, in that buffering is not a positional integer, but a keyword arg:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
The docs have the following note:
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.
That special 8192 number is simply 2^13.
I would suggest trying buffering=5000.
I have done some more research, and managed to find some reasons as to why setting 'buffering' to a value more than 1, does not manually manipulate the buffer to a desired size (bytes) in python 3 or above.
It seems to be because the io library uses two buffers when working with files, a text buffer, and a binary buffer. When in text mode, the file is flushed in accordance to the text buffer (which does not seem to be able to be manipulated when buffering > 1). Instead the buffering argument, manipulates the binary buffer, which then feeds into the text buffer, therefore the buffering function does not work how the programmer expects. This is further explained in the following link:
https://bugs.python.org/issue30718
There is however a work around; you need to use open() in binary mode and not text mode, then use the io.TextIOWrapper function to write to a txt or csv file using the binary buffer. The work around is as follows:
import sys, time, os, io
_File = 'TemperatureData1.csv'
# Open or overwrite the file _file, and use a 20kb buffer in RAM
# before data is saved to disk.
output = open(_File, mode='wb', buffering=700)
output = io.TextIOWrapper(output, write_through=True)
try:
while (1):
output.write('h\n')
time.sleep(0.01)
except KeyboardInterrupt:
print('Keyboard has been pressed')
output.close()
sys.exit(1)

Buffer size VS File size when reading binary file in Python

buffersize=50000
inflie = open('in.jpg','rb')
outfile = open('out.jpg','wb')
buffer = infile.read(buffersize)
while len(buffer):
outfile.write(buffer)
buffer = infile.read(buffersize)
I am learning basics of reading / writing binary files in python, and understand this code.
I'd appreciate any help on understanding this code.
Thank you!
Q1: Is 50000 in buffersize equivalent to 50kb? (in.jpg is about 150kb)
Q2: How is next increment of data (ie. next 50,000 bytes of data) read from the input file?
(first 50,000 bytes are read and stored before while loop, then are written into output file,
how does the next 50,000 bytes are read without any incrementation in range?)
Q3: len(buffer) means the size of buffer (file object). When does this turn false in while loop?
The documentation answers all your questions:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
1: yes. The size parameter is interpreted as a number of bytes.
2: infile.read(50000) means "read (at most) 50000 bytes from infile". The second time you invoke this method, it will automatically read the next 50000 bytes from the file.
3: buffer is not the file but what you last read from the file. len(buffer) will evaluate to False when buffer is empty, i.e. when there's no more data to read from the file.

http post, byte size limitation on post - python

I have been reading the contents of a file which is continuously updated. I'm Trying something like this.
offset = 0
now = datetime.now()
FileName = now.date()
logfile = open("FileName","a")
logfile.seek(offset)
data = logfile.read()
try:
http post
except:
Exceptions...
Now I want to read only the specific number of bytes from the file. Just because if I lose the Ethernet connection and get the connection again, it takes a long time to read the whole file. So that Can someone help me reg this?
You can use .read() with a numeric argument to read only a specific number of bytes, e.g. read(10) will read 10 bytes from the current position in the file.
http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

How can I work with Gzip files which contain extra data?

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python's gzip.GzipFile`:
Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.
Python's zlib.Decompress.unused_data:
A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)
While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])
This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.
I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.
I had exactly this problem, but none of this answers resolved my issue. So, here is what I did to solve the problem:
#for gzip files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
#for zlib files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS)
#automatic header detection (zlib or gzip):
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
Depending on your case, it might be necessary to decode your data, like:
unzipped = unzipped.decode()
https://docs.python.org/3/library/zlib.html
I couldn't make it to work with the above mentioned techniques. so made a work around using zipfile package
import zipfile
from io import BytesIO
mock_file = BytesIO(data) #data is the compressed string
z = zipfile.ZipFile(file = mock_file)
neat_data = z.read(z.namelist()[0])
Works perfect

Categories