The size parameter for gzip.open().read() - python

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:
with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.
This is the documentation for the read() function:
Read at most n characters from stream.
Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.
If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?
I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.
In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?
My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.
Reproducible details
My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.
The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \x1c`) and the data begins:
import gzip
import numpy as np
train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1
def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30
first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.
Any reference or advice would be very appreciated!

From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.
Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.
That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.
See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data
The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).
I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.

Related

Parsing IFF- style data using Python

I have an IFF- style file (see below) whose contents I need to inspect in Python.
https://en.wikipedia.org/wiki/Interchange_File_Format
I can iterate through the file using the following code
from chunk import Chunk
def chunks(f):
while True:
try:
c=Chunk(f, align=False, bigendian=False)
yield c
c.skip()
except EOFError:
break
if __name__=="__main__":
for c in chunks(file("sample.iff", 'rb')):
name, sz, value = c.getname(), c.getsize(), c.read()
print (name, sz, value)
Now I need to parse the different values. I have had some success using Python's 'struct' module, unpacking different fields as follows
struct.unpack('<I', value)
or
struct.unpack('BBBB', value)
by experimenting with different formatting characters shown in the struct module documentation
https://docs.python.org/2/library/struct.html
This works with some of the simpler fields but not with the more complex ones. It is all very trial-and-error. What I need is some systematic way of unpacking the different values, some way of knowing or inspecting the type of data they represent. I am not a C datatype expert.
Any ideas ? Many thanks.
SVOXVERS BVER BPM }SPEDTGRDGVOL`NAME2017-02-15 16-38MSCLMZOOMXOFMYOFLMSKCURLTIMESELSLGENPATNPATTPATLPDTAa � 1pQ 10 `q !#QP! 0A �`A PCHNPLIN PYSZ PFLGPICO �m�!�a��Q�1:\<<<<:\�1�Q��a�!�mPFGCPBGC���PFFFPXXXPYYYPENDSFFFCSNAM OutputSFINSRELSXXXDSYYYhSZZZSSCLSVPRSCOL���SMICSMIB����SMIP����SLNK����SENDSFFFISNAM FMSTYPFMSFINSRELSXXX�SYYY8SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNKCVAL�CVAL0CVAL�CVALCVALCVALCVALCVALGCVALnCVAL\CVALCVAL&CVALoCVALDCVALCVALCVALCMID������������������SENDSFFFQSNAM EchoSTYPEchoSFINSRELSXXX�SYYY SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVALCVALCVALCMID0������SENDSFFFQSNAM ReverbSTYPReverbSFINSRELSXXX\SYYY�SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVAL�CVALCVALCVALCVALCVALCMIDH���������SENDSENDSENDSENDSEND
If it's really an IFF file, it needs alignment and big-endian turned on, and the file would contain a single FORM chunk that in turn contains the FORM type such as SVOX and the contents chunks. (Or it could contain a LIST or CAT container chunk.)
An IFF chunk has:
A four-character chunk-type code
A four-byte big-endian integer: length
length number of data bytes
A pad byte for alignment if length is odd
This is documented in "EA IFF 85". See the "EA IFF-85" Repository for the original IFF docs. [I wrote them.]
Some file formats like RIFF and PNG are variations on the IFF design, not conforming applications of the IFF standard. They vary the chunk format details, which is why Python's Chunk reader library lets you pick alignment, endian, and when to recurse into chunks.
By looking at your file in a hex/ascii dump and mapping out the chunk spans, you should be able to deduce whether it uses big-endian or little-endian length fields, whether each odd-length chunk is followed by a pad byte for alignment, and whether there are chunks within chunks.
Now to the contents. A chunk's type signals the format and semantics of its contents. Those contents could be a simple C struct or could contain variable-length strings. IFF itself does not provide metadata on that level of structure, unlike JSON and TIFF.
So try to find the documentation for the file format (SVOX?).
Otherwise try to reverse engineer the data. If you put sample data into an application that generates these files, you can try special cases, look for the expected values in the file, change just one parameter, then look for what changed in the file.
Finally, your code should call c.close(). c.close() will call c.skip() for you and also handle chunk closing, which includes safety checks for attempts to read the chunk afterwards.

How do I search for a set amount of hex and non hex data in python

I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]
Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results

Python struct.unpack binary file

I'm using struct.unpack to read the 11th byte of a file to the 21st byte which represents a field that is supposed to read 'SNA'. The field is 'populated as BCS-A where it is left justified and padded to the right boundary with BCS spaces'. Since the field is 10 bytes long, my format string is '10s'. However, per the output mentioned, the remaining 7 bytes are spaces. To eliminate those spaces I use strip. Unfortunately, this still yields 'SNA\x00'. What am I doing wrong?
field = struct.unpack('10s',data[start:stop])
field[0].strip() (since the output of a strut.unpack is a tuple)
Your data doesn't conform to the standard you've specified. Either contact your data supplier and have them fix their bug, or be more generous about your definition of "space". If you want to accept that data, you could, for example, do this:
field[0].strip(' \t\n\x00')
or, with more limited acceptance:
field[0].strip().rstrip('\x00')

Speeding up (simple) text processing

I need to display data from two files (with equal sizes) to be able to visually compare them. For this, I made a new Tk widget consisting of four Text widgets. The first widget contains characters representing bytes from the first file, the second one contains hexadecimal values of the bytes in the left widget, and the same goes for the third and four one respective (containing data/hex values for the second file). The input data to be displayed are two bytearrays.
To fill the Text widgets, I have to process the input data (bytearrays), because
I have to get rid of unprintable characters and some characters that caused misalignment of the respective lines in the four widgets,
I have to fill the second/fourth widgets with hex values of the bytes, therefore I have to convert the byte values to hex numbers.
The code I used does the functionality described, and it works quite well for small files (several hundreds of kilobytes). However when I try to load bigger files (several megabytes), the time it takes to process and load the data is unacceptable (tens of seconds).
An example of my widget for displaying the data can be seen here:
To process the input data, I use the following code. _ldata and _rdata are bytearrays with the input data, ldata and rdata are strings to be loaded in the first and third Text widgets, lhexdata and rhexdata are strings with the hexadecimal values to be loaded in the second and fourth Text widget. wrap is an integer determining how many bytes will be represented on one line. The print_chars function replaces all the characters that caused misalignment or couldn't be selected in the Text widgets.
def print_chars(self, byte):
if (byte < 0x20 or
(byte > 0x7E and byte < 0xB1)):
return 0x07
else:
return byte
...
ldata = "\n".join(["".join(map(chr,
map(self.print_chars, _ldata[i:i+wrap])))
for i in range(0, len(_ldata), wrap)])
rdata = "\n".join(["".join(map(chr,
map(self.print_chars, _rdata[i:i+wrap])))
for i in range(0, len(_rdata), wrap)])
lhexdata = "\n".join([" ".join(map("{0:02X}".format, _ldata[i:i+wrap]))
for i in range(0, len(_ldata), wrap)])
rhexdata = "\n".join([" ".join(map("{0:02X}".format, _rdata[i:i+wrap]))
for i in range(0, len(_rdata), wrap)])
I think there is a way to speed things up, but can't figure out any. Before I implemented the list comprehension, I had used for cycles for the data processing, and it was a real pain in the neck even for very short inputs. The list comprehensions vere a big improvement in performance, yet not sufficient. Thanks for any advices.
I think your first two lines can be improved by using bytearray.translate with an appropriate translation table rather than using your own escaping and converting system. Then you can turn it into a string with bytearray.decode. You still need an additional step to split the text into lines and recombine it, but I suspect that it will be faster if you've done the translation work already.
table = bytearray.maketrans(bytes(range(0x20))+bytes(range(0x7f, 0xb1)),
b"\x07"*(0x20+0xb1-0x7f))
ldata_string = _ldata.translate(table).decode("latin-1") # pick some 8-bit encoding
ldata = "\n".join(ldata_string[i:i+wrap] for i in range(0, len(ldata), wrap))
You can do something similar for the hex output, using the b16encode function from the base64 module to convert to hex, then decode to make the bytes output into a string. The splitting and rejoining gets a bit more complicated due to the need for spaces between each pair of hex digits, but I suspect it will still be faster than encoding each byte separately.
import base64
lhexdata_string = base64.b16encode(_ldata).decode("ascii") # hex will always be ASCII
lhexdata = "\n".join(" ".join(hexdata_string[i+j:i+j+2] for i in range(0, 2*wrap, 2))
for j in range(0, len(lhexdata_string), 2*wrap))
Note that the code above assumes that you're using Python 3. If you're using Python 2 you'll need to change a few things (such as working around the lack of maketrans and not needing to decode).

Using struct.unpack() without knowing anything about the string

I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.

Categories