Parsing IFF- style data using Python

Parsing IFF- style data using Python - python

I have an IFF- style file (see below) whose contents I need to inspect in Python.
https://en.wikipedia.org/wiki/Interchange_File_Format
I can iterate through the file using the following code
from chunk import Chunk
def chunks(f):
while True:
try:
c=Chunk(f, align=False, bigendian=False)
yield c
c.skip()
except EOFError:
break
if __name__=="__main__":
for c in chunks(file("sample.iff", 'rb')):
name, sz, value = c.getname(), c.getsize(), c.read()
print (name, sz, value)
Now I need to parse the different values. I have had some success using Python's 'struct' module, unpacking different fields as follows
struct.unpack('<I', value)
or
struct.unpack('BBBB', value)
by experimenting with different formatting characters shown in the struct module documentation
https://docs.python.org/2/library/struct.html
This works with some of the simpler fields but not with the more complex ones. It is all very trial-and-error. What I need is some systematic way of unpacking the different values, some way of knowing or inspecting the type of data they represent. I am not a C datatype expert.
Any ideas ? Many thanks.
SVOXVERS BVER BPM }SPEDTGRDGVOL`NAME2017-02-15 16-38MSCLMZOOMXOFMYOFLMSKCURLTIMESELSLGENPATNPATTPATLPDTAa � 1pQ 10 `q !#QP! 0A �`A PCHNPLIN PYSZ PFLGPICO �m�!�a��Q�1:\<<<<:\�1�Q��a�!�mPFGCPBGC���PFFFPXXXPYYYPENDSFFFCSNAM OutputSFINSRELSXXXDSYYYhSZZZSSCLSVPRSCOL���SMICSMIB����SMIP����SLNK����SENDSFFFISNAM FMSTYPFMSFINSRELSXXX�SYYY8SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNKCVAL�CVAL0CVAL�CVALCVALCVALCVALCVALGCVALnCVAL\CVALCVAL&CVALoCVALDCVALCVALCVALCMID������������������SENDSFFFQSNAM EchoSTYPEchoSFINSRELSXXX�SYYY SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVALCVALCVALCMID0������SENDSFFFQSNAM ReverbSTYPReverbSFINSRELSXXX\SYYY�SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVAL�CVALCVALCVALCVALCVALCMIDH���������SENDSENDSENDSENDSEND

If it's really an IFF file, it needs alignment and big-endian turned on, and the file would contain a single FORM chunk that in turn contains the FORM type such as SVOX and the contents chunks. (Or it could contain a LIST or CAT container chunk.)
An IFF chunk has:
A four-character chunk-type code
A four-byte big-endian integer: length
length number of data bytes
A pad byte for alignment if length is odd
This is documented in "EA IFF 85". See the "EA IFF-85" Repository for the original IFF docs. [I wrote them.]
Some file formats like RIFF and PNG are variations on the IFF design, not conforming applications of the IFF standard. They vary the chunk format details, which is why Python's Chunk reader library lets you pick alignment, endian, and when to recurse into chunks.
By looking at your file in a hex/ascii dump and mapping out the chunk spans, you should be able to deduce whether it uses big-endian or little-endian length fields, whether each odd-length chunk is followed by a pad byte for alignment, and whether there are chunks within chunks.
Now to the contents. A chunk's type signals the format and semantics of its contents. Those contents could be a simple C struct or could contain variable-length strings. IFF itself does not provide metadata on that level of structure, unlike JSON and TIFF.
So try to find the documentation for the file format (SVOX?).
Otherwise try to reverse engineer the data. If you put sample data into an application that generates these files, you can try special cases, look for the expected values in the file, change just one parameter, then look for what changed in the file.
Finally, your code should call c.close(). c.close() will call c.skip() for you and also handle chunk closing, which includes safety checks for attempts to read the chunk afterwards.

Related

The size parameter for gzip.open().read()

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:
with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.
This is the documentation for the read() function:
Read at most n characters from stream.
Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.
If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?
I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.
In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?
My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.
Reproducible details
My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.
The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \x1c`) and the data begins:
import gzip
import numpy as np
train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1
def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30
first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.
Any reference or advice would be very appreciated!

From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.
Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.
That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.
See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data
The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).
I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.

Clean way to read a null-terminated (C-style) string from a file?

I'm looking for a clean and simple way to read a null-terminated C string from a file or file-like object in Python. In a way that doesn't consume more input from the file than it needs, or pushes it back onto whatever file/buffer it works with such that other code can read the data immediately after a null-terminated string.
I've seen a bit of rather ugly code to do it, but not much that I'd like to use.
universal newlines support only works for open()ed files, not StringIO objects etc, and doesn't look like it handles unconventional newlines. Also, if it did work it'd result in strings with \n appended, which is undesirable.
struct doesn't look like it supports reading arbitrary-length C strings at all, requiring a length as part of the format.
ctypes has c_buffer, which can be constructed from a byte string and will return the first null terminated string as its value. Again, this requires determining how much must be read in advance, and it doesn't differentiate between null-terminated and unterminated strings. The same is true of c_char_p. So it doesn't seem to help much, since you already have to know you've read enough of the string and have to handle buffer splits.
The usual way to do this in C is read chunks into a buffer, copying and resizing the buffer if needed, then check if the newest chunk read contains a null byte. If it does, return everything up to the null byte and either realign the buffer or if you're being fancy, keep on reading and use it as a ring buffer. (This only works if you can hand the excess data read back to the caller, or if your platform's ungetc lets to push a lot back onto the file, of course.)
Is it necessary to spell out similar code in Python? I was surprised not to find anything canned in io, ctypes or struct.
file objects don't seem to have a way to push back onto their buffer, like ungetc, and neither do buffered I/O streams in the io module.
I feel like I must be missing the obvious here. I'd really rather avoid byte-by-byte reading:
def readcstr(f):
buf = bytearray()
while True:
b = f.read(1)
if b is None or b == '\0':
return str(buf)
else:
buf.append(b)
but right now that's what I'm doing.

Incredibly mild improvement on what you have (mostly in that it uses more built-ins that, in CPython, are implemented in C, which usually runs faster):
import functools
import itertools
def readcstr(f):
toeof = iter(functools.partial(f.read, 1), '')
return ''.join(itertools.takewhile('\0'.__ne__, toeof))
This is relatively ugly (and sensitive to the type of the file object; it won't work with file objects that return unicode), but pushes all the work to the C layer. The two arg iter ensures you stop if the file is exhausted, while itertools.takewhile looks for (and consumes) the NUL terminator but no more; ''.join then combines the bytes read into a single return value.

Speeding up (simple) text processing

I need to display data from two files (with equal sizes) to be able to visually compare them. For this, I made a new Tk widget consisting of four Text widgets. The first widget contains characters representing bytes from the first file, the second one contains hexadecimal values of the bytes in the left widget, and the same goes for the third and four one respective (containing data/hex values for the second file). The input data to be displayed are two bytearrays.
To fill the Text widgets, I have to process the input data (bytearrays), because
I have to get rid of unprintable characters and some characters that caused misalignment of the respective lines in the four widgets,
I have to fill the second/fourth widgets with hex values of the bytes, therefore I have to convert the byte values to hex numbers.
The code I used does the functionality described, and it works quite well for small files (several hundreds of kilobytes). However when I try to load bigger files (several megabytes), the time it takes to process and load the data is unacceptable (tens of seconds).
An example of my widget for displaying the data can be seen here:
To process the input data, I use the following code. _ldata and _rdata are bytearrays with the input data, ldata and rdata are strings to be loaded in the first and third Text widgets, lhexdata and rhexdata are strings with the hexadecimal values to be loaded in the second and fourth Text widget. wrap is an integer determining how many bytes will be represented on one line. The print_chars function replaces all the characters that caused misalignment or couldn't be selected in the Text widgets.
def print_chars(self, byte):
if (byte < 0x20 or
(byte > 0x7E and byte < 0xB1)):
return 0x07
else:
return byte
...
ldata = "\n".join(["".join(map(chr,
map(self.print_chars, _ldata[i:i+wrap])))
for i in range(0, len(_ldata), wrap)])
rdata = "\n".join(["".join(map(chr,
map(self.print_chars, _rdata[i:i+wrap])))
for i in range(0, len(_rdata), wrap)])
lhexdata = "\n".join([" ".join(map("{0:02X}".format, _ldata[i:i+wrap]))
for i in range(0, len(_ldata), wrap)])
rhexdata = "\n".join([" ".join(map("{0:02X}".format, _rdata[i:i+wrap]))
for i in range(0, len(_rdata), wrap)])
I think there is a way to speed things up, but can't figure out any. Before I implemented the list comprehension, I had used for cycles for the data processing, and it was a real pain in the neck even for very short inputs. The list comprehensions vere a big improvement in performance, yet not sufficient. Thanks for any advices.

I think your first two lines can be improved by using bytearray.translate with an appropriate translation table rather than using your own escaping and converting system. Then you can turn it into a string with bytearray.decode. You still need an additional step to split the text into lines and recombine it, but I suspect that it will be faster if you've done the translation work already.
table = bytearray.maketrans(bytes(range(0x20))+bytes(range(0x7f, 0xb1)),
b"\x07"*(0x20+0xb1-0x7f))
ldata_string = _ldata.translate(table).decode("latin-1") # pick some 8-bit encoding
ldata = "\n".join(ldata_string[i:i+wrap] for i in range(0, len(ldata), wrap))
You can do something similar for the hex output, using the b16encode function from the base64 module to convert to hex, then decode to make the bytes output into a string. The splitting and rejoining gets a bit more complicated due to the need for spaces between each pair of hex digits, but I suspect it will still be faster than encoding each byte separately.
import base64
lhexdata_string = base64.b16encode(_ldata).decode("ascii") # hex will always be ASCII
lhexdata = "\n".join(" ".join(hexdata_string[i+j:i+j+2] for i in range(0, 2*wrap, 2))
for j in range(0, len(lhexdata_string), 2*wrap))
Note that the code above assumes that you're using Python 3. If you're using Python 2 you'll need to change a few things (such as working around the lack of maketrans and not needing to decode).

Using struct.unpack() without knowing anything about the string

I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?

You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.

Python, how to put 32-bit integer into byte array

I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.

What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data

Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.

The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g

You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.

The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.