Using struct.unpack() without knowing anything about the string

Using struct.unpack() without knowing anything about the string - python

I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?

You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.

Related

Python np.fromfile() adding arbitrary random comma when reading from binary file

I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.

read a .h264 file

I'll be happy for your help with some problem that I have.
Goal: To read a .h264 file (I extracted the raw bitstream to a file using ffmpeg) using python, and save it in some data structure (probably a list, I'll be happy for suggestions).
I want to read the data as hexa, for example I'll show how the data looks like:
What I want is to feed each byte(2 hexa digits), into a list, or some other data structure.
But any step forward will help me.
My Attempts:
First I tried to read the way I know:
with open(path, 'r') as fp:
data = fp.read()
Didn't work, got just ".
After a lot of changes, I tried something else, I saw online:
with open(path, 'r') as fp:
hex_list = ["{:02}".format(ord(c)) for c in fp.read()]
Still got an empty list.
I'll be happy for you help.
Thanks a lot.
EDIT:
Thanks to the comment below, I tried to open using 'rb', but still with no luck.

If you have an h264 mp4 file, you can open it and get a hexadecimal string representation like this using binascii.hexlify():
import binascii
with open('test.mp4', 'rb') as fin:
hexa = binascii.hexlify(fin.read())
print(hexa[0:1000])
hexa will be a python bytes object, and you can easily get back the binary representation by doing binascii.unhexlify(hexa). This will be much more efficient than storing the hex representation as strings in a list(), both in terms of space and time. You can access the bytes array with indices/slices, so whatever you were intending to do with the list will probably work fine with this (it will just be much faster and use a lot less memory).
One thing to keep in mind though is to get the the first hexadecimal digit from a bytes object, you don't do hexa[0], but rather hexa[0:1]. To get the first pair of hexadecimal digits (byte), you do: hexa[0:2]. The second byte is hexa[2:4] etc. As explained in the docs for hex():
Since bytes objects are sequences of integers (akin to a tuple), for a
bytes object b, b[0] will be an integer, while b[0:1] will be a bytes
object of length 1. (This contrasts with text strings, where both
indexing and slicing will produce a string of length 1)

Parsing IFF- style data using Python

I have an IFF- style file (see below) whose contents I need to inspect in Python.
https://en.wikipedia.org/wiki/Interchange_File_Format
I can iterate through the file using the following code
from chunk import Chunk
def chunks(f):
while True:
try:
c=Chunk(f, align=False, bigendian=False)
yield c
c.skip()
except EOFError:
break
if __name__=="__main__":
for c in chunks(file("sample.iff", 'rb')):
name, sz, value = c.getname(), c.getsize(), c.read()
print (name, sz, value)
Now I need to parse the different values. I have had some success using Python's 'struct' module, unpacking different fields as follows
struct.unpack('<I', value)
or
struct.unpack('BBBB', value)
by experimenting with different formatting characters shown in the struct module documentation
https://docs.python.org/2/library/struct.html
This works with some of the simpler fields but not with the more complex ones. It is all very trial-and-error. What I need is some systematic way of unpacking the different values, some way of knowing or inspecting the type of data they represent. I am not a C datatype expert.
Any ideas ? Many thanks.
SVOXVERS BVER BPM }SPEDTGRDGVOL`NAME2017-02-15 16-38MSCLMZOOMXOFMYOFLMSKCURLTIMESELSLGENPATNPATTPATLPDTAa � 1pQ 10 `q !#QP! 0A �`A PCHNPLIN PYSZ PFLGPICO �m�!�a��Q�1:\<<<<:\�1�Q��a�!�mPFGCPBGC���PFFFPXXXPYYYPENDSFFFCSNAM OutputSFINSRELSXXXDSYYYhSZZZSSCLSVPRSCOL���SMICSMIB����SMIP����SLNK����SENDSFFFISNAM FMSTYPFMSFINSRELSXXX�SYYY8SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNKCVAL�CVAL0CVAL�CVALCVALCVALCVALCVALGCVALnCVAL\CVALCVAL&CVALoCVALDCVALCVALCVALCMID������������������SENDSFFFQSNAM EchoSTYPEchoSFINSRELSXXX�SYYY SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVALCVALCVALCMID0������SENDSFFFQSNAM ReverbSTYPReverbSFINSRELSXXX\SYYY�SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVAL�CVALCVALCVALCVALCVALCMIDH���������SENDSENDSENDSENDSEND

If it's really an IFF file, it needs alignment and big-endian turned on, and the file would contain a single FORM chunk that in turn contains the FORM type such as SVOX and the contents chunks. (Or it could contain a LIST or CAT container chunk.)
An IFF chunk has:
A four-character chunk-type code
A four-byte big-endian integer: length
length number of data bytes
A pad byte for alignment if length is odd
This is documented in "EA IFF 85". See the "EA IFF-85" Repository for the original IFF docs. [I wrote them.]
Some file formats like RIFF and PNG are variations on the IFF design, not conforming applications of the IFF standard. They vary the chunk format details, which is why Python's Chunk reader library lets you pick alignment, endian, and when to recurse into chunks.
By looking at your file in a hex/ascii dump and mapping out the chunk spans, you should be able to deduce whether it uses big-endian or little-endian length fields, whether each odd-length chunk is followed by a pad byte for alignment, and whether there are chunks within chunks.
Now to the contents. A chunk's type signals the format and semantics of its contents. Those contents could be a simple C struct or could contain variable-length strings. IFF itself does not provide metadata on that level of structure, unlike JSON and TIFF.
So try to find the documentation for the file format (SVOX?).
Otherwise try to reverse engineer the data. If you put sample data into an application that generates these files, you can try special cases, look for the expected values in the file, change just one parameter, then look for what changed in the file.
Finally, your code should call c.close(). c.close() will call c.skip() for you and also handle chunk closing, which includes safety checks for attempts to read the chunk afterwards.

Changing bytes in the file?

I have a request to, "Encode the file by adding 5 to every byte in the file". I tried opening the file as read binary, but all that does is add a b to the beginning of the string- I don't think that is what the expectation of the statement is. I tried looking into pickle, but I don't think that is right either.
What else could this mean? Any ideas as to what possible solutions there are?

It doesn't actually add a b to the beginning of the string -- b is just a marker that python puts on the string when representing it to you so that you know it's a bytes type, not str. Bytes are really just numbers (0-255) so you can walk through the byte object and get each value, figure out what number it corresponds to and add 5, etc.
hint - this task probably gets easier if you choose to use a bytearray to store the bytes.

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!

LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.

If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.