I'm using struct.unpack to read the 11th byte of a file to the 21st byte which represents a field that is supposed to read 'SNA'. The field is 'populated as BCS-A where it is left justified and padded to the right boundary with BCS spaces'. Since the field is 10 bytes long, my format string is '10s'. However, per the output mentioned, the remaining 7 bytes are spaces. To eliminate those spaces I use strip. Unfortunately, this still yields 'SNA\x00'. What am I doing wrong?
field = struct.unpack('10s',data[start:stop])
field[0].strip() (since the output of a strut.unpack is a tuple)
Your data doesn't conform to the standard you've specified. Either contact your data supplier and have them fix their bug, or be more generous about your definition of "space". If you want to accept that data, you could, for example, do this:
field[0].strip(' \t\n\x00')
or, with more limited acceptance:
field[0].strip().rstrip('\x00')
Related
The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.
When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:
with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.
This is the documentation for the read() function:
Read at most n characters from stream.
Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.
If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?
I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.
In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?
My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.
Reproducible details
My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.
The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \x1c`) and the data begins:
import gzip
import numpy as np
train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1
def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30
first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.
Any reference or advice would be very appreciated!
From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.
Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.
That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
Thus, if you want to skip all four of those items, you would take 16 bytes off the top.
From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.
See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data
The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).
I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.
I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]
Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results
I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.
I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!
LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.
If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.