How Does One Read Bytes from File in Python

How Does One Read Bytes from File in Python - python

Similar to this question, I am trying to read in an ID3v2 tag header and am having trouble figuring out how to get individual bytes in python.
I first read all ten bytes into a string. I then want to parse out the individual pieces of information.
I can grab the two version number chars in the string, but then I have no idea how to take those two chars and get an integer out of them.
The struct package seems to be what I want, but I can't get it to work.
Here is my code so-far (I am very new to python btw...so take it easy on me):
def __init__(self, ten_byte_string):
self.whole_string = ten_byte_string
self.file_identifier = self.whole_string[:3]
self.major_version = struct.pack('x', self.whole_string[3:4]) #this
self.minor_version = struct.pack('x', self.whole_string[4:5]) # and this
self.flags = self.whole_string[5:6]
self.len = self.whole_string[6:10]
Printing out any value except is obviously crap because they are not formatted correctly.

If you have a string, with 2 bytes that you wish to interpret as a 16 bit integer, you can do so by:
>>> s = '\0\x02'
>>> struct.unpack('>H', s)
(2,)
Note that the > is for big-endian (the largest part of the integer comes first). This is the format id3 tags use.
For other sizes of integer, you use different format codes. eg. "i" for a signed 32 bit integer. See help(struct) for details.
You can also unpack several elements at once. eg for 2 unsigned shorts, followed by a signed 32 bit value:
>>> a,b,c = struct.unpack('>HHi', some_string)
Going by your code, you are looking for (in order):
a 3 char string
2 single byte values (major and minor version)
a 1 byte flags variable
a 32 bit length quantity
The format string for this would be:
ident, major, minor, flags, len = struct.unpack('>3sBBBI', ten_byte_string)

Why write your own? (Assuming you haven't checked out these other options.) There's a couple options out there for reading in ID3 tag info from MP3s in Python. Check out my answer over at this question.

I am trying to read in an ID3v2 tag header
FWIW, there's already a module for this.

I was going to recommend the struct package but then you said you had tried it. Try this:
self.major_version = struct.unpack('H', self.whole_string[3:5])
The pack() function convers Python data types to bits, and the unpack() function converts bits to Python data types.

Related

How to send an IPv6 address with the 'struct' library in python

I'm asking for your help because I've been stuck with the same problem for 3 days.
If I have :
Value1 = 0, Value2 = 3.10 and IPv6 = '2001::1'
I would like to pack all 3 values with this command: package = struct.pack(*format*, value1, value2, IPv6)
My problem is: I don't know what format characters in C type I can use to pack the IPv6 and keep its 16 bytes.
I know that I can use format = 'i f ?' with i for integer / f for float but I need to find with what to replace the '?' which is the format characters in C type for an IPv6 address to pack the three values.
Please, someone can help me?

It is not that straightforward to use struct for this, as you'd have to know the proper byte values of the IPv6 address before adding then to a struct.
'2001::1' is a textual representation that is nowhere close to giving you those values: you'd have to split the string on :, replace missing values with "0", then you 'd have a 4 16 bit number list to pack in the struct. and them, certainly there are corner cases and special syntax in the IPv6 string representation you'd have to account for.
Fortunately Python already handles that for you in the. ipaddress module of the stdlib.
Just import ipaddress, format the struct for the first part of your package and concatenate it with the "packed" attribute in the IPv6Address Python automatically genreates for you:
import struct
import ipaddress
Value1 = 0
Value2 = 3.10
IPv6 = '2001::1'
payload = struct.pack("if", (Value1, Value2)) + ipaddress.ip_address(IPv6).packed
However, I wonder if it will be productive to simply pack an int and a float along with an IP address in this way - whatever code will be reading this will be super coupled with the code you are writing to that.
If you are simply storing it to a file to be read back by a Python program under your control just use pickle instead. If you intend to send these values to a non Python program over a network, a schemaless textual way of conveying them, like JSON, might be much simpler.
If you really want to store these, and only these, in a compact way in order to save space, and there are tens of thousands of them, and they will be read back by the same program: try numpy arrays. They will take care of the compact binary representation for each object type and can be read and written to binary files, and numpy will take care of record offset for you.
The only use case I could see for that is if you have a program not under your control in a low level protocol that would be expecting exactly this record format. Since you are speculating about how to create the payload, and trying to convey "3.10" as a floating point value, this does not seem to be the case. Talking about that, "3.10" or other numbers might not round trip well as a nicely formed 2-decimal digit value with structs like this, due to how floating points are represented internally. I suggest you review your goals and needs there, and not overcomplicate things.
To unpack back, the easier thing is to use struct to recover just the numeric values, and pass the remaining 16 bytes back to the ip_address factory function - it automatically creates an IPv6 object, which string representation is the human friendly "2001::1".
I type the "roundtrip" in an interactive prompt:
In [30]: import struct, ipaddress
In [31]: x = ipaddress.ip_address('2001::1')
In [32]: v1 = 2;v2 = 3.10
In [33]: payload = struct.pack(">if",v1, v2) + x.packed
In [34]: print(payload)
b'\x00\x00\x00\x02#Fff \x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
In [35]: v3, v4, nx = struct.unpack(">if", aa[:-16]) + (ipaddress.ip_address(aa[-16:]),)
In [36]: print(v3, v4, nx)
2 3.0999999046325684 2001::1
Two things to be noted: the added > prefix on the struct formatting string, to ensure byte order when encoding its contents,
and that the "3.10" value, as described above, does not roundtrip as a "nice two decimal places" value. Use round(number, 2) when representing the number on the server side, if needed.

Switching endianness in the middle of a struct.unpack format string

I have a bunch of binary data (the contents of a video game save-file, as it happens) where a part of the data contains both little-endian and big-endian integer values. Naively, without reading much of the docs, I tried to unpack it this way...
struct.unpack(
'3sB<H<H<H<H4s<I<I32s>IbBbBbBbB12s20sBB4s',
string_data
)
...and of course I got this cryptic error message:
struct.error: bad char in struct format
The problem is that struct.unpack format strings do not expect individual fields to be marked with endianness. The actually correct format-string here would be something like
struct.unpack(
'<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s',
string_data
)
except that this will flip the endianness of the third I field (parsing it as little-endian, when I really want to parse it as big-endian).
Is there an easy and/or "Pythonic" solution to my problem? I have already thought of three possible solutions, but none of them is particularly elegant. In the absence of better ideas I'll probably go with number 3:
I could extract a substring and parse it separately:
(my.f1, my.f2, ...) = struct.unpack('<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s', string_data)
my.f11 = struct.unpack('>I', string_data[56:60])
I could flip the bits in the field after the fact:
(my.f1, my.f2, ...) = struct.unpack('<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s', string_data)
my.f11 = swap32(my.f11)
I could just change my downstream code to expect this field to be represented differently — it's actually a bitmask, not an arithmetic integer, so it wouldn't be too hard to flip around all the bitmasks I'm using with it; but the big-endian versions of these bitmasks are more mnemonically relevant than the little-endian versions.

A little late to the party, but I just had the same problem. I solved it with a custom numpy dtype, which allows to mix elements with different endianess (see https://numpy.org/doc/stable/reference/generated/numpy.dtype.html):
t=np.dtype('>u4,<u4') # Compound type with two 4-byte unsigned int with different byte order
a=np.zeros(shape=1, dtype=t) # Create an array of length one with above type
a[0][0]=1 # Assign first uint
a[0][1]=1 # Assign second uint
bytes=a.tobytes() # bytes should be b'\x01\x00\x00\x00\x00\x00\x00\x01'
b=np.frombuffer(buf, dtype=t) # should yield array[(1,1)]
c=np.frombuffer(buf, dtype=np.uint32) # yields array([ 1, 16777216]

How can i convert binary data from a file to readable base two binary in python?

In a class i am in, we are assigned to a simple mips simulator. The instructions that my program is supposed to process are given in a binary file. I have no idea how to get anything usable out of that file. Here is my code:
import struct
import argparse
'''open a parser to get command line arguments '''
parser = argparse.ArgumentParser(description='Mips instruction simulator')
'''add two required arguments for the input file and the output file'''
parser.add_argument('-i', action="store", dest='infile_name', help="-i INPUT_FILE", required=True)
parser.add_argument('-o', action="store", dest='outfile_name', help="-o OUTPUT_FILE_NAME", required=True)
'''get the passed arguments'''
args = parser.parse_args()
class Disassembler:
'''Disassembler for mips code'''
instruction_buffer = None
instructions_read = 0
def __init__(self, filename):
bin_file = None
try:
bin_file = open(filename, 'rb')
except:
print("Input file: ", filename, " could not be opened. Check the name, permissions, or path")
quit()
while True:
read_bytes = bin_file.read(4)
if (read_bytes == b''):
break
int_var = struct.unpack('>I', read_bytes)
print(int_var)
bin_file.close()
disembler = Disassembler(args.infile_name)
So, at first i just printed the 4 bytes i read to see what was returned.
I was hoping to see plain bits(just 1's and 0's). What i got was byte strings from what I've read. So i tried googling to find out what i can do about this. So i found i could use struct to convert these byte strings to integers. That outputs them in a format like (4294967295,).
This is still annoying, because i have to trim that to make it a usable integer then even still i have to convert it to bits(base 2). It's nice that i can read the bytes with struct as either signed or unsigned, because half of the input file's input are signed 32 bit numbers.
All of this seems way more complicated than it should be to just get the bits out of a binary file. Is there an easier way to do this? Also can you explain it like you would to someone who is not incredibly familiar with esoteric python code and is new to binary data?
My overall goal is to get straight 32 bits out of each 4 bytes i've read. The beginning of the file is a list of mips opcodes. So i need to be able to see specific parts of these numbers, like the first 5 bits, then the next 6, or so on. The end of the file contains 32 bit signed integer values. The two halves of the files are separated by a break opcode.
Thank you for any help you can give me. It's driving me crazy that i can't find any straight forward answers through searching. If you want to see the binary file, tell me where and i'll post it.

Bear in mind that normal Python integers don't have a fixed bit width: they're as big as they need to be. This can be annoying when you want to convert signed integers to bit strings. I recommend that you stick with what you're currently doing: converting blocks of 4 bytes to unsigned integer using
n = struct.unpack('>I', read_bytes)[0]
and then using either format(n, '032b') or '{0:032b}'.format(n) to convert that to a bit string if you want to print the bits.
To access or modify the bits in an integer, you shouldn't be bothering with string conversion, instead you should use Python's bitwise operators, &, |, ^, <<, >>, ~. Eg, (n >> 7) & 1 gives you bit 7 of n.
Please see Unary arithmetic and bitwise operations and the following sections in the Python docs for detailed information about these operators.

This way you can access each individual bit in the file.
"".join(format(i, "08b") for i in byte_string)
For example:
>>> "".join(format(i, "08b") for i in b"\x23\x54a")
'001000110101010001100001'

How do I search for a set amount of hex and non hex data in python

I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]

Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results

Python, how to put 32-bit integer into byte array

I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.

What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data

Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.

The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g

You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.

The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.