I am trying to write data (text, floating point data) to a file in binary, which is to be read by another program later. The problem is that this program (in Fort95) is incredibly particular; each byte has to be in exactly the right place in order for the file to be read correctly. I've tried using Bytes objects and .encode() to write, but haven't had much luck (I can tell from the file size that it is writing extra bytes of data). Some code I've tried:
mgcnmbr='42'
bts=bytes(mgcnmbr)
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(bts)
test_file.close()
I've also tried:
mgcnmbr='42'
bts=mgcnmbr.encode(utf_32_le)
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(bts)
test_file.close()
To clarify, what I need is the integer value 42, written as a 4 byte binary. Next, I would write the numbers 1 and 0 in 4 byte binary. At that point, I should have exactly 12 bytes. Each is a 4 byte signed integer, written in binary. I'm pretty new to Python, and can't seem to get it to work out. Any suggestions? Soemthing like this? I need complete control over how many bytes each integer (and later, 4 byte floating point ) is.
Thanks
You need the struct module.
import struct
fout = open('test.dat', 'wb')
fout.write(struct.pack('>i', 42))
fout.write(struct.pack('>f', 2.71828182846))
fout.close()
The first argument in struct.pack is the format string.
The first character in the format string dictates the byte order or endianness of the data (Is the most significant or least significant byte stored first - big-endian or little-endian). Endianness varies from system to system. If ">" doesn't work try "<".
The second character in the format string is the data type. Unsurprisingly the "i" stands for integer and the "f" stands for float. The number of bytes is determined by the type. Shorts or "h's" for example are two bytes long. There are also codes for unsigned types. "H" corresponds to an unsigned short for instance.
The second argument in struct.pack is of course the value to be packed into the bytes object.
Here's the part where I tell you that I lied about a couple of things. First I said that the number of bytes is determined by the type. This is only partially true. The size of a given type is technically platform dependent as the C/C++ standard (which the struct module is based on) merely specifies minimum sizes. This leads me to the second lie. The first character in the format string also encodes whether the standard (minimum) number of bytes or the native (platform dependent) number of bytes is to be used. (Both ">" and "<" guarantee that the standard, minimum number of bytes is used which is in fact four in the case of an integer "i" or float "f".) It additionally encodes the alignment of the data.
The documentation on the struct module has tables for the format string parameters.
You can also pack multiple primitives into a single bytes object and realize the same result.
import struct
fout = open('test.dat', 'wb')
fout.write(struct.pack('>if', 42, 2.71828182846))
fout.close()
And you can of course parse binary data with struct.unpack.
Assuming that you want it in little-endian, you could do something like this to write 42 in a four byte binary.
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(b'\xA2\0\0\0')
test_file.close()
A2 is 42 in hexadecimal, and the bytes '\xA2\0\0\0' makes the first byte equal to 42 followed by three empty bytes. This code writes the byte: 42, 0, 0, 0.
Your code writes the bytes to represent the character '4' in UTF 32 and the bytes to represent 2 in UTF 32. This means it writes the bytes: 52, 0, 0, 0, 50, 0, 0, 0, because each character is four bytes when encoded in UTF 32.
Also having a hex editor for debugging could be useful for you, then you could see the bytes that your program is outputting and not just the size.
In my problem Write binary string in binary file Python 3.4 I do like this:
file.write(bytes(chr(int(mgcnmbr)), 'iso8859-1'))
Related
I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.
I'm working with an array of bytes extracted from an UDP packet in Python.
The data it's represented like this:
data = [0x00,0x01,0x23,0x84,0xa6]
And when I use bytearray(data) and prints its content the prompt shows me a not hexadecimal digit like x01# or with other data contents the # digit its replace by a \n digit. I don't really know why this happens.
The complete code example
data = [0x00,0x01,0x23,0x84,0xa6]
data1 = bytearray(data)
print(data)
print(data1)
And the print shows
[0, 1, 35, 132, 166]
bytearray(b'\x00\x01#\x84\xa6')
Using bytes(data) the problem is the same.
Your bytearray is represented as a string. When a string is represented for human eyes the characters are displayed according to the current encoding (ASCII, utf-8, etc.). In your current encoding, the character with the value 0x23 is a hash-symbol (#). Only for the bytes which do not have a character representation (0x00, etc.) the hex representation is displayed (e.g. \x00).
So what you see is absolutely correct because you asked (maybe without knowing) for a string representation of your byte array.
If you want to see a hex value for each byte, use data1.hex(). This will create a hex representation for each byte and concatenate all of these. The result will be a string containing only hex digits (0-9 and a-f). This is only useful for printing, in most cases it is not useful for further processing.
In Python3, consider using bytes([0x00, 0x01, ...]) instead. That will produce a bytes object which is more native to the language (e.g. many functions like write(), send(), etc. will accept it as input). It also has a hex() method as described above.
What I am really doing is creating a BMP file from JPEG using python and it's got some header data which contains info like size, height or width of the image, so basically I want to read a JPEG file, gets it width and height, calculate the new size of a BMP file and store it in the header.
Let's say the new size of the BMP file is 40000 bytes whose hex value is 0x9c40, now as there is 4 byte space to save this in the header, we can write it as 0x00009c40. In BMP header data, LSB is written first and then MSB so I have to write, 0x409c0000 in the file.
My Problems:-
I was able to do this in C but I am totally lost how to do so in Python.
For example, if I have i=40000, and by using str=hex(i)[2:] I got the hex value, now by some coding I was able to add the extra zeros and then reverse the code. Now how to write this '409c0000' data in the file as hex?
The header size is 54 bytes for BMP file, so is there is another way to just store the data in a string like str='00ffcf4f...'(upto 54 bytes) and just convert the whole str at once as hex and write it to file?
My friend told me to use unhexlify from binascii,
by doing unhexlify('fffcff') I get '\xff\xfc\xff' which is what I want but when I try unhexlify('3000') I get '0\x00'` which is not what I want. It is same for any value containing 3, 4, 5, 6 or 7. Is it the right way to do this?
You are not writing hex, you are writing binary data. Hexadecimal is a helpful notation when dealing with binary data, but don't confuse the notation with the value.
Use the struct module to pack integer data into binary structures, the same way C would.
binascii.unhexlify also is a good choice, provided you already have the data in a string using hex notation. The output is correct, but the binary representation only uses hex escapes for bytes outside the printable ASCII range.
Thus fffcff does correctly becomes \xff\xfc\xff, representing 3 bytes in hex escape notation, and 3000 is \x30\x00, but \x30 is the '0' character in ASCII, so the Python representation for that byte simply uses that ASCII character, as that is the most common way to interpret bytes.
Packing the integer value 40000 using struct.pack() as an unsigned integer (little endian) then becomes:
>>> import struct
>>> struct.pack('<I', 40000)
'#\x9c\x00\x00'
where the 40 byte is represented by the ASCII character for that byte, the # glyph.
If this is confusing, you can always create a new hex representation by going the other way and use 0binascii.hexlify() function](https://docs.python.org/2/library/binascii.html#binascii.hexlify) to create a hexadecimal representation for yourself, just to debug the output:
>>> import binascii
>>> binascii.hexlify(struct.pack('<I', 40000))
'409c0000'
and you'll see that the # byte is still the right hex value.
I want to append a crc that I calculate to an existing binary file.
For example, the crc is 0x55667788.
I want to append 0x55, 0x66, 0x77 and 0x88 to the end of the file.
For example, if I open the file in HexEdit, the last four bytes of the file
will show 0x55667788.
Here is my code so far:
fileopen = askopenfilename()
filename = open(fileopen, 'rb+')
filedata = filename.read()
filecrc32 = hex(binascii.crc32(filedata))
filename.seek(0,2)
filename.write(filecrc32)
filename.close()
I get the following error:
File "C:\Users\cjackel\openfile.py", line 9, in <module>
filename.write(filecrc32)
TypeError: 'str' does not support the buffer interface
Any suggestions?
The hex function returns a string. In this case, you've got a string of 10 hex characters representing your 4-byte number, like this:
'0x55667788'
In Python 2.x, you would be allowed to write this incorrect data to a binary file (it would show up as the 10 bytes 30 78 35 35 36 36 37 37 38 38 rather than the four bytes you want, 55 66 77 88). Python 3.x is smarter, and only allows you to write bytes (or bytearray or similar) to a binary file, not str.
What you want here is not the hex string, but the actual bytes.
The way you described the bytes you want is called big-endian order. On most computers, the "native" order is the opposite, little-endian, which would give you 0x88776655 instead of 0x55667788.
In Python 3.2+, the simplest way to get that is the int.to_bytes method:
filecrc = binascii.crc32(filedata).to_bytes(4, byteorder='big', signed=False)
(The signed=False isn't really necessary, because it's the default, but it's a good way of making it explicit that you're definitely dealing with an unsigned 32-bit integer.)
If you're stuck with earlier versions, you can use the struct module:
filecrc = struct.pack('>I', binascii.crc32(filedata))
The > means big-endian, and the I means unsigned 4-byte integer. So, this returns the same thing. In either case, what you get is b'\x55\x66\x77\x88' (or, as Python would repr it, b'\Ufw\x88').
The error is a bit cryptic, because no novice is going to have any idea what "the buffer interface" is (especially since the 3.x documentation calls it the Buffer Protocol, and it's only documented as part of CPython's C extension API…), but effectively it means that you need a bytes-like object. Usually, this error will mean that you just forgot to encode your string to UTF-8 or some other encoding. But when you were trying to write actual binary data rather than encoded text, it's the same error.
You need to serialize the data. Serialization is the process of getting
the relevant bytes from the whole number. In your case, your CRC is a
4-byte number. The individual 4 bytes can be retrieved to a list as below:
serialized_crc = [(filecrc32 >> 24) & 0xFF,(filecrc32 >> 16) & 0xFF,
(filecrc32 >> 8) & 0xFF,filecrc32 & 0xFF]
The CRC can be then written to the file by converting to a bytearray as below:
filename.write(bytearray(serialized_crc))