Where are python bytearrays used? - python

I recently came across the dataType called bytearray in python. Could someone provide scenarios where bytearrays are required?

This answer has been shameless ripped off from here
Example 1: Assembling a message from fragments
Suppose you're writing some network code that is receiving a large message on a socket connection. If you know about sockets, you know that the recv() operation doesn't wait for all of the data to arrive. Instead, it merely returns what's currently available in the system buffers. Therefore, to get all of the data, you might write code that looks like this:
# remaining = number of bytes being received (determined already)
msg = b""
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msg += chunk # Add it to the message
remaining -= len(chunk)
The only problem with this code is that concatenation (+=) has horrible performance. Therefore, a common performance optimization in Python 2 is to collect all of the chunks in a list and perform a join when you're done. Like this:
# remaining = number of bytes being received (determined already)
msgparts = []
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msgparts.append(chunk) # Add it to list of chunks
remaining -= len(chunk)
msg = b"".join(msgparts) # Make the final message
Now, here's a third solution using a bytearray:
# remaining = number of bytes being received (determined already)
msg = bytearray()
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msg.extend(chunk) # Add to message
remaining -= len(chunk)
Notice how the bytearray version is really clean. You don't collect parts in a list and you don't perform that cryptic join at the end. Nice.
Of course, the big question is whether or not it performs. To test this out, I first made a list of small byte fragments like this:
chunks = [b"x"*16]*512
I then used the timeit module to compare the following two code fragments:
# Version 1
msgparts = []
for chunk in chunks:
msgparts.append(chunk)
msg = b"".join(msgparts)
#Version 2
msg = bytearray()
for chunk in chunks:
msg.extend(chunk)
When tested, version 1 of the code ran in 99.8s whereas version 2 ran in 116.6s (a version using += concatenation takes 230.3s by comparison). So while performing a join operation is still faster, it's only faster by about 16%. Personally, I think the cleaner programming of the bytearray version might make up for it.
Example 2: Binary record packing
This example is an slight twist on the last example. Suppose you had a large Python list of integer (x,y) coordinates. Something like this:
points = [(1,2),(3,4),(9,10),(23,14),(50,90),...]
Now, suppose you need to write that data out as a binary encoded file consisting of a 32-bit integer length followed by each point packed into a pair of 32-bit integers. One way to do it would be to use the struct module like this:
import struct
f = open("points.bin","wb")
f.write(struct.pack("I",len(points)))
for x,y in points:
f.write(struct.pack("II",x,y))
f.close()
The only problem with this code is that it performs a large number of small write() operations. An alternative approach is to pack everything into a bytearray and only perform one write at the end. For example:
import struct
f = open("points.bin","wb")
msg = bytearray()
msg.extend(struct.pack("I",len(points))
for x,y in points:
msg.extend(struct.pack("II",x,y))
f.write(msg)
f.close()
Sure enough, the version that uses bytearray runs much faster. In a simple timing test involving a list of 100000 points, it runs in about half the time as the version that makes a lot of small writes.
Example 3: Mathematical processing of byte values
The fact that bytearrays present themselves as arrays of integers makes it easier to perform certain kinds of calculations. In a recent embedded systems project, I was using Python to communicate with a device over a serial port. As part of the communications protocol, all messages had to be signed with a Longitudinal Redundancy Check (LRC) byte. An LRC is computed by taking an XOR across all of the byte values.
Bytearrays make such calculations easy. Here's one version:
message = bytearray(...) # Message already created
lrc = 0
for b in message:
lrc ^= b
message.append(lrc) # Add to the end of the message
Here's a version that increases your job security:
message.append(functools.reduce(lambda x,y:x^y,message))
And here's the same calculation in Python 2 without bytearrays:
message = "..." # Message already created
lrc = 0
for b in message:
lrc ^= ord(b)
message += chr(lrc) # Add the LRC byte
Personally, I like the bytearray version. There's no need to use ord() and you can just append the result at the end of the message instead of using concatenation.
Here's another cute example. Suppose you wanted to run a bytearray through a simple XOR-cipher. Here's a one-liner to do it:
>>> key = 37
>>> message = bytearray(b"Hello World")
>>> s = bytearray(x ^ key for x in message)
>>> s
bytearray(b'm#IIJ\x05rJWIA')
>>> bytearray(x ^ key for x in s)
bytearray(b"Hello World")
>>>
Here is a link to the presentation

A bytearray is very similar to a regular python string (str in python2.x, bytes in python3) but with an important difference, whereas strings are immutable, bytearrays are mutable, a bit like a list of single character strings.
This is useful because some applications use byte sequences in ways that perform poorly with immutable strings. When you are making lots of little changes in the middle of large chunks of memory, as in a database engine, or image library, strings perform quite poorly; since you have to make a copy of the whole (possibly large) string. bytearrays have the advantage of making it possible to make that kind of change without making a copy of the memory first.
But this particular case is actually more the exception, rather than the rule. Most uses involve comparing strings, or string formatting. For the latter, there's usually a copy anyway, so a mutable type would offer no advantage, and for the former, since immutable strings cannot change, you can calculate a hash of the string and compare that as a shortcut to comparing each byte in order, which is almost always a big win; and so it's the immutable type (str or bytes) that is the default; and bytearray is the exception when you need it's special features.

If you look at the documentation for bytearray, it says:
Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.
In contrast, the documentation for bytes says:
Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray – it has the same non-mutating methods and the same indexing and slicing behaviors.
As you can see, the primary distinction is mutability. str methods that "change" the string actually return a new string with the desired modification. Whereas bytearray methods that change the sequence actually change the sequence.
You would prefer using bytearray, if you are editing a large object (e.g. an image's pixel buffer) through its binary representation and you want the modifications to be done in-place for efficiency.

Wikipedia provides an example of XOR cipher using Python's bytearrays (docstrings reduced):
#!/usr/bin/python2.7
from os import urandom
def vernam_genkey(length):
"""Generating a key"""
return bytearray(urandom(length))
def vernam_encrypt(plaintext, key):
"""Encrypting the message."""
return bytearray([ord(plaintext[i]) ^ key[i] for i in xrange(len(plaintext))])
def vernam_decrypt(ciphertext, key):
"""Decrypting the message"""
return bytearray([ciphertext[i] ^ key[i] for i in xrange(len(ciphertext))])
def main():
myMessage = """This is a topsecret message..."""
print 'message:',myMessage
key = vernam_genkey(len(myMessage))
print 'key:', str(key)
cipherText = vernam_encrypt(myMessage, key)
print 'cipherText:', str(cipherText)
print 'decrypted:', vernam_decrypt(cipherText,key)
if vernam_decrypt(vernam_encrypt(myMessage, key),key)==myMessage:
print ('Unit Test Passed')
else:
print('Unit Test Failed - Check Your Python Distribution')
if __name__ == '__main__':
main()

Related

When to use io.BytesIO() in Python for modifying strings

I am studying the IO python module.
I have the two following code snippets:
buffer = b""
buffer += b"Hello World"
buffer += b"Hello World"
buffer += b"Hello World"
import io
with io.BytesIO() as f:
f.write(b"Hello World")
f.write(b"Hello World")
f.write(b"Hello World")
To me these two block pretty much do the same thing.
The only difference is that the second works in place, while the first do not.
I have heard that the second way is usually faster, but I have no idea why.
Could somebody explain me when the second method is preferred in respect to the first (in string modifications)?
Bytes objects are immutable, so in-place addition creates a new object every time. If the operation is repeated many times then this can cause poor performance. From the Python sequence docs (irrelevant parts omitted)
Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below:
...
if concatenating bytes objects, you can similarly use bytes.join() or io.BytesIO, or you can do in-place concatenation with a bytearray object. bytearray objects are mutable and have an efficient overallocation mechanism
...

which are python best practices to concatenate different types of data?

I want to generate bytes sequence containing string length and string content.
For example, for string 'hello' I want to get b'\x05hello'
After some docs reading I've wrote a function:
def LenAndStrBytes(strdata):
return bytearray([len(strdata)&0xFF])+strdata if strdata!=[] else 0
question:
I'm new in python programming and I wonder, which are best python practices to concatenate different types of data like int and something iterable like bytearray
Did I write my function optimally?
Well, just larsmans points out, it depends on your usage. If you can get the result w/ clear code and fulfilling context limitation, it is nice practice which is suitable.
No need for &0xFF, bytearray checks to ensure values between 0 and 255.
>>> strdata = 'hello'
>>> bytearray([len(strdata)]) + strdata if strdata else bytearray()
bytearray(b'\x05hello')
And you could also
import struct
bytearray(struct.pack('B%ds' % len(strdata), len(strdata), strdata))
Are you trying to serialise binary data before you write it to a file (or send it over a network?)
Did you perhaps mean to use the pickle module for data serialisation instead?

Python, how to put 32-bit integer into byte array

I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.
What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data
Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.
The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g
You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.
The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213

Converting 2.5 byte comparisons to 3

I'm trying to convert a 2.5 program to 3.
Is there a way in python 3 to change a byte string, such as b'\x01\x02' to a python 2.5 style string, such as '\x01\x02', so that string and byte-by-byte comparisons work similarly to 2.5? I'm reading the string from a binary file.
I have a 2.5 program that reads bytes from a file, then compares or processes each byte or combination of bytes with specified constants. To run the program under 3, I'd like to avoid changing all my constants to bytes and byte strings ('\x01' to b'\x01'), then dealing with issues in 3 such as:
a = b'\x01'
b = b'\x02'
results in
(a+b)[0] != a
even though similar operation work in 2.5. I have to do (a+b)[0] == ord(a), while a+b == b'\x01\x02' works fine. (By the way, what do I do to (a+b)[0] so it equals a?)
Unpacking structures is also an issue.
Am I missing something simple?
Bytes is an immutable sequence of integers (in the range 0<= to <256), therefore when you're accessing (a+b)[0] you're getting back an integer, exactly the same one you'd get by accessing a[0]. so when you're comparing sequence a to an integer (a+b)[0], they're naturally different.
using the slice notation you could however get a sequence back:
>>> (a+b)[:1] == a # 1 == len(a) ;)
True
because slicing returns bytes object.
I would also advised to run 2to3 utility (it needs to be run with py2k) to convert some code automatically. It won't solve all your problems, but it'll help a lot.

How Does One Read Bytes from File in Python

Similar to this question, I am trying to read in an ID3v2 tag header and am having trouble figuring out how to get individual bytes in python.
I first read all ten bytes into a string. I then want to parse out the individual pieces of information.
I can grab the two version number chars in the string, but then I have no idea how to take those two chars and get an integer out of them.
The struct package seems to be what I want, but I can't get it to work.
Here is my code so-far (I am very new to python btw...so take it easy on me):
def __init__(self, ten_byte_string):
self.whole_string = ten_byte_string
self.file_identifier = self.whole_string[:3]
self.major_version = struct.pack('x', self.whole_string[3:4]) #this
self.minor_version = struct.pack('x', self.whole_string[4:5]) # and this
self.flags = self.whole_string[5:6]
self.len = self.whole_string[6:10]
Printing out any value except is obviously crap because they are not formatted correctly.
If you have a string, with 2 bytes that you wish to interpret as a 16 bit integer, you can do so by:
>>> s = '\0\x02'
>>> struct.unpack('>H', s)
(2,)
Note that the > is for big-endian (the largest part of the integer comes first). This is the format id3 tags use.
For other sizes of integer, you use different format codes. eg. "i" for a signed 32 bit integer. See help(struct) for details.
You can also unpack several elements at once. eg for 2 unsigned shorts, followed by a signed 32 bit value:
>>> a,b,c = struct.unpack('>HHi', some_string)
Going by your code, you are looking for (in order):
a 3 char string
2 single byte values (major and minor version)
a 1 byte flags variable
a 32 bit length quantity
The format string for this would be:
ident, major, minor, flags, len = struct.unpack('>3sBBBI', ten_byte_string)
Why write your own? (Assuming you haven't checked out these other options.) There's a couple options out there for reading in ID3 tag info from MP3s in Python. Check out my answer over at this question.
I am trying to read in an ID3v2 tag header
FWIW, there's already a module for this.
I was going to recommend the struct package but then you said you had tried it. Try this:
self.major_version = struct.unpack('H', self.whole_string[3:5])
The pack() function convers Python data types to bits, and the unpack() function converts bits to Python data types.

Categories