When to use io.BytesIO() in Python for modifying strings

When to use io.BytesIO() in Python for modifying strings - python

I am studying the IO python module.
I have the two following code snippets:
buffer = b""
buffer += b"Hello World"
buffer += b"Hello World"
buffer += b"Hello World"
import io
with io.BytesIO() as f:
f.write(b"Hello World")
f.write(b"Hello World")
f.write(b"Hello World")
To me these two block pretty much do the same thing.
The only difference is that the second works in place, while the first do not.
I have heard that the second way is usually faster, but I have no idea why.
Could somebody explain me when the second method is preferred in respect to the first (in string modifications)?

Bytes objects are immutable, so in-place addition creates a new object every time. If the operation is repeated many times then this can cause poor performance. From the Python sequence docs (irrelevant parts omitted)
Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below:
...
if concatenating bytes objects, you can similarly use bytes.join() or io.BytesIO, or you can do in-place concatenation with a bytearray object. bytearray objects are mutable and have an efficient overallocation mechanism
...

Related

Reusing hashlib.md5 calculates different values for identical strings

This is my first test code:
import hashlib
md5Hash = hashlib.md5()
md5Hash.update('Coconuts')
print md5Hash.hexdigest()
md5Hash.update('Apples')
print md5Hash.hexdigest()
md5Hash.update('Oranges')
print md5Hash.hexdigest()
And this is my second chunk of code:
import hashlib
md5Hash = hashlib.md5()
md5Hash.update('Coconuts')
print md5Hash.hexdigest()
md5Hash.update('Bananas')
print md5Hash.hexdigest()
md5Hash.update('Oranges')
print md5Hash.hexdigest()
But the output for 1st code is:
0e8f7761bb8cd94c83e15ea7e720852a
217f2e2059306ab14286d8808f687abb
4ce7cfed2e8cb204baeba9c471d48f07
And for the second code is:
0e8f7761bb8cd94c83e15ea7e720852a
a82bf69bf25207f2846c015654ae68d1
47dba619e1f3eaa8e8a01ab93c79781e
I replaced the second string from 'Apples' to 'Bananas' and the third string still remains same. But still I am getting a different result for third string. Hashing supposed to have a same result everytime.
Am I missing something?

hashlib.md5.update() adds data to the hash. It doesn't replace the existing values; if you want to hash a new value, you need to initialize a new hashlib.md5 object.
The values you're hashing are:
"Coconuts" -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsApples" -> 217f2e2059306ab14286d8808f687abb
"CoconutsApplesOranges" -> 4ce7cfed2e8cb204baeba9c471d48f07
"Coconuts" -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsBananas" -> a82bf69bf25207f2846c015654ae68d1
"CoconutsBananasOranges" -> 47dba619e1f3eaa8e8a01ab93c79781e

Because you're using update method, md5Hash object is reused for the 3 strings. So it's basically the hash of the 3 strings concatenated together. So changing the second string changes the outcome for the 3rd print as well.
You need to declare a separate md5 object for each string. Use a loop (and python 3 compliant code needs the bytes prefix BTW, and also works in python 2):
import hashlib
for s in (b'Coconuts',b'Bananas',b'Oranges'):
md5Hash = hashlib.md5(s) # no need for update, pass data at construction
print(md5Hash.hexdigest())
result:
0e8f7761bb8cd94c83e15ea7e720852a
1ee31b77d0697c36914b99d1428f7f32
62f2b77089fea4c595e895901b63c10b
note that the values are now different, but at least it is the MD5 of each string, computed independently.

Expected result
What you are expecting is generally what you should be expecting from common cryptographic libraries. In most cryptographic libraries the hash object is reset after calling a method that finalizes the calculation such as hexdigest. It seems that hashlib.md5 uses alternate behavior.
Result by hashlib.md5
MD5 requires the input to be padded with a 1 bit, zero or more 0 bits and the length of the input in bits. Then the final hash value is calculated. hashlib.md5 internally seems to perform the final calculation using separate variables, keeping the state after hashing each string without this final padding.
So the result of your hashes is the concatenation of the earlier strings with the given string, followed by the correct padding, as duskwulf pointed out in his answer.
This is correctly documented by hashlib:
hash.digest()
Return the digest of the strings passed to the update() method so far. This is a string of digest_size bytes which may contain non-ASCII characters, including null bytes.
and
hash.hexdigest()
Like digest() except the digest is returned as a string of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments.
Solution for hashlib.md5
As there doesn't seem to be a reset() method you should create a new md5 object for each separate hash value you want to create. Fortunately the hash objects themselves are relatively lightweight (even if the hashing itself isn't) so this won't consume many CPU or memory resources.
Discussion of the differences
For hashing itself resetting the hash in the finalizer may not make all that much sense. But it does matter for signature generation: you might want to initialize the same signature instance and then generate multiple signatures with it. The hash function should reset so it can calculate the signature over multiple messages.
Sometimes an application requires a congregated hash over multiple inputs, including intermediate hash results. In that case however a Merkle tree of hashes is used, where the intermediate hashes themselves are hashed again.
As indicated, I consider this is bad API design by the authors of hashlib. For cryptographers it certainly doesn't follow the rule of least surprise.

Clean way to read a null-terminated (C-style) string from a file?

I'm looking for a clean and simple way to read a null-terminated C string from a file or file-like object in Python. In a way that doesn't consume more input from the file than it needs, or pushes it back onto whatever file/buffer it works with such that other code can read the data immediately after a null-terminated string.
I've seen a bit of rather ugly code to do it, but not much that I'd like to use.
universal newlines support only works for open()ed files, not StringIO objects etc, and doesn't look like it handles unconventional newlines. Also, if it did work it'd result in strings with \n appended, which is undesirable.
struct doesn't look like it supports reading arbitrary-length C strings at all, requiring a length as part of the format.
ctypes has c_buffer, which can be constructed from a byte string and will return the first null terminated string as its value. Again, this requires determining how much must be read in advance, and it doesn't differentiate between null-terminated and unterminated strings. The same is true of c_char_p. So it doesn't seem to help much, since you already have to know you've read enough of the string and have to handle buffer splits.
The usual way to do this in C is read chunks into a buffer, copying and resizing the buffer if needed, then check if the newest chunk read contains a null byte. If it does, return everything up to the null byte and either realign the buffer or if you're being fancy, keep on reading and use it as a ring buffer. (This only works if you can hand the excess data read back to the caller, or if your platform's ungetc lets to push a lot back onto the file, of course.)
Is it necessary to spell out similar code in Python? I was surprised not to find anything canned in io, ctypes or struct.
file objects don't seem to have a way to push back onto their buffer, like ungetc, and neither do buffered I/O streams in the io module.
I feel like I must be missing the obvious here. I'd really rather avoid byte-by-byte reading:
def readcstr(f):
buf = bytearray()
while True:
b = f.read(1)
if b is None or b == '\0':
return str(buf)
else:
buf.append(b)
but right now that's what I'm doing.

Incredibly mild improvement on what you have (mostly in that it uses more built-ins that, in CPython, are implemented in C, which usually runs faster):
import functools
import itertools
def readcstr(f):
toeof = iter(functools.partial(f.read, 1), '')
return ''.join(itertools.takewhile('\0'.__ne__, toeof))
This is relatively ugly (and sensitive to the type of the file object; it won't work with file objects that return unicode), but pushes all the work to the C layer. The two arg iter ensures you stop if the file is exhausted, while itertools.takewhile looks for (and consumes) the NUL terminator but no more; ''.join then combines the bytes read into a single return value.

Where are python bytearrays used?

I recently came across the dataType called bytearray in python. Could someone provide scenarios where bytearrays are required?

This answer has been shameless ripped off from here
Example 1: Assembling a message from fragments
Suppose you're writing some network code that is receiving a large message on a socket connection. If you know about sockets, you know that the recv() operation doesn't wait for all of the data to arrive. Instead, it merely returns what's currently available in the system buffers. Therefore, to get all of the data, you might write code that looks like this:
# remaining = number of bytes being received (determined already)
msg = b""
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msg += chunk # Add it to the message
remaining -= len(chunk)
The only problem with this code is that concatenation (+=) has horrible performance. Therefore, a common performance optimization in Python 2 is to collect all of the chunks in a list and perform a join when you're done. Like this:
# remaining = number of bytes being received (determined already)
msgparts = []
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msgparts.append(chunk) # Add it to list of chunks
remaining -= len(chunk)
msg = b"".join(msgparts) # Make the final message
Now, here's a third solution using a bytearray:
# remaining = number of bytes being received (determined already)
msg = bytearray()
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msg.extend(chunk) # Add to message
remaining -= len(chunk)
Notice how the bytearray version is really clean. You don't collect parts in a list and you don't perform that cryptic join at the end. Nice.
Of course, the big question is whether or not it performs. To test this out, I first made a list of small byte fragments like this:
chunks = [b"x"*16]*512
I then used the timeit module to compare the following two code fragments:
# Version 1
msgparts = []
for chunk in chunks:
msgparts.append(chunk)
msg = b"".join(msgparts)
#Version 2
msg = bytearray()
for chunk in chunks:
msg.extend(chunk)
When tested, version 1 of the code ran in 99.8s whereas version 2 ran in 116.6s (a version using += concatenation takes 230.3s by comparison). So while performing a join operation is still faster, it's only faster by about 16%. Personally, I think the cleaner programming of the bytearray version might make up for it.
Example 2: Binary record packing
This example is an slight twist on the last example. Suppose you had a large Python list of integer (x,y) coordinates. Something like this:
points = [(1,2),(3,4),(9,10),(23,14),(50,90),...]
Now, suppose you need to write that data out as a binary encoded file consisting of a 32-bit integer length followed by each point packed into a pair of 32-bit integers. One way to do it would be to use the struct module like this:
import struct
f = open("points.bin","wb")
f.write(struct.pack("I",len(points)))
for x,y in points:
f.write(struct.pack("II",x,y))
f.close()
The only problem with this code is that it performs a large number of small write() operations. An alternative approach is to pack everything into a bytearray and only perform one write at the end. For example:
import struct
f = open("points.bin","wb")
msg = bytearray()
msg.extend(struct.pack("I",len(points))
for x,y in points:
msg.extend(struct.pack("II",x,y))
f.write(msg)
f.close()
Sure enough, the version that uses bytearray runs much faster. In a simple timing test involving a list of 100000 points, it runs in about half the time as the version that makes a lot of small writes.
Example 3: Mathematical processing of byte values
The fact that bytearrays present themselves as arrays of integers makes it easier to perform certain kinds of calculations. In a recent embedded systems project, I was using Python to communicate with a device over a serial port. As part of the communications protocol, all messages had to be signed with a Longitudinal Redundancy Check (LRC) byte. An LRC is computed by taking an XOR across all of the byte values.
Bytearrays make such calculations easy. Here's one version:
message = bytearray(...) # Message already created
lrc = 0
for b in message:
lrc ^= b
message.append(lrc) # Add to the end of the message
Here's a version that increases your job security:
message.append(functools.reduce(lambda x,y:x^y,message))
And here's the same calculation in Python 2 without bytearrays:
message = "..." # Message already created
lrc = 0
for b in message:
lrc ^= ord(b)
message += chr(lrc) # Add the LRC byte
Personally, I like the bytearray version. There's no need to use ord() and you can just append the result at the end of the message instead of using concatenation.
Here's another cute example. Suppose you wanted to run a bytearray through a simple XOR-cipher. Here's a one-liner to do it:
>>> key = 37
>>> message = bytearray(b"Hello World")
>>> s = bytearray(x ^ key for x in message)
>>> s
bytearray(b'm#IIJ\x05rJWIA')
>>> bytearray(x ^ key for x in s)
bytearray(b"Hello World")
>>>
Here is a link to the presentation

A bytearray is very similar to a regular python string (str in python2.x, bytes in python3) but with an important difference, whereas strings are immutable, bytearrays are mutable, a bit like a list of single character strings.
This is useful because some applications use byte sequences in ways that perform poorly with immutable strings. When you are making lots of little changes in the middle of large chunks of memory, as in a database engine, or image library, strings perform quite poorly; since you have to make a copy of the whole (possibly large) string. bytearrays have the advantage of making it possible to make that kind of change without making a copy of the memory first.
But this particular case is actually more the exception, rather than the rule. Most uses involve comparing strings, or string formatting. For the latter, there's usually a copy anyway, so a mutable type would offer no advantage, and for the former, since immutable strings cannot change, you can calculate a hash of the string and compare that as a shortcut to comparing each byte in order, which is almost always a big win; and so it's the immutable type (str or bytes) that is the default; and bytearray is the exception when you need it's special features.

If you look at the documentation for bytearray, it says:
Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.
In contrast, the documentation for bytes says:
Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray – it has the same non-mutating methods and the same indexing and slicing behaviors.
As you can see, the primary distinction is mutability. str methods that "change" the string actually return a new string with the desired modification. Whereas bytearray methods that change the sequence actually change the sequence.
You would prefer using bytearray, if you are editing a large object (e.g. an image's pixel buffer) through its binary representation and you want the modifications to be done in-place for efficiency.

Wikipedia provides an example of XOR cipher using Python's bytearrays (docstrings reduced):
#!/usr/bin/python2.7
from os import urandom
def vernam_genkey(length):
"""Generating a key"""
return bytearray(urandom(length))
def vernam_encrypt(plaintext, key):
"""Encrypting the message."""
return bytearray([ord(plaintext[i]) ^ key[i] for i in xrange(len(plaintext))])
def vernam_decrypt(ciphertext, key):
"""Decrypting the message"""
return bytearray([ciphertext[i] ^ key[i] for i in xrange(len(ciphertext))])
def main():
myMessage = """This is a topsecret message..."""
print 'message:',myMessage
key = vernam_genkey(len(myMessage))
print 'key:', str(key)
cipherText = vernam_encrypt(myMessage, key)
print 'cipherText:', str(cipherText)
print 'decrypted:', vernam_decrypt(cipherText,key)
if vernam_decrypt(vernam_encrypt(myMessage, key),key)==myMessage:
print ('Unit Test Passed')
else:
print('Unit Test Failed - Check Your Python Distribution')
if __name__ == '__main__':
main()

Does Python do slice-by-reference on strings?

I want to know if when I do something like
a = "This could be a very large string..."
b = a[:10]
a new string is created or a view/iterator is returned

Python does slice-by-copy, meaning every time you slice (except for very trivial slices, such as a[:]), it copies all of the data into a new string object.
According to one of the developers, this choice was made because
The [slice-by-reference] approach is more complicated, harder to implement
and may lead to unexpected behavior.
For example:
a = "a long string with 500,000 chars ..."
b = a[0]
del a
With the slice-as-copy design the string a is immediately freed. The
slice-as-reference design would keep the 500kB string in memory although
you are only interested in the first character.
Apparently, if you absolutely need a view into a string, you can use a memoryview object.

When you slice strings, they return a new instance of String. Strings are immutable objects.

What is the Python 'buffer' type for?

There is a buffer type in Python, but how can I use it?
In the Python documentation about buffer(), the description is:
buffer(object[, offset[, size]])
The object argument must be an object that supports the buffer call interface (such as strings, arrays, and buffers). A new buffer object will be created which references the object argument. The buffer object will be a slice from the beginning of object (or from the specified offset). The slice will extend to the end of object (or will have a length given by the size argument).

An example usage:
>>> s = 'Hello world'
>>> t = buffer(s, 6, 5)
>>> t
<read-only buffer for 0x10064a4b0, size 5, offset 6 at 0x100634ab0>
>>> print t
world
The buffer in this case is a sub-string, starting at position 6 with length 5, and it doesn't take extra storage space - it references a slice of the string.
This isn't very useful for short strings like this, but it can be necessary when using large amounts of data. This example uses a mutable bytearray:
>>> s = bytearray(1000000) # a million zeroed bytes
>>> t = buffer(s, 1) # slice cuts off the first byte
>>> s[1] = 5 # set the second element in s
>>> t[0] # which is now also the first element in t!
'\x05'
This can be very helpful if you want to have more than one view on the data and don't want to (or can't) hold multiple copies in memory.
Note that buffer has been replaced by the better named memoryview in Python 3, though you can use either in Python 2.7.
Note also that you can't implement a buffer interface for your own objects without delving into the C API, i.e. you can't do it in pure Python.

I think buffers are e.g. useful when interfacing Python to native libraries (Guido van Rossum explains buffer in this mailing list post).
For example, NumPy seems to use buffer for efficient data storage:
import numpy
a = numpy.ndarray(1000000)
The a.data is a:
<read-write buffer for 0x1d7b410, size 8000000, offset 0 at 0x1e353b0>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.