more efficient way to pickle a string - python

The pickle module seems to use string escape characters when pickling; this becomes inefficient e.g. on numpy arrays. Consider the following
z = numpy.zeros(1000, numpy.uint8)
The lengths are 1133 characters and 4249 characters respectively.
z.dumps() reveals something like "\x00\x00" (actual zeros in string), but pickle seems to be using the string's repr() function, yielding "'\x00\x00'" (zeros being ascii zeros).
i.e. ("0" in z.dumps() == False) and ("0" in cPickle.dumps(z.dumps()) == True)

Try using a later version of the pickle protocol with the protocol parameter to pickle.dumps(). The default is 0 and is an ASCII text format. Ones greater than 1 (I suggest you use pickle.HIGHEST_PROTOCOL). Protocol formats 1 and 2 (and 3 but that's for py3k) are binary and should be more space conservative.

import zlib, cPickle
def zdumps(obj):
return zlib.compress(cPickle.dumps(obj,cPickle.HIGHEST_PROTOCOL),9)
def zloads(zstr):
return cPickle.loads(zlib.decompress(zstr))
>>> len(zdumps(z))

z.dumps() is already pickled string i.e., it can be unpickled using pickle.loads():
>>> z = numpy.zeros(1000, numpy.uint8)
>>> s = z.dumps()
>>> a = pickle.loads(s)
>>> all(a == z)

An improvement to vartec's answer, that seems a bit more memory efficient (since it doesn't force everything into a string):
def pickle(fname, obj):
import cPickle, gzip
cPickle.dump(obj=obj,, "wb", compresslevel=3), protocol=2)
def unpickle(fname):
import cPickle, gzip
return cPickle.load(, "rb"))


Divide an extremely long byte stream into smaller bytes

So i need to unpack an extremely long byte stream (from USB) into 4 byte values.
Currently i got it working, but i feel there's a better way to do this.
Currently i got:
for i in range(int(len(mybytes)/4)):
So this feels like very resource expensive, and im doing this for 16k bytes A LOT.
I also feel like this has probably been asked before i just don't really know how to word it for searching
You could also try the array module which has the ability to load directly from binary data:
import array
arr = array.array("I",mybytes) # "I" stands for unsigned integer
arr.byteswap() # only if you're reading endian coding different from your platform
l = list(arr)
You can specify a size for the integers to unpack (Python 3.6+):
>>> import struct
>>> mybytes = bytes([1,2,3,4,5,6,7,8])
>>> struct.unpack(f'>2i',mybytes)
(16909060, 84281096)
>>> n = len(mybytes) // 4
>>> struct.unpack(f'>{n}i',mybytes) # Python 3.6+ f-strings
(16909060, 84281096)
>>> struct.unpack('>{}i'.format(n),mybytes) # Older Pythons
(16909060, 84281096)
>>> [hex(i) for i in _]
['0x1020304', '0x5060708']
Wrap it in a BytesIO object, then use iter to call its read method until it returns an empty bytes value.
>>> import io, struct
>>> bio = io.BytesIO(b'abcdefgh')
>>> int_fmt = struct.Struct(">i")
>>> list(map(int_fmt.unpack, iter(lambda:, b'')))
[(1633837924,), (1701209960,)]
You can tweak this to extract the single int value from each tuple, or switch to the from_bytes class method.
>>> bio = io.BytesIO(b'abcdefgh')
>>> list(map(lambda i: int.from_bytes(i, 'big'), iter(lambda:, b'')))
[1633837924, 1701209960]

How to generate a random UUID which is reproducible (with a seed) in Python

The uuid4() function of Python's module uuid generates a random UUID, and seems to generate a different one every time:
In [1]: import uuid
In [2]: uuid.uuid4()
Out[2]: UUID('f6c9ad6c-eea0-4049-a7c5-56253bc3e9c0')
In [3]: uuid.uuid4()
Out[3]: UUID('2fc1b6f9-9052-4564-9be0-777e790af58f')
I would like to be able to generate the same random UUID every time I run a script - that is, I'd like to seed the random generator in uuid4(). Is there a way to do this? (Or achieve this by some other means)?
What I've tried so far
I've to generate a UUID using the uuid.UUID() method with a random 128-bit integer (from a seeded instance of random.Random()) as input:
import uuid
import random
rd = random.Random()
However, UUID() seems not to accept this as input:
Traceback (most recent call last):
File "", line 6, in <module>
File "/usr/lib/python2.7/", line 133, in __init__
hex = hex.replace('urn:', '').replace('uuid:', '')
AttributeError: 'long' object has no attribute 'replace'
Any other suggestions?
Almost there:
This was determined with the help of help:
>>> help(uuid.UUID.__init__)
Help on method __init__ in module uuid:
__init__(self, hex=None, bytes=None, bytes_le=None, fields=None, int=None, version=None) unbound uuid.UUID method
Create a UUID from either a string of 32 hexadecimal digits,
a string of 16 bytes as the 'bytes' argument, a string of 16 bytes
in little-endian order as the 'bytes_le' argument, a tuple of six
integers (32-bit time_low, 16-bit time_mid, 16-bit time_hi_version,
8-bit clock_seq_hi_variant, 8-bit clock_seq_low, 48-bit node) as
the 'fields' argument, or a single 128-bit integer as the 'int'
argument. When a string of hex digits is given, curly braces,
hyphens, and a URN prefix are all optional. For example, these
expressions all yield the same UUID:
UUID(bytes_le='\x78\x56\x34\x12\x34\x12\x78\x56' +
UUID(fields=(0x12345678, 0x1234, 0x5678, 0x12, 0x34, 0x567812345678))
Exactly one of 'hex', 'bytes', 'bytes_le', 'fields', or 'int' must
be given. The 'version' argument is optional; if given, the resulting
UUID will have its variant and version set according to RFC 4122,
overriding the given 'hex', 'bytes', 'bytes_le', 'fields', or 'int'.
Faker makes this easy
>>> from faker import Faker
>>> f1 = Faker()
>>> f1.seed(4321)
>>> print(f1.uuid4())
>>> print(f1.uuid4())
>>> f1.seed(4321)
>>> print(f1.uuid4())
This is based on a solution used here:
import hashlib
import uuid
m = hashlib.md5()
new_uuid = uuid.UUID(m.hexdigest())
Since the straight-forward solution hasn't been posted yet to generate consistent version 4 UUIDs:
import random
import uuid
rnd = random.Random()
rnd.seed(123) # NOTE: Of course don't use a static seed in production
random_uuid = uuid.UUID(int=rnd.getrandbits(128), version=4)
where you can see then:
>>> random_uuid.version
This doesn't just "mock" the version information. It creates a proper UUIDv4:
The version argument is optional; if given, the resulting UUID will have its variant and version number set according to RFC 4122, overriding bits in the given hex, bytes, bytes_le, fields, or int.
Python 3.8 docs
Gonna add this here if anyone needs to monkey patch in a seeded UUID. My code uses uuid.uuid4() but for testing I wanted consistent UUIDs. The following code is how I did that:
import uuid
import random
# -------------------------------------------
# Remove this block to generate different
# UUIDs everytime you run this code.
# This block should be right below the uuid
# import.
rd = random.Random()
uuid.uuid4 = lambda: uuid.UUID(int=rd.getrandbits(128))
# -------------------------------------------
# Then normal code:
Based on alex's solution, the following would provide a proper UUID4:
a = "%32x" % random.getrandbits(128)
rd = a[:12] + '4' + a[13:16] + 'a' + a[17:]
uuid4 = uuid.UUID(rd)
Simple solution based on the answer of #user10229295, with a comment about the seed.
The Edit queue was full, so I opened a new answer:
import hashlib
import uuid
seed = 'Type your seed_string here' #Read comment below
m = hashlib.md5()
new_uuid = uuid.UUID(m.hexdigest())
Comment about the string 'seed':
It will be the seed from which the UUID will be generated: from the same seed string will be always generated the same UUID. You can convert integer with some significance as string, concatenate different strings and use the result as your seed. With this you will have control on the UUID generated, which means you will be able to reproduce your UUID knowing the seed you used: with the same seed, the UUID generated from it will be the same.
if your goal is a reproducible UUID, here's one concise approach
import uuid
seeded_uuid = uuid.UUID(bytes=b"z123456789101112") # 7a313233-3435-3637-3839-313031313132
How does this work internally ?
Using binary strings allows almost anything to act as a seed. You could also use alternative deterministic hashes that will take some data and give you a 32 byte string representing that data. There's a lot more sophistication underneath the uuid call but at it's core here's how a seeded uuid call works
initial_seed = b"z123456789101112"
# use this function to validate initial seed
is_valid = lambda x: len(x) == 16 and isinstance(x, bytes)
# for each byte get its unicode int value, convert to hex and concatenate as string
hex_rep = "".join([f"{b:x}" for b in initial_seed]) # 7a313233343536373839313031313132
# for a uuid, storing an int representation unlocks O(1) comparision
int_rep = int(hex_rep, base=16) # 162421256209101963464626711665304482098
# string representation for readability
str_rep = f"\
{hex_rep[20:]}" # 7a313233-3435-3637-3839-313031313132

Why does "bytes(n)" create a length n byte string instead of converting n to a binary representation?

I was trying to build this bytes object in Python 3:
so I tried the obvious (for me), and found a weird behaviour:
>>> bytes(3) + b'\r\n'
>>> bytes(10)
I've been unable to see any pointers on why the bytes conversion works this way reading the documentation. However, I did find some surprise messages in this Python issue about adding format to bytes (see also Python 3 bytes formatting):
This interacts even more poorly with oddities like bytes(int) returning zeroes now
It would be much more convenient for me if bytes(int) returned the ASCIIfication of that int; but honestly, even an error would be better than this behavior. (If I wanted this behavior - which I never have - I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)
Can someone explain me where this behaviour comes from?
From python 3.2 you can use to_bytes:
>>> (1024).to_bytes(2, byteorder='big')
def int_to_bytes(x: int) -> bytes:
return x.to_bytes((x.bit_length() + 7) // 8, 'big')
def int_from_bytes(xbytes: bytes) -> int:
return int.from_bytes(xbytes, 'big')
Accordingly, x == int_from_bytes(int_to_bytes(x)).
Note that the above encoding works only for unsigned (non-negative) integers.
For signed integers, the bit length is a bit more tricky to calculate:
def int_to_bytes(number: int) -> bytes:
return number.to_bytes(length=(8 + (number + (number < 0)).bit_length()) // 8, byteorder='big', signed=True)
def int_from_bytes(binary_data: bytes) -> Optional[int]:
return int.from_bytes(binary_data, byteorder='big', signed=True)
That's the way it was designed - and it makes sense because usually, you would call bytes on an iterable instead of a single integer:
>>> bytes([3])
The docs state this, as well as the docstring for bytes:
>>> help(bytes)
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
You can use the struct's pack:
In [11]: struct.pack(">I", 1)
Out[11]: '\x00\x00\x00\x01'
The ">" is the byte-order (big-endian) and the "I" is the format character. So you can be specific if you want to do something else:
In [12]: struct.pack("<H", 1)
Out[12]: '\x01\x00'
In [13]: struct.pack("B", 1)
Out[13]: '\x01'
This works the same on both python 2 and python 3.
Note: the inverse operation (bytes to int) can be done with unpack.
Python 3.5+ introduces %-interpolation (printf-style formatting) for bytes:
>>> b'%d\r\n' % 3
See PEP 0461 -- Adding % formatting to bytes and bytearray.
On earlier versions, you could use str and .encode('ascii') the result:
>>> s = '%d\r\n' % 3
>>> s.encode('ascii')
Note: It is different from what int.to_bytes produces:
>>> n = 3
>>> n.to_bytes((n.bit_length() + 7) // 8, 'big') or b'\0'
>>> b'3' == b'\x33' != b'\x03'
The documentation says:
bytes(int) -> bytes object of size given by the parameter
initialized with null bytes
The sequence:
It is the character '3' (decimal 51) the character '\r' (13) and '\n' (10).
Therefore, the way would treat it as such, for example:
>>> bytes([51, 13, 10])
>>> bytes('3', 'utf8') + b'\r\n'
>>> n = 3
>>> bytes(str(n), 'ascii') + b'\r\n'
Tested on IPython 1.1.0 & Python 3.2.3
The ASCIIfication of 3 is "\x33" not "\x03"!
That is what python does for str(3) but it would be totally wrong for bytes, as they should be considered arrays of binary data and not be abused as strings.
The most easy way to achieve what you want is bytes((3,)), which is better than bytes([3]) because initializing a list is much more expensive, so never use lists when you can use tuples. You can convert bigger integers by using int.to_bytes(3, "little").
Initializing bytes with a given length makes sense and is the most useful, as they are often used to create some type of buffer for which you need some memory of given size allocated. I often use this when initializing arrays or expanding some file by writing zeros to it.
I was curious about performance of various methods for a single int in the range [0, 255], so I decided to do some timing tests.
Based on the timings below, and from the general trend I observed from trying many different values and configurations, struct.pack seems to be the fastest, followed by int.to_bytes, bytes, and with str.encode (unsurprisingly) being the slowest. Note that the results show some more variation than is represented, and int.to_bytes and bytes sometimes switched speed ranking during testing, but struct.pack is clearly the fastest.
Results in CPython 3.7 on Windows:
Testing with 63:
bytes_: 100000 loops, best of 5: 3.3 usec per loop
to_bytes: 100000 loops, best of 5: 2.72 usec per loop
struct_pack: 100000 loops, best of 5: 2.32 usec per loop
chr_encode: 50000 loops, best of 5: 3.66 usec per loop
Test module (named
"""Functions for converting a single int to a bytes object with that int's value."""
import random
import shlex
import struct
import timeit
def bytes_(i):
"""From Tim Pietzcker's answer:
return bytes([i])
def to_bytes(i):
"""From brunsgaard's answer:
return i.to_bytes(1, byteorder='big')
def struct_pack(i):
"""From Andy Hayden's answer:
return struct.pack('B', i)
# Originally, jfs's answer was considered for testing,
# but the result is not identical to the other methods
def chr_encode(i):
"""Another method, from Quuxplusone's answer here:
Similar to g10guang's answer:
return chr(i).encode('latin1')
converters = [bytes_, to_bytes, struct_pack, chr_encode]
def one_byte_equality_test():
"""Test that results are identical for ints in the range [0, 255]."""
for i in range(256):
results = [c(i) for c in converters]
# Test that all results are equal
start = results[0]
if any(start != b for b in results):
raise ValueError(results)
def timing_tests(value=None):
"""Test each of the functions with a random int."""
if value is None:
# random.randint takes more time than int to byte conversion
# so it can't be a part of the timeit call
value = random.randint(0, 255)
print(f'Testing with {value}:')
for c in converters:
print(f'{c.__name__}: ', end='')
# Uses technique borrowed from
f"-s 'from int_to_byte import {c.__name__}; value = {value}' " +
The behaviour comes from the fact that in Python prior to version 3 bytes was just an alias for str. In Python3.x bytes is an immutable version of bytearray - completely new type, not backwards compatible.
From bytes docs:
Accordingly, constructor arguments are interpreted as for bytearray().
Then, from bytearray docs:
The optional source parameter can be used to initialize the array in a few different ways:
If it is an integer, the array will have that size and will be initialized with null bytes.
Note, that differs from 2.x (where x >= 6) behavior, where bytes is simply str:
>>> bytes is str
PEP 3112:
The 2.6 str differs from 3.0’s bytes type in various ways; most notably, the constructor is completely different.
int (including Python2's long) can be converted to bytes using following function:
import codecs
def int2bytes(i):
hex_value = '{0:x}'.format(i)
# make length of hex_value a multiple of two
hex_value = '0' * (len(hex_value) % 2) + hex_value
return codecs.decode(hex_value, 'hex_codec')
The reverse conversion can be done by another one:
import codecs
import six # should be installed via 'pip install six'
long = six.integer_types[-1]
def bytes2int(b):
return long(codecs.encode(b, 'hex_codec'), 16)
Both functions work on both Python2 and Python3.
Although the prior answer by brunsgaard is an efficient encoding, it works only for unsigned integers. This one builds upon it to work for both signed and unsigned integers.
def int_to_bytes(i: int, *, signed: bool = False) -> bytes:
length = ((i + ((i * signed) < 0)).bit_length() + 7 + signed) // 8
return i.to_bytes(length, byteorder='big', signed=signed)
def bytes_to_int(b: bytes, *, signed: bool = False) -> int:
return int.from_bytes(b, byteorder='big', signed=signed)
# Test unsigned:
for i in range(1025):
assert i == bytes_to_int(int_to_bytes(i))
# Test signed:
for i in range(-1024, 1025):
assert i == bytes_to_int(int_to_bytes(i, signed=True), signed=True)
For the encoder, (i + ((i * signed) < 0)).bit_length() is used instead of just i.bit_length() because the latter leads to an inefficient encoding of -128, -32768, etc.
Credit: CervEd for fixing a minor inefficiency.
As you want to deal with binary representation, the best is to use ctypes.
import ctypes
x = ctypes.c_int(1234)
You must use the specific integer representation (signed/unsigned and the number of bits: c_uint8, c_int8, c_unit16,...).
Some answers don't work with large numbers.
Convert integer to the hex representation, then convert it to bytes:
def int_to_bytes(number):
hrepr = hex(number).replace('0x', '')
if len(hrepr) % 2 == 1:
hrepr = '0' + hrepr
return bytes.fromhex(hrepr)
>>> int_to_bytes(2**256 - 1)
I think you can convert the int to str first, before you convert to byte.
That should produce the format you want.
bytes(str(your_number),'UTF-8') + b'\r\n'
It works for me in py3.8.
If the question is how to convert an integer itself (not its string equivalent) into bytes, I think the robust answer is:
>>> i = 5
>>> i.to_bytes(2, 'big')
>>> int.from_bytes(i.to_bytes(2, 'big'), byteorder='big')
More information on these methods here:
>>> chr(116).encode()

Iterate over individual bytes in Python 3

When iterating over a bytes object in Python 3, one gets the individual bytes as ints:
>>> [b for b in b'123']
[49, 50, 51]
How to get 1-length bytes objects instead?
The following is possible, but not very obvious for the reader and most likely performs bad:
>>> [bytes([b]) for b in b'123']
[b'1', b'2', b'3']
If you are concerned about performance of this code and an int as a byte is not suitable interface in your case then you should probably reconsider data structures that you use e.g., use str objects instead.
You could slice the bytes object to get 1-length bytes objects:
L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]
There is PEP 0467 -- Minor API improvements for binary sequences that proposes bytes.iterbytes() method:
>>> list(b'123'.iterbytes())
[b'1', b'2', b'3']
int objects have a to_bytes method which can be used to convert an int to its corresponding byte:
>>> import sys
>>> [i.to_bytes(1, sys.byteorder) for i in b'123']
[b'1', b'2', b'3']
As with some other other answers, it's not clear that this is more readable than the OP's original solution: the length and byteorder arguments make it noisier I think.
Another approach would be to use struct.unpack, though this might also be considered difficult to read, unless you are familiar with the struct module:
>>> import struct
>>> struct.unpack('3c', b'123')
(b'1', b'2', b'3')
(As jfs observes in the comments, the format string for struct.unpack can be constructed dynamically; in this case we know the number of individual bytes in the result must equal the number of bytes in the original bytestring, so struct.unpack(str(len(bytestring)) + 'c', bytestring) is possible.)
>>> import random, timeit
>>> bs = bytes(random.randint(0, 255) for i in range(100))
>>> # OP's solution
>>> timeit.timeit(setup="from __main__ import bs",
stmt="[bytes([b]) for b in bs]")
>>> # Accepted answer from jfs
>>> timeit.timeit(setup="from __main__ import bs",
stmt="[bs[i:i+1] for i in range(len(bs))]")
>>> # Leon's answer
>>> timeit.timeit(setup="from __main__ import bs",
stmt="list(map(bytes, zip(bs)))")
>>> # guettli's answer
>>> timeit.timeit(setup="from __main__ import iter_bytes, bs",
>>> # user38's answer (with Leon's suggested fix)
>>> timeit.timeit(setup="from __main__ import bs",
stmt="[chr(i).encode('latin-1') for i in bs]")
>>> # Using int.to_bytes
>>> timeit.timeit(setup="from __main__ import bs;from sys import byteorder",
stmt="[x.to_bytes(1, byteorder) for x in bs]")
>>> # Using struct.unpack, converting the resulting tuple to list
>>> # to be fair to other methods
>>> timeit.timeit(setup="from __main__ import bs;from struct import unpack",
stmt="list(unpack('100c', bs))")
struct.unpack seems to be at least an order of magnitude faster than other methods, presumably because it operates at the byte level. int.to_bytes, on the other hand, performs worse than most of the "obvious" approaches.
I thought it might be useful to compare the runtimes of the different approaches so I made a benchmark (using my library simple_benchmark):
Probably unsurprisingly the NumPy solution is by far the fastest solution for large bytes object.
But if a resulting list is desired then both the NumPy solution (with the tolist()) and the struct solution are much faster than the other alternatives.
I didn't include guettlis answer because it's almost identical to jfs solution just instead of a comprehension a generator function is used.
import numpy as np
import struct
import sys
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
def jfs(bytes_obj):
return [bytes_obj[i:i+1] for i in range(len(bytes_obj))]
def snakecharmerb_tobytes(bytes_obj):
return [i.to_bytes(1, sys.byteorder) for i in bytes_obj]
def snakecharmerb_struct(bytes_obj):
return struct.unpack(str(len(bytes_obj)) + 'c', bytes_obj)
def Leon(bytes_obj):
return list(map(bytes, zip(bytes_obj)))
def rusu_ro1_format(bytes_obj):
return [b'%c' % i for i in bytes_obj]
def rusu_ro1_numpy(bytes_obj):
return np.frombuffer(bytes_obj, dtype='S1')
def rusu_ro1_numpy_tolist(bytes_obj):
return np.frombuffer(bytes_obj, dtype='S1').tolist()
def User38(bytes_obj):
return [chr(i).encode() for i in bytes_obj]
#b.add_arguments('byte object length')
def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, b'a' * size
r =
since python 3.5 you can use % formatting to bytes and bytearray:
[b'%c' % i for i in b'123']
[b'1', b'2', b'3']
the above solution is 2-3 times faster than your initial approach, if you want a more fast solution I will suggest to use numpy.frombuffer:
import numpy as np
np.frombuffer(b'123', dtype='S1')
array([b'1', b'2', b'3'],
The second solution is ~10% faster than struct.unpack (I have used the same performance test as #snakecharmerb, against 100 random bytes)
A trio of map(), bytes() and zip() does the trick:
>>> list(map(bytes, zip(b'123')))
[b'1', b'2', b'3']
However I don't think that it is any more readable than [bytes([b]) for b in b'123'] or performs better.
I use this helper method:
def iter_bytes(my_bytes):
for i in range(len(my_bytes)):
yield my_bytes[i:i+1]
Works for Python2 and Python3.
A short way to do this:
[bytes([i]) for i in b'123\xaa\xbb\xcc\xff']

Python 3.x: Using string.maketrans() in order to create a unicode-character transformation

I would like to write the following code:
import string
frm = b'acdefhnoprstuw'
to = 'אקדיפהנופרסתאו'
trans_table = string.maketrans(frm, to)
hebrew_phrase = 'fear cuts deeper than swords'.translate(trans_table)
The above code doesn't work because the to parameter to string.maketrans(frm, to) has to be a byte sequence, not a string. The problem is that byte sequences can only contain ASCII literal characters. Therefore I cannot make a transformation which translates English strings to Hebrew strings. The reason is that string.maketrans() retruns a bytes object.
Is there an elegant way to use the string.maketrans() and translate() functions (or equivalent functions that work with unicode) for my task?
You need to use str.maketrans(), which takes two str as arguments.
>>> frm = 'acdefhnoprstuw'
>>> to = 'אקדיפהנופרסתאו'
>>> trans_table = str.maketrans(frm, to)
>>> hebrew_phrase = 'fear cuts deeper than swords'.translate(trans_table)
>>> hebrew_phrase
'פיאר קאתס דייפיר תהאנ סוורדס'
String.maketrans still existed in Python 3.1, but that's just because they missed moving it to bytes.maketrans() in 3.0. It was deprecated in 3.1 already and in 3.2 it is gone.
