Concatenate multiple zlib compressed data streams into a single stream efficiently

Concatenate multiple zlib compressed data streams into a single stream efficiently - python

If I have several binary strings with compressed zlib data, is there a way to efficiently combine them into a single compressed string without decompressing everything?
Example of what I have to do now:
c1 = zlib.compress("The quick brown fox jumped over the lazy dog. ")
c2 = zlib.compress("We ride at dawn! ")
c = zlib.compress(zlib.decompress(c1)+zlib.decompress(c2)) # Warning: Inefficient!
d1 = zlib.decompress(c1)
d2 = zlib.decompress(c2)
d = zlib.decompress(c)
assert d1+d2 == d # This will pass!
Example of what I want:
c1 = zlib.compress("The quick brown fox jumped over the lazy dog. ")
c2 = zlib.compress("We ride at dawn! ")
c = magic_zlib_add(c1+c2) # Magical method of combining compressed streams
d1 = zlib.decompress(c1)
d2 = zlib.decompress(c2)
d = zlib.decompress(c)
assert d1+d2 == d # This should pass!
I don't know too much about zlib and the DEFLATE algorithm, so this may be entirely impossible from a theoretical point of view. Also, I must use use zlib; so I can't wrap zlib and come up with my own protocol that transparently handles concatenated streams.
NOTE: I don't really mind if the solution is not trivial in Python. I'm willing to write some C code and use ctypes in Python.

Since you don't mind venturing into C, you can start by looking at the code for gzjoin.
Note, the gzjoin code has to decompress to find the parts that have to change when merged, but it doesn't have to recompress. That's not too bad because decompression is typically faster than compression.

In addition to gzjoin which requires decompression of the first deflate stream, you can take a look at gzlog.h and gzlog.c, which efficiently appends short strings to a gzip file without having to decompress the deflate stream each time. (It can be easily modified to operate on zlib-wrapped deflate data instead of gzip-wrapped deflate data.) You would use this approach if you are in control of the creation of the first deflate stream. If you are not creating the first deflate stream, then you would have to use the approach of gzjoin which requires decompression.
None of the approaches require recompression.

I'm just turning #zorlak's comment into an answer and adding some code so I can find it later.
For the first stream, call deflate() with a Z_SYNC_FLUSH. For each subsequent stream, call deflate() with Z_SYNC_FLUSH, strip the first two bytes, and concatenate, also collecting the adler32 value and uncompressed length. Then for the last stream, [call] deflate() with a Z_FINISH, strip off the 4 byte checksum, and replace it with adler32_combine() of all the checksums.
If you can control the initial compression of your streams, you can store the length of the uncompressed data, its Adler-32 checksum, and the compressed data somewhere. Later you can then concatenate the individual streams in an arbitrary order.
Note that I am not sure if the individual streams can have different compression levels, compression strategies, or window sizes since the concatenate function strips the zlib header of all but the first stream...
from typing import Tuple
import zlib
def prepare(data: bytes) -> Tuple[int, bytes, int]:
deflate = zlib.compressobj()
result = deflate.compress(data)
result += deflate.flush(zlib.Z_SYNC_FLUSH)
return len(data), result, zlib.adler32(data)
def concatenate(*chunks: Tuple[int, bytes, int]) -> bytes:
if not chunks:
return b''
_, result, final_checksum = chunks[0]
for length, chunk, checksum in chunks[1:]:
result += chunk[2:] # strip the zlib header
final_checksum = adler32_combine(final_checksum, checksum, length)
result += b'\x03\x00' # insert a final empty block
result += final_checksum.to_bytes(4, byteorder='big')
return result
def adler32_combine(adler1: int, adler2: int, length2: int) -> int:
# Python implementation of adler32_combine
# The orignal C implementation is Copyright (C) 1995-2011, 2016 Mark Adler
# see https://github.com/madler/zlib/blob/master/adler32.c#L143
BASE = 65521
WORD = 0xffff
DWORD = 0xffffffff
if adler1 < 0 or adler1 > DWORD:
raise ValueError('adler1 must be between 0 and 2^32')
if adler2 < 0 or adler2 > DWORD:
raise ValueError('adler2 must be between 0 and 2^32')
if length2 < 0:
raise ValueError('length2 must not be negative')
remainder = length2 % BASE
sum1 = adler1 & WORD
sum2 = (remainder * sum1) % BASE
sum1 += (adler2 & WORD) + BASE - 1
sum2 += ((adler1 >> 16) & WORD) + ((adler2 >> 16) & WORD) + BASE - remainder
if sum1 >= BASE:
sum1 -= BASE
if sum1 >= BASE:
sum1 -= BASE
if sum2 >= (BASE << 1):
sum2 -= (BASE << 1)
if sum2 >= BASE:
sum2 -= BASE
return (sum1 | (sum2 << 16))
A quick example:
hello = prepare(b'Hello World! ')
test = prepare(b'This is a test. ')
fox = prepare(b'The quick brown fox jumped over the lazy dog. ')
dawn = prepare(b'We ride at dawn! ')
# these all print what you would expect
print(zlib.decompress(concatenate(hello, test, fox, dawn)))
print(zlib.decompress(concatenate(dawn, fox, test, hello)))
print(zlib.decompress(concatenate(fox, hello, dawn, test)))
print(zlib.decompress(concatenate(test, dawn, hello, fox)))

Related

Python string comparison doesn't short circuit?

The usual saying is that string comparison must be done in constant time when checking things like password or hashes, and thus, it is recommended to avoid a == b.
However, I run the follow script and the results don't support the hypothesis that a==b short circuit on the first non-identical character.
from time import perf_counter_ns
import random
def timed_cmp(a, b):
start = perf_counter_ns()
a == b
end = perf_counter_ns()
return end - start
def n_timed_cmp(n, a, b):
"average time for a==b done n times"
ts = [timed_cmp(a, b) for _ in range(n)]
return sum(ts) / len(ts)
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 2 ** 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Here is the result of a run, re-running the script gives slightly different results, but nothing satisfactory.
# ran with cpython 3.8.3
6 78.051700
1 78.203200
15 78.222700
14 78.384800
11 78.396300
12 78.441800
9 78.476900
13 78.519000
8 78.586200
3 78.631500
---
0 80.691100
1 78.203200
I would've expected that the fastest comparison would be where the first differing character is at the beginning of the string, but it's not what I get.
Any idea what's going on ???

There's a difference, you just don't see it on such small strings. Here's a small patch to apply to your code, so I use longer strings, and I do 10 checks by putting the A at a place, evenly spaced in the original string, from the beginning to the end, I mean, like this:
A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
## -15,13 +15,13 ## def n_timed_cmp(n, a, b):
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
- n = 2 ** 8
+ n = 2 ** 16
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
and you'll get:
0 122.621000
1 213.465700
2 380.214100
3 460.422000
5 694.278700
4 722.010000
7 894.630300
6 1020.722100
9 1149.473000
8 1341.754500
---
0 122.621000
1 213.465700
Note that with your example, with only 2**8 characters, it's already noticable, apply this patch:
## -21,7 +21,7 ## def check_cmp_time():
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
to only keep the two extreme cases (first letter change vs last letter change) and you'll get:
$ python3 cmp.py
0 124.131800
1 135.566000
Numbers may vary, but most of the time test 0 is a tad faster that test 1.
To isolate more precisely which caracter is modified, it's possible as long as the memcmp does it character by character, so as long as it does not use integer comparisons, typically on the last character if they get misaligned, or on really short strings, like 8 char string, as I demo here:
from time import perf_counter_ns
from statistics import median
import random
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i + 1 :] for i in range(n)]
values = {x: [] for x in range(n)}
for _ in range(10_000_000):
for i, diff in enumerate(diffs):
start = perf_counter_ns()
s == diff
values[i].append(perf_counter_ns() - start)
timed = [[k, median(v)] for k, v in values.items()]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Which gives me:
1 221.000000
2 222.000000
3 223.000000
4 223.000000
5 223.000000
6 223.000000
7 223.000000
0 241.000000
The differences are so small, Python and perf_counter_ns may no longer be the right tools here.

See, to know why it doesn't short circuit, you'll have to do some digging. The simple answer is, of course, it doesn't short circuit because the standard doesn't specify so. But you might think, "Why wouldn't the implementations choose to short circuit? Surely, It must be faster!". Not quite.
Let's take a look at cpython, for obvious reasons. Look at the code for unicode_compare_eq function defined in unicodeobject.c
static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
{
int kind;
void *data1, *data2;
Py_ssize_t len;
int cmp;
len = PyUnicode_GET_LENGTH(str1);
if (PyUnicode_GET_LENGTH(str2) != len)
return 0;
kind = PyUnicode_KIND(str1);
if (PyUnicode_KIND(str2) != kind)
return 0;
data1 = PyUnicode_DATA(str1);
data2 = PyUnicode_DATA(str2);
cmp = memcmp(data1, data2, len * kind);
return (cmp == 0);
}
(Note: This function is actually called after deducing that str1 and str2 are not the same object - if they are - well that's just a simple True immediately)
Focus on this line specifically-
cmp = memcmp(data1, data2, len * kind);
Ahh, we're back at another cross road. Does memcmp short circuit? The C standard does not specify such a requirement. As seen in the opengroup docs and also in Section 7.24.4.1 of the C Standard Draft
7.24.4.1 The memcmp function
Synopsis
#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
The memcmp function compares the first n characters of the object pointed to by s1 to
the first n characters of the object pointed to by s2.
Returns
The memcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
Most Some C implementations (including glibc) choose to not short circuit. But why? are we missing something, why would you not short circuit?
Because the comparison they use isn't might not be as naive as a byte by byte by check. The standard does not require the objects to be compared byte by byte. Therein lies the chance of optimization.
What glibc does, is that it compares elements of type unsigned long int instead of just singular bytes represented by unsigned char. Check out the implementation
There's a lot more going under the hood - a discussion far outside the scope of this question, after all this isn't even tagged as a C question ;). Though I found that this answer may be worth a look. But just know, the optimization is there, just in a much different form than the approach that may come in mind at first glance.
Edit: Fixed wrong function link
Edit: As #Konrad Rudolph has stated, glibc memcmp does apparently short circuit. I've been misinformed.

Why doesn't my hash function output a dynamic value?

I'm a newbie in this field and am trying to learn a bit about how to write cryptographic hash functions.
To get some hands-on, I tried updating the PySHA2 algorithm for Python 3.6 and up (the original version doesn't work on Python 2.5+ and the author says he won't fix this). I don't intend to use this algorithm for any work, just coding this for the sake of knowledge.
I've reached this far:
import copy
import struct
_initial_hashes = [0x6a09e667f3bcc908, 0xbb67ae8584caa73b, 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1,
0x510e527fade682d1, 0x9b05688c2b3e6c1f, 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179]
_round_constants = [0x428a2f98d728ae22, 0x7137449123ef65cd, 0xb5c0fbcfec4d3b2f, 0xe9b5dba58189dbbc,
0x3956c25bf348b538, 0x59f111f1b605d019, 0x923f82a4af194f9b, 0xab1c5ed5da6d8118,
0xd807aa98a3030242, 0x12835b0145706fbe, 0x243185be4ee4b28c, 0x550c7dc3d5ffb4e2,
0x72be5d74f27b896f, 0x80deb1fe3b1696b1, 0x9bdc06a725c71235, 0xc19bf174cf692694,
0xe49b69c19ef14ad2, 0xefbe4786384f25e3, 0x0fc19dc68b8cd5b5, 0x240ca1cc77ac9c65,
0x2de92c6f592b0275, 0x4a7484aa6ea6e483, 0x5cb0a9dcbd41fbd4, 0x76f988da831153b5,
0x983e5152ee66dfab, 0xa831c66d2db43210, 0xb00327c898fb213f, 0xbf597fc7beef0ee4,
0xc6e00bf33da88fc2, 0xd5a79147930aa725, 0x06ca6351e003826f, 0x142929670a0e6e70,
0x27b70a8546d22ffc, 0x2e1b21385c26c926, 0x4d2c6dfc5ac42aed, 0x53380d139d95b3df,
0x650a73548baf63de, 0x766a0abb3c77b2a8, 0x81c2c92e47edaee6, 0x92722c851482353b,
0xa2bfe8a14cf10364, 0xa81a664bbc423001, 0xc24b8b70d0f89791, 0xc76c51a30654be30,
0xd192e819d6ef5218, 0xd69906245565a910, 0xf40e35855771202a, 0x106aa07032bbd1b8,
0x19a4c116b8d2d0c8, 0x1e376c085141ab53, 0x2748774cdf8eeb99, 0x34b0bcb5e19b48a8,
0x391c0cb3c5c95a63, 0x4ed8aa4ae3418acb, 0x5b9cca4f7763e373, 0x682e6ff3d6b2b8a3,
0x748f82ee5defb2fc, 0x78a5636f43172f60, 0x84c87814a1f0ab72, 0x8cc702081a6439ec,
0x90befffa23631e28, 0xa4506cebde82bde9, 0xbef9a3f7b2c67915, 0xc67178f2e372532b,
0xca273eceea26619c, 0xd186b8c721c0c207, 0xeada7dd6cde0eb1e, 0xf57d4f7fee6ed178,
0x06f067aa72176fba, 0x0a637dc5a2c898a6, 0x113f9804bef90dae, 0x1b710b35131c471b,
0x28db77f523047d84, 0x32caab7b40c72493, 0x3c9ebe0a15c9bebc, 0x431d67c49c100d4c,
0x4cc5d4becb3e42b6, 0x597f299cfc657e2a, 0x5fcb6fab3ad6faec, 0x6c44198c4a475817]
def _rit_rot(on: int, by: int) -> int:
"""
helper function for right rotation as it isn't done by a simple bitwise operation (xor is done by '^')
:param on: value to be rotated
:param by: value by which to rotate
:return: right rotated 'on'
"""
return ((on >> by) | (on << (64 - by))) & 0xFFFFFFFFFFFFFFFF
def hash_main(chunk):
global _initial_hashes, _round_constants
# start the hashing process
# to begin, create a place to store the 80 words that we'll make
words = [0] * 80
# first 16 words will be saved without any changes
words[:16] = struct.unpack('!16Q', chunk)
# extend these 16 words into the remaining 64 words of 'message schedule array'
for i in range(16, 80):
part_1 = _rit_rot(words[i - 15], 1) ^ _rit_rot(words[i - 15], 8) ^ (words[i - 15] >> 7)
part_2 = _rit_rot(words[i - 2], 19) ^ _rit_rot(words[i - 2], 61) ^ (words[i - 2] >> 6)
words[i] = (words[i - 16] + part_1 + words[i - 7] + part_2) & 0xFFFFFFFFFFFFFFFF
# create the working variables
a, b, c, d, e, f, g, h = _initial_hashes
# start the compression function
for z in range(80):
var_1 = _rit_rot(a, 28) ^ _rit_rot(a, 34) ^ _rit_rot(a, 39)
var_2 = _rit_rot(e, 14) ^ _rit_rot(e, 18) ^ _rit_rot(e, 41)
var_3 = (a & b) ^ (a & c) ^ (b & c)
var_4 = (e & f) ^ ((~e) & g)
temp_1 = var_1 + var_3
temp_2 = h + var_2 + var_4 + _round_constants[z] + words[z]
# remix the hashes
h = g
g = f
f = e
e = (d + temp_2) & 0xFFFFFFFFFFFFFFFF
d = c
c = b
b = a
a = (temp_1 + temp_2) & 0xFFFFFFFFFFFFFFFF
# add this chunk to initial hashes
_initial_hashes = [(x + y) & 0xFFFFFFFFFFFFFFFF for x, y in zip(_initial_hashes,
[a, b, c, d, e, f, g, h])]
def _sha_backend_update(text_copy, _buffer, _counter):
"""
backend function that hashes given string
"""
global _initial_hashes, _round_constants
# create variables for cycling
_buffer += text_copy
_counter += len(text_copy)
# assert the variables are correct
if not text_copy:
return
if type(text_copy) is not str:
raise TypeError("Invalid Object! Please enter a valid string for hashing!")
# break the buffer into 128-bit chunks
while len(_buffer) >= 128:
chunk = _buffer[:128].encode()[1:]
hash_main(chunk)
_buffer = _buffer[128:]
def sha_backend_digest(text_to_hash: str, _buffer: str, _counter: int,
_output_size: int, hex_output: bool = False):
# initialize variables
variable_x = _counter & 0x7F
length = str(struct.pack('!Q', _counter << 3))
# set the thresholds
if variable_x < 112:
padding_len = 111 - variable_x
else:
padding_len = 239 - variable_x
# make a copy of the text_to_hash before starting hashing
text_copy = copy.deepcopy(text_to_hash)
m = '\x80' + ('\x00' * (padding_len + 8)) + length
# run the update function
_sha_backend_update(text_copy, _buffer, _counter)
# return the hash value
return_val = [hex(stuff) for stuff in _initial_hashes[:_output_size]]
if hex_output is True:
return_val = [int(stuff, base=16) for stuff in return_val]
return return_val
return ''.join(return_val)
def sha_512(text_to_hash: str, hex_digest: bool = False) -> str:
"""
frontend function for SHA512 hashing
:return: hashed string
"""
# before anything, check if the input is correct
if not text_to_hash:
return ""
if type(text_to_hash) is not str:
raise TypeError("Invalid content! Please provide content in correct format for hashing!")
# initialize default variables
_buffer = ''
_counter = 0
_output_size = 8
# start the backend function
return sha_backend_digest(text_to_hash, _buffer, _counter, _output_size, hex_output=hex_digest)
message = "This is a string to be hashed"
from hashlib import sha512
print("hashlib gives: ", sha512(message.encode()).hexdigest())
print("I give: ", sha_512(message))
As is obvious, I don't understand a lot of things in this algorithm and have literally copied many parts from the original code (also, I know it isn't good practice to write everything in a single function but I find it easier when trying to understand something).
But the biggest problem I have right now is it doesn't work! Whatever input message I provide to my function, it gives the same output:
0x6a09e667f3bcc9080xbb67ae8584caa73b0x3c6ef372fe94f82b0xa54ff53a5f1d36f1
0x510e527fade682d10x9b05688c2b3e6c1f0x1f83d9abfb41bd6b0x5be0cd19137e2179
I wrote a code at the bottom to compare it with python's hashlib module.
Where am I going wrong in this and how do I fix this?
EDIT: As mentioned in the comments, I tried to feed in a longer message string and the code seems to be working (it still gives longer output than hashlib though):
message = "This is a string to be hashed. I'll try to make this string as long as possible by adding" \
"as much information to it as I can, in the hopes that this string would somehow become longer than" \
"128 bits and my code can run properly. Hopefully, this is already longer than 128 bits, so lets see" \
"how it works..."
hash: 0x6fcc0f346f2577800x334bd9b6c1178a970x90964a3f45f7b5bb0xc14033d12f6607e60xb598bea0a8b0ac1e0x116b0e134691ab540x73d88e77e5b862ba0x89181da7462c5574
message = "This is a string to be hashed. I'll try to make this string as long as possible by adding" \
"as much information to it as I can, in the hopes that this string would somehow become longer than"
hash: 0x166e40ab03bc98750xe81fe34168b6994f0xe56b81bd5972b5560x8789265c3a56b30b0x2c810d652ea7b1550xa23ca2704602a8240x12ffb1ec8f3dd6d10x88c29f84cbef8988

You'll always have to pad the message. Padding and adding the length are always required as last step of the SHA-2 process. Currently you weren't performing that last step (to completion).
Here are my last two comments that pointed you in the right direction:
So generally you try and take one 128 byte block from the binary message, update the hash state using the information in that block, then move to the next one until you have a partial or 0 byte block. That block you need to pad & add size indication (in bits) and process. If you've not enough space for the padding / size indication then you need yet another block consisting entirely of padding and the size indication. If you read carefully, then you always process at least one block.
and
Hmm, it is already in sha_backend_digest (the 0x80 followed by zero bytes and the length which is input size * 8 (_counter << 3).
But of course you do need to perform that and not skip any step.

Python: How do I convert file to custom base number and back?

I have a file that I want to convert into custom base (base 86 for example, with custom alphabet)
I have try to convert the file with hexlify and then into my custom base but it's too slow... 8 second for 60 Ko..
def HexToBase(Hexa, AlphabetList, OccurList, threshold=10):
number = int(Hexa,16) #base 16 vers base 10
alphabet = GetAlphabet(AlphabetList, OccurList, threshold)
#GetAlphabet return a list of all chars that occurs more than threshold times
b_nbr = len(alphabet) #get the base
out = ''
while number > 0:
out = alphabet[(number % b_nbr)] + out
number = number // b_nbr
return out
file = open("File.jpg","rb")
binary_data = file.read()
HexToBase(binascii.hexlify(binary_data),['a','b'],[23,54])
So, could anyone help me to find the right solution ?
Sorry for my poor English I'm French, and Thank's for your help !

First you can replace:
int(binascii.hexlify(binary_data), 16) # timeit: 14.349809918712538
By:
int.from_bytes(binary_data, byteorder='little') # timeit: 3.3330371951720164
Second you can use the divmod function to speed up the loop:
out = ""
while number > 0:
number, m = divmod(number, b_nbr)
out = alphabet[m] + out
# timeit: 3.8345545611298126 vs 7.472579440019706
For divmod vs %, // comparison and large numbers, see Is divmod() faster than using the % and // operators?.
(Remark: I expected that buildind an array and then making a string with "".join would be faster than out = ... + out but that was not the case with CPython 3.6.)
Everything put together gave me a speed up factor of 6.

Ascii string of bytes packed into bitmap/bitstring back to string?

I have a string that is packed such that each character was originally an unsigned byte but is stored as 7 bits and then packed into an unsigned byte array. I'm trying to find a quick way to unpack this string in Python but the function I wrote that uses the bitstring module works well but is very slow. It seems like something like this should not be so slow but I'm probably doing it very inefficiently...
This seems like something that is probably trivial but I just don't know what to use, maybe there is already a function that will unpack the string?
from bitstring import BitArray
def unpackString(raw):
msg = ''
bits = BitArray(bytes=raw)
mask = BitArray('0b01111111')
i = 0
while 1:
try:
iByte = (bits[i:i + 8] & mask).int
# value of 0 denotes a line break
if iByte == 0:
msg += '\n'
elif iByte >= 32 and iByte <= 126:
msg += chr(iByte)
i += 7
except:
break
return msg

This took me a while to figure out, as your solution seems to ignore the first bit of data. Given the input byte of 129 (0b10000001) I would expect to see 64 '1000000' printed by the following, but your code produces 1 '0000001' -- ignoring the first bit.
bs = b'\x81' # one byte string, whose value is 129 (0x81)
arr = BitArray(bs)
mask = BitArray('0b01111111')
byte = (arr[0:8] & mask).int
print(byte, repr("{:07b}".format(byte)))
Simplest solution would be to modify your solution to use bitstring.ConstBitStream -- I got an order of magnitude speed increase with the following.
from bitstring import ConstBitStream
def unpack_bitstream(raw):
num_bytes, remainder = divmod(len(raw) * 8 - 1, 7)
bitstream = ConstBitStream(bytes=raw, offset=1) # use offset to ignore leading bit
msg = b''
for _ in range(num_bytes):
byte = bitstream.read("uint:7")
if not byte:
msg += b'\n'
elif 32 <= byte <= 126:
msg += bytes((byte,))
# msg += chr(byte) # python 2
return msg
However, this can be done quite easily using only the standard library. This makes the solution more portable and, in the instances I tried, faster by another order of magnitude (I didn't try the cythonised version of bitstring).
def unpack_bytes(raw, zero_replacement=ord("\n")):
# use - 1 to ignore leading bit
num_bytes, remainder = divmod(len(raw) * 8 - 1, 7)
i = int.from_bytes(raw, byteorder="big")
# i = int(raw.encode("hex"), 16) # python 2
if remainder:
# remainder means there are unused trailing bits, so remove these
i >>= remainder
msg = []
for _ in range(num_bytes):
byte = i & 127
if not byte:
msg.append(zero_replacement)
elif 32 <= byte <= 126:
msg.append(byte)
i >>= 7
msg.reverse()
return bytes(msg)
# return b"".join(chr(c) for c in msg) # python 2
I've used python 3 to create these methods. If you're using python 2 then there are a number of adjustments you'll need to make. I've added these as comments after the line they are intended to replace and marked them python 2.

Is there a faster way to convert an arbitrary large integer to a big endian sequence of bytes?

I have this Python code to do this:
from struct import pack as _pack
def packl(lnum, pad = 1):
if lnum < 0:
raise RangeError("Cannot use packl to convert a negative integer "
"to a string.")
count = 0
l = []
while lnum > 0:
l.append(lnum & 0xffffffffffffffffL)
count += 1
lnum >>= 64
if count <= 0:
return '\0' * pad
elif pad >= 8:
lens = 8 * count % pad
pad = ((lens != 0) and (pad - lens)) or 0
l.append('>' + 'x' * pad + 'Q' * count)
l.reverse()
return _pack(*l)
else:
l.append('>' + 'Q' * count)
l.reverse()
s = _pack(*l).lstrip('\0')
lens = len(s)
if (lens % pad) != 0:
return '\0' * (pad - lens % pad) + s
else:
return s
This takes approximately 174 usec to convert 2**9700 - 1 to a string of bytes on my machine. If I'm willing to use the Python 2.7 and Python 3.x specific bit_length method, I can shorten that to 159 usecs by pre-allocating the l array to be the exact right size at the very beginning and using l[something] = syntax instead of l.append.
Is there anything I can do that will make this faster? This will be used to convert large prime numbers used in cryptography as well as some (but not many) smaller numbers.
Edit
This is currently the fastest option in Python < 3.2, it takes about half the time either direction as the accepted answer:
def packl(lnum, padmultiple=1):
"""Packs the lnum (which must be convertable to a long) into a
byte string 0 padded to a multiple of padmultiple bytes in size. 0
means no padding whatsoever, so that packing 0 result in an empty
string. The resulting byte string is the big-endian two's
complement representation of the passed in long."""
if lnum == 0:
return b'\0' * padmultiple
elif lnum < 0:
raise ValueError("Can only convert non-negative numbers.")
s = hex(lnum)[2:]
s = s.rstrip('L')
if len(s) & 1:
s = '0' + s
s = binascii.unhexlify(s)
if (padmultiple != 1) and (padmultiple != 0):
filled_so_far = len(s) % padmultiple
if filled_so_far != 0:
s = b'\0' * (padmultiple - filled_so_far) + s
return s
def unpackl(bytestr):
"""Treats a byte string as a sequence of base 256 digits
representing an unsigned integer in big-endian format and converts
that representation into a Python integer."""
return int(binascii.hexlify(bytestr), 16) if len(bytestr) > 0 else 0
In Python 3.2 the int class has to_bytes and from_bytes functions that can accomplish this much more quickly that the method given above.

Here is a solution calling the Python/C API via ctypes. Currently, it uses NumPy, but if NumPy is not an option, it could be done purely with ctypes.
import numpy
import ctypes
PyLong_AsByteArray = ctypes.pythonapi._PyLong_AsByteArray
PyLong_AsByteArray.argtypes = [ctypes.py_object,
numpy.ctypeslib.ndpointer(numpy.uint8),
ctypes.c_size_t,
ctypes.c_int,
ctypes.c_int]
def packl_ctypes_numpy(lnum):
a = numpy.zeros(lnum.bit_length()//8 + 1, dtype=numpy.uint8)
PyLong_AsByteArray(lnum, a, a.size, 0, 1)
return a
On my machine, this is 15 times faster than your approach.
Edit: Here is the same code using ctypes only and returning a string instead of a NumPy array:
import ctypes
PyLong_AsByteArray = ctypes.pythonapi._PyLong_AsByteArray
PyLong_AsByteArray.argtypes = [ctypes.py_object,
ctypes.c_char_p,
ctypes.c_size_t,
ctypes.c_int,
ctypes.c_int]
def packl_ctypes(lnum):
a = ctypes.create_string_buffer(lnum.bit_length()//8 + 1)
PyLong_AsByteArray(lnum, a, len(a), 0, 1)
return a.raw
This is another two times faster, totalling to a speed-up factor of 30 on my machine.

For completeness and for future readers of this question:
Starting in Python 3.2, there are functions int.from_bytes() and int.to_bytes() that perform the conversion between bytes and int objects in a choice of byte orders.

I suppose you really should just be using numpy, which I'm sure has something or other built in for this. It might also be faster to hack around with the array module. But I'll take a stab at it anyway.
IMX, creating a generator and using a list comprehension and/or built-in summation is faster than a loop that appends to a list, because the appending can be done internally. Oh, and 'lstrip' on a large string has got to be costly.
Also, some style points: special cases aren't special enough; and you appear not to have gotten the memo about the new x if y else z construct. :) Although we don't need it anyway. ;)
from struct import pack as _pack
Q_size = 64
Q_bitmask = (1L << Q_size) - 1L
def quads_gen(a_long):
while a_long:
yield a_long & Q_bitmask
a_long >>= Q_size
def pack_long_big_endian(a_long, pad = 1):
if lnum < 0:
raise RangeError("Cannot use packl to convert a negative integer "
"to a string.")
qs = list(reversed(quads_gen(a_long)))
# Pack the first one separately so we can lstrip nicely.
first = _pack('>Q', qs[0]).lstrip('\x00')
rest = _pack('>%sQ' % len(qs) - 1, *qs[1:])
count = len(first) + len(rest)
# A little math trick that depends on Python's behaviour of modulus
# for negative numbers - but it's well-defined and documented
return '\x00' * (-count % pad) + first + rest

Just wanted to post a follow-up to Sven's answer (which works great). The opposite operation - going from arbitrarily long bytes object to Python Integer object requires the following (because there is no PyLong_FromByteArray() C API function that I can find):
import binascii
def unpack_bytes(stringbytes):
#binascii.hexlify will be obsolete in python3 soon
#They will add a .tohex() method to bytes class
#Issue 3532 bugs.python.org
return int(binascii.hexlify(stringbytes), 16)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenate multiple zlib compressed data streams into a single stream efficiently - python

Since you don't mind venturing into C, you can start by looking at the code for gzjoin. Note, the gzjoin code has to decompress to find the parts that have to change when merged, but it doesn't have to recompress. That's not too bad because decompression is typically faster than compression.

Related

Python string comparison doesn't short circuit?

Why doesn't my hash function output a dynamic value?

Python: How do I convert file to custom base number and back?

Ascii string of bytes packed into bitmap/bitstring back to string?

Is there a faster way to convert an arbitrary large integer to a big endian sequence of bytes?

Categories

Resources