Processing a byte string in chunks

Processing a byte string in chunks - python

I have a really long byte string such as this (the actual value may be random):
in_var = b'\x01\x02\x03\x04\x05\x06...\xff'
I also have a function that performs an operation on a chunk of bytes and returns the same number of bytes (let's say 10 bytes for this example):
def foo(chunk):
# do smth with chunk
# ......
return chunk
I want to process in_var with foo() for all chunks of 10 bytes (sending the last chunk as is if less than 10 bytes remain at the end) and create a new variable out_var with the outputs.
The way I'm currently doing it is taking way too long:
out_var = b''
for chunk in range(0, len(in_var), 10):
out_var += foo(in_var[chunk: chunk + 10])
The function foo() only takes a fraction of a second per run, so the total should be very fast (total of all chunks of 10). However, I'm getting an order of magnitude longer.
I also tried this with similar results:
import numpy as np
import math
in_var= np.array_split(np.frombuffer(in_var, dtype=np.uint8), math.ceil(len(in_var)/10))
out_var= b"".join(map(lambda x: foo(x), in_var))
foo() can only process 10 bytes for this example (ex: it's an encryption function with a fixed block size) and if a smaller chunk is given to it, it just pads to make the chunk 10 bytes. Let's say I have no control over it, and foo() can only process in chunks of 10 bytes.
Is there a much faster way to do this? As a last resort, I may have to parallelize my code so all chunks get processed in parallel...
Thank you!
UPDATE:
Apparently I had not correctly measured the time foo() takes. It turns out, foo() is taking the majority of the time, hence the order of magnitude comment above. Thank you again for your comments and suggestions, I did make some improvements nevertheless. Parallelizing the code seems to be the correct path forward.

The problem with your for loop is that it creates a new string, slightly longer each time, and copies the old data to the new. You can speed it up by pre-allocating the bytes and just copying over them directly:
out_var = bytearray(len(in_var))
for chunk in range(0, len(in_var), 10):
out_var[chunk: chunk + 10] = foo(in_var[chunk: chunk + 10])

Related

Is there a way to estimate the size of file to be written based on the pandas dataframe that holds the data?

I am extracting data from a table in a database and writing it to a CSV file in a windows file directory using pandas and python.
I want to partition the data and split into multiple files if the file size exceeds a certain amount of memory.
So for an example that threshold is 32 MB, if my CSV data file is going to be less than 32 MB, I will write the data in a single CSV file.
But if the file size may exceed 32 MB, say 50 MB, I would split the data and write to two files one of 32 MB and other of (50-32)=18 MB.
The only thing I found is how to find the memory a dataframe accommodates using memory_usage method or python's getsizeof function. But I am not able to relate that memory with actual size of the data file. The in-process memory is generally 5-10 times greater than the file size.
Appreciate any suggestions.

Do some checks in your code. Write a portion of the DataFrame as csv to an io.StringIO() object and examine the length of that object; use the percentage it is over or under your goal to redefine the DataFrame slice; repeat; when sat write to disk then use that slice size to write the rest.
Something like...
import StringIO from io
g = StringIO()
n = 100
limit = 3000
tolerance = .90
while True:
data[:n].to_csv(g)
p = g.tell()/limit
print(n,g.tell(),p)
if tolerance < p <= 1:
break
else:
n = int(n/p)
g = StringIO()
if n >= nrows: break
_ = input('?')
# with open(somefilename, 'w') as f:
# g.seek(0)
# f.write(g.read())
# some type of loop where succesive slices of size n are written to a new file.
# [0n:1n], [1n:2n], [2n:3n] ...
Caveat, the docs for .tell() say:
Return the current stream position as an opaque number. The number does not usually represent a number of bytes in the underlying binary storage
My experience is that .tell() at the end of the stream is the number of bytes for an io.StringIO object - I must be missing something. Maybe if contains multibyte unicode stuff it is different.
Maybe it is safer to use the length of the csv string for testing in which case the io.StringIO object is not needed. This is probably better/simpler. If I had thouroughly read the docs first I would not have proposed the io.StringIO version - ##$%##.
n = 100
limit = 3000
tolerance = .90
while True:
q = data[:n].to_csv()
p = len(q)/limit
print(f'n:{n}, len(q):{len(q)}, p:{p}')
if tolerance < p <= 1:
break
else:
n = int(n/p)
if n >= nrows: break
_ = input('?')
Another caveat: if the number of characters in each row for the first n rows varies significantly from other other n sized slices it is possible to overshoot or undershoot your limit if you don't test and adjust each slice before you write it.
setup for example:
import numpy as np
import pandas as pd
nrows = 1000
data = pd.DataFrame(np.random.randint(0,100,size=(nrows, 4)), columns=list('ABCD'))

Very fast rolling hash in Python?

I'm writing a toy rsync-like tool in Python. Like many similar tools, it will first use a very fast hash as the rolling hash, and then a SHA256 once a match has been found (but the latter is out of topic here: SHA256, MDA5, etc. are too slow as a rolling hash).
I'm currently testing various fast hash methods:
import os, random, time
block_size = 1024 # 1 KB blocks
total_size = 10*1024*1024 # 10 MB random bytes
s = os.urandom(total_size)
t0 = time.time()
for i in range(len(s)-block_size):
h = hash(s[i:i+block_size])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
I get: 0.8 MB/s ... so the Python built-in hash(...) function is too slow here.
Which solution would allow a faster hash of at least 10 MB/s on a standard machine?
I tried with
import zlib
...
h = zlib.adler32(s[i:i+block_size])
but it's not much better (1.1 MB/s)
I tried with sum(s[i:i+block_size]) % modulo and it's slow too
Interesting fact: even without any hash fonction, the loop itself is slow!
t0 = time.time()
for i in range(len(s)-block_size):
s[i:i+block_size]
I get: 3.0 MB/s only! So the simpe fact of having a loop accessing to a rolling block on s is already slow.
Instead of reinventing the wheel and write my own hash / or use custom Rabin-Karp algorithms, what would you suggest, first to speed up this loop, and then as a hash?
Edit: (Partial) solution for the "Interesting fact" slow loop above:
import os, random, time, zlib
from numba import jit
#jit()
def main(s):
for i in range(len(s)-block_size):
block = s[i:i+block_size]
total_size = 10*1024*1024 # 10 MB random bytes
block_size = 1024 # 1 KB blocks
s = os.urandom(total_size)
t0 = time.time()
main(s)
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
With Numba, there is a massive improvement: 40.0 MB/s, but still no hash done here. At least we're not blocked at 3 MB/s.

Instead of reinventing the wheel and write my own hash / or use custom
Rabin-Karp algorithms, what would you suggest, first to speed up this
loop, and then as a hash?
It's always great to start with this mentality, but seems that you didn't get the idea of rolling hashes.
What makes a hashing function great for rolling is it's capability of reuse the previous processing.
A few hash functions allow a rolling hash to be computed very
quickly—the new hash value is rapidly calculated given only the old
hash value, the old value removed from the window, and the new value
added to the window.
From the same wikipedia page
It's hard to compare performance across different machines without timeit, but I changed your script to use a simple polynomial hashing with a prime modulo (would be even faster to work with a Mersene prime, because the modulo operation could be done with binary operations):
import os, random, time
block_size = 1024 # 1 KB blocks
total_size = 10*1024*1024 # 10 MB random bytes
s = os.urandom(total_size)
base = 256
mod = int(1e9)+7
def extend(previous_mod, byte):
return ((previous_mod * base) + ord(byte)) % mod
most_significant = pow(base, block_size-1, mod)
def remove_left(previous_mod, byte):
return (previous_mod - (most_significant * ord(byte)) % mod) % mod
def start_hash(bytes):
h = 0
for b in bytes:
h = extend(h, b)
return h
t0 = time.time()
h = start_hash(s[:block_size])
for i in range(block_size, len(s)):
h = remove_left(h, s[i - block_size])
h = extend(h, s[i])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))
Apparently you achieved quite a improvement with Numba and it may speed up this code as well.
To extract more performance you may want to write a C (or other low-level language as Rust) functions to process a big slice of the list at time and returns an array with the hashes.
I'm creating a rsync-like tool as well, but as I'm writing in Rust performance in this level isn't a concern of mine. Instead, I'm following the tips of the creator of rsync and trying to parallelize everything I can, a painful task to do in Python (probably impossible without Jython).

what would you suggest, first to speed up this loop, and then as a hash?
Increase the blocksize. The smaller your blocksize the more python you'll be executing per byte, and the slower it will be.
edit: your range has the default step of 1 and you don't multiply i by block_size, so instead of iterating on 10*1024 non-overlapping blocks of 1k, you're iterating on 10 million - 1024 mostly overlapping blocks

First, your slow loop. As has been mentioned you are slicing a new block for every byte (less blocksize) in the stream. This is a lot of work on both cpu and memory.
A faster loop would be to pre chunk the data into parallel bits.
chunksize = 4096 # suggestion
# roll the window over the previous chunk's last block into the new chunk
lastblock = None
for readchunk in read_file_chunks(chunksize):
for i in range(0, len(readchunk), blocksize):
# slice a block only once
newblock = readchunk[i:blocksize]
if lastblock:
for bi in range(len(newblock)):
outbyte = lastblock[bi]
inbyte = newblock[bi]
# update rolling hash with inbyte and outbyte
# check rolling hash for "hit"
else:
pass # calculate initial weak hash, check for "hit"
lastblock = newblock
Chunksize should be a multiple of blocksize
Next, you were calculating a "rolling hash" over the entirety of each block in turn, instead of updating the hash byte by byte in "rolling" fashion. That is immensely slower. The above loop forces you to deal with the bytes as they go in and out of the window. Still, my trials show pretty poor throughput (~3Mbps~ edit: sorry that's 3MiB/s) even with a modest number of arithmetic operations on each byte. Edit: I initially had a zip() and that appears rather slow. I got more than double the throughout for the loop alone without the zip (current code above)
Python is single threaded and interpreted. I see one cpu pegged and that is the bottleneck. To get faster you'll want multiple threads (subprocess) or break into C, or both. Simply running the math in C would probably be enough I think. (Haha, "simply")

Write (large amount of) zeros into a binary file

This might be a stupid question but I'm unable to find a proper answer to it. I want to store (don't ask why) a binary representation of a (2000, 2000, 2000) array of zeros into disk, binary format. The traditional approach to achieve so would be:
with open('myfile', 'wb') as f:
f.write('\0' * 4 * 2000 * 2000 * 2000) # 4 bytes = float32
But that would imply creating a very large string which is not necessary at all. I know of two other options:
Iterate over the elements and store one byte at a time (extremely slow)
Create a numpy array and flush it to disk (as memory expensive as the string creation in the example above)
I was hopping to find something like write(char, ntimes) (as it exists in C and other languages) to copy on disk char ntimes at C speed, and not at Python-loops speed, without having to create such a big array on memory.

I don't know why you are making such a fuss about "Python's loop speed", but writing in the way of
for i in range(2000 * 2000):
f.write('\0' * 4 * 2000) # 4 bytes = float32
will tell the OS to write 8000 0-bytes. After write returns, in the next loop run it is called again.
It might be that the loop is executed slightly slower than it would be in C, but that definitely won't make a difference.
If it is ok to have a sparse file, you as well can seek to the desired file size's position and then truncate the file.

This would be a valid answer to fill a file from Python using numpy's memmap:
shape = (2000, 2000, 2000) # or just (2000 * 2000 * 2000,)
fp = np.memmap(filename, dtype='float32', mode='w+', shape=shape)
fp[...] = 0

To write zeroes, there's a nice hack: open the file in read-write, seek to the offset minus one, and write another zero.
This pads the start of the file with zeroes as well:
mega_size = 100000
with open("zeros.bin","wb+") as f:
f.seek(mega_size-1)
f.write(bytearray(1))

The original poster does not make clear why this really needs to be done in python. So here is a little shell script command which does the same thing, but probably slightly faster. Assuming the OP is on a unix like system (linux / Mac)
Definitions: bs (blocksize) = 2000 * 2000, count = 4 * 2000
The if (input file) is a special 'zero producing' device. The of (output file) has to be specified as wel.
dd bs=4000000 count=8000 if=/dev/zero of=/Users/me/thirtygig
On my computer (ssd) this takes about 110 seconds:
32000000000 bytes transferred in 108.435354 secs (295106705 bytes/sec)
You can always call this little shell command from python.
For comparisons, #glglgl little python script runs in 303 seconds on the same computer, so only 3 times slower, which is probably fast enough.

Python: Read and write binary data

I am aware that there are a lot of almost identical questions, but non seems to really target the general case.
So assume I want to open a file, read it in memory, possibly do some operations on the respective bitstring and write the result back to file.
The following is what seems straightforward to me, but it results in completely different output. Note that for simplicity I only copy the file here:
file = open('INPUT','rb')
data = file.read()
data_16 = data.encode('hex')
data_2 = bin(int(data_16,16))
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
I looked at data, data_16 and data_2. The transformations make sense as far as I can see.
As expected, the output file has exactly the same size in bits as the input file.
EDIT: I considered the possibility that the leading '0b' has to be cut. See the following:
>>> data[:100]
'BMFU"\x00\x00\x00\x00\x006\x00\x00\x00(\x00\x00\x00\xe8\x03\x00\x00\xee\x02\x00\x00\x01\x00\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x0b\x00\x00\x12\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05=o\xce\xf4^\x16\xe0\x80\x92\x00\x00\x00\x01I\x02\x1d\xb5\x81\xcaN\xcb\xb8\x91\xc3\xc6T\xef\xcb\xe1j\x06\xc3;\x0c*\xb9Q\xbc\xff\xf6\xff\xff\xf7\xed\xdf'
>>> data_16[:100]
'424d46552200000000003600000028000000e8030000ee020000010018000000000000000000120b0000120b000000000000'
>>> data_2[:100]
'0b10000100100110101000110010101010010001000000000000000000000000000000000000000000011011000000000000'
>>> data_2[1]
'b'
Maybe the BMFU" part should be cut from data?

>>> bin(25)
'0b11001'
Note two things:
The "0b" at the beginning. This means that your slicing will be off by 2 bits.
The lack of padding to 8 bits. This will corrupt your data every time unless it happens to mesh up with point 1.
Process the file byte by byte instead of attempting to process it in one big gulp like this. If you find your code too slow then you need to find a faster way of working byte by byte, not switch to an irreparably flawed method such as this one.

You could simply write the data variable back out and you'd have a successful round trip.
But it looks like you intend to work on the file as a string of 0 and 1 characters. Nothing wrong with that (though it's rarely necessary), but your code takes a very roundabout way of converting the data to that form. Instead of building a monster integer and converting it to a bit string, just do so for one byte at a time:
data = file.read()
data_2 = "".join( bin(ord(c))[2:] for c in data )
data_2 is now a sequence of zeros and ones. (In a single string, same as you have it; but if you'll be making changes, I'd keep the bitstrings in a list). The reverse conversion is also best done byte by byte:
newdata = "".join(chr(int(byte, 8)) for byte in grouper(long_bitstring, 8, "0"))
This uses the grouper recipe from the itertools documentation.
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

You can use the struct module to read and write binary data. (Link to the doc here.)
EDIT
Sorry, I was mislead by your title. I’ve just understand that you write binary data in a text file instead of writing binary data directly.

Ok, thanks to alexis and being aware of Ignacio's warning about the padding, I found a way to do what I wanted to do, that is read data into a binary representation and write a binary representation to file:
def padd(bitstring):
padding = ''
for i in range(8-len(bitstring)):
padding += '0'
bitstring = padding + bitstring
return bitstring
file = open('INPUT','rb')
data = file.read()
data_2 = "".join( padd(bin(ord(c))[2:]) for c in data )
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
If I did not do it exactly the way proposed by alexis then that is because it did not work. Of course this is terribly slow but now that I can do the simplest thing, I can optimize it further.

Fast way to read interleaved data?

I've got a file containing several channels of data. The file is sampled at a base rate, and each channel is sampled at that base rate divided by some number -- it seems to always be a power of 2, though I don't think that's important.
So, if I have channels a, b, and c, sampled at divders of 1, 2, and 4, my stream will look like:
a0 b0 c0 a1 a2 b1 a3 a4 b2 c1 a5 ...
For added fun, the channels can independently be floats or ints (though I know for each one), and the data stream does not necessarily end on a power of 2: the example stream would be valid without further extension. The values are sometimes big and sometimes little-endian, though I know what I'm dealing with up-front.
I've got code that properly unpacks these and fills numpy arrays with the correct values, but it's slow: it looks something like (hope I'm not glossing over too much; just giving an idea of the algorithm):
for sample_num in range(total_samples):
channels_to_sample = [ch for ch in all_channels if ch.samples_for(sample_num)]
format_str = ... # build format string from channels_to_sample
data = struct.unpack( my_file.read( ... ) ) # read and unpack the data
# iterate over data tuple and put values in channels_to_sample
for val, ch in zip(data, channels_to_sample):
ch.data[sample_num / ch.divider] = val
And it's slow -- a few seconds to read a 20MB file on my laptop. Profiler tells me I'm spending a bunch of time in Channel#samples_for() -- which makes sense; there's a bit of conditional logic there.
My brain feels like there's a way to do this in one fell swoop instead of nesting loops -- maybe using indexing tricks to read the bytes I want into each array? The idea of building one massive, insane format string also seems like a questionable road to go down.
Update
Thanks to those who responded. For what it's worth, the numpy indexing trick reduced the time required to read my test data from about 10 second to about 0.2 seconds, for a speedup of 50x.

The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.
First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then
N = lcm*sum(b_1/d_1+...+b_k/d_k)
will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.
You can now build the array of stream indices for the first N bytes by something similar to
stream_index = []
for sample_num in range(lcm):
stream_index += [i for i, ch in enumerate(all_channels)
if ch.samples_for(sample_num)]
repeat_count = [b[i] for i in stream_index]
stream_index = numpy.array(stream_index).repeat(repeat_count)
Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.
Now you can do
data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)
streams = [data[:,stream_index == i].ravel() for i in range(k)]
You possibly need to pad the data a bit at the end to make the reshape() work.
Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write
streams[0].dtype = ">i"
This won't change the data in the array in any way, just the way it is interpreted.
This may look a bit cryptic, but should be much better performance-wise.

Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:
for (chan, sample_data) in izip(iter_channels(), data):
decoded_data = chan.decode(sample_data)
To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.
for i in itertools.count():
for chan in channels:
if i % chan.period == 0:
yield chan

The grouper() recipe along with itertools.izip() should be of some help here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.