Fastest method to modify bytestring over loop in Python?

Fastest method to modify bytestring over loop in Python? - python

I have an image stored as a bytestring b'' and am performing per-pixel operations. Right now the fastest way I've found is to use the struct crate to pack and unpack the bytes during modification, then save the pixels to a bytearray
# retrieve image data. Stored as bytestring
pixels = buff.get(rect, 1.0, "CIE LCH(ab) alpha double",
Gegl.AbyssPolicy.CLAMP)
# iterator split into 32-byte chunks for each pixel's 8-byte LCHA channels
pixels_iter = (pixels[x:x + 32] for x in range(0, len(pixels), 32))
new_pixels = bytearray()
# when using `pool.map`, the loop was placed in its own function.
for pixel in pixels_iter:
l, c, h, a = struct.unpack('dddd', pixel)
# simple operation for now: lower chroma if bright and saturated
c = c - (l * c) / 100
new_pixels += struct.pack('dddd', l, c, h, a)
# save new data. everything hereout handled by GEGL instead of myself.
shadow.set(rect, "CIE LCH(ab) alpha double", bytes(new_pixels))
Problem is this takes about 3 1/2 seconds for a 7MP image on my workstation. Fair but not ideal if updates are frequently requested. From what I've gathered, it seems the constant array modification and possibly struct [un]packing are the main culprits. I've refactored this probably a dozen times and I think I'm out of ideas for optimizing this.
I've tried:
struct.unpacking the whole bytestring once instead of each pixel as-needed. Lost about 20% efficiency.
collections.deque Admittedly not familiar with its technicalities. Lost 10-30% depending on implementation
similar results with other iterator helpers like map/join
numpy.array Also admittedly know basically nothing about general numpy. Similar results to deque
multiprocessing seemed to be bottlenecked when I appended the pool.map results to new_pixels. Actually lost about 10% which seems wild, as usually I can just lazily throw threads at problems. The pixels_iter was grouped again into equally sized sublists for each thread, so new_pixels concatenated 8 large lists instead of a few million small lists, which I thought would be faster. Tempted to retry this one as I might've botched it somehow with my 4 am implementation.
In theory it could also work by saving multiple small sections of the image buffer to avoid concatenating to new_pixels entirely, but that would vastly increase code complexity elsewhere.
Converting pixels itself into a bytearray and modifying it in-place using slice ranges. Lost ~30% but also halved memory usage.
Completely separate interpreters like Pypy are off the table, as I'm not the one bundling the Python version.

NumPy should produce much faster results than a manual loop, if you use it properly. Using it properly means using NumPy operations over whole arrays, not just looping manually over a NumPy array.
For example,
new_pixels = bytearray(pixels)
as_numpy = numpy.frombuffer(new_pixels, dtype=float)
as_numpy[1::4] *= 1 - as_numpy[::4] / 100
Now new_pixels contains the adjusted values.

Related

Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.
If it makes a difference, the ints themselve range from 0 to 15 million.
I have considered using:
Pandas, storing the batches as python lists
Numpy, where the batch is stored as it's own numpy array (since numpy doesn't allow variable length rows in it's 2D data structures)
Python List of Lists.
I also looked at Tensorflow tfrecords but not too sure about this one.
I only have about 12 gbs of RAM. I will also be using to train over a machine learning algorithm so

If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.
Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.
with open(file, 'r') as f:
data = (l for l in f) # generator
for line in data:
# process each record here

I would do the following:
# create example data
A = np.random.randint(0,15000000,100)
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
int32 is sufficient
A32 = A.astype(np.int32)
We want to glue all the batches together.
First, write down the batch sizes so we can separate them later.
from itertools import chain
sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()
# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
After glueing resplit.
B32 = np.split(B_all, boundaries[1:-1])
Finally, make an array of pairs for convenience:
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
What was the point of glueing and then splitting again?
First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.
The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']

Use numpy. It us the most efficient and you can use it easily with a machine learning model.

numpy.dot -> MemoryError, my_dot -> very slow, but works. Why?

I am trying to compute the dot product of two numpy arrays sized respectively (162225, 10000) and (10000, 100). However, if I call numpy.dot(A, B) a MemoryError happens.
I, then, tried to write my implementation:
def slower_dot (A, B):
"""Low-memory implementation of dot product"""
#Assuming A and B are of the right type and size
R = np.empty([A.shape[0], B.shape[1]])
for i in range(A.shape[0]):
for j in range(B.shape[1]):
R[i,j] = np.dot(A[i,:], B[:,j])
return R
and it works just fine, but is of course very slow. Any idea of 1) what is the reason behind this behaviour and 2) how I could circumvent / solve the problem?
I am using Python 3.4.2 (64bit) and Numpy 1.9.1 on a 64bit equipped computer with 16GB of ram running Ubuntu 14.10.

The reason you're getting a memory error is probably because numpy is trying to copy one or both arrays inside the call to dot. For small to medium arrays this is often the most efficient option, but for large arrays you'll need to micro-manage numpy in order to avoid the memory error. Your slower_dot function is slow largely because of the python function call overhead, which you suffer 162225 x 100 times. Here is one common way of dealing with this kind of situation when you want to balance memory and performance limitations.
import numpy as np
def chunking_dot(big_matrix, small_matrix, chunk_size=100):
# Make a copy if the array is not already contiguous
small_matrix = np.ascontiguousarray(small_matrix)
R = np.empty((big_matrix.shape[0], small_matrix.shape[1]))
for i in range(0, R.shape[0], chunk_size):
end = i + chunk_size
R[i:end] = np.dot(big_matrix[i:end], small_matrix)
return R
You'll want to pick the chunk_size that works best for your specific array sizes. Typically larger chunk sizes will be faster as long as everything fits in memory.

I think the problem starts from the matrix A itself as a 16225 * 10000 size matrix already occupies about 12GB of memory if each element is a double precision floating point number. That together with how numpy creates temporary copies to do the dot operation will cause the error. The extra copies is because numpy uses the underlying BLAS operations for dot which needs the matrices to be stored in contiguous C order
Check out these links if you want more discussions about improving dot performance
http://wiki.scipy.org/PerformanceTips
Speeding up numpy.dot
https://github.com/numpy/numpy/pull/2730

Generating very large 2D-array in Python?

I'd like to generate very large 2D-array (or, in other terms, a matrix) using list of lists. Each element should be a float.
So, just to give an example, let's assume to have the following code:
import numpy as np
N = 32000
def largeMat():
m = []
for i in range(N):
l = list(np.ones(N))
m.append(l)
if i % 1000 == 0:
print i
return m
m = largeMat()
I have 12GB of RAM, but as the code reaches the 10000-th line of the matrix, my RAM is already full. Now, if I'm not wrong, each float is 64-bit large (or 8 byte), so the total occupied RAM should be:
32000 * 32000 * 8 / 1 MB = 8192 MB
Why does python fill my whole RAM and even start to allocate into swap?

Python does not necessarily store list items in the most compact form, as lists require pointers to the next item, etc. This is a side effect of having a data type which allows deletes, inserts, etc. For a simple two-way linked list the usage would be two pointers plus the value, in a 64-bit machine that would be 24 octets per float item in the list. In practice the implementation is not that stupid, but there is still some overhead.
If you want to have a concise format, I'd suggest using a numpy.array as it will take exactly as many bytes you think it'd take (plus a small overhead).
Edit Oops. Not necessarily. Explanation wrong, suggestion valid. numpy is the right tool as numpy.array exists for this reason. However, the problem is most probably something else. My computer will run the procedure even though it takes a lot of time (appr. 2 minutes). Also, quitting python after this takes a long time (actually, it hung). Memory use of the python process (as reported by top) peaks at 10 000 MB and then falls down to slightly below 9 000 MB. Probably the allocated numpy arrays are not garbage collected very fast.
But about the raw data size in my machine:
>>> import sys
>>> l = [0.0] * 1000000
>>> sys.getsizeof(l)
8000072
So there seems to be a fixed overhead of 72 octets per list.
>>> listoflists = [ [1.0*i] * 1000000 for i in range(1000)]
>>> sys.getsizeof(listoflists)
9032
>>> sum([sys.getsizeof(l) for l in listoflists])
8000072000
So, this is as expected.
On the other hand, reserving and filling the long list of lists takes a while (about 10 s). Also, quitting python takes a while. The same for numpy:
>>> a = numpy.empty((1000,1000000))
>>> a[:] = 1.0
>>> a.nbytes
8000000000
(The byte count is not entirely reliable, as the object itself takes some space for its metadata, etc. There has to be the pointer to the start of the memory block, data type, array shape, etc.)
This takes much less time. The creation of the array is almost instantaneous, inserting the numbers takes maybe a second or two. Allocating and freeing a lot of small memory chunks is time consuming and while it does not cause fragmentation problems in a 64-bit machine, it is still much easier to allocate a big chunk of data.
If you have a lot of data which can be put into an array, you need a good reason for not using numpy.

Repeatedly appending to a large list (Python 2.6.6)

I have a project where I am reading in ASCII values from a microcontroller through a serial port (looks like this : AA FF BA 11 43 CF etc)
The input is coming in quickly (38 two character sets / second).
I'm taking this input and appending it to a running list of all measurements.
After about 5 hours, my list has grown to ~ 855000 entries.
I'm given to understand that the larger a list becomes, the slower list operations become. My intent is to have this test run for 24 hours, which should yield around 3M results.
Is there a more efficient, faster way to append to a list then list.append()?
Thanks Everyone.

I'm given to understand that the larger a list becomes, the slower list operations become.
That's not true in general. Lists in Python are, despite the name, not linked lists but arrays. There are operations that are O(n) on arrays (copying and searching, for instance), but you don't seem to use any of these. As a rule of thumb: If it's widely used and idiomatic, some smart people went and chose a smart way to do it. list.append is a widely-used builtin (and the underlying C function is also used in other places, e.g. list comprehensions). If there was a faster way, it would already be in use.
As you will see when you inspect the source code, lists are overallocating, i.e. when they are resized, they allocate more than needed for one item so the next n items can be appended without need to another resize (which is O(n)). The growth isn't constant, it is proportional with the list size, so resizing becomes rarer as the list grows larger. Here's the snippet from listobject.c:list_resize that determines the overallocation:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
As Mark Ransom points out, older Python versions (<2.7, 3.0) have a bug that make the GC sabotage this. If you have such a Python version, you may want to disable the gc. If you can't because you generate too much garbage (that slips refcounting), you're out of luck though.

One thing you might want to consider is writing your data to a file as it's collected. I don't know (or really care) if it will affect performance, but it will help ensure that you don't lose all your data if power blips. Once you've got all the data, you can suck it out of the file and jam it in a list or an array or a numpy matrix or whatever for processing.

Appending to a python list has a constant cost. It is not affected by the number of items in the list (in theory). In practice appending to a list will get slower once you run out of memory and the system starts swapping.
http://wiki.python.org/moin/TimeComplexity
It would be helpful to understand why you actually append things into a list. What are you planning to do with the items. If you don't need all of them you could build a ring buffer, if you don't need to do computation you could write the list to a file, etc.

First of all, 38 two-character sets per second, 1 stop bit, 8 data bits, and no parity, is only 760 baud, not fast at all.
But anyway, my suggestion, if you're worried about having overly large lists/don't want to use one huge list, is just to store store a list on disk once it reaches a certain size and start a new list, repeating until you've gotten all the data, then combining all the lists into one once you're done receiving the data.
Though you may skip the sublists completely and just go with nmichaels' suggestion, writing the data to a file as you get it and using a small circular buffer to hold the received data that has not yet been written.

It might be faster to use numpy if you know how long the array is going to be and you can convert your hex codes to ints:
import numpy
a = numpy.zeros(3000000, numpy.int32)
for i in range(3000000):
a[i] = int(scanHexFromSerial(),16)
This will leave you with an array of integers (which you could convert back to hex with hex()), but depending on your application maybe that will work just as well for you.

Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)

I do understand that querying a non-existent key in a defaultdict the way I do will add items to the defaultdict. That is why it is fair to compare my 2nd code snippet to my first one in terms of performance.
import numpy as num
from collections import defaultdict
topKeys = range(16384)
keys = range(8192)
table = dict((k,defaultdict(int)) for k in topKeys)
dat = num.zeros((16384,8192), dtype="int32")
print "looping begins"
#how much memory should this use? I think it shouldn't use more that a few
#times the memory required to hold (16384*8192) int32's (512 mb), but
#it uses 11 GB!
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
print "done"
What is going on here? Furthermore, this similar script takes eons to run compared to the first one, and also uses an absurd quantity of memory.
topKeys = range(16384)
keys = range(8192)
table = [(j,0) for k in topKeys for j in keys]
I guess python ints might be 64 bit ints, which would account for some of this, but do these relatively natural and simple constructions really produce such a massive overhead?
I guess these scripts show that they do, so my question is: what exactly is causing the high memory usage in the first script and the long runtime and high memory usage of the second script and is there any way to avoid these costs?
Edit:
Python 2.6.4 on 64 bit machine.
Edit 2: I can see why, to a first approximation, my table should take up 3 GB
16384*8192*(12+12) bytes
and 6GB with a defaultdict load factor that forces it to reserve double the space.
Then inefficiencies in memory allocation eat up another factor of 2.
So here are my remaining questions:
Is there a way for me to tell it to use 32 bit ints somehow?
And why does my second code snippet take FOREVER to run compared to the first one? The first one takes about a minute and I killed the second one after 80 minutes.

Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem.
The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.)
A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, 16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some.
Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation.
As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.

Well, look at what your code is actually doing:
topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)
This creates a dict holding 16384 defaultdict(int)'s. A dict has a certain amount of overhead: the dict object itself is between 60 and 120 bytes (depending on the size of pointers and ssize_t's in your build.) That's just the object itself; unless the dict is less than a couple of items, the data is a separate block of memory, between 12 and 24 bytes, and it's always between 1/2 and 2/3rds filled. And defaultdicts are 4 to 8 bytes bigger because they have this extra thing to store. And ints are 12 bytes each, and although they're reused where possible, that snippet won't reuse most of them. So, realistically, in a 32-bit build, that snippet will take up 60 + (16384*12) * 1.8 (fill factor) bytes for the table dict, 16384 * 64 bytes for the defaultdicts it stores as values, and 16384 * 12 bytes for the integers. So that's just over a megabyte and a half without storing anything in your defaultdicts. And that's in a 32-bit build; a 64-bit build would be twice that size.
Then you create a numpy array, which is actually pretty conservative with memory:
dat = num.zeros((16384,8192), dtype="int32")
This will have some overhead for the array itself, the usual Python object overhead plus the dimensions and type of the array and such, but it wouldn't be much more than 100 bytes, and only for the one array. It does store 16384*8192 int32's in your 512Mb though.
And then you have this rather peculiar way of filling this numpy array:
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
The two loops themselves don't use much memory, and they re-use it each iteration. However, table[k][j] creates a new Python integer for each value you request, and stores it in the defaultdict. The integer created is always 0, and it so happens that that always gets reused, but storing the reference to it still uses up space in the defaultdict: the aforementioned 12 bytes per entry, times the fill factor (between 1.66 and 2.) That lands you close to 3Gb of actual data right there, and 6Gb in a 64-bit build.
On top of that the defaultdicts, because you keep adding data, have to keep growing, which means they have to keep reallocating. Because of Python's malloc frontend (obmalloc) and how it allocates smaller objects in blocks of its own, and how process memory works on most operating systems, this means your process will allocate more and not be able to free it; it won't actually use all of the 11Gb, and Python will re-use the available memory inbetween the large blocks for the defaultdicts, but the total mapped address space will be that 11Gb.

Mike Graham gives a good explanation of why dictionaries use more memory, but I thought that I'd explain why your table dict of defaultdicts starts to take up so much memory.
The way that the defaultdict (DD) is set-up right now, whenever you retrieve an element that isn't in the DD, you get the default value for the DD (0 for your case) but also the DD now stores a key that previously wasn't in the DD with the default value of 0. I personally don't like this, but that's how it goes. However, it means that for every iteration of the inner loop, new memory is being allocated which is why it is taking forever. If you change the lines
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
to
for k in topKeys:
for j in keys:
if j in table[k]:
dat[k,j] = table[k][j]
else:
dat[k,j] = 0
then default values aren't being assigned to keys in the DDs and so the memory stays around 540 MB for me which is mostly just the memory allocated for dat. DDs are decent for sparse matrices though you probably should just use the sparse matrices in Scipy if that's what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.