How to circular shift numpy array without O(n) memory copying?

How to circular shift numpy array without O(n) memory copying? - python

Short version:
how to do this without O(n) memory copying
start = 2 # \/ start points here
input = np.array([4, 5, 1, 2, 3])
output = np.array([1, 2, 3, 4, 5]) # output doesn't need to be memory contiguous, a view is just fine
Long version:
I'm trying to implement a fixed length FIFO buffer based on numpy.ndarray for hardware simulation.
My goal is to achieve O(1) (or at least less than O(n)) of enqueue/dequeue, and most important: get numpy array with correct order.
I end up using a pointer start to point at first element of FIFO, thus enqueue/dequeue is just single assignment/indexing which is fast.
But the problem is when I want use this whole FIFO as a numpy.ndarray and do some math(like np.dot(signal,fifo)), I cannot find a way to get corresponding array without an O(n) operation.
It sounds like a wheel that already be invented so I searched online and tried something, but none of these suits my need:
input[1:] = input[:-1]; input[0] = new_val a naive but most efficient way so far, but still O(n)
collection.deque: good enqueue/dequeue performance, but take long time to work with other numpy array (seems numpy will convert deque to numpy.ndarray first then do the math, which will take a lot of time)
indexing: input[np.arange(start,start+size)%size] this is advanced indexing so it returns new array, and the index array creation part makes it even slower
np.r_[input[start:],input[:start]], better than indexing when buffer size is large, but still O(n)
np.roll(input,-start), O(n) and slower
np.concatenate([input[start:],input[:start]]), O(n), slower than naive way but faster than others
since the original array is contiguous so I also tried use strides/shape trick to create a view of original array but not work neither.
So I came here for help, or it has to be O(n)?

Related

python array time complexity?

What's the .append time complexity of array.array and np.array ?
I see time complexity for list, collections.deque, set, and dict in python_wiki , but I can't find the time complexity of array.array and np.array. Where can I find them?

So to link you provided (also a TLDR) list are internally "represented as an array" link It's supposed to be O(1) with a note at the bottom saying:
"These operations rely on the "Amortized" part of "Amortized Worst Case". Individual actions may take surprisingly long, depending on the history of the container."
link
More details
It doesn't go into detail in the docs but if you look at the source code you'll actually see what's going on. Python arrays have internal buffer(s) that allow for quick resizing of themselves and will realloc as it grows/shrinks.
array.append uses arraymodule.array_array_append which calls arraymodule.ins calling arraymodule.ins1 which is the meat and potatoes of the operation. Incidentally array.extend uses this as well but it just supplies it Py_SIZE(self) as the insertion index.
So if we read the notes in arraymodule.ins1 it starts off with:
Bypass realloc() when a previous overallocation is large enough
to accommodate the newsize. If the newsize is 16 smaller than the
current size, then proceed with the realloc() to shrink the array.
link
...
This over-allocates proportional to the array size, making room
for additional growth. The over-allocation is mild, but is
enough to give linear-time amortized behavior over a long
sequence of appends() in the presence of a poorly-performing
system realloc().
The growth pattern is: 0, 4, 8, 16, 25, 34, 46, 56, 67, 79, ...
Note, the pattern starts out the same as for lists but then
grows at a smaller rate so that larger arrays only overallocate
by about 1/16th -- this is done because arrays are presumed to be more
memory critical.
link

It is important to understand the array data structure to answer your question. Since both array objects are based on C arrays (regular, numpy), they share a lot of the same functionality.
Adding an item to an array is amortized O(1), but in most cases, ends up being O(n) time. This is because it could be the case that your array object is not filled yet, and thus appending some data to that spot in memory is a relatively trivial exercise, it is O(1). However, in most cases, the array is full and thus needs to be completely copied over in memory with the new item added to it. This is a very expensive operation since an array of n size needs to be copied, thus making the insertion O(n).
An interesting example from this post:
To make this clearer, consider the case where the factor is 2 and
initial array size is 1. Then consider copy costs to grow the array
from size 1 to where it's large enough to hold, 2^k+1 elements for any
k >= 0. This size is 2^(k+1). Total copy costs will include all the
copying to become that big in factor-of-2 steps:
1 + 2 + 4 + ... + 2^k = 2^(k+1) - 1 = 2n - 1

Most efficent way to assign lists of irregular length to sub-processes for processing

I have a number of objects (roughly 530,000). These objects are randomly assigned to a set of lists (not actually random but let's assume it is). These lists are indexed consecutively and assigned to a dictionary, called groups, according to their index. I know the total number of objects but I do not know the length of each list ahead of time (which in this particular case happens to vary between 1 and 36000).
Next I have to process each object contained in the lists. In order to speed up this operation I am using MPI to send them to different processes. The naive way to do this is to simply assign each process len(groups)/size (where size contains the number of processes used) lists, assign any possible remainder, have it process the contained objects, return the results and wait. This obviously means, however, that if one process gets, say, a lot of very short lists and another all the very long lists the first process will sit idle most of the time and the performance gain will not be very large.
What would be the most efficient way to assign the lists? One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible. But I am not sure how to best implement this. Does anybody have any suggestions?

One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible.
Assuming that processing time scales exactly with the sum of list lengths, and your processor capacity is homogeneous, this is in fact what you want. This is called the multiprocessor scheduling problem, which is very close to the bin packing problem, but with a constant number of bins minimizing the maximum capacity.
Generally this is a NP-hard problem, so you will not get a perfect solution. The simplest reasonable approach is to greedily pick the largest chunk of work for the processor that has the smallest work assigned to it yet.
It is trivial to implement this in python (examples uses a list of lists):
greedy = [[] for _ in range(nprocs)]
for group in sorted(groups, key=len, reverse=True):
smallest_index = np.argmin([sum(map(len, assignment)) for assignment in greedy])
greedy[smallest_index].append(group)
If you have a large number of processors you may want to optimize the smallest_index computation by using a priority queue. This will produce significantly better results than the naive sorted split as recommended by Attersson:
(https://gist.github.com/Zulan/cef67fa436acd8edc5e5636482a239f8)

On the assumption that a longer list has a larger memory size,your_list has a memory size retrievable by the following code:
import sys
sys.getsizeof(your_list)
(Note: it depends on Python implementation. Please read How many bytes per element are there in a Python list (tuple)?)
There are several ways you can proceed then. If your original "pipeline" of lists can be sorted by key=sys.getSizeof you can then slice and assign to process N every Nth element (Pythonic way to return list of every nth item in a larger list).
Example:
sorted_pipeline = [list1,list2,list3,.......]
sorted_pipeline[0::10] # every 10th item, assign to the first sub-process of 10
This will balance loads in a fair manner, while keeping complexity O(NlogN) due to the original sort and then constant (or linear if the lists are copied) to assign the lists.
Illustration (as requested) of splitting 10 elements into 3 groups:
>>> my_list = [0,1,2,3,4,5,6,7,8,9]
>>> my_list[0::3]
[0, 3, 6, 9]
>>> my_list[1::3]
[1, 4, 7]
>>> my_list[2::3]
[2, 5, 8]
And the final solution:
assigned_groups = {}
for i in xrange(size):
assigned_groups[i] = sorted_pipeline[i::size]
If this is not possible, you can always keep a counter of total queue size, per sub-process pipeline, and tweak probability or selection logic to take that into account.

numpy.dot -> MemoryError, my_dot -> very slow, but works. Why?

I am trying to compute the dot product of two numpy arrays sized respectively (162225, 10000) and (10000, 100). However, if I call numpy.dot(A, B) a MemoryError happens.
I, then, tried to write my implementation:
def slower_dot (A, B):
"""Low-memory implementation of dot product"""
#Assuming A and B are of the right type and size
R = np.empty([A.shape[0], B.shape[1]])
for i in range(A.shape[0]):
for j in range(B.shape[1]):
R[i,j] = np.dot(A[i,:], B[:,j])
return R
and it works just fine, but is of course very slow. Any idea of 1) what is the reason behind this behaviour and 2) how I could circumvent / solve the problem?
I am using Python 3.4.2 (64bit) and Numpy 1.9.1 on a 64bit equipped computer with 16GB of ram running Ubuntu 14.10.

The reason you're getting a memory error is probably because numpy is trying to copy one or both arrays inside the call to dot. For small to medium arrays this is often the most efficient option, but for large arrays you'll need to micro-manage numpy in order to avoid the memory error. Your slower_dot function is slow largely because of the python function call overhead, which you suffer 162225 x 100 times. Here is one common way of dealing with this kind of situation when you want to balance memory and performance limitations.
import numpy as np
def chunking_dot(big_matrix, small_matrix, chunk_size=100):
# Make a copy if the array is not already contiguous
small_matrix = np.ascontiguousarray(small_matrix)
R = np.empty((big_matrix.shape[0], small_matrix.shape[1]))
for i in range(0, R.shape[0], chunk_size):
end = i + chunk_size
R[i:end] = np.dot(big_matrix[i:end], small_matrix)
return R
You'll want to pick the chunk_size that works best for your specific array sizes. Typically larger chunk sizes will be faster as long as everything fits in memory.

I think the problem starts from the matrix A itself as a 16225 * 10000 size matrix already occupies about 12GB of memory if each element is a double precision floating point number. That together with how numpy creates temporary copies to do the dot operation will cause the error. The extra copies is because numpy uses the underlying BLAS operations for dot which needs the matrices to be stored in contiguous C order
Check out these links if you want more discussions about improving dot performance
http://wiki.scipy.org/PerformanceTips
Speeding up numpy.dot
https://github.com/numpy/numpy/pull/2730

Find max non-infinity element in pytables CArray

This must be easy, but I'm very new to pytables. My application has dataset sizes so large they cannot be held in memory, thus I use PyTable CArrays. However, I need to find the maximum element in an array that is not infinity. Naively in numpy I'd do this:
max_element = numpy.max(array[array != numpy.inf])
Obviously that won't work in PyTables without introducing a whole array into memory. I could loop through the CArray in windows that fit in memory, but it'd be surprising to me if there weren't a max/min reduction operation. Is there an elegant mechanism to get the conditional maximum element of that array?

If your CArray is one dimensional, it is probably easier to stick it in a single-column Table. Then you have access to the where() method and can easily evaluate expressions like the following.
from itertools import imap
max(imap(lamdba r: r['col'], tab.where('col != np.inf')))
This works because where() never reads in all the data at once and returns an iterator, which is handed off to map, which is handed off to max. Note that in Python 3, you don't need to import imap() and imap() becomes just the builtin map().
Not using a table means that you need to use the Expr class and do more of the wiring yourself.

Repeatedly appending to a large list (Python 2.6.6)

I have a project where I am reading in ASCII values from a microcontroller through a serial port (looks like this : AA FF BA 11 43 CF etc)
The input is coming in quickly (38 two character sets / second).
I'm taking this input and appending it to a running list of all measurements.
After about 5 hours, my list has grown to ~ 855000 entries.
I'm given to understand that the larger a list becomes, the slower list operations become. My intent is to have this test run for 24 hours, which should yield around 3M results.
Is there a more efficient, faster way to append to a list then list.append()?
Thanks Everyone.

I'm given to understand that the larger a list becomes, the slower list operations become.
That's not true in general. Lists in Python are, despite the name, not linked lists but arrays. There are operations that are O(n) on arrays (copying and searching, for instance), but you don't seem to use any of these. As a rule of thumb: If it's widely used and idiomatic, some smart people went and chose a smart way to do it. list.append is a widely-used builtin (and the underlying C function is also used in other places, e.g. list comprehensions). If there was a faster way, it would already be in use.
As you will see when you inspect the source code, lists are overallocating, i.e. when they are resized, they allocate more than needed for one item so the next n items can be appended without need to another resize (which is O(n)). The growth isn't constant, it is proportional with the list size, so resizing becomes rarer as the list grows larger. Here's the snippet from listobject.c:list_resize that determines the overallocation:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
As Mark Ransom points out, older Python versions (<2.7, 3.0) have a bug that make the GC sabotage this. If you have such a Python version, you may want to disable the gc. If you can't because you generate too much garbage (that slips refcounting), you're out of luck though.

One thing you might want to consider is writing your data to a file as it's collected. I don't know (or really care) if it will affect performance, but it will help ensure that you don't lose all your data if power blips. Once you've got all the data, you can suck it out of the file and jam it in a list or an array or a numpy matrix or whatever for processing.

Appending to a python list has a constant cost. It is not affected by the number of items in the list (in theory). In practice appending to a list will get slower once you run out of memory and the system starts swapping.
http://wiki.python.org/moin/TimeComplexity
It would be helpful to understand why you actually append things into a list. What are you planning to do with the items. If you don't need all of them you could build a ring buffer, if you don't need to do computation you could write the list to a file, etc.

First of all, 38 two-character sets per second, 1 stop bit, 8 data bits, and no parity, is only 760 baud, not fast at all.
But anyway, my suggestion, if you're worried about having overly large lists/don't want to use one huge list, is just to store store a list on disk once it reaches a certain size and start a new list, repeating until you've gotten all the data, then combining all the lists into one once you're done receiving the data.
Though you may skip the sublists completely and just go with nmichaels' suggestion, writing the data to a file as you get it and using a small circular buffer to hold the received data that has not yet been written.

It might be faster to use numpy if you know how long the array is going to be and you can convert your hex codes to ints:
import numpy
a = numpy.zeros(3000000, numpy.int32)
for i in range(3000000):
a[i] = int(scanHexFromSerial(),16)
This will leave you with an array of integers (which you could convert back to hex with hex()), but depending on your application maybe that will work just as well for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.