I have some code that uses the result of an unravel_index command to access elements of some matrices. I did some line-profiling and just accessing these elements and storing them in new arrays is taking somewhere around 25-30% of my run time. Is there a way to speed up this indexing at all?
Related
I would like to store a non-rectangular array in Python. The array has millions of elements and I will be applying a function to each element in the array, so I am concerned about performance. What data structure should I use? Should I use a Python list or a numpy array of type object? Is there another data structure that would work even better?
You can use the dictionary data structure to store everything. If you have ample memory, dictionaries is a good option. The hashing process makes them faster.
I'd suggest you to use scipy sparse matrices.
UPD. Some elaboration goes below.
I assume that "non-rectangular" implies there will be empty elements in plain 2D array. Having millions of elements will make these 'holes' tax on memory usage. Sparse matrix provide a way to have familiar array interface and occupy only necessary amount of memory.
Though if array-ish indexing is not required, dictionary is pretty fine storage to use.
Let's say I create 2 numpy arrays, one of which is an empty array and one which is of size 1000x1000 made up of zeros:
import numpy as np;
A1 = np.array([])
A2 = np.zeros([1000,1000])
When I want to change a value in A2, this seems to work fine:
A2[n,m] = 17
The above code would change the value of position [n][m] in A2 to 17.
When I try the above with A1 I get this error:
A1[n,m] = 17
IndexError: index n is out of bounds for axis 0 with size 0
I know why this happens, because there is no defined position [n,m] in A1 and that makes sense, but my question is as follows:
Is there a way to define a dynamic array without that updates the array with new rows and columns if A[n,m] = somevalue is entered when n or m or both are greater than the bound of an Array A?
It doesn't have to be in numpy, any library or method that can update array size would be awesome. If it is a method, I can imagine there being an if loop that checks if [n][m] is out of bounds and does something about it.
I am coming from a MATLAB background where it's easy to do this. I tried to find something about this in the documentation in numpy.array but I've been unsuccessful.
EDIT:
I want to know if some way to create a dynamic list is possible at all in Python, not just in the numpy library. It appears from this question that it doesn't work with numpy Creating a dynamic array using numpy in python.
This can't be done in numpy, and it technically can't be done in MATLAB either. What MATLAB is doing behind-the-scenes is creating an entire new matrix, then copying all the data to the new matrix, then deleting the old matrix. It is not dynamically resizing, that isn't actually possible because of how arrays/matrices work. This is extremely slow, especially for large arrays, which is why MATLAB nowadays warns you not to do it.
Numpy, like MATLAB, cannot resize arrays (actually, unlike MATLAB it technically can, but only if you are lucky so I would advise against trying). But in order to avoid the sort of confusion and slow code this causes in MATLAB, numpy requires that you explicitly make the new array (using np.zeros) then copy the data over.
Python, unlike MATLAB, actually does have a truly resizable data structure: the list. Lists still require there to be enough elements, since this avoids silent indexing errors that are hard to catch in MATLAB, but you can resize an array with very good performance. You can make an effectively n-dimensional list by using nested lists of lists. Then, once the list is done, you can convert it to a numpy array.
I wrote a program using normal Python, and I now think it would be a lot better to use numpy instead of standard lists. The problem is there are a number of things where I'm confused how to use numpy, or whether I can use it at all.
In general how do np.arrays work? Are they dynamic in size like a C++ vector or do I have declare their length and type beforehand like a standard C++ array? In my program I've got a lot of cases where I create a list
ex_list = [] and then cycle through something and append to it ex_list.append(some_lst). Can I do something like with a numpy array? What if I knew the size of ex_list, could I declare and empty one and then add to it?
If I can't, let's say I only call this list, would it be worth it to convert it to numpy afterwards, i.e. is calling a numpy list faster?
Can I do more complicated operations for each element using a numpy array (not just adding 5 to each etc), example below.
full_pallete = [(int(1+i*(255/127.5)),0,0) for i in range(0,128)]
full_pallete += [col for col in right_palette if col[1]!=0 or col[2]!=0 or col==(0,0,0)]
In other words, does it make sense to convert to a numpy array and then cycle through it using something other than for loop?
Numpy arrays can be appended to (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html), although in general calling the append function many times in a loop has a heavy performance cost - it is generally better to pre-allocate a large array and then fill it as necessary. This is because the arrays themselves do have fixed size under the hood, but this is hidden from you in python.
Yes, Numpy is well designed for many operations similar to these. In general, however, you don't want to be looping through numpy arrays (or arrays in general in python) if they are very large. By using inbuilt numpy functions, you basically make use of all sorts of compiled speed up benefits. As an example, rather than looping through and checking each element for a condition, you would use numpy.where().
The real reason to use numpy is to benefit from pre-compiled mathematical functions and data processing utilities on large arrays - both those in the core numpy library as well as many other packages that use them.
I will need to create array of integer arrays like [[0,1,2],[4,4,5,7]...[4,5]]. The size of internal arrays changeable. Max number of internal arrays is 2^26. So what do you recommend for the fastest way for updating this array.
When I use list=[[]] * 2^26 initialization is very fast but update is very slow. Instead I use
list=[] , for i in range(2**26): list.append.([]) .
Now initialization is slow, update is fast. For example, for 16777216 internal array and 0.213827311993 avarage number of elements on each array for 2^26-element array it takes 1.67728900909 sec. It is good but I will work much bigger datas, hence I need the best way. Initialization time is not important.
Thank you.
What you ask is quite of a problem. Different data structures have different properties. In general, if you need quick access, do not use lists! They have linear access time, which means, the more you put in them, the longer it will take in average to access an element.
You could perhaps use numpy? That library has matrices that can be accessed quite fast, and can be reshaped on the fly. However, if you want to add or delete rows, it will might be a bit slow because it generally reallocates (thus copies) the entire data. So it is a trade off.
If you are gonna have so many internal arrays of different sizes, perhaps you could have a dictionary that contains the internal arrays. I think if it is indexed by integers it will be much faster than a list. Then, the internal arrays could be created with numpy.
I'm still confused whether to use list or numpy array.
I started with the latter, but since I have to do a lot of append
I ended up with many vstacks slowing my code down.
Using list would solve this problem, but I also need to delete elements
which again works well with delete on numpy array.
As it looks now I'll have to write my own data type (in a compiled language, and wrap).
I'm just curious if there isn't a way to get the job done using a python type.
To summarize this are the criterions my data type would have to fulfil:
2d n (variable) rows, each row k (fixed) elements
in memory in one piece (would be nice for efficient operating)
append row (with an in average constant time, like C++ vector just always k elements)
delete a set of elements (best: inplace, keep free space at the end for later append)
access element given the row and column index ( O(1) like data[row*k+ column]
It appears generally useful to me to have a data type like this and not impossible to implement in C/Fortran.
What would be the closest I could get with python?
(Or maybe, Do you think it would work to write a python class for the datatype? what performance should I expect in this case?)
As I see it, if you were doing this in C or Fortran, you'd have to have an idea of the size of the array so that you can allocate the correct amount of memory (ignoring realloc!). So assuming you do know this, why do you need to append to the array?
In any case, numpy arrays have the resize method, which you can use to extend the size of the array.