How to optimize array storage within a numpy array?

How to optimize array storage within a numpy array? - python

I have a numpy array with shape (n, m):
import numpy as np
foo = np.zeros((5,5))
I make some calculations, getting results in a (n, 2) shape:
bar = np.zeros((8,2))
I want to store the calculation results within the array, since I might have to extend them after another calculation. I can do it like this:
foo = np.zeros((5,5), object)
# one calculation result for index (1, 1)
bar1 = np.zeros((8,2))
foo[1, 1] = bar1
# another calculation result for index (1, 1)
bar2 = np.zeros((5,2))
foo[1, 1] = np.concatenate((foo[1, 1], bar2))
however this seems quite odd to me since I have to do a lot of checking if the array has already got a value at this place or not. Additionally I don't know if using object as datatype is a good idea since I only want to store numpy specific data and not any python objects.
Is there a more numpy specific way to this approach?

defaultdict streamlines the task of adding values to dict elements incrementallly:
In [644]: from collections import defaultdict
Start with a dict that has default value of list, [].
In [645]: dd = defaultdict(list)
In [646]: dd[(1,1)].append(np.zeros((1,2),int))
In [647]: dd[(1,1)].append(np.ones((3,2),int))
In [648]: dd
Out[648]:
defaultdict(list,
{(1, 1): [array([[0, 0]]), array([[1, 1],
[1, 1],
[1, 1]])]})
Once we've collected all values, we can convert the nested lists into an array:
In [649]: dd[(1,1)] = np.concatenate(dd[(1,1)])
In [650]: dd
Out[650]:
defaultdict(list,
{(1, 1): array([[0, 0],
[1, 1],
[1, 1],
[1, 1]])})
In [652]: dict(dd)
Out[652]:
{(1,
1): array([[0, 0],
[1, 1],
[1, 1],
[1, 1]])}
In doing the conversion we will have to take care with keys with [], since we can't concatenate an empty list.

Related

How to convert [2,3,4] to [0,0,1,1,1,2,2,2,2] to utilize tf.math.segment_sum?

Assume I have an array like [2,3,4], I am looking for a way in NumPy (or Tensorflow) to convert it to [0,0,1,1,1,2,2,2,2] to apply tf.math.segment_sum() on a tensor that has a size of 2+3+4.
No elegant idea comes to my mind, only loops and list comprehension.

Would something like this work for you?
import numpy
arr = numpy.array([2, 3, 4])
numpy.repeat(numpy.arange(arr.size), arr)
# array([0, 0, 1, 1, 1, 2, 2, 2, 2])

You don't need to use numpy. You can use nothing but list comprehensions:
>>> foo = [2,3,4]
>>> sum([[i]*foo[i] for i in range(len(foo))], [])
[0, 0, 1, 1, 1, 2, 2, 2, 2]
It works like this:
You can create expanded arrays by multiplying a simple one with a constant, so [0] * 2 == [0,0]. So for each index in the array, we expand with [i]*foo[i]. In other words:
>>> [[i]*foo[i] for i in range(len(foo))]
[[0, 0], [1, 1, 1], [2, 2, 2, 2]]
Then we use sum to reduce the lists into a single list:
>>> sum([[i]*foo[i] for i in range(len(foo))], [])
[0, 0, 1, 1, 1, 2, 2, 2, 2]
Because we are "summing" lists, not integers, we pass [] to sum to make an empty list the starting value of the sum.
(Note that this likely will be slower than numpy, though I have not personally compared it to something like #Patol75's answer.)

I really like the answer from #Patol75 since it's neat. However, there is no pure tensorflow solution yet, so I provide one which maybe kinda complex. Just for reference and fun!
BTW, I didn't see tf.repeat this API in tf master. Please check this PR which adds tf.repeat support equivalent to numpy.repeat.
import tensorflow as tf
repeats = tf.constant([2,3,4])
values = tf.range(tf.size(repeats)) # [0,1,2]
max_repeats = tf.reduce_max(repeats) # max repeat is 4
tiled = tf.tile(tf.reshape(values, [-1,1]), [1,max_repeats]) # [[0,0,0,0],[1,1,1,1],[2,2,2,2]]
mask = tf.sequence_mask(repeats, max_repeats) # [[1,1,0,0],[1,1,1,0],[1,1,1,1]]
res = tf.boolean_mask(tiled, mask) # [0,0,1,1,1,2,2,2,2]

Patol75's answer uses Numpy but Gort the Robot's answer is actually faster (on your example list at least).
I'll keep this answer up as another solution, but it's slower than both.
Given that a = [2,3,4] this could be done using a loop like so:
b = []
for i in range(len(a)):
for j in range(a[i]):
b.append(range(len(a))[i])
Which, as a list comprehension one-liner, is this diabolical thing:
b = [range(len(a))[i] for i in range(len(a)) for j in range(a[i])]
Both end up with b = [0,0,1,1,1,2,2,2,2].

Get list of indices matching condition with NumPy [duplicate]

Is there any way to get the indices of several elements in a NumPy array at once?
E.g.
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
I would like to find the index of each element of a in b, namely: [0,1,4].
I find the solution I am using a bit verbose:
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
c = np.zeros_like(a)
for i, aa in np.ndenumerate(a):
c[i] = np.where(b == aa)[0]
print('c: {0}'.format(c))
Output:
c: [0 1 4]

You could use in1d and nonzero (or where for that matter):
>>> np.in1d(b, a).nonzero()[0]
array([0, 1, 4])
This works fine for your example arrays, but in general the array of returned indices does not honour the order of the values in a. This may be a problem depending on what you want to do next.
In that case, a much better answer is the one #Jaime gives here, using searchsorted:
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([0, 1, 4])
This returns the indices for values as they appear in a. For instance:
a = np.array([1, 2, 4])
b = np.array([4, 2, 3, 1])
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([3, 1, 0]) # the other method would return [0, 1, 3]

This is a simple one-liner using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
idx = npi.indices(b, a)
The implementation is fully vectorized, and it gives you control over the handling of missing values. Moreover, it works for nd-arrays as well (for instance, finding the indices of rows of a in b).

All of the solutions here recommend using a linear search. You can use np.argsort and np.searchsorted to speed things up dramatically for large arrays:
sorter = b.argsort()
i = sorter[np.searchsorted(b, a, sorter=sorter)]

For an order-agnostic solution, you can use np.flatnonzero with np.isin (v 1.13+).
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
res = np.flatnonzero(np.isin(a, b)) # NumPy v1.13+
res = np.flatnonzero(np.in1d(a, b)) # earlier versions
# array([0, 1, 2], dtype=int64)

There are a bunch of approaches for getting the index of multiple items at once mentioned in passing in answers to this related question: Is there a NumPy function to return the first index of something in an array?. The wide variety and creativity of the answers suggests there is no single best practice, so if your code above works and is easy to understand, I'd say keep it.
I personally found this approach to be both performant and easy to read: https://stackoverflow.com/a/23994923/3823857
Adapting it for your example:
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
indices = [b_list.index(x) for x in a]
vals_at_indices = b_array[indices]
I personally like adding a little bit of error handling in case a value in a does not exist in b.
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
b_set = set(b_list)
indices = [b_list.index(x) if x in b_set else np.nan for x in a]
vals_at_indices = b_array[indices]
For my use case, it's pretty fast, since it relies on parts of Python that are fast (list comprehensions, .index(), sets, numpy indexing). Would still love to see something that's a NumPy equivalent to VLOOKUP, or even a Pandas merge. But this seems to work for now.

how do I make arrays to be of the same length

I have a pickled file that contain a hash with key -> list
like this
h = { 'two': [1,2], 'three': [3,4,5]}
I want to convert the arrays into an array of array, making the arrays of the same length (just filling up with zeros for the shorter arrays)
So the example above I would like to have a result like this
>>> np.asarray([[1,2,0],[3,4,5]])
array([[1, 2, 0],
[3, 4, 5]])
(I don't care about the keys in the hash, and I also don't care about the order of the arrays).
The first step would be to find the longest array, which I think this code will do for me
m = max(map(len, h.values()))
But how do I create the arrays after that?
I thought numpy.copyto() would be possible;
copying the original array into a new array filled with zeros and the new length, but it demands arrays of the same shapes.

Since you start with a dictionary, it's very unlikely this can benefit from numpy vectorization; So solution would be bare loops, and you can pad zeros either in numpy way with np.pad:
np.array([np.pad(v, (0, m - len(v)), 'constant') for v in h.values()])
#array([[1, 2, 0],
# [3, 4, 5]])
Or vanilla list way:
np.array([v + [0] * (m - len(v)) for v in h.values()])
#array([[1, 2, 0],
# [3, 4, 5]])

Getting column from a multidimensional list

I have a quite involved nested list: each element is a tuple with two elements: one is an object, the other is an 3x2xn array. Here is a toy model.
toy=[('mol1',array([[[1,1,1],[2,2,2]],[[1,1,1],[2,2,2]]])),('mol2',array([[[1,1,1],[2,2,2]],[[1,1,1],[2,2,2]]]))]
How can I get a single column from that?
I am looking for
('mol1', 'mol2')
and for the 2Darrays like:
array([[1,1,1],[1,1,1],[1,1,1],[1,1,1]])
I have a solution but I think it is pretty inefficient:
zip(*toy)[0]
it returns
('mol1', 'mol2')
then
zip(*toy)[1][0][:,0]
which returns
array([[1, 1, 1],
[1, 1, 1]])
a for cycle like that
for i in range(len(toy)):
zip(*toy)[1][i][:,0]
gives all the element of the column and I can build it with a vstack

This should be reasonably efficient:
>>> tuple(t[0] for t in toy)
('mol1', 'mol2')
For the 2D array, with the help of numpy's vstack function:
>>> from numpy import vstack
>>> vstack([t[1][:, 0] for t in toy])
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])

You can use the array in numpy to store your data or convert yours to that, then use the column slicing function built in. In general numpy slicing is very fast.
import numpy as np
np.asarray(toy)[::, 0] # first column
# output
array(['mol1', 'mol2'],
dtype='|S4')

Is there any easy way to sparsely store a matrix with a redundant pattern in python?

The type of matrix I am dealing with was created from a vector as shown below:
Start with a 1-d vector V of length L.
To create a matrix A from V with N rows, make the i'th column of A the first N entries of V, starting from the i'th entry of V, so long as there are enough entries left in V to fill up the column. This means A has L - N + 1 columns.
Here is an example:
V = [0, 1, 2, 3, 4, 5]
N = 3
A =
[0 1 2 3
1 2 3 4
2 3 4 5]
Representing the matrix this way requires more memory than my machine has. Is there any reasonable way of storing this matrix sparsely? I am currently storing N * (L - N + 1) values, when I only need to store L values.

You can take a view of your original vector as follows:
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>>
>>> v = np.array([0, 1, 2, 3, 4, 5])
>>> n = 3
>>>
>>> a = as_strided(v, shape=(n, len(v)-n+1), strides=v.strides*2)
>>> a
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
This is a view, not a copy of your original data, e.g.
>>> v[3] = 0
>>> v
array([0, 1, 2, 0, 4, 5])
>>> a
array([[0, 1, 2, 0],
[1, 2, 0, 4],
[2, 0, 4, 5]])
But you have to be careful no to do any operation on a that triggers a copy, since that would send your memory use through the ceiling.

If you're already using numpy, use its strided or sparse arrays, as Jaime explained.
If you're not already using numpy, you may to strongly consider using it.
If you need to stick with pure Python, there are three obvious ways to do this, depending on your use case.
For strided or sparse-but-clustered arrays, you could do effectively the same thing as numpy.
Or you could use a simple run-length-encoding scheme, plus maybe a higher-level list of runs for, or list of pointers to every Nth element, or even a whole stack of such lists (one for every 100 elements, one for every 10000, etc.).
But for mostly-uniformly-dense arrays, the easiest thing is to simply store a dict or defaultdict mapping indices to values. Random-access lookups or updates are still O(1)—albeit with a higher constant factor—and the storage you waste storing (in effect) a hash, key, and value instead of just a value for each non-default element is more than made up for by not storing values for the default elements, as long as you're less than 0.33 density.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize array storage within a numpy array? - python

Related

How to convert [2,3,4] to [0,0,1,1,1,2,2,2,2] to utilize tf.math.segment_sum?

Get list of indices matching condition with NumPy [duplicate]

how do I make arrays to be of the same length

Getting column from a multidimensional list

Is there any easy way to sparsely store a matrix with a redundant pattern in python?

Categories

Resources