how do I make arrays to be of the same length - python

I have a pickled file that contain a hash with key -> list
like this
h = { 'two': [1,2], 'three': [3,4,5]}
I want to convert the arrays into an array of array, making the arrays of the same length (just filling up with zeros for the shorter arrays)
So the example above I would like to have a result like this
>>> np.asarray([[1,2,0],[3,4,5]])
array([[1, 2, 0],
[3, 4, 5]])
(I don't care about the keys in the hash, and I also don't care about the order of the arrays).
The first step would be to find the longest array, which I think this code will do for me
m = max(map(len, h.values()))
But how do I create the arrays after that?
I thought numpy.copyto() would be possible;
copying the original array into a new array filled with zeros and the new length, but it demands arrays of the same shapes.

Since you start with a dictionary, it's very unlikely this can benefit from numpy vectorization; So solution would be bare loops, and you can pad zeros either in numpy way with np.pad:
np.array([np.pad(v, (0, m - len(v)), 'constant') for v in h.values()])
#array([[1, 2, 0],
# [3, 4, 5]])
Or vanilla list way:
np.array([v + [0] * (m - len(v)) for v in h.values()])
#array([[1, 2, 0],
# [3, 4, 5]])

Related

How to optimize array storage within a numpy array?

I have a numpy array with shape (n, m):
import numpy as np
foo = np.zeros((5,5))
I make some calculations, getting results in a (n, 2) shape:
bar = np.zeros((8,2))
I want to store the calculation results within the array, since I might have to extend them after another calculation. I can do it like this:
foo = np.zeros((5,5), object)
# one calculation result for index (1, 1)
bar1 = np.zeros((8,2))
foo[1, 1] = bar1
# another calculation result for index (1, 1)
bar2 = np.zeros((5,2))
foo[1, 1] = np.concatenate((foo[1, 1], bar2))
however this seems quite odd to me since I have to do a lot of checking if the array has already got a value at this place or not. Additionally I don't know if using object as datatype is a good idea since I only want to store numpy specific data and not any python objects.
Is there a more numpy specific way to this approach?
defaultdict streamlines the task of adding values to dict elements incrementallly:
In [644]: from collections import defaultdict
Start with a dict that has default value of list, [].
In [645]: dd = defaultdict(list)
In [646]: dd[(1,1)].append(np.zeros((1,2),int))
In [647]: dd[(1,1)].append(np.ones((3,2),int))
In [648]: dd
Out[648]:
defaultdict(list,
{(1, 1): [array([[0, 0]]), array([[1, 1],
[1, 1],
[1, 1]])]})
Once we've collected all values, we can convert the nested lists into an array:
In [649]: dd[(1,1)] = np.concatenate(dd[(1,1)])
In [650]: dd
Out[650]:
defaultdict(list,
{(1, 1): array([[0, 0],
[1, 1],
[1, 1],
[1, 1]])})
In [652]: dict(dd)
Out[652]:
{(1,
1): array([[0, 0],
[1, 1],
[1, 1],
[1, 1]])}
In doing the conversion we will have to take care with keys with [], since we can't concatenate an empty list.

Efficiently slicing an array using indexes from another array

(I apologize in advance if this is a duplicate question, though I looked at many similar questions on SO but didn't find a matching solution)
Suppose you have an array
A = np.array([
[0, 1, 2],
[3, 4, 5],
[6, 7, 8]
])
and another array
I = np.array([1, 1, 2])
For each row in A, I want to get the i-th element of it, where i is the row-th element of I.
In this case, the output I'd like to have is array([1, 4, 8]).
My most intuitive attempt to do so is:
A[:, I]
then I figured that the desired output is actually the diagonal of it, so A[:, I].diagonal() would do the trick.
But it feels that there's some waste of space and time by doing this way, because it requires an intermediate "big" matrix, which diagonal will be extracted from.
Is there a more efficient to perform this slicing?
This would do the trick:
res = A[np.arange(A.shape[0]), I]

Numpy array indexing syntax

I am learning numpy newly and confused about syntax used in indexing of arrays. For example:
arr[2, 3]
This means element at intersection of 3nd row and 4th column. What confuses me separation of different indices by comma inside square brackets (like in function arguments). Doing so with python lists is not valid:
l = [[1, 2], [3, 4]]
l[1, 1]
Traceback (most recent call last):
File "", line 1, in
TypeError: list indices must be integers or slices, not tuple
So, if this not a valid python syntax, how numpy arrays work?
Use Colon ':' instead of commas ','.
In slicing or indexing is done using colon ':'
In your above example,
l = [[1, 2], [3, 4]]
->l[0] is [1,2] and -> l[1] is [3,4]
Read further documentation for better understanding.
Thank You
In your given example, you're comparing a numpy array to a list of lists. The main difference between the two is that a numpy array is predictable in terms of shape, data type of its elements, and so on, while a list can contain an arbitrary combination of any other python objects (lists, tuples, strings, etc.)
Take this as an example, say you create a numpy array like so:
arr = np.array([[0, 1], [2, 3], [4, 5]])
Here, the shape of arr is known right after instantiation "arr.shape returns (3,2)", so you can easily index the array with only a comma separated square bracket. On the other hand, take the list example:
l = [[0, 1], [2, 3], [4, 5]]
l[0] # This returns the list [0, 1]
l[0].append("HELLO")
l[0] # This returns the list [0, 1, "HELLO"]
A list is very unpredictable, as there's no way to know what each list element will return to you. So, the way we index a specific element in a list of lists is by using 2 square brackets "e.g. l[0][0]"
What if we created a non-uniform numpy array? Well, you get a similar behaviour to a list of lists:
arr = np.array([[0, 1], [2, 3], [4]]) # Here, you get a Warning!
print(arr) # Returns: array([list([0, 1]), list([2, 3]), list([4])], dtype=object)
In this case, you can't index the numpy array using [0, 0]. Instead, you have to use two square brackets, just like a list of lists
You can also check the documentation of ndarray for more info.

numpy array TypeError: only integer scalar arrays can be converted to a scalar index

i=np.arange(1,4,dtype=np.int)
a=np.arange(9).reshape(3,3)
and
a
>>>array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
a[:,0:1]
>>>array([[0],
[3],
[6]])
a[:,0:2]
>>>array([[0, 1],
[3, 4],
[6, 7]])
a[:,0:3]
>>>array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
Now I want to vectorize the array to print them all together. I try
a[:,0:i]
or
a[:,0:i[:,None]]
It gives TypeError: only integer scalar arrays can be converted to a scalar index
Short answer:
[a[:,:j] for j in i]
What you are trying to do is not a vectorizable operation. Wikipedia defines vectorization as a batch operation on a single array, instead of on individual scalars:
In computer science, array programming languages (also known as vector or multidimensional languages) generalize operations on scalars to apply transparently to vectors, matrices, and higher-dimensional arrays.
...
... an operation that operates on entire arrays can be called a vectorized operation...
In terms of CPU-level optimization, the definition of vectorization is:
"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.
The problem with your case is that the result of each individual operation has a different shape: (3, 1), (3, 2) and (3, 3). They can not form the output of a single vectorized operation, because the output has to be one contiguous array. Of course, it can contain (3, 1), (3, 2) and (3, 3) arrays inside of it (as views), but that's what your original array a already does.
What you're really looking for is just a single expression that computes all of them:
[a[:,:j] for j in i]
... but it's not vectorized in a sense of performance optimization. Under the hood it's plain old for loop that computes each item one by one.
I ran into the problem when venturing to use numpy.concatenate to emulate a C++ like pushback for 2D-vectors; If A and B are two 2D numpy.arrays, then numpy.concatenate(A,B) yields the error.
The fix was to simply to add the missing brackets: numpy.concatenate( ( A,B ) ), which are required because the arrays to be concatenated constitute to a single argument
This could be unrelated to this specific problem, but I ran into a similar issue where I used NumPy indexing on a Python list and got the same exact error message:
# incorrect
weights = list(range(1, 129)) + list(range(128, 0, -1))
mapped_image = weights[image[:, :, band]] # image.shape = [800, 600, 3]
# TypeError: only integer scalar arrays can be converted to a scalar index
It turns out I needed to turn weights, a 1D Python list, into a NumPy array before I could apply multi-dimensional NumPy indexing. The code below works:
# correct
weights = np.array(list(range(1, 129)) + list(range(128, 0, -1)))
mapped_image = weights[image[:, :, band]] # image.shape = [800, 600, 3]
try the following to change your array to 1D
a.reshape((1, -1))
You can use numpy.ravel to return a flattened array from n-dimensional array:
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> a.ravel()
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
I had a similar problem and solved it using list...not sure if this will help or not
classes = list(unique_labels(y_true, y_pred))
this problem arises when we use vectors in place of scalars
for example in a for loop the range should be a scalar, in case you have given a vector in that place you get error. So to avoid the problem use the length of the vector you have used
I ran across this error when while trying to access elements of a list using a 1-D array. I was suggested this page but I don't the answer I was looking for.
Let l be the list and myarray be my 1D array. The correct way to access list l using elements of myarray is
np.take(l,myarray)

Is there any easy way to sparsely store a matrix with a redundant pattern in python?

The type of matrix I am dealing with was created from a vector as shown below:
Start with a 1-d vector V of length L.
To create a matrix A from V with N rows, make the i'th column of A the first N entries of V, starting from the i'th entry of V, so long as there are enough entries left in V to fill up the column. This means A has L - N + 1 columns.
Here is an example:
V = [0, 1, 2, 3, 4, 5]
N = 3
A =
[0 1 2 3
1 2 3 4
2 3 4 5]
Representing the matrix this way requires more memory than my machine has. Is there any reasonable way of storing this matrix sparsely? I am currently storing N * (L - N + 1) values, when I only need to store L values.
You can take a view of your original vector as follows:
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>>
>>> v = np.array([0, 1, 2, 3, 4, 5])
>>> n = 3
>>>
>>> a = as_strided(v, shape=(n, len(v)-n+1), strides=v.strides*2)
>>> a
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
This is a view, not a copy of your original data, e.g.
>>> v[3] = 0
>>> v
array([0, 1, 2, 0, 4, 5])
>>> a
array([[0, 1, 2, 0],
[1, 2, 0, 4],
[2, 0, 4, 5]])
But you have to be careful no to do any operation on a that triggers a copy, since that would send your memory use through the ceiling.
If you're already using numpy, use its strided or sparse arrays, as Jaime explained.
If you're not already using numpy, you may to strongly consider using it.
If you need to stick with pure Python, there are three obvious ways to do this, depending on your use case.
For strided or sparse-but-clustered arrays, you could do effectively the same thing as numpy.
Or you could use a simple run-length-encoding scheme, plus maybe a higher-level list of runs for, or list of pointers to every Nth element, or even a whole stack of such lists (one for every 100 elements, one for every 10000, etc.).
But for mostly-uniformly-dense arrays, the easiest thing is to simply store a dict or defaultdict mapping indices to values. Random-access lookups or updates are still O(1)—albeit with a higher constant factor—and the storage you waste storing (in effect) a hash, key, and value instead of just a value for each non-default element is more than made up for by not storing values for the default elements, as long as you're less than 0.33 density.

Categories