create numpy array with varying shape

create numpy array with varying shape - python

I want to create a numpy array in order to fill it with numpy arrays. for example:
a = [] (simple array or numpy array)
b = np.array([[5,3],[7,9],[3,8],[2,1]])
a = np.concatenate([a,b])
c = np.array([[1,2],[2,9],[3,0]])
a = np.concatenate([a,c])
I would like to do so because I have wav files from which I extract some features so I can't read from 2 files concurrently but iteratively.
How can I create an empty ndarray with the second dimension fixed e.g. a.shape = (x,2) or how can I concatenate the arrays even without the creation of a "storage" array ?

Actually there are 2 options.
The first one is:
a = np.empty((0, 2)) , which creates an empty np array with the first dimension varying.
The second is to create an empty array
a = [] , append the np arrays in the array and then use np.vstack to concatenate them all together in the end. The latter the most efficient option.

You have to had brackets in concatenate function:
b = np.array([[5,3],[7,9],[3,8],[2,1]])
c = np.array([[1,2],[2,9],[3,0]])
a = np.concatenate([b,c])
Output:
[[5 3]
[7 9]
[3 8]
[2 1]
[1 2]
[2 9]
[3 0]]

Related

How to get a numpy array from a list

I have a list
list_num = [4 , 5 , 6]
How to convert it into a numpy array with of shape ( , 3) as when using the function
res = np.array(list_num).shape
output is ( ,3 )

If printing the res variable shape, the result is (3,) and not (,3), so I will understand you have mistyped that.
Just removing the .shape from res, as imM4TT user had cited, the variable will be like you need: res = [4 5 6].
If printing the res.shape, you will obtain (3,).
If you really want to reshape to an array with size (,3), its necessary to use what executable user had cited, the reshape function:
res = res.reshape(1,3)

Difficulties to understand np.nditer

I am very new to python. I want to clearly understand the below code, if there's anyone who can help me.
Code:
import numpy as np
arr = np.array([[1, 2, 3, 4,99,11,22], [5, 6, 7, 8,43,54,22]])
for x in np.nditer(arr[0:,::4]):
print(x)
My understanding:
This 2D array has two 1D arrays.
np.nditer(arr[0:,::4]) will give all value from 0 indexed array to upto last array, ::4 means the gap between printed arrays will be 4.
Question:
Is my understanding for no 2 above correct?
How can I get the index for the print(x)? Because of the step difference of 4 e.g [0:,::4] or any gap [0:,::x] I want to find out the exact index that it is printing. But how?

Addressing your questions below
Yes, I think your understanding is correct. It might help to first print what arr[0:,::4] returns though:
iter_array = arr[0:,::4]
print(iter_array)
>>> [[ 1 99]
>>> [ 5 43]]
The slicing takes out each 4th index of the original array. All nditer does is iterate through these values in order. (Quick FYI: arr[0:] and arr[:] are equivalent, since the starting point is 0 by default).
As you pointed out, to get the index for these you need to keep track of the slicing that you did, i.e. arr[0:, ::x]. Remember, nditer has nothing to do with how you sliced your array. I'm not sure how to best get the indices of your slicing, but this is what I came up with:
import numpy as np
ls = [
[1, 2, 3, 4,99,11,22],
[5, 6, 7, 8,43,54,22]
]
arr = np.array(ls)
inds = np.array([
[(ctr1, ctr2) for ctr2, _ in enumerate(l)] for ctr1, l in enumerate(ls)
]) # create duplicate of arr filled with zeros
step = 4
iter_array = arr[0:,::step]
iter_inds = inds[0:,::step]
print(iter_array)
>>> [[ 1 99]
>>> [ 5 43]]
print(iter_inds)
>>> [[[0 0]
>>> [0 4]]
>>>
>>> [[1 0]
>>> [1 4]]]
All that I added here was an inds array. This array has elements equal to their own index. Then, when you slice both arrays in the same way, you get your indices. Hopefully this helps!

numpy sum of each array in a list of arrays of different size

Given a list of numpy arrays, each of different length, as that obtained by doing lst = np.array_split(arr, indices), how do I get the sum of every array in the list? (I know how to do it using list-comprehension but I was hoping there was a pure-numpy way to do it).
I thought that this would work:
np.apply_along_axis(lambda arr: arr.sum(), axis=0, arr=lst)
But it doesn't, instead it gives me this error which I don't understand:
ValueError: operands could not be broadcast together with shapes (0,) (12,)
NB: It's an array of sympy objects.

There's a faster way which avoids np.split, and utilizes np.reduceat. We create an ascending array of indices where you want to sum elements with np.append([0], np.cumsum(indices)[:-1]). For proper indexing we need to put a zero in front (and discard the last element, if it covers the full range of the original array.. otherwise just delete the [:-1] indexing). Then we use the np.add ufunc with np.reduceat:
import numpy as np
arr = np.arange(1, 11)
indices = np.array([2, 4, 4])
# this should split like this
# [1 2 | 3 4 5 6 | 7 8 9 10]
np.add.reduceat(arr, np.append([0], np.cumsum(indices)[:-1]))
# array([ 3, 18, 34])

how to slice 2D numpy array based on 2 1D arrays containing initial and final indexes

I have a 2D numpy array, let's say it has shape 4x10 (4 rows and 10 columns). I have 2 1D arrays that have the initial and final indexes, so they are both 20x1. For an example, let's say
initial = [1, 2, 4, 5]
final = [3, 6, 8, 6]
then I'd like to get
data[0,1:3]
data[1,2:6]
data[2,4:8]
data[3,5:6]
Of course, each of those arrays would have different size, so I'd like to store them in a list.
If I were to do it with a for loop, it'd look like this:
arrays = []
for i in range(4):
slice = data[i,initial[i]:final[i]]
arrays.append(slice)
Is there a more efficient way to do this? I'd rather avoid use a for loop, because my actual data is huge.

You can use numpy.split with flattened data (using numpy.ndarray.flatten) and modifying the slices:
sections = np.column_stack([initial, final]).flatten()
sections[::2] += np.arange(len(initial)) * data.shape[1]
sections[1::2] += sections[::2] - np.array(initial)
np.split(data.flatten(), sections)[1::2]

Efficiently Subtract Vector from Matrix (Scipy)

I've got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common task when you're doing things like normalization/standardization, but I can't seem to find the proper way to do this efficiently.
Here's an example to demonstrate:
# mat is a 3x3 matrix
mat = scipy.sparse.csc_matrix([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T
"""
I want to subtract `vec` from each of the columns in `mat` yielding...
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]
"""
One way to accomplish what I want is to hstack vec to itself 3 times, yielding a 3x3 matrix where each column is vec and then subtract that from mat. But again, I'm looking for a way to do this efficiently, and the hstacked matrix takes a long time to create. I'm sure there's some magical way to do this with slicing and broadcasting, but it eludes me.
Thanks!
EDIT: Removed the 'in-place' constraint, because sparsity structure would be constantly changing in an in-place assignment scenario.

For a start what would we do with dense arrays?
mat-vec.A # taking advantage of broadcasting
mat-vec.A[:,[0]*3] # explicit broadcasting
mat-vec[:,[0,0,0]] # that also works with csr matrix
In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566
we found that using as_strided on the mat.indptr vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows, x.cols of an lil_matrix are nearly as good. getrow is slow). This function implements such as iteration.
def sum(X,v):
rows, cols = X.shape
row_start_stop = as_strided(X.indptr, shape=(rows, 2),
strides=2*X.indptr.strides)
for row, (start, stop) in enumerate(row_start_stop):
data = X.data[start:stop]
data -= v[row]
sum(mat, vec.A)
print mat.A
I'm using vec.A for simplicity. If we keep vec sparse we'd have to add a test for nonzero value at row. Also this type of iteration only modifies the nonzero elements of mat. 0's are unchanged.
I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec has lots of zeros, then it makes sense to iterate, modifying only those rows of mat where vec is nonzero. But vec is nearly dense like this example, it may be hard to beat mat-vec.A.

Summary
So in short, if you use CSR instead of CSC, it's a one-liner:
mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))
Explanation
If you realized it, this is better done in row-wise fashion, since we will deduct the same number from each row. In your example then: deduct 1 from the first row, 2 from the second row, 3 from the third row.
I actually encountered this in a real life application where I want to classify documents, each represented as a row in the matrix, while the columns represent words. Each document has a score which should be multiplied to the score of each word in that document. Using row representation of the sparse matrix, I did something similar to this (I modified my code to answer your question):
mat = scipy.sparse.csc_matrix([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T
# Use the row version
mat_row = mat.tocsr()
vec_row = vec.T
# mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
# mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
# By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
print mat_row.todense()
Which results in:
[[0 1 2]
[0 1 2]
[0 1 2]]
The visualization is something like this:
>>> mat_row.data
[1 2 3 2 3 4 3 4 5]
>>> mat_row.indptr
[0 3 6 9]
>>> numpy.diff(mat_row.indptr)
[3 3 3]
>>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[1 1 1 2 2 2 3 3 3]
>>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[0 1 2 0 1 2 0 1 2]
>>> mat_row.todense()
[[0 1 2]
[0 1 2]
[0 1 2]]

You can introduce fake dimensions by altering the strides of your vector. You can, with out additional allocation, "convert" your vector to a 3 x 3 matrix using np.lib.stride_tricks.as_strided. This page has an example and a bit of a discussion about it along with some discussion of related topics (like views). Search the page for "Example: fake dimensions with strides."
There are also quite a few example on SO about this... but my searching skills are failing me now.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

create numpy array with varying shape - python

You have to had brackets in concatenate function: b = np.array([[5,3],[7,9],[3,8],[2,1]]) c = np.array([[1,2],[2,9],[3,0]]) a = np.concatenate([b,c]) Output: [[5 3] [7 9] [3 8] [2 1] [1 2] [2 9] [3 0]]

Related

How to get a numpy array from a list

Difficulties to understand np.nditer

numpy sum of each array in a list of arrays of different size

how to slice 2D numpy array based on 2 1D arrays containing initial and final indexes

Efficiently Subtract Vector from Matrix (Scipy)

Categories

Resources