Related
I'm confused about how np.zeros() dimensions are handled.
I have a pandas dataframe with some toy data
# A toy 3col x 4row Dataframe
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['colA','colB','colC'])
b = pd.DataFrame([[40,41,42],[43,44,45],[46,47,48],[49,50,51]],columns=['colA','colB','colC'])
colA colB colC
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I want to get this two dataframes into a 3D numpy array of dimensions (4 rows, 3cols, 2channels), so I can calculate statistics between the two dataframes (eg: average, max values, etc...)
So I basically create a 3D array of zeros and populate each channel with the values from the dataframes.
But it looks like the dimensions are not correctly arranged.
c = np.zeros((4, 3, 2))
c[:,:,0] = a.values
c[:,:,1] = b.values
array([[[ 1., 40.],
[ 2., 41.],
[ 3., 42.]],
[[ 4., 43.],
[ 5., 44.],
[ 6., 45.]],
[[ 7., 46.],
[ 8., 47.],
[ 9., 48.]],
[[10., 49.],
[11., 50.],
[12., 51.]]])
If I put the number of channels as the first index then it is correctly arranged.
However this is very counterintuitive, usually in 3-dimensional data the channel is the third index, not the first.
c = np.zeros((2,4,3))
c[0,:,:] = a.values
c[1,:,:] = b.values
array([[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]],
[[40., 41., 42.],
[43., 44., 45.],
[46., 47., 48.],
[49., 50., 51.]]])
I don't understand this logic. Why the third dimension (channel) is the first index instead of the last one?
When I calculate the average over the two channels I have to do it using axis=0 which is very confusing. Anybody looking at the code will think it's a columnwise average instead of an average between channels.
Am I doing anything wrong?
Regarding intuition, this is just a typical way of access in most (all?) programming language. Typically, when you do something like:
my_array[a][b][c][d]
which is more common than the Numpy style indexing considering all languages, what you typically mean is:
From my_array, get block a. This is an inner block.
From the previous block, get block b. this is an inner block.
From the previous block, get block c. this is an inner block.
From the previous block, get item d (which happens to not be a block because it's the last dimensions).
The order is always outer most dimensions to inner most dimension. This has nothing two do with images or channels. So in your example, if you expect c[0] to return the channel rather than what you call the row, then that is the intuition. You always put first your outer dimension - just like when you have an image as an array the first dimensions is rows (height) and then columns (width).
This entire conversation is ignoring FORTRAN based array orderings (Matlab uses that for example) where columns is "outer" to rows, by definition. If you came from those languages (to Python and C based orderings row->column), that is a common source of misunderstanding. In this case, intuition just equals what you are used to working with, which is subjective and somewhat arbitrary.
usually i think in 3-dimensional data the channel is the first index, as shown in your second codes. it how it is arranged. so just using it that way
This would be my approach
>>> x = a.values.reshape((a.shape[0], a.shape[1], 1)) # Convert 2D to 3D - One layer
>>> y = b.values.reshape((b.shape[0], b.shape[1], 1)) # Convert 2D to 3D - Second layer
>>> z = np.concatenate((x, y), axis=2) # Concatenate on 3rd(starts from zero) axis
Which would yeild something similar to your array, which is correct.
array([[[ 1, 40],
[ 2, 41],
[ 3, 42]],
[[ 4, 43],
[ 5, 44],
[ 6, 45]],
[[ 7, 46],
[ 8, 47],
[ 9, 48]],
[[10, 49],
[11, 50],
[12, 51]]])
Also, If you wanna visually see the array in dataframe (just for checking)
>>> pd.DataFrame(z.tolist())
0 1 2
0 [1, 40] [2, 41] [3, 42]
1 [4, 43] [5, 44] [6, 45]
2 [7, 46] [8, 47] [9, 48]
3 [10, 49] [11, 50] [12, 51]
Let's say I have data for 3 variable pairs, A, B, and C (in my actual application the number of variables is anywhere from 1000-3000 but could be even higher).
Let's also say that there are pieces of the data that come in arrays.
For example:
Array X:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
Where:
X[0,0] = corresponds to data for variables A and A
X[0,1] = corresponds to data for variables A and B
X[0,2] = corresponds to data for variables A and C
X[1,0] = corresponds to data for variables B and A
X[1,1] = corresponds to data for variables B and B
X[1,2] = corresponds to data for variables B and C
X[2,0] = corresponds to data for variables C and A
X[2,1] = corresponds to data for variables C and B
X[2,2] = corresponds to data for variables C and C
Array Y:
np.array([[2,12],
[-12, 2]])
Y[0,0] = corresponds to data for variables A and C
Y[0,1] = corresponds to data for variables A and B
Y[1,0] = corresponds to data for variables B and A
Y[1,1] = corresponds to data for variables C and A
Array Z:
np.array([[ 99, 77],
[-77, -99]])
Z[0,0] = corresponds to data for variables A and C
Z[0,1] = corresponds to data for variables B and C
Z[1,0] = corresponds to data for variables C and B
Z[1,1] = corresponds to data for variables C and A
I want to concatenate the above arrays keeping the variable position fixed as follows:
END_RESULT_ARRAY index 0 corresponds to variable A
END_RESULT_ARRAY index 1 corresponds to variable B
END_RESULT_ARRAY index 2 corresponds to variable C
Basically, there are N variables in the universe but can change every month (new ones can be introduced and existing ones can drop out and then return or never return). Within the N variables in the universe I compute permutations pairs and the positioning of each variable is fixed i.e. index 0 corresponds to variable A, index = 1 corresponds to variable B (as described above).
Given the above requirement the end END_RESULT_ARRAY should look like the following:
array([[[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]],
[[ nan, 12., 2.],
[-12., nan, nan],
[ 2., nan, nan]],
[[ nan, nan, 99.],
[ nan, nan, 77.],
[-99., -77., nan]]])
Keep in mind that the above is an illustration.
In my actual application, I have about 125 arrays and a new one is generated every month. Each monthly array may have different sizes and may only have data for a portion of the variables defined in my universe. Also, as new arrays are created each month there is no way of knowing what its size will be or which variables will have data (or which ones will be missing).
So up until the most recent monthly array, we can determine the max size from the available historical data. Each month we will have to re-check the max size of all the arrays as a new array comes available. Once we have the max size we can then re-stitch/concatenate all the arrays together IF THIS IS SOMETHING THAT IS DOABLE in numpy. This will be an on-going operation done every month.
I want a general mechanism to be able to stitch these arrays together keeping the requirements I describe regarding the index position for the variables fixed.
I actually want to use H5PY arrays as my data set will grow exponentially not too distant future. However, I would like to get this working with numpy as a first step.
Based on the comment made by #user3483203. The next step is to concatenate the arrays.
a = np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
b = np.array([[0,12], [-12, 0]])
out = np.full_like(a, np.nan); i, j = b.shape; out[:i, :j] = b
res = np.array([a, out])
print (res)
This answers the original question which has since been changed:
Lets say I have the following arrays:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
np.array([[0,12],
[-12, 0]])
I want to concatenate the above 2 arrays such that the end result is
as follows:
array([[[0, 2, 3],
[-2, 0, 4],
[-3,-4, 0]],
[[0,12, np.nan],
[-12, 0, np.nan],
[np.nan, np.nan, np.nan]]])
Find out how much each array exceeds the max size in each dimension, then use np.pad to pad at the end of each dimension, then finally np.stack to stack them together:
import numpy as np
a = np.arange(12).reshape(4,3).astype(np.float)
b = np.arange(4).reshape(1,4).astype(np.float)
arrs = (a,b)
dims = len(arrs[0].shape)
maxshape = tuple( max(( x.shape[i] for x in arrs)) for i in range(dims))
paddedarrs = ( np.pad(x, tuple((0, maxshape[i]-x.shape[i]) for i in range(dims)), 'constant', constant_values=(np. nan,)) for x in (a,b))
c = np.stack(paddedarrs,0)
print (a)
print(b,"\n======================")
print(c)
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]
[ 9. 10. 11.]]
[[0. 1. 2. 3.]]
======================
[[[ 0. 1. 2. nan]
[ 3. 4. 5. nan]
[ 6. 7. 8. nan]
[ 9. 10. 11. nan]]
[[ 0. 1. 2. 3.]
[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]]]
I have two 3D matrices (A and B), both containing NaN in random elements. I am comparing these two matrices and in places where at least one of them contains NaN I want both of them to contain NaN. In other words, if the both of them don't already contain NaN at that index, I want to replace that index value with NaN. Is there an efficient way to do this with a python function?
import numpy as np
# Create the fake variables A and B. Here is what A and B look like initially:
A = np.array([[1, 2, np.nan], [4, np.nan, 6], [np.nan, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 8, 9]])
# What I want A and B to look like in the end:
A
array([[ 1., 2., nan],
[ 4., nan, nan],
[ nan, 8., 9.]])
B
array([[ 1., 2., nan],
[ 4., nan, nan],
[ nan, 8., 9.]])
You need numpy.isnan() and boolean indexing.
>>> A[np.isnan(B)] = np.nan
>>> B[np.isnan(A)] = np.nan
I'm trying to solve a problem using multidimensional arrays, rather than resorting to for loops, in order to gain a performance boost, but am having trouble with the indexing.
I've tried various permutations using np.newaxis, but can't seem to achieve the following functionality.
Problem:
Part 1) Take an M x N x N array called a, and for each of the M square matrices, set the upper triangular matrix elements as their negative values.
Part 2) Sum all elements in each of the M matrices (of shape N X N), returning a 1D array with M elements. Let's call this array b.
Attempted Solution
Here is my MWP / attempt using loops (which does work, but I'd rather find a fully array/matrix-based approach
a = np.array(
[[[ 0, 1],
[ 5, 0]],
[[ 0, 3],
[ 2, 0]]])
Part 1):
triangular_upper_idx = np.triu_indices_from(a[0])
for i in range(len(a)):
a[i][triangular_upper_idx] *= -1
a
result:
array([[[ 0, -1],
[ 5, 0]],
[[ 0, -3],
[ 2, 0]]])
Part 2):
b = np.zeros(len(a))
for i in range(len(a)):
b[i] = np.sum(a[i])
b
result:
array([ 4., -1.])
Note:
I have seen a similar question on this topic (Triangular indices for multidimensional arrays in numpy) but the solution there was nested for loops... I feel like numpy may offer a more efficient, clever array-based solution?
Any guidance would be much appreciated.
Thanks
yes numpy has the tools
r = 2
neg_uppr = np.triu(-np.ones((r,r)),1) + np.tril(np.ones((r,r)))
can't tell from your numerical example if you want the diagonal to be inverted too? Then use np.triu(-np.ones((r,r))) + np.tril(np.ones((r,r)),-1)
neg_uppr
Out[23]:
array([[ 1., -1.],
[ 1., 1.]])
a = np.array(
[[[ 0, 1],
[ 5, 0]],
[[ 0, 3],
[ 2, 0]]])
its fast to use the builtin element-wise arithmetic
a = a * neg_uppr
a
Out[26]:
array([[[ 0., -1.],
[ 5., 0.]],
[[ 0., -3.],
[ 2., 0.]]])
you can specify axes to sum over:
np.sum(a, (1,2))
Out[27]: array([ 4., -1.])
I have a large scipy sparse symmetric matrix which I need to condense by taking the sum of blocks to make a new smaller matrix.
For example, for a 4x4 sparse matrix A I will like to make a 2x2 matrix B in which B[i,j] = sum(A[i:i+2,j:j+2]).
Currently, I just go block by block to recreate the condensed matrix but this is slow. Any ideas on how to optimize this?
Update: Here is an example code that works fine, but is slow for a sparse matrix of 50.000x50.000 that I want to condense in a 10.000x10.000:
>>> A = (rand(4,4)<0.3)*rand(4,4)
>>> A = scipy.sparse.lil_matrix(A + A.T) # make the matrix symmetric
>>> B = scipy.sparse.lil_matrix((2,2))
>>> for i in range(B.shape[0]):
... for j in range(B.shape[0]):
... B[i,j] = A[i:i+2,j:j+2].sum()
First of all, lil matrix for the one your are summing up is probably really bad, I would try COO or maybe CSR/CSS (I don't know which will be better, but lil is probably inherently slower for many of these operations, even the slicing might be much slower, though I did not test). (Unless you know that for example dia fits perfectly)
Based on COO I could imagine doing some tricking around. Since COO has row and col arrays to give the exact positions:
matrix = A.tocoo()
new_row = matrix.row // 5
new_col = matrix.col // 5
bin = (matrix.shape[0] // 5) * new_col + new_row
# Now do a little dance because this is sparse,
# and most of the possible bin should not be in new_row/new_col
# also need to group the bins:
unique, bin = np.unique(bin, return_inverse=True)
sum = np.bincount(bin, weights=matrix.data)
new_col = unique // (matrix.shape[0] // 5)
new_row = unique - new_col * (matrix.shape[0] // 5)
result = scipy.sparse.coo_matrix((sum, (new_row, new_col)))
(I won't guarantee that I didn't confuse row and column somewhere and this only works for square matrices...)
Given a square matrix of size N and a split size of d (so matrix will be partitioned into N/d * N/d sub-matrices of size d), could you use numpy.split a couple times to build a collection of those sub-matrices, sum each of them, and put them back together?
This should be treated more as pseudocode than an efficient implementation, but it expresses my idea:
def chunk(matrix, size):
row_wise = []
for hchunk in np.split(matrix, size):
row_wise.append(np.split(hchunk, size, 1))
return row_wise
def sum_chunks(chunks):
sum_rows = []
for row in chunks:
sum_rows.append([np.sum(col) for col in row])
return np.array(sum_rows)
Or more compactly as
def sum_in_place(matrix, size):
return np.array([[np.sum(vchunk) for vchunk in np.split(hchunk, size, 1)]
for hchunk in np.split(matrix, size)])
This gives you something like the following:
In [16]: a
Out[16]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [17]: chunk.sum_in_place(a, 2)
Out[17]:
array([[10, 18],
[42, 50]])
For a 4x4 example you can do the following:
In [43]: a = np.arange(16.).reshape((4, 4))
In [44]: a
Out[44]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 12., 13., 14., 15.]])
In [45]: u = np.array([a[:2, :2], a[:2, 2:], a[2:,:2], a[2:, 2:]])
In [46]: u
Out[46]:
array([[[ 0., 1.],
[ 4., 5.]],
[[ 2., 3.],
[ 6., 7.]],
[[ 8., 9.],
[ 12., 13.]],
[[ 10., 11.],
[ 14., 15.]]])
In [47]: u.sum(1).sum(1).reshape(2, 2)
Out[47]:
array([[ 10., 18.],
[ 42., 50.]])
Using something like itertools it should be possible to automate and generalise an expression for u.