Subtracting minimum of row from the row - python

I know that
a - a.min(axis=0)
will subtract the minimum of each column from every element in the column. I want to subtract the minimum in each row from every element in the row. I know that
a.min(axis=1)
specifies the minimum within a row, but how do I tell the subtraction to go by rows instead of columns? (How do I specify the axis of the subtraction?)
edit: For my question, a is a 2d array in NumPy.

Assuming a is a numpy array, you can use this:
new_a = a - np.min(a, axis=1)[:,None]
Try it out:
import numpy as np
a = np.arange(24).reshape((4,6))
print (a)
new_a = a - np.min(a, axis=1)[:,None]
print (new_a)
Result:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
[[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]]
Note that np.min(a, axis=1) returns a 1d array of row-wise minimum values.
We than add an extra dimension to it using [:,None]. It then looks like this 2d array:
array([[ 0],
[ 6],
[12],
[18]])
When this 2d array participates in the subtraction, it gets broadcasted into a shape of (4,6), which looks like this:
array([[ 0, 0, 0, 0, 0, 0],
[ 6, 6, 6, 6, 6, 6],
[12, 12, 12, 12, 12, 12],
[18, 18, 18, 18, 18, 18]])
Now, element-wise subtraction happens between the two (4,6) arrays.

Specify keepdims=True to preserve a length-1 dimension in place of the dimension that min collapses, allowing broadcasting to work out naturally:
a - a.min(axis=1, keepdims=True)
This is especially convenient when axis is determined at runtime, but still probably clearer than manually reintroducing the squashed dimension even when the 1 value is fixed.

If you want to use only pandas you can just apply a lambda to every column using min(row)
new_df = pd.DataFrame()
for i, col in enumerate(df.columns):
new_df[col] = df.apply(lambda row: row[i] - min(row))

Related

Numpy array add inplace, values not summed when select same row multiple times

suppose I have a 2x2 matrix, I want to select a few rows and add inplace with another array of the correct shape. The problem is, when a row is selected multiple times, the values from another array is not summed:
Example:
I have a 2x2 matrix:
>>> import numpy as np
>>> x = np.arange(15).reshape((5,3))
>>> print(x)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]]
I want to select a few rows, and add values:
>>> x[np.array([[1,1],[2,3]])] # row 1 is selected twice
[[[ 3 4 5]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
>>> add_value = np.random.randint(0,10,(2,2,3))
[[[6 1 2] # add to row 1
[9 8 5]] # add to row 1 again!
[[5 0 5] # add to row 2
[1 9 3]]] # add to row 3
>>> x[np.array([[1,1],[2,3]])] += add_value
>>> print(x)
[[ 0 1 2]
[12 12 10] # [12,12,10]=[3,4,5]+[9,8,5]
[11 7 13]
[10 19 14]
[12 13 14]]
as above, the first row is [12,12,10], which means [9,8,5] and [6,1,2] is not summed when added onto the first row. Are there any solutions? Thanks!
This behavior is described in the numpy documentation, near the bottom of this page, under "assigning values to indexed arrays":
https://numpy.org/doc/stable/user/basics.indexing.html#basics-indexing
Quoting:
Unlike some of the references (such as array and mask indices) assignments are always made to the original data in the array (indeed, nothing else would make sense!). Note though, that some actions may not work as one may naively expect. This particular example is often surprising to people:
>>> x = np.arange(0, 50, 10)
>>> x
array([ 0, 10, 20, 30, 40])
>>> x[np.array([1, 1, 3, 1])] += 1
>>> x
array([ 0, 11, 20, 31, 40])
Where people expect that the 1st location will be incremented by 3. In fact, it will only be incremented by 1. The reason is that a new array is extracted from the original (as a temporary) containing the values at 1, 1, 3, 1, then the value 1 is added to the temporary, and then the temporary is assigned back to the original array. Thus the value of the array at x[1] + 1 is assigned to x[1] three times, rather than being incremented 3 times.
Just wanna share what #hpaulj suggests that uses np.add.at:
>>> import numpy as np
>>> x = np.arange(15).reshape((5,3))
>>> select = np.array([[1,1],[2,3]])
>>> add_value = np.array([[[6,1,2],[9,8,5]],[[5,0,5],[1,9,3]]])
>>> np.add.at(x, select.flatten(), add_value.reshape(-1, add_value.shape[-1]))
[[ 0 1 2]
[18 13 12]
[11 7 13]
[10 19 14]
[12 13 14]]
Now the first row is [18,13,12] which is the sum of [3,4,5], [6,1,2] and [9,8,5]

Convert c-order index into f-order index in Python

I am trying to find a solution to the following problem. I have an index in C-order and I need to convert it into F-order.
To explain simply my problem, here is an example:
Let's say we have a matrix x as:
x = np.arange(1,5).reshape(2,2)
print(x)
array([[1, 2],
[3, 4]])
Then the flattened matrix in C order is:
flat_c = x.ravel()
print(flat_c)
array([1, 2, 3, 4])
Now, the value 3 is at the 2nd position of the flat_c vector i.e. flat_c[2] is 3.
If I would flatten the matrix x using the F-order, I would have:
flat_f = x.ravel(order='f')
array([1, 3, 2, 4])
Now, the value 3 is at the 1st position of the flat_f vector i.e. flat_f[1] is 3.
I am trying to find a way to get the F-order index knowing the dimension of the matrix and the corresponding index in C-order.
I tried using np.unravel_index but this function returns the matrix positions...
We can use a combination of np.ravel_multi_index and np.unravel_index for a ndarray supported solution. Hence, given array shape s of input array a and c-order index c_idx, it would be -
s = a.shape
f_idx = np.ravel_multi_index(np.unravel_index(c_idx,s)[::-1],s[::-1])
So, the idea is pretty simple. Use np.unravel_index to get c-based indices in n-dim, then get flattened-linear index in fortran order by using np.ravel_multi_index on flipped shape and those flipped n-dim indices to simulate fortran behavior.
Sample runs on 2D -
In [321]: a
Out[321]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
In [322]: s = a.shape
In [323]: c_idx = 6
In [324]: np.ravel_multi_index(np.unravel_index(c_idx,s)[::-1],s[::-1])
Out[324]: 4
In [325]: c_idx = 12
In [326]: np.ravel_multi_index(np.unravel_index(c_idx,s)[::-1],s[::-1])
Out[326]: 8
Sample run on 3D array -
In [336]: a
Out[336]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
In [337]: s = a.shape
In [338]: c_idx = 21
In [339]: np.ravel_multi_index(np.unravel_index(c_idx,s)[::-1],s[::-1])
Out[339]: 9
In [340]: a.ravel('F')[9]
Out[340]: 21
Suppose your matrix is of shape (nrow,ncol). Then the 1D index when unraveled in C style for the (irow,icol) entry is given by
idxc = ncol*irow + icol
In the above equation, you know idxc. Then,
icol = idxc % ncol
Now you can find irow
irow = (idxc - icol) / ncol
Now you know both irow and icol. You can use them to get the F index. I think the F index will be given by
idxf = nrow*icol + irow
Please double-check my math, I might have got something wrong...
For the 3D case, if your array has dimensions [n1][n2][n3], then the unraveled C-index for [i1][i2][i3] is
idxc = n2*n3*i1 + n3*i2+i3
Using modulo operations similar to the 2D case, we can recover i1,i2,i3 and then convert to unraveled F index, i.e.
n3*i2 + i3 = idxc % (n2*n3)
i3 = (n3*i2+i3) % n3
i2 = ((n3*i2+i3) - i3) /n3
i1 = (idxc-(n3+i2+i3)) /(n2*n3)
F index would be:
idxf = i1 + n1*i2 +n1*n2*i3
Please check my math.
In simple cases you may also get away with transposing and ravelling the array:
import numpy as np
x = np.arange(2 * 2).reshape(2, 2)
print(x)
# [[0 1]
# [2 3]]
print(x.ravel())
# [0 1 2 3]
print(x.transpose().ravel())
# [0 2 1 3]
x = np.arange(2 * 3 * 4).reshape(2, 3, 4)
print(x)
# [[[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
# [[12 13 14 15]
# [16 17 18 19]
# [20 21 22 23]]]
print(x.ravel())
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
print(x.transpose().ravel())
# [ 0 12 4 16 8 20 1 13 5 17 9 21 2 14 6 18 10 22 3 15 7 19 11 23]

Slicing a 2D NumPy Array by all zero rows

This is essentially the 2D array equivalent of slicing a python list into smaller lists at indexes that store a particular value. I'm running a program that extracts a large amount of data out of a CSV file and copies it into a 2D NumPy array. The basic format of these arrays are something like this:
[[0 8 9 10]
[9 9 1 4]
[0 0 0 0]
[1 2 1 4]
[0 0 0 0]
[1 1 1 2]
[39 23 10 1]]
I want to separate my NumPy array along rows that contain all zero values to create a set of smaller 2D arrays. The successful result for the above starting array would be the arrays:
[[0 8 9 10]
[9 9 1 4]]
[[1 2 1 4]]
[[1 1 1 2]
[39 23 10 1]]
I've thought about simply iterating down the array and checking if the row has all zeros but the data I'm handling is substantially large. I have potentially millions of rows of data in the text file and I'm trying to find the most efficient approach as opposed to a loop that could waste computation time. What are your thoughts on what I should do? Is there a better way?
a is your array. You can use any to find all zero rows, remove them, and then use split to split by their indices:
#not_all_zero rows indices
idx = np.flatnonzero(a.any(1))
#all_zero rows indices
idx_zero = np.delete(np.arange(a.shape[0]),idx)
#select not_all_zero rows and split by all_zero row indices
output = np.split(a[idx],idx_zero-np.arange(idx_zero.size))
output:
[array([[ 0, 8, 9, 10],
[ 9, 9, 1, 4]]),
array([[1, 2, 1, 4]]),
array([[ 1, 1, 1, 2],
[39, 23, 10, 1]])]
You can use the np.all function to check for rows which are all zeros, and then index appropriately.
# assume `x` is your data
indices = np.all(x == 0, axis=1)
zeros = x[indices]
nonzeros = x[np.logical_not(indices)]
The all function accepts an axis argument (as do many NumPy functions), which indicates the axis along which to operate. 1 here means to do the reduction along rows, so you get back a boolean array of shape (x.shape[0],), which can be used to directly index x.
Note that this will be much faster than a for-loop over the rows, especially for large arrays.

What does numpy.ix_() function do and what is the output used for?

Below shows the output from numpy.ix_() function. What is the use of the output? It's structure is quite unique.
import numpy as np
>>> gfg = np.ix_([1, 2, 3, 4, 5, 6], [11, 12, 13, 14, 15, 16], [21, 22, 23, 24, 25, 26], [31, 32, 33, 34, 35, 36] )
>>> gfg
(array([[[[1]]],
[[[2]]],
[[[3]]],
[[[4]]],
[[[5]]],
[[[6]]]]),
array([[[[11]],
[[12]],
[[13]],
[[14]],
[[15]],
[[16]]]]),
array([[[[21],
[22],
[23],
[24],
[25],
[26]]]]),
array([[[[31, 32, 33, 34, 35, 36]]]]))
According to numpy doc:
Construct an open mesh from multiple sequences.
This function takes N 1-D sequences and returns N outputs with N dimensions each, such that the shape is 1 in all but one dimension and the dimension with the non-unit shape value cycles through all N dimensions.
Using ix_ one can quickly construct index arrays that will index the cross product. a[np.ix_([1,3],[2,5])] returns the array [[a[1,2] a[1,5]], [a[3,2] a[3,5]]].
numpy.ix_()'s main use is to create an open mesh so that we can use it to select specific indices from an array (specific sub-array). An easy example to understand it is:
Say you have a 2D array of shape (5,5), and you would like to select a sub-array that is constructed by selecting the rows 1 and 3 and columns 0 and 3. You can use np.ix_ to create a (index) mesh so as to be able to select the sub-array as follows in the example below:
a = np.arange(5*5).reshape(5,5)
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
sub_indices = np.ix_([1,3],[0,3])
(array([[1],
[3]]), array([[0, 3]]))
a[sub_indices]
[[ 5 8]
[15 18]]
which is basically the selected sub-array from a that is in rows array([[1],[3]]) and columns array([[0, 3]]):
col 0 col 3
| |
v v
[[ 0 1 2 3 4]
[ 5 6 7 8 9] <- row 1
[10 11 12 13 14]
[15 16 17 18 19] <- row 3
[20 21 22 23 24]]
Please note in the output of the np.ix_, the N-arrays returned for the N 1-D input indices you feed to np.ix_ are returned in a way that first one is for rows, second one is for columns, third one is for depth and so on. That is why in the above example, array([[1],[3]]) is for rows and array([[0, 3]]) is for columns. Same goes for the example OP provided in the question. The reason behind it is the way numpy uses advanced indexing for multi-dimensional arrays.
It's basically used to create N array mask or arrays of indexes each one referring to a different dimension.
For example if I've a 3d np.ndarray and I want to get only some entries of it I can use numpy.ix to create 3 arrays that will have shapes like (N,1,1) (1,N,1) and (1,1,N) containing the corresponding index for each one of the 3 axis.
Take a look at the examples at numpy documentation page. They're self explanatory.
This function isn't commonly used.
I think it's used in some algebra operations like cross product and it's generalisations.

Add column for squares/cubes/etc for each column in numpy/pandas

I’m trying to take a set of data that consists of N rows, and expand each row to include the squares/cube/etc of each column in that row (what power to go up to is determined by a variable j). The data starts out as a pandas DataFrame but can be turned into a numpy array.
For example:
If the row is [3,2] and j is 3, the row should be transformed to [3, 2, 9, 4, 27, 8]
I currently have a semi working version that consists of a bunch of nested for loops and is pretty ugly. I’m hoping for a cleaner way to make this transformation so things will be a bit easier for me to debug.
The behavior I’m looking for is basically the same as sklearns PolynomialFeature, but I’m trying to do it in numpy and or pandas only.
Thanks!
Use NumPy broadcasting for a vectorized solution -
In [66]: a = np.array([3,2])
In [67]: j = 3
In [68]: a**np.arange(1,j+1)[:,None]
Out[68]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
And there's a NumPy builtin : np.vander -
In [142]: np.vander(a,j+1).T[::-1][1:]
Out[142]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
Or with increasing flat set as True -
In [180]: np.vander(a,j+1,increasing=True).T[1:]
Out[180]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
Try concat with ignore_index option to remove duplicate in column names:
df = pd.DataFrame(np.arange(9).reshape(3,3))
j = 3
pd.concat([df**i for i in range(1,j+1)], axis=1,ignore_index=True)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 0 1 4 0 1 8
1 3 4 5 9 16 25 27 64 125
2 6 7 8 36 49 64 216 343 512

Categories