Related
Let's suppose we have a bidimensional array arr1 with shape (10,4):
import numpy as np
np.set_printoptions(suppress=True)
np.random.seed(0)
a = np.random.rand(10, 3)
np.random.seed(0)
b = np.random.randint(24, size=10)
b = b.reshape(len(b),1)
arr1 = np.hstack((a, b))
print(arr1)
arr1 looks like this:
[[ 0.5488135 0.71518937 0.60276338 12. ]
[ 0.54488318 0.4236548 0.64589411 15. ]
[ 0.43758721 0.891773 0.96366276 21. ]
[ 0.38344152 0.79172504 0.52889492 0. ]
[ 0.56804456 0.92559664 0.07103606 3. ]
[ 0.0871293 0.0202184 0.83261985 3. ]
[ 0.77815675 0.87001215 0.97861834 7. ]
[ 0.79915856 0.46147936 0.78052918 9. ]
[ 0.11827443 0.63992102 0.14335329 19. ]
[ 0.94466892 0.52184832 0.41466194 21. ]]
Now, some external process is performed, so we lost the information in the last column and some rows are filtered:
arr2 = arr1[:,:3]
np.random.seed(0)
filter_arr = np.random.choice(10, size=6, replace=False)
arr2 = arr2[filter_arr]
print(arr2)
As a result, we got the following array arr2:
[[0.43758721 0.891773 0.96366276]
[0.11827443 0.63992102 0.14335329]
[0.56804456 0.92559664 0.07103606]
[0.94466892 0.52184832 0.41466194]
[0.54488318 0.4236548 0.64589411]
[0.77815675 0.87001215 0.97861834]]
The goal is to efficiently check which rows according to the values from the three first columns remain in arr2 and add the values from the fourth column of arr1 to arr2. Of course, the filter_arr in the previous step would be completely unknown.
The expected result would be this:
[[0.43758721 0.891773 0.96366276 21. ]
[0.11827443 0.63992102 0.14335329 19. ]
[0.56804456 0.92559664 0.07103606 3. ]
[0.94466892 0.52184832 0.41466194 21. ]
[0.54488318 0.4236548 0.64589411 15. ]
[0.77815675 0.87001215 0.97861834 7. ]]
Thx.
P.S. If you come up with a better title for this question, please, just let me know to change it in order to be more useful to other users.
If you can fit the broadcasting in memory, you can compare first 3 columns of arr1 with arr2 and then use numpy.argmax after numpy.all (along the right axis) to retrieve the index of each row of arr2 from arr1, finally use such indexes to get the last column of arr1 and stack it with arr2.
As Jérôme Richard pointed out it is not safe to use == for comparison, you can use numpy.isclose function and customize the tollerance to your need.
Regarding the eventuality of having np.nan values, np.isclose also accept the parameter equal_nan which by default is False but can be set to True so that np.isclose(np.nan, np.nan, equal_nan=True) returns True (if that's the behaviour you want to have).
import numpy as np
idx = np.argmax(np.all(np.isclose(arr1[:, :3, None],arr2.T[None, :, :]), axis=1), axis=0)
filtered_last_col = arr1[idx, -1, None]
np.hstack([arr2, filtered_last_col])
array([[ 0.43758721, 0.891773 , 0.96366276, 21. ],
[ 0.11827443, 0.63992102, 0.14335329, 19. ],
[ 0.56804456, 0.92559664, 0.07103606, 3. ],
[ 0.94466892, 0.52184832, 0.41466194, 21. ],
[ 0.54488318, 0.4236548 , 0.64589411, 15. ],
[ 0.77815675, 0.87001215, 0.97861834, 7. ]])
I have the following numpy array:
foo = np.array([[0.0, 10.0], [0.13216, 12.11837], [0.25379, 42.05027], [0.30874, 13.11784]])
which yields:
[[ 0. 10. ]
[ 0.13216 12.11837]
[ 0.25379 42.05027]
[ 0.30874 13.11784]]
How can I normalize the Y component of this array. So it gives me something like:
[[ 0. 0. ]
[ 0.13216 0.06 ]
[ 0.25379 1 ]
[ 0.30874 0.097]]
Referring to this Cross Validated Link, How to normalize data to 0-1 range?, it looks like you can perform min-max normalisation on the last column of foo.
v = foo[:, 1] # foo[:, -1] for the last column
foo[:, 1] = (v - v.min()) / (v.max() - v.min())
foo
array([[ 0. , 0. ],
[ 0.13216 , 0.06609523],
[ 0.25379 , 1. ],
[ 0.30874 , 0.09727968]])
Another option for performing normalisation (as suggested by OP) is using sklearn.preprocessing.normalize, which yields slightly different results -
from sklearn.preprocessing import normalize
foo[:, [-1]] = normalize(foo[:, -1, None], norm='max', axis=0)
foo
array([[ 0. , 0.2378106 ],
[ 0.13216 , 0.28818769],
[ 0.25379 , 1. ],
[ 0.30874 , 0.31195614]])
sklearn.preprocessing.MinMaxScaler can also be used (feature_range=(0, 1) is default):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
v = foo[:,1]
v_scaled = min_max_scaler.fit_transform(v)
foo[:,1] = v_scaled
print(foo)
Output:
[[ 0. 0. ]
[ 0.13216 0.06609523]
[ 0.25379 1. ]
[ 0.30874 0.09727968]]
Advantage is that scaling to any range can be done.
I think you want this:
foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())
You are trying to min-max scale between 0 and 1 only the second column.
Using sklearn.preprocessing.minmax_scale, should easily solve your problem.
e.g.:
from sklearn.preprocessing import minmax_scale
column_1 = foo[:,0] #first column you don't want to scale
column_2 = minmax_scale(foo[:,1], feature_range=(0,1)) #second column you want to scale
foo_norm = np.stack((column_1, column_2), axis=1) #stack both columns to get a 2d array
Should yield
array([[0. , 0. ],
[0.13216 , 0.06609523],
[0.25379 , 1. ],
[0.30874 , 0.09727968]])
Maybe you want to min-max scale between 0 and 1 both columns. In this case, use:
foo_norm = minmax_scale(foo, feature_range=(0,1), axis=0)
Which yields
array([[0. , 0. ],
[0.42806245, 0.06609523],
[0.82201853, 1. ],
[1. , 0.09727968]])
note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.
I have following numpy array
import numpy as np
np.random.seed(20)
np.random.rand(20).reshape(5, 4)
array([[ 0.5881308 , 0.89771373, 0.89153073, 0.81583748],
[ 0.03588959, 0.69175758, 0.37868094, 0.51851095],
[ 0.65795147, 0.19385022, 0.2723164 , 0.71860593],
[ 0.78300361, 0.85032764, 0.77524489, 0.03666431],
[ 0.11669374, 0.7512807 , 0.23921822, 0.25480601]])
For each column I would like to slice it in positions:
position_for_slicing=[0, 3, 4, 4]
So I will get following array:
array([[ 0.5881308 , 0.85032764, 0.23921822, 0.81583748],
[ 0.03588959, 0.7512807 , 0 0],
[ 0.65795147, 0, 0 0],
[ 0.78300361, 0, 0 0],
[ 0.11669374, 0, 0 0]])
Is there fast way to do this ? I know I can use to do for loop for each column, but I was wondering if there is more elegant way to do this.
If "elegant" means "no loop" the following would qualify, but probably not under many other definitions (arr is your input array):
m, n = arr.shape
arrf = np.asanyarray(arr, order='F')
padded = np.r_[arrf, np.zeros_like(arrf)]
assert padded.flags['F_CONTIGUOUS']
expnd = np.lib.stride_tricks.as_strided(padded, (m, m+1, n), padded.strides[:1] + padded.strides)
expnd[:, [0,3,4,4], range(4)]
# array([[ 0.5881308 , 0.85032764, 0.23921822, 0.25480601],
# [ 0.03588959, 0.7512807 , 0. , 0. ],
# [ 0.65795147, 0. , 0. , 0. ],
# [ 0.78300361, 0. , 0. , 0. ],
# [ 0.11669374, 0. , 0. , 0. ]])
Please note that order='C' and then 'C_CONTIGUOUS' in the assertion also works. My hunch is that 'F' could be a bit faster because the indexing then operates on contiguous slices.
I have a matrix A with shape 1.6M rows and 400 columns.
One of the columns in A (call it the output column) has binary values (0,1) with a predominance of 0's.
I want to create a new matrix B (same shape as A) by sampling rows in A with replacement such, that the distribution of 0's & 1's in the output column in B becomes 50/50.
What is the efficient way to do this using python/numpy?
You could do this by:
Creating a list of all rows with 0 in the "output column" (called outputZeros), and a list of all rows with 1 in the output column (called outputOnes); then,
Sampling with replacement from outputZeros and outputOnes 1.6M times.
Here's a small example. It's not clear to me if you want the rows in B to be in any particular order, so here they first include 0s, then include 1s.
In [1]: import numpy as np, random
In [2]: A = np.random.rand(10, 2)
In [3]: A
In [4]: A[:7, 1] = 0
In [5]: A[7:, 1] = 1
In [6]: A
Out[6]:
array([[ 0.70126052, 0. ],
[ 0.51161067, 0. ],
[ 0.76511966, 0. ],
[ 0.91257144, 0. ],
[ 0.97024895, 0. ],
[ 0.55817776, 0. ],
[ 0.55963466, 0. ],
[ 0.6318139 , 1. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ]])
In [7]: outputZeros = np.where(A[:, 1] == 0)[0]
In [8]: outputZeros
Out[8]: array([0, 1, 2, 3, 4, 5, 6])
In [9]: outputOnes
Out[9]: array([7, 8, 9])
In [10]: outputOnes = np.where(A[:, 1] == 1)[0]
In [11]: B = np.zeros((10, 2))
In [12]: for i in range(10):
if i < 5:
B[i, :] = A[random.choice(outputZeros), :]
else:
B[i, :] = A[random.choice(outputOnes), :]
....:
In [13]: B
Out[13]:
array([[ 0.97024895, 0. ],
[ 0.97024895, 0. ],
[ 0.76511966, 0. ],
[ 0.76511966, 0. ],
[ 0.51161067, 0. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ],
[ 0.6318139 , 1. ],
[ 0.6318139 , 1. ],
[ 0.76033151, 1. ]])
I would create a new 1D numpy array filled with values from numpy.random.random_integers(low, high=None, size=None) and swap that new array with the old one.
I have a ndarray that looks like this:
array([[ -2.1e+00, -9.89644000e-03],
[ -2.2e+00, 0.00000000e+00],
[ -2.3e+00, 2.33447000e-02],
[ -2.4e+00, 5.22411000e-02]])
Whats the most pythonic way to add the integer 2 to the first column to give:
array([[ -0.1e+00, -9.89644000e-03],
[ -0.2e+00, 0.00000000e+00],
[ -0.3e+00, 2.33447000e-02],
[ -0.4e+00, 5.22411000e-02]])
Edit:
To add 2 to the first column only, do
>>> import numpy as np
>>> x = np.array([[ -2.1e+00, -9.89644000e-03],
[ -2.2e+00, 0.00000000e+00],
[ -2.3e+00, 2.33447000e-02],
[ -2.4e+00, 5.22411000e-02]])
>>> x[:,0] += 2 # : selects all rows, 0 selects first column
>>> x
array([[-0.1, -0.00989644],
[-0.2, 0. ],
[-0.3, 0.0233447 ],
[-0.4, 0.0522411 ]])
>>> import numpy as np
>>> x = np.array([[ -2.1e+00, -9.89644000e-03],
[ -2.2e+00, 0.00000000e+00],
[ -2.3e+00, 2.33447000e-02],
[ -2.4e+00, 5.22411000e-02]])
>>> x + 2
array([[-0.1, 1.99010356],
[-0.2, 2. ],
[-0.3, 2.0233447 ],
[-0.4, 2.0522411 ]])
Perhaps the Numpy Tutorial may help you.