Sample with replacement from existing array - python

I have a matrix A with shape 1.6M rows and 400 columns.
One of the columns in A (call it the output column) has binary values (0,1) with a predominance of 0's.
I want to create a new matrix B (same shape as A) by sampling rows in A with replacement such, that the distribution of 0's & 1's in the output column in B becomes 50/50.
What is the efficient way to do this using python/numpy?

You could do this by:
Creating a list of all rows with 0 in the "output column" (called outputZeros), and a list of all rows with 1 in the output column (called outputOnes); then,
Sampling with replacement from outputZeros and outputOnes 1.6M times.
Here's a small example. It's not clear to me if you want the rows in B to be in any particular order, so here they first include 0s, then include 1s.
In [1]: import numpy as np, random
In [2]: A = np.random.rand(10, 2)
In [3]: A
In [4]: A[:7, 1] = 0
In [5]: A[7:, 1] = 1
In [6]: A
Out[6]:
array([[ 0.70126052, 0. ],
[ 0.51161067, 0. ],
[ 0.76511966, 0. ],
[ 0.91257144, 0. ],
[ 0.97024895, 0. ],
[ 0.55817776, 0. ],
[ 0.55963466, 0. ],
[ 0.6318139 , 1. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ]])
In [7]: outputZeros = np.where(A[:, 1] == 0)[0]
In [8]: outputZeros
Out[8]: array([0, 1, 2, 3, 4, 5, 6])
In [9]: outputOnes
Out[9]: array([7, 8, 9])
In [10]: outputOnes = np.where(A[:, 1] == 1)[0]
In [11]: B = np.zeros((10, 2))
In [12]: for i in range(10):
if i < 5:
B[i, :] = A[random.choice(outputZeros), :]
else:
B[i, :] = A[random.choice(outputOnes), :]
....:
In [13]: B
Out[13]:
array([[ 0.97024895, 0. ],
[ 0.97024895, 0. ],
[ 0.76511966, 0. ],
[ 0.76511966, 0. ],
[ 0.51161067, 0. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ],
[ 0.6318139 , 1. ],
[ 0.6318139 , 1. ],
[ 0.76033151, 1. ]])

I would create a new 1D numpy array filled with values from numpy.random.random_integers(low, high=None, size=None) and swap that new array with the old one.

Related

Numpy: How to stack a single array into each row of a bigger array and turn it into a 2D array?

I have a numpy array named heartbeats with 100 rows. Each row has 5 elements.
I also have a single array named time_index with 5 elements.
I need to prepend the time index to each row of heartbeats.
heartbeats = np.array([
[-0.58, -0.57, -0.55, -0.39, -0.40],
[-0.31, -0.31, -0.32, -0.46, -0.46]
])
time_index = np.array([-2, -1, 0, 1, 2])
What I need:
array([-2, -0.58],
[-1, -0.57],
[0, -0.55],
[1, -0.39],
[2, -0.40],
[-2, -0.31],
[-1, -0.31],
[0, -0.32],
[1, -0.46],
[2, -0.46])
I only wrote two rows of heartbeats to illustrate.
Assuming you are using numpy, the exact output array you are looking for can be made by stacking a repeated version of time_index with the raveled version of heartbeats:
np.stack((np.tile(time_index, len(heartbeats)), heartbeats.ravel()), axis=-1)
Another approach, using broadcasting
In [13]: heartbeats = np.array([
...: [-0.58, -0.57, -0.55, -0.39, -0.40],
...: [-0.31, -0.31, -0.32, -0.46, -0.46]
...: ])
...: time_index = np.array([-2, -1, 0, 1, 2])
Make a target array:
In [14]: res = np.zeros(heartbeats.shape + (2,), heartbeats.dtype)
In [15]: res[:,:,1] = heartbeats # insert a (2,5) into a (2,5) slot
In [17]: res[:,:,0] = time_index[None] # insert a (5,) into a (2,5) slot
In [18]: res
Out[18]:
array([[[-2. , -0.58],
[-1. , -0.57],
[ 0. , -0.55],
[ 1. , -0.39],
[ 2. , -0.4 ]],
[[-2. , -0.31],
[-1. , -0.31],
[ 0. , -0.32],
[ 1. , -0.46],
[ 2. , -0.46]]])
and then reshape to 2d:
In [19]: res.reshape(-1,2)
Out[19]:
array([[-2. , -0.58],
[-1. , -0.57],
[ 0. , -0.55],
[ 1. , -0.39],
[ 2. , -0.4 ],
[-2. , -0.31],
[-1. , -0.31],
[ 0. , -0.32],
[ 1. , -0.46],
[ 2. , -0.46]])
[17] takes a (5,), expands it to (1,5), and then to (2,5) for the insert. Read up on broadcasting.
As an alternative way, you can repeat time_index by np.concatenate based on the specified times:
concatenated = np.concatenate([time_index] * heartbeats.shape[0])
# [-2 -1 0 1 2 -2 -1 0 1 2]
# result = np.dstack((concatenated, heartbeats.reshape(-1))).squeeze()
result = np.array([concatenated, heartbeats.reshape(-1)]).T
Using np.concatenate may be faster than np.tile. This solution is faster than Mad Physicist, but the fastest is using broadcasting as hpaulj's answer.

Numpy array minmax normalization python [duplicate]

I have the following numpy array:
foo = np.array([[0.0, 10.0], [0.13216, 12.11837], [0.25379, 42.05027], [0.30874, 13.11784]])
which yields:
[[ 0. 10. ]
[ 0.13216 12.11837]
[ 0.25379 42.05027]
[ 0.30874 13.11784]]
How can I normalize the Y component of this array. So it gives me something like:
[[ 0. 0. ]
[ 0.13216 0.06 ]
[ 0.25379 1 ]
[ 0.30874 0.097]]
Referring to this Cross Validated Link, How to normalize data to 0-1 range?, it looks like you can perform min-max normalisation on the last column of foo.
v = foo[:, 1] # foo[:, -1] for the last column
foo[:, 1] = (v - v.min()) / (v.max() - v.min())
foo
array([[ 0. , 0. ],
[ 0.13216 , 0.06609523],
[ 0.25379 , 1. ],
[ 0.30874 , 0.09727968]])
Another option for performing normalisation (as suggested by OP) is using sklearn.preprocessing.normalize, which yields slightly different results -
from sklearn.preprocessing import normalize
foo[:, [-1]] = normalize(foo[:, -1, None], norm='max', axis=0)
foo
array([[ 0. , 0.2378106 ],
[ 0.13216 , 0.28818769],
[ 0.25379 , 1. ],
[ 0.30874 , 0.31195614]])
sklearn.preprocessing.MinMaxScaler can also be used (feature_range=(0, 1) is default):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
v = foo[:,1]
v_scaled = min_max_scaler.fit_transform(v)
foo[:,1] = v_scaled
print(foo)
Output:
[[ 0. 0. ]
[ 0.13216 0.06609523]
[ 0.25379 1. ]
[ 0.30874 0.09727968]]
Advantage is that scaling to any range can be done.
I think you want this:
foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())
You are trying to min-max scale between 0 and 1 only the second column.
Using sklearn.preprocessing.minmax_scale, should easily solve your problem.
e.g.:
from sklearn.preprocessing import minmax_scale
column_1 = foo[:,0] #first column you don't want to scale
column_2 = minmax_scale(foo[:,1], feature_range=(0,1)) #second column you want to scale
foo_norm = np.stack((column_1, column_2), axis=1) #stack both columns to get a 2d array
Should yield
array([[0. , 0. ],
[0.13216 , 0.06609523],
[0.25379 , 1. ],
[0.30874 , 0.09727968]])
Maybe you want to min-max scale between 0 and 1 both columns. In this case, use:
foo_norm = minmax_scale(foo, feature_range=(0,1), axis=0)
Which yields
array([[0. , 0. ],
[0.42806245, 0.06609523],
[0.82201853, 1. ],
[1. , 0.09727968]])
note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.

Printing numpy with different position in the column

I have following numpy array
import numpy as np
np.random.seed(20)
np.random.rand(20).reshape(5, 4)
array([[ 0.5881308 , 0.89771373, 0.89153073, 0.81583748],
[ 0.03588959, 0.69175758, 0.37868094, 0.51851095],
[ 0.65795147, 0.19385022, 0.2723164 , 0.71860593],
[ 0.78300361, 0.85032764, 0.77524489, 0.03666431],
[ 0.11669374, 0.7512807 , 0.23921822, 0.25480601]])
For each column I would like to slice it in positions:
position_for_slicing=[0, 3, 4, 4]
So I will get following array:
array([[ 0.5881308 , 0.85032764, 0.23921822, 0.81583748],
[ 0.03588959, 0.7512807 , 0 0],
[ 0.65795147, 0, 0 0],
[ 0.78300361, 0, 0 0],
[ 0.11669374, 0, 0 0]])
Is there fast way to do this ? I know I can use to do for loop for each column, but I was wondering if there is more elegant way to do this.
If "elegant" means "no loop" the following would qualify, but probably not under many other definitions (arr is your input array):
m, n = arr.shape
arrf = np.asanyarray(arr, order='F')
padded = np.r_[arrf, np.zeros_like(arrf)]
assert padded.flags['F_CONTIGUOUS']
expnd = np.lib.stride_tricks.as_strided(padded, (m, m+1, n), padded.strides[:1] + padded.strides)
expnd[:, [0,3,4,4], range(4)]
# array([[ 0.5881308 , 0.85032764, 0.23921822, 0.25480601],
# [ 0.03588959, 0.7512807 , 0. , 0. ],
# [ 0.65795147, 0. , 0. , 0. ],
# [ 0.78300361, 0. , 0. , 0. ],
# [ 0.11669374, 0. , 0. , 0. ]])
Please note that order='C' and then 'C_CONTIGUOUS' in the assertion also works. My hunch is that 'F' could be a bit faster because the indexing then operates on contiguous slices.

Python - add 1D-array as column of 2D

I want to add a vector as the first column of my 2D array which looks like :
[[ 1. 0. 0. nan]
[ 4. 4. 9.97 1. ]
[ 4. 4. 27.94 1. ]
[ 2. 1. 4.17 1. ]
[ 3. 2. 38.22 1. ]
[ 4. 4. 31.83 1. ]
[ 3. 4. 41.87 1. ]
[ 2. 1. 18.33 1. ]
[ 4. 4. 33.96 1. ]
[ 2. 1. 5.65 1. ]
[ 3. 3. 40.74 1. ]
[ 2. 1. 10.04 1. ]
[ 2. 2. 53.15 1. ]]
I want to add an aray [] of 13 elements as the first column of the matrix. I tried with np.stack_column, np.append but it is for 1D vector or doesn't work because I can't chose axis=1 and only do np.append(peak_values, results)
I have a very simple option for you using numpy -
x = np.array( [[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.942777 , -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427767 ,-4.297677 ],
[ 3.9427767, -4.297677 ],
[ 3.9427772 ,-4.297677 ]])
b = np.arange(10).reshape(-1,1)
np.concatenate((b.T, x), axis=1)
Output-
array([[ 0. , 3.9427767, -4.297677 ],
[ 1. , 3.9427767, -4.297677 ],
[ 2. , 3.9427767, -4.297677 ],
[ 3. , 3.9427767, -4.297677 ],
[ 4. , 3.942777 , -4.297677 ],
[ 5. , 3.9427767, -4.297677 ],
[ 6. , 3.9427767, -4.297677 ],
[ 7. , 3.9427767, -4.297677 ],
[ 8. , 3.9427767, -4.297677 ],
[ 9. , 3.9427772, -4.297677 ]])
Improving on this answer by removing the unnecessary transposition, you can indeed use reshape(-1, 1) to transform the 1d array you'd like to prepend along axis 1 to the 2d array to a 2d array with a single column. At this point, the arrays only differ in shape along the second axis and np.concatenate accepts the arguments:
>>> import numpy as np
>>> a = np.arange(12).reshape(3, 4)
>>> b = np.arange(3)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> b
array([0, 1, 2])
>>> b.reshape(-1, 1) # preview the reshaping...
array([[0],
[1],
[2]])
>>> np.concatenate((b.reshape(-1, 1), a), axis=1)
array([[ 0, 0, 1, 2, 3],
[ 1, 4, 5, 6, 7],
[ 2, 8, 9, 10, 11]])
For the simplest answer, you probably don't even need numpy.
Try the following:
new_array = []
new_array.append(your_array)
That's it.
I would suggest using Numpy. It will allow you to easily do what you want.
Here is an example of squaring the entire set. you can use something like nums[0].
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
print even_squares # Prints "[0, 4, 16]"

numpy: get indices value in a separate column

How to get indices value in a separate column as effienctly as possible? i know how to do this in a loop, but i wonder what other ways there are?
from this ndarray
[[ 0.71587892 0.72278279 ]
[ 0.72225173 0.73340414 ]
[ 0.7259692 0.72862454 ]]
to this
[[0 0.71587892 0.72278279 ]
[1 0.72225173 0.73340414 ]
[2 0.7259692 0.72862454 ]]
How about
np.column_stack((np.arange(len(a)), a))
where a is your initial array?
Take a look at the following IPython-session:
In [1]: import numpy as np
In [2]: a = np.array([[0.71587892, 0.72278279],
...: [ 0.72225173, 0.73340414],
...: [ 0.7259692, 0.72862454]])
In [3]: np.column_stack((np.arange(len(a)), a))
Out[3]:
array([[ 0. , 0.71587892, 0.72278279],
[ 1. , 0.72225173, 0.73340414],
[ 2. , 0.7259692 , 0.72862454]])

Categories