Numpy array minmax normalization python [duplicate] - python

I have the following numpy array:
foo = np.array([[0.0, 10.0], [0.13216, 12.11837], [0.25379, 42.05027], [0.30874, 13.11784]])
which yields:
[[ 0. 10. ]
[ 0.13216 12.11837]
[ 0.25379 42.05027]
[ 0.30874 13.11784]]
How can I normalize the Y component of this array. So it gives me something like:
[[ 0. 0. ]
[ 0.13216 0.06 ]
[ 0.25379 1 ]
[ 0.30874 0.097]]

Referring to this Cross Validated Link, How to normalize data to 0-1 range?, it looks like you can perform min-max normalisation on the last column of foo.
v = foo[:, 1] # foo[:, -1] for the last column
foo[:, 1] = (v - v.min()) / (v.max() - v.min())
foo
array([[ 0. , 0. ],
[ 0.13216 , 0.06609523],
[ 0.25379 , 1. ],
[ 0.30874 , 0.09727968]])
Another option for performing normalisation (as suggested by OP) is using sklearn.preprocessing.normalize, which yields slightly different results -
from sklearn.preprocessing import normalize
foo[:, [-1]] = normalize(foo[:, -1, None], norm='max', axis=0)
foo
array([[ 0. , 0.2378106 ],
[ 0.13216 , 0.28818769],
[ 0.25379 , 1. ],
[ 0.30874 , 0.31195614]])

sklearn.preprocessing.MinMaxScaler can also be used (feature_range=(0, 1) is default):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
v = foo[:,1]
v_scaled = min_max_scaler.fit_transform(v)
foo[:,1] = v_scaled
print(foo)
Output:
[[ 0. 0. ]
[ 0.13216 0.06609523]
[ 0.25379 1. ]
[ 0.30874 0.09727968]]
Advantage is that scaling to any range can be done.

I think you want this:
foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())

You are trying to min-max scale between 0 and 1 only the second column.
Using sklearn.preprocessing.minmax_scale, should easily solve your problem.
e.g.:
from sklearn.preprocessing import minmax_scale
column_1 = foo[:,0] #first column you don't want to scale
column_2 = minmax_scale(foo[:,1], feature_range=(0,1)) #second column you want to scale
foo_norm = np.stack((column_1, column_2), axis=1) #stack both columns to get a 2d array
Should yield
array([[0. , 0. ],
[0.13216 , 0.06609523],
[0.25379 , 1. ],
[0.30874 , 0.09727968]])
Maybe you want to min-max scale between 0 and 1 both columns. In this case, use:
foo_norm = minmax_scale(foo, feature_range=(0,1), axis=0)
Which yields
array([[0. , 0. ],
[0.42806245, 0.06609523],
[0.82201853, 1. ],
[1. , 0.09727968]])
note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.

Related

How to extract an array of same dimension as the original array meeting a condition? [duplicate]

This question already has answers here:
Numpy array loss of dimension when masking
(5 answers)
Closed 3 years ago.
The question sounds very basic. But when I try to use where or boolean conditions on numpy arrays, it always returns a flattened array.
I have the NumPy array
P = array([[ 0.49530662, 0.07901 , -0.19012371],
[ 0.1421513 , 0.48607405, -0.20315014],
[ 0.76467375, 0.16479826, -0.56598029],
[ 0.53530718, -0.21166188, -0.08773241]])
I want to extract the array of only negative values, but when I try
P[P<0]
array([-0.19012371, -0.41421612, -0.20315014, -0.56598029, -0.21166188,
-0.08773241, -0.09241335])
P[np.where(P<0)]
array([-0.19012371, -0.41421612, -0.20315014, -0.56598029, -0.21166188,
-0.08773241, -0.09241335])
I get a flattened array. How can I extract the array of the form
array([[ 0, 0, -0.19012371],
[ 0 , 0, -0.20315014],
[ 0, 0, -0.56598029],
[ 0, -0.21166188, -0.08773241]])
I do not wish to create a temp array and then use something like Temp[Temp>=0] = 0
Since your need is:
I want to "extract" the array of only negative values
You can use numpy.where() with your condition (checking for negative values), which can preserve the dimension of the array, as in the below example:
In [61]: np.where(P<0, P, 0)
Out[61]:
array([[ 0. , 0. , -0.19012371],
[ 0. , 0. , -0.20315014],
[ 0. , 0. , -0.56598029],
[ 0. , -0.21166188, -0.08773241]])
where P is your input array.
Another idea could be to use numpy.zeros_like() for initializing a same shape array and numpy.where() to gather the indices at which our condition satisfies.
# initialize our result array with zeros
In [106]: non_positives = np.zeros_like(P)
# gather the indices where our condition is obeyed
In [107]: idxs = np.where(P < 0)
# copy the negative values to correct indices
In [108]: non_positives[idxs] = P[idxs]
In [109]: non_positives
Out[109]:
array([[ 0. , 0. , -0.19012371],
[ 0. , 0. , -0.20315014],
[ 0. , 0. , -0.56598029],
[ 0. , -0.21166188, -0.08773241]])
Yet another idea would be to simply use the barebones numpy.clip() API, which would return a new array, if we omit the out= kwarg.
In [22]: np.clip(P, -np.inf, 0) # P.clip(-np.inf, 0)
Out[22]:
array([[ 0. , 0. , -0.19012371],
[ 0. , 0. , -0.20315014],
[ 0. , 0. , -0.56598029],
[ 0. , -0.21166188, -0.08773241]])
This should work, essentially get the indexes of all elements which are above 0, and set them to 0, this will preserve the dimensions! I got the idea from here: Replace all elements of Python NumPy Array that are greater than some value
Also note that I have modified the original array, I haven't used a temp array here
import numpy as np
P = np.array([[ 0.49530662, 0.07901 , -0.19012371],
[ 0.1421513 , 0.48607405, -0.20315014],
[ 0.76467375, 0.16479826, -0.56598029],
[ 0.53530718, -0.21166188, -0.08773241]])
P[P >= 0] = 0
print(P)
The output will be
[[ 0. 0. -0.19012371]
[ 0. 0. -0.20315014]
[ 0. 0. -0.56598029]
[ 0. -0.21166188 -0.08773241]]
As noted below, this will modify the array, so we should use np.where(P<0, P 0) to preserve the original array as follows, thanks #kmario123 as follows
import numpy as np
P = np.array([[ 0.49530662, 0.07901 , -0.19012371],
[ 0.1421513 , 0.48607405, -0.20315014],
[ 0.76467375, 0.16479826, -0.56598029],
[ 0.53530718, -0.21166188, -0.08773241]])
print( np.where(P<0, P, 0))
print(P)
The output will be
[[ 0. 0. -0.19012371]
[ 0. 0. -0.20315014]
[ 0. 0. -0.56598029]
[ 0. -0.21166188 -0.08773241]]
[[ 0.49530662 0.07901 -0.19012371]
[ 0.1421513 0.48607405 -0.20315014]
[ 0.76467375 0.16479826 -0.56598029]
[ 0.53530718 -0.21166188 -0.08773241]]

Cartesian product from 2 series

I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.

Printing numpy with different position in the column

I have following numpy array
import numpy as np
np.random.seed(20)
np.random.rand(20).reshape(5, 4)
array([[ 0.5881308 , 0.89771373, 0.89153073, 0.81583748],
[ 0.03588959, 0.69175758, 0.37868094, 0.51851095],
[ 0.65795147, 0.19385022, 0.2723164 , 0.71860593],
[ 0.78300361, 0.85032764, 0.77524489, 0.03666431],
[ 0.11669374, 0.7512807 , 0.23921822, 0.25480601]])
For each column I would like to slice it in positions:
position_for_slicing=[0, 3, 4, 4]
So I will get following array:
array([[ 0.5881308 , 0.85032764, 0.23921822, 0.81583748],
[ 0.03588959, 0.7512807 , 0 0],
[ 0.65795147, 0, 0 0],
[ 0.78300361, 0, 0 0],
[ 0.11669374, 0, 0 0]])
Is there fast way to do this ? I know I can use to do for loop for each column, but I was wondering if there is more elegant way to do this.
If "elegant" means "no loop" the following would qualify, but probably not under many other definitions (arr is your input array):
m, n = arr.shape
arrf = np.asanyarray(arr, order='F')
padded = np.r_[arrf, np.zeros_like(arrf)]
assert padded.flags['F_CONTIGUOUS']
expnd = np.lib.stride_tricks.as_strided(padded, (m, m+1, n), padded.strides[:1] + padded.strides)
expnd[:, [0,3,4,4], range(4)]
# array([[ 0.5881308 , 0.85032764, 0.23921822, 0.25480601],
# [ 0.03588959, 0.7512807 , 0. , 0. ],
# [ 0.65795147, 0. , 0. , 0. ],
# [ 0.78300361, 0. , 0. , 0. ],
# [ 0.11669374, 0. , 0. , 0. ]])
Please note that order='C' and then 'C_CONTIGUOUS' in the assertion also works. My hunch is that 'F' could be a bit faster because the indexing then operates on contiguous slices.

Meshgrid a N-columned matrix in Numpy (or smth else)

Python version: 2.7
I have the following numpy 2d array:
array([[ -5.05000000e+01, -1.05000000e+01],
[ -4.04000000e+01, -8.40000000e+00],
[ -3.03000000e+01, -6.30000000e+00],
[ -2.02000000e+01, -4.20000000e+00],
[ -1.01000000e+01, -2.10000000e+00],
[ 7.10542736e-15, -1.77635684e-15],
[ 1.01000000e+01, 2.10000000e+00],
[ 2.02000000e+01, 4.20000000e+00],
[ 3.03000000e+01, 6.30000000e+00],
[ 4.04000000e+01, 8.40000000e+00]])
If I wanted to find all the combinations of the first and the second columns, I would use np.array(np.meshgrid(first_column, second_column)).T.reshape(-1,2). As a result, I would get a 100*1 matrix with 10*10 = 100 data points. However, my matrix can have 3, 4, or more columns, so I have a problem of using this numpy function.
Question: how can I make an automatically meshgridded matrix with 3+ columns?
UPD: for example, I have the initial array:
[[-50.5 -10.5]
[ 0. 0. ]]
As a result, I want to have the output array like this:
array([[-10.5, -50.5],
[-10.5, 0. ],
[ 0. , -50.5],
[ 0. , 0. ]])
or this:
array([[-50.5, -10.5],
[-50.5, 0. ],
[ 0. , -10.5],
[ 0. , 0. ]])
You could use * operator on the transposed array version that unpacks those columns sequentially. Finally, a swap axes operation is needed to merge the output grid arrays as one array.
Thus, one generic solution would be -
np.swapaxes(np.meshgrid(*arr.T),0,2)
Sample run -
In [44]: arr
Out[44]:
array([[-50.5, -10.5],
[ 0. , 0. ]])
In [45]: np.swapaxes(np.meshgrid(*arr.T),0,2)
Out[45]:
array([[[-50.5, -10.5],
[-50.5, 0. ]],
[[ 0. , -10.5],
[ 0. , 0. ]]])

Sample with replacement from existing array

I have a matrix A with shape 1.6M rows and 400 columns.
One of the columns in A (call it the output column) has binary values (0,1) with a predominance of 0's.
I want to create a new matrix B (same shape as A) by sampling rows in A with replacement such, that the distribution of 0's & 1's in the output column in B becomes 50/50.
What is the efficient way to do this using python/numpy?
You could do this by:
Creating a list of all rows with 0 in the "output column" (called outputZeros), and a list of all rows with 1 in the output column (called outputOnes); then,
Sampling with replacement from outputZeros and outputOnes 1.6M times.
Here's a small example. It's not clear to me if you want the rows in B to be in any particular order, so here they first include 0s, then include 1s.
In [1]: import numpy as np, random
In [2]: A = np.random.rand(10, 2)
In [3]: A
In [4]: A[:7, 1] = 0
In [5]: A[7:, 1] = 1
In [6]: A
Out[6]:
array([[ 0.70126052, 0. ],
[ 0.51161067, 0. ],
[ 0.76511966, 0. ],
[ 0.91257144, 0. ],
[ 0.97024895, 0. ],
[ 0.55817776, 0. ],
[ 0.55963466, 0. ],
[ 0.6318139 , 1. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ]])
In [7]: outputZeros = np.where(A[:, 1] == 0)[0]
In [8]: outputZeros
Out[8]: array([0, 1, 2, 3, 4, 5, 6])
In [9]: outputOnes
Out[9]: array([7, 8, 9])
In [10]: outputOnes = np.where(A[:, 1] == 1)[0]
In [11]: B = np.zeros((10, 2))
In [12]: for i in range(10):
if i < 5:
B[i, :] = A[random.choice(outputZeros), :]
else:
B[i, :] = A[random.choice(outputOnes), :]
....:
In [13]: B
Out[13]:
array([[ 0.97024895, 0. ],
[ 0.97024895, 0. ],
[ 0.76511966, 0. ],
[ 0.76511966, 0. ],
[ 0.51161067, 0. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ],
[ 0.6318139 , 1. ],
[ 0.6318139 , 1. ],
[ 0.76033151, 1. ]])
I would create a new 1D numpy array filled with values from numpy.random.random_integers(low, high=None, size=None) and swap that new array with the old one.

Categories