How to normalize 2D array with sklearn? - python

Given a 2D array, I would like to normalize it into range 0-1.
I know this can be achieve as below
import numpy as np
from sklearn.preprocessing import normalize,MinMaxScaler
np.random.seed(0)
t_feat=4
t_epoch=3
t_wind=2
result = [np.random.rand(t_epoch, t_feat) for _ in range(t_wind)]
wdw_epoch_feat=np.array(result)
matrix=wdw_epoch_feat[:,:,0]
xmax, xmin = matrix.max(), matrix.min()
x_norm = (matrix - xmin)/(xmax - xmin)
which produce
[[0.55153917 0.42094786 0.98439526], [0.57160496 0. 1. ]]
However, I cannot get the same result using the MinMaxScaler of sklearn
scaler = MinMaxScaler()
x_norm = scaler.fit_transform(matrix)
which produce
[[0. 1. 0.], [1. 0. 1.]]
Appreciate for any thought

You are standardizing the entire matrix. MinMaxScaler is designed for machine learning, thus performs standardization per row or column based on how you define it. To get the same results as you, you would need to turn the 2D array into a 1D array. I show this below and get your same results in the first column:
import numpy as np
from sklearn.preprocessing import normalize, MinMaxScaler
np.random.seed(0)
t_feat=4
t_epoch=3
t_wind=2
result = [np.random.rand(t_epoch, t_feat) for _ in range(t_wind)]
wdw_epoch_feat=np.array(result)
matrix=wdw_epoch_feat[:,:,0]
xmax, xmin = matrix.max(), matrix.min()
x_norm = (matrix - xmin)/(xmax - xmin)
matrix = np.array([matrix.flatten(), np.random.rand(len(matrix.flatten()))]).T
scaler = MinMaxScaler()
test = scaler.fit_transform(matrix)
print(test)
-------------------------------------------
[[0.55153917 0. ]
[0.42094786 0.63123194]
[0.98439526 0.03034732]
[0.57160496 1. ]
[0. 0.48835502]
[1. 0.35865137]]
When you use MinMaxScaler for Machine Learning, you generally want to standardize each column.

A clever way to do this would be to reshape your data to 1D, apply transform and reshape it back to original -
import numpy as np
X = np.array([[-1, 2], [-0.5, 6]])
scaler = MinMaxScaler()
X_one_column = X.reshape([-1,1])
result_one_column = scaler.fit_transform(X_one_column)
result = result_one_column.reshape(X.shape)
print(result)
[[ 0. 0.42857143]
[ 0.07142857 1. ]]

Related

scipy.stats.norm for array of values with different accuracy in different method

Generate two arrays:
np.random.seed(1)
x = np.random.rand(30, 2)
np.random.seed(2)
x_test = np.random.rand(5,2)
Caluclate scipy.stats.norm axis by axis:
gx0 = scipy.stats.norm(np.mean(x[:,0]), np.std(x[:,0])).pdf(x_test[:,0])
gx1 = scipy.stats.norm(np.mean(x[:,1]), np.std(x[:,1])).pdf(x_test[:,1])
and get:
gx0 = array([1.29928091, 1.1344507 , 1.30920536, 1.10709298, 1.26903949])
gx1 = array([0.29941644, 1.36808598, 1.13817727, 1.34149231, 0.95054596])
Calculate using NumPy broadcasting
gx = scipy.stats.norm(np.mean(x, axis = 0), np.std(x, axis = 0)).pdf(x_test)
and get:
gx = array([[1.29928091, 0.29941644],
[1.1344507 , 1.36808598],
[1.30920536, 1.13817727],
[1.10709298, 1.34149231],
[1.26903949, 0.95054596]])
gx[:,0] and gx0 look like the same, but subtracting one from another gx[:,0] - gx0 will get:
array([-4.44089210e-16, -2.22044605e-16, -4.44089210e-16, 0.00000000e+00,
0.00000000e+00])
Why is that?
Not sure why they calculate the answer to different precisions, but converting the input arrays to 128 bit floats solves the problem:
np.random.seed(1)
x = np.random.rand(30, 2).astype(np.float128)
np.random.seed(2)
x_test = np.random.rand(5,2).astype(np.float128)
...
print(gx[:,0] - gx0)
results in:
[0. 0. 0. 0. 0.]

How can I randomly set elements to zero in TF?

The pure numpy solution is:
import numpy as np
data = np.random.rand(5,5) #data is of shape (5,5) with floats
masking_prob = 0.5 #probability of an element to get masked
indices = np.random.choice(np.prod(data.shape), replace=False, size=int(np.prod(data.shape)*masking_prob))
data[np.unravel_index(indices, data)] = 0. #set to zero
How can I achieve this in TensorFlow?
Use tf.nn.dropout:
import tensorflow as tf
import numpy as np
data = np.random.rand(5,5)
array([[0.38658212, 0.6896139 , 0.92139911, 0.45646086, 0.23185075],
[0.03461688, 0.22073962, 0.21254995, 0.20046708, 0.43419155],
[0.49012903, 0.45495968, 0.83753471, 0.58815975, 0.90212244],
[0.04071416, 0.44375078, 0.55758641, 0.31893155, 0.67403431],
[0.52348073, 0.69354454, 0.2808658 , 0.6628248 , 0.82305081]])
tf.nn.dropout(data, rate=prob).numpy()*(1-prob)
array([[0.38658212, 0.6896139 , 0.92139911, 0. , 0. ],
[0.03461688, 0. , 0. , 0.20046708, 0. ],
[0.49012903, 0.45495968, 0. , 0. , 0. ],
[0. , 0.44375078, 0.55758641, 0.31893155, 0. ],
[0.52348073, 0.69354454, 0.2808658 , 0.6628248 , 0. ]])
Dropout multiplies remaining values so I counter this by multiplying by (1-prob)
For further users looking for a TF 2.x compatible answer, this is what I came up with:
import tensorflow as tf
import numpy as np
input_tensor = np.random.rand(5,5).astype(np.float32)
def my_numpy_func(x):
# x will be a numpy array with the contents of the input to the
# tf.function
p = 0.5
indices = np.random.choice(np.prod(x.shape), replace=False, size=int(np.prod(x.shape)*p))
x[np.unravel_index(indices, x.shape)] = 0.
return x
#tf.function(input_signature=[tf.TensorSpec((None, None), tf.float32)])
def tf_function(input):
y = tf.numpy_function(my_numpy_func, [input], tf.float32)
return y
tf_function(tf.constant(input_tensor))
You can also use this is code in the context of a Dataset by using the map() operation.

Min-max scaling along rows in numpy array

I have a numpy array and I want to rescale values along each row to values between 0 and 1 using the following procedure:
If the maximum value along a given row is X_max and the minimum value along that row is X_min, then the rescaled value (X_rescaled) of a given entry (X) in that row should become:
X_rescaled = (X - X_min)/(X_max - X_min)
As an example, let's consider the following array (arr):
arr = np.array([[1.0,2.0,3.0],[0.1, 5.1, 100.1],[0.01, 20.1, 1000.1]])
print arr
array([[ 1.00000000e+00, 2.00000000e+00, 3.00000000e+00],
[ 1.00000000e-01, 5.10000000e+00, 1.00100000e+02],
[ 1.00000000e-02, 2.01000000e+01, 1.00010000e+03]])
Presently, I am trying to use MinMaxscaler from scikit-learn in the following way:
from sklearn.preprocessing import MinMaxScaler
result = MinMaxScaler(arr)
But, I keep getting my initial array, i.e. result turns out to be the same as arr in the aforementioned method. What am I doing wrong?
How can I scale the array arr in the manner that I require (min-max scaling along each axis?) Thanks in advance.
MinMaxScaler is a bit clunky to use; sklearn.preprocessing.minmax_scale is more convenient. This operates along columns, so use the transpose:
>>> import numpy as np
>>> from sklearn import preprocessing
>>>
>>> a = np.random.random((3,5))
>>> a
array([[0.80161048, 0.99572497, 0.45944366, 0.17338664, 0.07627295],
[0.54467986, 0.8059851 , 0.72999058, 0.08819178, 0.31421126],
[0.51774372, 0.6958269 , 0.62931078, 0.58075685, 0.57161181]])
>>> preprocessing.minmax_scale(a.T).T
array([[0.78888024, 1. , 0.41673812, 0.10562126, 0. ],
[0.63596033, 1. , 0.89412757, 0. , 0.314881 ],
[0. , 1. , 0.62648851, 0.35384099, 0.30248836]])
>>>
>>> b = np.array([(4, 1, 5, 3), (0, 1.5, 1, 3)])
>>> preprocessing.minmax_scale(b.T).T
array([[0.75 , 0. , 1. , 0.5 ],
[0. , 0.5 , 0.33333333, 1. ]])

SKlearn linear regression coeffs equals 0

There was a problem in the simplest example of linear regression. At the output, the coefficients are zero, what do I do wrong? Thanks for the help.
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = [25,50,75,100]
y = [10.5,17,23.25,29]
pred = [27,41,22,33]
df = pd.DataFrame({'x':x, 'y':y, 'pred':pred})
x = df['x'].values.reshape(1,-1)
y = df['y'].values.reshape(1,-1)
pred = df['pred'].values.reshape(1,-1)
plt.scatter(x,y,color='black')
clf = lm.LinearRegression(fit_intercept =True)
clf.fit(x,y)
m=clf.coef_[0]
b=clf.intercept_
print("slope=",m, "intercept=",b)
Output:
slope= [ 0. 0. 0. 0.] intercept= [ 10.5 17. 23.25 29. ]
Think it through for a second. Given that you have multiple coefficients returned suggests you have multiple factors. Since it's a single regression, the problem lies in the shape of your input data. Your original reshaping made the class think you had 4 variables and only one observation per variable.
Try something like this:
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.array([25,99,75,100, 3, 4, 6, 80])[..., np.newaxis]
y = np.array([10.5,17,23.25,29, 1, 2, 33, 4])[..., np.newaxis]
clf = lm.LinearRegression()
clf.fit(x,y)
clf.coef_
Output:
array([[ 0.09399429]])
As #jrjames83 has already explained in his answer after reshaping (.reshape(1,-1)) you were feeding a data set containing one sample (row) and four features (columns):
In [103]: x.shape
Out[103]: (1, 4)
most probably you wanted to reshape it this way:
In [104]: x = df['x'].values.reshape(-1, 1)
In [105]: x.shape
Out[105]: (4, 1)
so that you would have four samples and one feature...
alternatively you could pass DataFrame columns to your model as follows (no need to pollute your memory with additional variables):
In [98]: clf = lm.LinearRegression(fit_intercept =True)
In [99]: clf.fit(df[['x']],df['y'])
Out[99]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [100]: clf.coef_
Out[100]: array([0.247])
In [101]: clf.intercept_
Out[101]: 4.5

How do I use scipy.interpolate.splrep to interpolate a curve?

Using some experimental data, I cannot for the life of me work out how to use splrep to create a B-spline. The data are here: http://ubuntuone.com/4ZFyFCEgyGsAjWNkxMBKWD
Here is an excerpt:
#Depth Temperature
1 14.7036
-0.02 14.6842
-1.01 14.7317
-2.01 14.3844
-3 14.847
-4.05 14.9585
-5.03 15.9707
-5.99 16.0166
-7.05 16.0147
and here's a plot of it with depth on y and temperature on x:
Here is my code:
import numpy as np
from scipy.interpolate import splrep, splev
tdata = np.genfromtxt('t-data.txt',
skip_header=1, delimiter='\t')
depth = tdata[:, 0]
temp = tdata[:, 1]
# Find the B-spline representation of 1-D curve:
tck = splrep(depth, temp)
### fails here with "Error on input data" returned. ###
I know I am doing something bleedingly stupid, but I just can't see it.
You just need to have your values from smallest to largest :). It shouldn't be a problem for you #a different ben, but beware readers from the future, depth[indices] will throw a TypeError if depth is a list instead of a numpy array!
>>> indices = np.argsort(depth)
>>> depth = depth[indices]
>>> temp = temp[indices]
>>> splrep(depth, temp)
(array([-7.05, -7.05, -7.05, -7.05, -5.03, -4.05, -3. , -2.01, -1.01,
1. , 1. , 1. , 1. ]), array([ 16.0147 , 15.54473241, 16.90606794, 14.55343229,
15.12525673, 14.0717599 , 15.19657895, 14.40437622,
14.7036 , 0. , 0. , 0. , 0. ]), 3)
Hat tip to #FerdinandBeyer for the suggestion of argsort instead of my ugly "zip the values, sort the zip, re-assign the values" method.

Categories