I have a numpy array and I want to rescale values along each row to values between 0 and 1 using the following procedure:
If the maximum value along a given row is X_max and the minimum value along that row is X_min, then the rescaled value (X_rescaled) of a given entry (X) in that row should become:
X_rescaled = (X - X_min)/(X_max - X_min)
As an example, let's consider the following array (arr):
arr = np.array([[1.0,2.0,3.0],[0.1, 5.1, 100.1],[0.01, 20.1, 1000.1]])
print arr
array([[ 1.00000000e+00, 2.00000000e+00, 3.00000000e+00],
[ 1.00000000e-01, 5.10000000e+00, 1.00100000e+02],
[ 1.00000000e-02, 2.01000000e+01, 1.00010000e+03]])
Presently, I am trying to use MinMaxscaler from scikit-learn in the following way:
from sklearn.preprocessing import MinMaxScaler
result = MinMaxScaler(arr)
But, I keep getting my initial array, i.e. result turns out to be the same as arr in the aforementioned method. What am I doing wrong?
How can I scale the array arr in the manner that I require (min-max scaling along each axis?) Thanks in advance.
MinMaxScaler is a bit clunky to use; sklearn.preprocessing.minmax_scale is more convenient. This operates along columns, so use the transpose:
>>> import numpy as np
>>> from sklearn import preprocessing
>>>
>>> a = np.random.random((3,5))
>>> a
array([[0.80161048, 0.99572497, 0.45944366, 0.17338664, 0.07627295],
[0.54467986, 0.8059851 , 0.72999058, 0.08819178, 0.31421126],
[0.51774372, 0.6958269 , 0.62931078, 0.58075685, 0.57161181]])
>>> preprocessing.minmax_scale(a.T).T
array([[0.78888024, 1. , 0.41673812, 0.10562126, 0. ],
[0.63596033, 1. , 0.89412757, 0. , 0.314881 ],
[0. , 1. , 0.62648851, 0.35384099, 0.30248836]])
>>>
>>> b = np.array([(4, 1, 5, 3), (0, 1.5, 1, 3)])
>>> preprocessing.minmax_scale(b.T).T
array([[0.75 , 0. , 1. , 0.5 ],
[0. , 0.5 , 0.33333333, 1. ]])
Related
The pure numpy solution is:
import numpy as np
data = np.random.rand(5,5) #data is of shape (5,5) with floats
masking_prob = 0.5 #probability of an element to get masked
indices = np.random.choice(np.prod(data.shape), replace=False, size=int(np.prod(data.shape)*masking_prob))
data[np.unravel_index(indices, data)] = 0. #set to zero
How can I achieve this in TensorFlow?
Use tf.nn.dropout:
import tensorflow as tf
import numpy as np
data = np.random.rand(5,5)
array([[0.38658212, 0.6896139 , 0.92139911, 0.45646086, 0.23185075],
[0.03461688, 0.22073962, 0.21254995, 0.20046708, 0.43419155],
[0.49012903, 0.45495968, 0.83753471, 0.58815975, 0.90212244],
[0.04071416, 0.44375078, 0.55758641, 0.31893155, 0.67403431],
[0.52348073, 0.69354454, 0.2808658 , 0.6628248 , 0.82305081]])
tf.nn.dropout(data, rate=prob).numpy()*(1-prob)
array([[0.38658212, 0.6896139 , 0.92139911, 0. , 0. ],
[0.03461688, 0. , 0. , 0.20046708, 0. ],
[0.49012903, 0.45495968, 0. , 0. , 0. ],
[0. , 0.44375078, 0.55758641, 0.31893155, 0. ],
[0.52348073, 0.69354454, 0.2808658 , 0.6628248 , 0. ]])
Dropout multiplies remaining values so I counter this by multiplying by (1-prob)
For further users looking for a TF 2.x compatible answer, this is what I came up with:
import tensorflow as tf
import numpy as np
input_tensor = np.random.rand(5,5).astype(np.float32)
def my_numpy_func(x):
# x will be a numpy array with the contents of the input to the
# tf.function
p = 0.5
indices = np.random.choice(np.prod(x.shape), replace=False, size=int(np.prod(x.shape)*p))
x[np.unravel_index(indices, x.shape)] = 0.
return x
#tf.function(input_signature=[tf.TensorSpec((None, None), tf.float32)])
def tf_function(input):
y = tf.numpy_function(my_numpy_func, [input], tf.float32)
return y
tf_function(tf.constant(input_tensor))
You can also use this is code in the context of a Dataset by using the map() operation.
I would like to get a matrix of values given two ndarray's from a ufunc, for example:
degs = numpy.array(range(5))
pnts = numpy.array([0.0, 0.1, 0.2])
values = scipy.special.eval_chebyt(degs, pnts)
The above code doesn't work (it gives a ValueError because it tries to broadcast two arrays and fails since they have different shapes: (5,) and (3,)); I would like to get a matrix of values with rows corresponding to degrees and columns to points at which polynomials are evaluated (or vice versa, it doesn't matter).
Currently my workaround is simply to use for-loop:
values = numpy.zeros((5,3))
for j in range(5):
values[j] = scipy.special.eval_chebyt(j, pnts)
Is there a way to do that? In general, how would you let a ufunc know you want an n-dimensional array if you have n array_like arguments?
I know about numpy.vectorize, but that seems neither faster nor more elegant than just a simple for-loop (and I'm not even sure you can apply it to an existent ufunc).
UPDATE What about ufunc's that receive 3 or more parameters? trying outer method gives a ValueError: outer product only supported for binary functions. For example, scipy.special.eval_jacobi.
What you need is exactly the outer method of ufuncs:
ufunc.outer(A, B, **kwargs)
Apply the ufunc op to all pairs (a, b) with a in A and b in B.
values = scipy.special.eval_chebyt.outer(degs, pnts)
#array([[ 1. , 1. , 1. ],
# [ 0. , 0.1 , 0.2 ],
# [-1. , -0.98 , -0.92 ],
# [-0. , -0.296 , -0.568 ],
# [ 1. , 0.9208, 0.6928]])
UPDATE
For more parameters, you must broadcast by hand. meshgrid often help for that,spanning each parameter in a dimension. For exemple :
n=3
alpha = numpy.array(range(5))
beta = numpy.array(range(3))
x = numpy.array(range(2))
data = numpy.meshgrid(n,alpha,beta,x)
values = scipy.special.eval_jacobi(*data)
Reshape the input arguments for broadcasting. In this case, change the shape of degs to be (5, 1) instead of just (5,). The shape (5, 1) broadcast with the shape (3,) results in the shape (5, 3):
In [185]: import numpy as np
In [186]: import scipy.special
In [187]: degs = np.arange(5).reshape(-1, 1) # degs has shape (5, 1)
In [188]: pnts = np.array([0.0, 0.1, 0.2])
In [189]: values = scipy.special.eval_chebyt(degs, pnts)
In [190]: values
Out[190]:
array([[ 1. , 1. , 1. ],
[ 0. , 0.1 , 0.2 ],
[-1. , -0.98 , -0.92 ],
[-0. , -0.296 , -0.568 ],
[ 1. , 0.9208, 0.6928]])
How do I use this set of data and plot it using pcolormesh? the data is as follows:
array([[ 0. , 0. , 0. , ...,
0. , 0. , 0. ],
[ 34.19227552, 34.19246389, 34.19265956, ...,
34.19284295, 34.19253446, 34.1923012 ],
[ 68.46819899, 68.46861825, 68.46892983, ...,
68.46895204, 68.46856004, 68.46812476],
...,
[ 3937.42832088, 3937.42522049, 3937.43673897, ...,
3937.43603929, 3937.44434961, 3937.43535423],
[ 3987.08591207, 3987.082997 , 3987.09487184, ...,
3987.09300137, 3987.10157045, 3987.09271431],
[ 4037.00035477, 4036.9977684 , 4037.01006508, ...,
4037.00674248, 4037.01561165, 4037.00689316]])
I need to plot this data into a 3D pcolormesh matplotlib. How do I do this? If anyone can help me I would really appreciate it, as I really need help on this one.
Maybe you can follow this turorials:
http://matplotlib.org/examples/pylab_examples/pcolor_demo.html
or
http://mlpy.sourceforge.net/docs/3.3/tutorial.html
Load the modules:
>>> import numpy as np
>>> import mlpy
>>> import matplotlib.pyplot as plt # required for plotting
Load the Iris dataset:
>>> iris = np.loadtxt('iris.csv', delimiter=',')
>>> x, y = iris[:, :4], iris[:, 4].astype(np.int) # x: (observations x attributes) matrix, y: classes (1: setosa, 2: versicolor, 3: virginica)
>>> x.shape
(150, 4)
>>> y.shape
(150, )
I am trying to implement Non-negative Matrix Factorization so as to find the missing values of a matrix for a Recommendation Engine Project. I am using the nimfa library to implement matrix factorization. But can't seem to figure out how to predict the missing values.
The missing values in this matrix is represented by 0.
a=[[ 1. 0.45643546 0. 0.1 0.10327956 0.0225877 ]
[ 0.15214515 1. 0.04811252 0.07607258 0.23570226 0.38271325]
[ 0. 0.14433757 1. 0.07905694 0. 0.42857143]
[ 0.1 0.22821773 0.07905694 1. 0. 0.27105237]
[ 0.06885304 0.47140452 0. 0. 1. 0.13608276]
[ 0.00903508 0.4592559 0.17142857 0.10842095 0.08164966 1. ]]
import nimfa
model = nimfa.Lsnmf(a, max_iter=100000,rank =4)
#fit the model
fit = model()
#get U and V matrices from fit
U = fit.basis()
V = fit.coef()
print numpy.dot(U,V)
But the ans given is nearly same as a and I can't predict the zero values.
Please tell me which method to use or any other implementations possible and any possible resources.
I want to use this function to minimize the error in predicting the values.
error=|| a - UV ||_F + c*||U||_F + c*||V||_F
where _F denotes the frobenius norm
I have not used nimfa before so I cannot answer on exactly how to do that, but with sklearn you can perform a preprocessor to transform the missing values, like this:
In [28]: import numpy as np
In [29]: from sklearn.preprocessing import Imputer
# prepare a numpy array
In [30]: a = np.array(a)
In [31]: a
Out[31]:
array([[ 1. , 0.45643546, 0. , 0.1 , 0.10327956,
0.0225877 ],
[ 0.15214515, 1. , 0.04811252, 0.07607258, 0.23570226,
0.38271325],
[ 0. , 0.14433757, 1. , 0.07905694, 0. ,
0.42857143],
[ 0.1 , 0.22821773, 0.07905694, 1. , 0. ,
0.27105237],
[ 0.06885304, 0.47140452, 0. , 0. , 1. ,
0.13608276],
[ 0.00903508, 0.4592559 , 0.17142857, 0.10842095, 0.08164966,
1. ]])
In [32]: pre = Imputer(missing_values=0, strategy='mean')
# transform missing_values as "0" using mean strategy
In [33]: pre.fit_transform(a)
Out[33]:
array([[ 1. , 0.45643546, 0.32464951, 0.1 , 0.10327956,
0.0225877 ],
[ 0.15214515, 1. , 0.04811252, 0.07607258, 0.23570226,
0.38271325],
[ 0.26600665, 0.14433757, 1. , 0.07905694, 0.35515787,
0.42857143],
[ 0.1 , 0.22821773, 0.07905694, 1. , 0.35515787,
0.27105237],
[ 0.06885304, 0.47140452, 0.32464951, 0.27271009, 1. ,
0.13608276],
[ 0.00903508, 0.4592559 , 0.17142857, 0.10842095, 0.08164966,
1. ]])
You can read more here.
I have an 5x17511 2D array (name = 'da') which made by a pandas.read_csv(...)
I also added one column for indexing like this: da.index = pd.date_range(...)
So my 2D array has 6x17511 size now.
I'd like to insert/append one more row to this 2D array, how to do this?
I already tried with: np.insert(da,1,np.array((1,2,3,4,5,6)), 0) but it says:
ValueError: Shape of passed values is (6, 17512), indices imply (6,
17511)
Thanks in advance!
I have assumed this is a numpy question rather than a pandas question ...
You could try vstack ...
import numpy as np
da = np.random.rand(17511, 6)
newrow = np.array((1,2,3,4,5,6))
da = np.vstack([da, newrow])
Which yields ...
In [5]: da
Out[5]:
array([[ 0.50203777, 0.55102172, 0.74798053, 0.57291239, 0.38977322,
0.40878739],
[ 0.9960413 , 0.22293403, 0.34136638, 0.12845067, 0.20262593,
0.50798698],
[ 0.05298782, 0.09129754, 0.40833606, 0.67150583, 0.19569471,
0.75176924],
...,
[ 0.97927055, 0.44649323, 0.84851791, 0.05370892, 0.94375771,
0.24508979],
[ 0.85952039, 0.2852414 , 0.85662827, 0.97665465, 0.65528357,
0.71483845],
[ 1. , 2. , 3. , 4. , 5. ,
6. ]])
In [6]: len(da)
Out[6]: 17512
And (albeit with different random numbers), I can access the top and bottom of the numpy array as follows ...
In [9]: da[:5]
Out[9]:
array([[ 0.76697236, 0.96475768, 0.09145486, 0.27159858, 0.05160006,
0.66495098],
[ 0.62635043, 0.1316334 , 0.66257157, 0.99141318, 0.77212699,
0.17016979],
[ 0.86705298, 0.11120927, 0.29585339, 0.44128326, 0.32290492,
0.99298705],
[ 0.74053894, 0.90743885, 0.99838398, 0.40713677, 0.17337202,
0.56982539],
[ 0.99136919, 0.13045787, 0.67881652, 0.03814385, 0.98036307,
0.53594215]])
In [10]: da[-5:]
Out[10]:
array([[ 0.8793664 , 0.0392912 , 0.8106504 , 0.17920025, 0.26767578,
0.98386519],
[ 0.41231276, 0.02633723, 0.7872108 , 0.60894162, 0.5358851 ,
0.65758067],
[ 0.10341791, 0.48079533, 0.1638601 , 0.5470736 , 0.7339205 ,
0.60609949],
[ 0.55320512, 0.12962241, 0.84443947, 0.81012583, 0.22057856,
0.33495709],
[ 1. , 2. , 3. , 4. , 5. ,
6. ]])