Right now I have a a 2 by 2 numpy array. By using RobustScaler, it normalizes each column one at a time, whereas I wish to normalize everything all at once. Is there anyway to do that?
From the documentation the RobustScaler:
removes the median and scales the data according to the quantile range
So you need to compute the median and the quantile range for the whole array, for this you can use the np.median and np.percentile functions, this is what sklearn does under the hood. The code:
import numpy as np
from sklearn.preprocessing import robust_scale
data = np.array([[3, 6],
[9, 12]], dtype=np.float64)
result = robust_scale(data, axis=0)
print(result)
reshape = data.reshape((1, 4))
result = robust_scale(reshape, axis=1)
me = np.median(data.flat) # 7.5
percentiles = np.percentile(data, (25.0, 75.0)) # 5.25 9.75
data -= me
data /= (percentiles[1] - percentiles[0])
print(data)
Output
[[-1. -1.]
[ 1. 1.]]
[[-1. -0.33333333]
[ 0.33333333 1. ]]
In the example I used (25.0, 75.0) because this are the default values for the quantile range, also the function robust_scale is equivalent to the functionality of RobustScaler (section See Also on the documentation).
Related
On my way through learning ML stuff I am confused by the MinMaxScaler provided by sklearn. The goal is to normalize numerical data into a range of [0, 1].
Example code:
from sklearn.preprocessing import MinMaxScaler
data = [[1, 2], [3, 4], [4, 5]]
scaler = MinMaxScaler(feature_range=(0, 1))
scaledData = scaler.fit_transform(data)
Giving output:
[[0. 0. ]
[0.66666667 0.66666667]
[1. 1. ]]
The first array [1, 2] got transformed into [0, 0] which in my eyes means:
The ratio between the numbers is gone
None value has any importance (anymore) as they both got set to min-value (0).
Example of what I have expected:
[[0.1, 0.2]
[0.3, 0.4]
[0.4, 0.5]]
This would have saved the ratios and put the numbers into the range of 0 to 1.
What am I doing wrong or misunderstanding with MinMaxScaler here? Because thinking of things like training on timeseries, it makes no sense to transform important numbers like prices or temperatures etc into broken stuff like above?
MinMaxScaler finds and translates the features according to a given range with the following formula according to the documentation. So you're issue is regarding the formula used.
Formula:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
Let us try and see what happens when you use it on your data.
You need to use numpy for this.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
scaler = MinMaxScaler()
data = [[1, 2], [3, 4], [4, 5]]
# min to max range is given from the feature range you specify
min = 0
max = 1
X_std = (data - np.min(data, axis=0)) / (np.max(data, axis=0) - np.min(data, axis=0))
X_scaled = X_std * (max - min) + min
This returns as expected:
array([[0. , 0. ],
[0.66666667, 0.66666667],
[1. , 1. ]])
As for your doubts regarding using MinMaxScaler you could use StandardScaler if you have outliers that are quite different from most of the values, but are still valid data.
StandardScaler is used the same way as MinMaxScaler, but it will scale your values so they have mean equal to 0 and standard deviation equal to 1. Since those values will be found based on all the values in the series, it is much more robust against outliers.
I have a numpy array of floats which when printed look like this:
The red circles are the original values, the blue crosses are a linear interpolation using numpy.interp.
I would like to find the abscissa of the zero crossing of this numpy array (red circle) using scipy.optimize.bisect (for example). Since this is a numpy array (and not a function) I cannot pass it directly to scipy.optimize.bisect. So I was thinking to pass a function that interpolates the numpy array to bisect. Here is the code I am using for the moment:
def Inter_F(x,xp,fp):
return np.interp(x,xp,fp)
Numpyroot = scp.optimize.bisect(Inter_F,0,9,args=(XNumpy,YNumpy))
I find a value that seems correct, Numpyroot = 3.376425289196618.
I am wondering:
if this is the correct technical way to use scipy.optimize.bisect on
a numpy array? Specially when I am going to do this 10^6 times on different set of numpy values.
if enforcing a linear interpolation is not influencing the results
that bisect is going to find and if yes, are there better choice?
Here are the two numpy arrays:
XNumpy = array([ 0. , 1.125, 2.25 , 3.375, 4.5 , 5.625, 6.75 , 7.875, 9. ])
YNumpy = array([ -2.70584242e+04, -2.46925289e+04, -1.53211676e+04,
-2.30000000e+01, 1.81312104e+04, 3.41662461e+04,
4.80466863e+04, 5.75113178e+04, 6.41718009e+04])
I think what you do is correct. However, there is a more concise way.
import numpy as np
from scipy.interpolate import interp1d
XNumpy = np.array([0., 1.125, 2.25, 3.375, 4.5, 5.625, 6.75, 7.875, 9.])
YNumpy = np.array([
-2.70584242e+04, -2.46925289e+04, -1.53211676e+04,
-2.30000000e+01, 1.81312104e+04, 3.41662461e+04,
4.80466863e+04, 5.75113178e+04, 6.41718009e+04
])
invf = interp1d(YNumpy, XNumpy)
print(invf(0))
Result:
array(3.376425289199028)
Here I use scipy.interpolate.interp1d to return a function. Also I interpolate the inverse function so that the abscissa are readily calculated. Of course you can do the same trick with np.interp, I just like scipy.interpolate.interp1d because it returns a function so I can calculate x value from any given y value.
There are N distributions which take on integer values 0,... with associated probabilities. Further, I assume 3 variables [value, prob]:
import numpy as np
x = np.array([ [0,0.3],[1,0.2],[3,0.5] ])
y = np.array([ [10,0.2],[11,0.4],[13,0.1],[14,0.3] ])
z = np.array([ [21,0.3],[23,0.7] ])
As there are N variables I convolve first x+y, then I add z, and so on.
Unfortunately numpy.convole() takes 1-d arrays as input variables, so it does not suit in this case directly. I play with variables to take them all values 0,1,2,...,23 (if value is not know then Pr=0)... I feel like there is another much better solution.
Does anyone have a suggestion for making it more efficient? Thanks in advance.
I don't see a built-in method for this in Scipy; there's a way to define a custom discrete random variables, but those don't support addition. Here is an approach using pandas, assuming import pandas as pd and x,y,z as in your example:
values = np.add.outer(x[:,0], y[:,0]).flatten()
probs = np.multiply.outer(x[:,1], y[:,1]).flatten()
df = pd.DataFrame({'values': values, 'probs': probs})
conv = df.groupby('values').sum()
result = conv.reset_index().values
The output is
array([[ 10. , 0.06],
[ 11. , 0.16],
[ 12. , 0.08],
[ 13. , 0.13],
[ 14. , 0.31],
[ 15. , 0.06],
[ 16. , 0.05],
[ 17. , 0.15]])
With more than two variables, you don't have to go back and forth between numpy and pandas: the additional variables can be included at the beginning.
values = np.add.outer(np.add.outer(x[:,0], y[:,0]), z[:,0]).flatten()
probs = np.multiply.outer(np.multiply.outer(x[:,1], y[:,1]), z[:,1]).flatten()
Aside: it would be better to keep values and probabilities in separate numpy arrays, if they have different intrinsic data types (integers vs reals).
Given an array such as:
import numpy as np
a = np.array([[1,2,3,4,5],[6,7,8,9,10]])
What's the quickest way to calculate the growth rates of each row so that my results would be 0.52083333333333326, and 0.13640873015873009 respectively.
I tried using:
>>> np.nanmean(np.rate(1,0,-a[:-1],a[1:]), axis=0)
array([ 5. , 2.5 , 1.66666667, 1.25 , 1. ])
but of course it doesn't yield the right result and I don't know how to get the axis right for the numpy.rate function.
In [262]: a = np.array([[1,2,3,4,5],[6,7,8,9,10]]).astype(float)
In [263]: np.nanmean((a[:, 1:]/a[:, :-1]), axis=1) - 1
Out[263]: array([ 0.52083333, 0.13640873])
To take your approach using numpy.rate, you need to index into your a array properly (consider all rows separately) and use axis=1:
In [6]: np.nanmean(np.rate(1,0,-a[:,:-1],a[:,1:]), axis=1)
Out[6]: array([ 0.52083333, 0.13640873])
I have a dataset array A. A is nĂ—2. It can be plotted on the x and y axis.
A[:,1] gets me all of the Y values ans A[:,0] gets me all the x values.
Now, I have a few other dataset arrays that are similar to A. X values are the same for these similar arrays. How do I calculate the standard deviation of the datasets? There should be a std value for each X. In the end my result std should have a length of n.
I can do this the manual way with loops but I'm not sure how to do this using NumPy in a pythonic and simple manner.
here are some sample data:
A=[[0,2.54],[1,254.5],[2,-43]]
B=[[0,3.34],[1,154.5],[2,-93]]
std_Array=[std(2.54,3.54),std(254.5,154.5),std(-43,-93)]
Suppose your arrays are all the same shape and they are in a list. Then to get the standard deviation of the first column of each you can do
arrays = [np.random.rand(10, 2) for _ in range(8)]
np.dstack(arrays).std(axis=0)[0]
This stacks the 2-D arrays into a 3-D array an then takes the std along the first axis, giving a 2 X 8 (the number of arrays). The first row of the result is the std. devs. of the 8 sets of x-values.
If you post some sample data perhaps we could help more.
Is this pythonic enough?
std_Array = numpy.std((A,B), axis = 0)[:,1]
li_arr = [np.array(x)[: , 1] for x in [A , B]]
This will produce numpy arrays with specifi columns you want to add the result will be
[array([ 2.54, 254.5 , -43. ]), array([ 3.34, 154.5 , -93. ])]
then you stack the values using column_stack
arr = np.column_stack(li_arr)
this will be the result stacking
array([[ 2.54, 3.34],
[ 254.5 , 154.5 ],
[ -43. , -93. ]])
and then finally
np.std(arr , axis = 1)