fit_transform and inverse_transform on two different scripts - python

How to fit_transform & inverse_transform in separate scripts?
I first normalize numerical targets (integers) in a script.
Then, I use an other script to predict in real-time these numerical targets (regression).
fit_transform & inverse_transform functions def are in a third script.
scaler = MinMaxScaler(copy=True, feature_range=(0.,1.))
def normalize(array):
array = scaler.fit_transform(array).flatten()
return array
def inverse_norm(array):
array = scaler.inverse_transform(array).flatten()
return array
Naively, I just "inverse_transformed" the predicted values within my real-time script.
But predicted values were not in the range as the original numerical targets: these are little float numbers.
Thank you for your help.

In general I think you don't want to normalize you target variable but if you want to do so you can use a label encoder instead of a minmax scaler which is rather use to normalize features

I fixed myself the problem thanks to mattOrNothing.
Just write down the answer.
First script:
myNormalizedArray = (myArray - myArrayMin) / (myArrayMax - myArrayMin)
Second script:
myDenormalizedArray = myPredicedArray * (myArrayMax - myArrayMin) + myArrayMin
Where myArray & myPredicedArray are numpy arrays.

Related

ValueError: Data must be 1-Dimensional error while creating a dataframe

I am trying to solve a classification problem with a neural network and after I get the prediction I want to create a pandas data frame with a column from the test dataset and my predictions as the second column. But I am constantly getting error. Here is my code:enter image description here
and here is my error:
enter image description here
Important sidenote: Please, take some time to look into How to make good reproducible pandas examples, there are great suggestions there on how you could ask your question better.
Now for your error:
Data must be 1-dimensional
That means pandas wants a 1-dimensional array, i.e. of the form [0,0,1,1,...,1]. But your preds array is 2-dimensional, i.e. of the form [[0],[0],[1],[1],...,[1]].
So you need to flatten the preds array here:
Instead of for-loops consider using list comprehensions to change your code to something like this:
predictions = [1 if p>0.5 else 0 for p in preds]
df = pd.DataFrame({'PassengerId': test['PassengerId'].values,
'Survived': predictions})
Also, in the meantime look into ndarray.round method - maybe it will better fit your use case:
predictions = preds.round()

How should I modify the test data for SVM method to be able to use the `precomputed` kernel function without error?

I am using sklearn.svm.SVR for a "regression task" which I want to use my "customized kernel method". Here is the dataset samples and the code:
index density speed label
0 14 58.844020 77.179139
1 29 67.624946 78.367394
2 44 77.679100 79.143744
3 59 79.361877 70.048869
4 74 72.529289 74.499239
.... and so on
from sklearn import svm
import pandas as pd
import numpy as np
density = np.random.randint(0,100, size=(3000, 1))
speed = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
label = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
d = np.hstack((a,b,c))
data = pd.DataFrame(d, columns=['density', 'speed', 'label'])
data.density = data.density.astype(dtype=np.int32)
def my_kernel(X,Y):
return np.dot(X,X.T)
svr = svm.SVR(kernel=my_kernel)
x = data[['density', 'speed']].iloc[:2000]
y = data['label'].iloc[:2000]
x_t = data[['density', 'speed']].iloc[2000:3000]
y_t = data['label'].iloc[2000:3000]
svr.fit(x,y)
y_preds = svr.predict(x_t)
the problem happens in the last line svm.predict which says:
X.shape[1] = 1000 should be equal to 2000, the number of samples at training time
I searched the web to find a way to deal with the problem but many questions alike (like {1}, {2}, {3}) were left unanswered.
Actually, I had used SVM methods with rbf, sigmoid, ... before and the code was working just fine but this was my first time using customized kernels and I suspected that it must be the reason why this error happened.
So after a little research and reading documentation I found out that when using precomputed kernels, the shape of the matrix for SVR.predict() must be like [n_samples_test, n_samples_train] shape.
I wonder how to modify x_test in order to get predictions and everything works just fine with no problem like when we don't use customized kernels?
If possible please describe "the reason that why the inputs for svm.predict function in precomputed kernel differentiates with the other kernels".
I really hope the unanswered questions that are related to this issue could be answered respectively.
The problem is in your kernel function, it doesn't do the job.
As the documentation https://scikit-learn.org/stable/modules/svm.html#using-python-functions-as-kernels says, "Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2, n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2)." The sample kernel on the same page satisfies this criteria:
def my_kernel(X, Y):
return np.dot(X, Y.T)
In your function the second argument of dot is X.T and thus the output will have shape (n_samples_1, n_samples_1) which is not that is expected.
The shape does not match means the test data and train data are of not equal shape, always think about matrix or array in numpy. If you are doing any arithmetic operation you always need a similar shape. That's why we check array.shape.
[n_samples_test, n_samples_train] you can modify shapes but its not best idea.
array.shape, reshape, resize
are used for that

Manual normalization function taking too long to execute

I am trying to implement a normalization function manually rather than using the scikit learn's one. The reason is that, I need to define the maximum and minimum parameters manually and scikit learn doesn't allow that alteration.
I successfully implemented this to normalize the values between 0 and 1. But it is taking a very long time to run.
Question: Is there another efficient way I can do this? How can I make this execute faster.
Shown below is my code:
scaled_train_data = scale(train_data)
def scale(data):
for index, row in data.iterrows():
X_std = (data.loc[index, "Close"] - 10) / (2000 - 10)
data.loc[index, "Close"] = X_std
return data
2000 and 10 are the attributes that i defined manually rather than taking the minimum and the maximum value of the dataset.
Thank you in advance.
Use numpy's matrix.you can also set your min and max mannually.
import numpy as np
data = np.array(df)
_min = np.min(data, axis=0)
_max = np.max(data, axis=0)
normed_data = (data - _min) / (_max - _min)
Why loop? You can just use
train_data['close'] = (train_data['close'] - 10)/(2000 - 10)
to make use of vectorized numpy functions. Of course, you could also put this in a function, if you prefer.
Alternatively, if you want to rescale to a linear range, you could use http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html. The advantage of this is that you can save it and then rescale the test data in the same manner.

Using Machine Learning in Python to load custom datasets?

Here's the problem:
It take 2 variable inputs, and predict a result.
For example: price and volume as inputs and a decision to buy/sell as a result.
I tried implementing this using K-Neighbors with no success. How would you go about this?
X = cleanedData['ES1 End Price'] #only accounts for 1 variable, don't know how to use input another.
y = cleanedData["Result"]
print(X.shape, y.shape)
kmm = KNeighborsClassifier(n_neighbors = 5)
kmm.fit(X,y) #ValueError for size inconsistency, but both are same size.
Thanks!
X needs to be a matrix/2d array where each column stands for a feature, which doesn't seem true from your code, try reshape X to 2d with X[:,None]:
kmm.fit(X[:,None], y)
Or without resorting to reshape, you'd better always use a list to extract features from a data frame:
X = cleanedData[['ES1 End Price']]
OR with more than one columns:
X = cleanedData[['ES1 End Price', 'volume']]
Then X would be a 2d array, and can be used directly in fit:
kmm.fit(X, y)

How to fit multidimensional output using scikit-learn?

I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?
This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.

Categories