Scikitlearn - order of fit and predict inputs, does it matter? - python

Just getting started with this library... having some issues (i've read the docs but didn't get clarity) with RandomForestClassifiers
My question is pretty simple, say i have a train data set like
A B C
1 2 3
Where A is the independent variable (y) and B-C are the dependent variables (x). Let's say the test set looks the same, however the order is
B A C
1 2 3
When I call forest.fit(train_data[0:,1:],train_data[0:,0])
do I then need to reorder the test set to match this order before running? (Ignoring the fact that I need to remove the already predicted y value (a), so lets just say B and C are out of order... )

Yes, you need to reorder them. Imagine a simpler case, Linear Regression. The algorithm will calculate the weights for each of the features, so for example if feature 1 is unimportant, it will get assigned a close to 0 weight.
If at prediction time the order is different, an important feature will be multiplied by this almost null weight, and the prediction will be totally off.

elyase is correct. scikit-learn will simply take the data in whatever order you give it. Hence, you'll have to ensure that the data is in the same order during training and prediction time.
Here's a simple illustrating example:
Training time:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
Prediction time:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
This is how I cache the column order during training so that I can use it during prediction time.
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())

Related

sklearn.transform and sklearn.fit_transform give different results

I am trying to plot my data in a 2-dimensional space using sklearn PCA. I want to re-use the same PCA representation to plot several data afterwards, but let us focus on one set first.
When I run a sklearn.fit_transform on my data I get the following result:
sklearn_pca = sklearnPCA(n_components = 2, random_state = 55)
X_train_proj = sklearn_pca.fit_transform(X_train)
plt.scatter(X_train_proj[:, 0],
X_train_proj[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 1: https://i.ibb.co/B4FcV08/capture-1.png
When I run a sklearn.transform on the same data, using the PCA object created before thanks to the fit_transform, here is what I get:
X_train_proj_2 = sklearn_pca.transform(X_train)
plt.scatter(X_train_proj_2[:, 0],
X_train_proj_2[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 2: https://i.ibb.co/0MS3Jhy/capture-2.png
My data contains absolutely no NAs and is already scaled. The size, however, is quite big, as I have ~11,000 lines and ~20 columns.
I have also rapidly checked that my columns are not correlated by computing the correlation matrix.

LSTM with more features / classes

How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Python SKlearn contamination must be in (0, 0.5] error

I'm new to Machine Learning and working on a project using Python(3.6), Pandas, Numpy and SKlearn. I have done classifications and reshaping but while in prediction it throws an error as contamination must be in (0, 0.5].
Here's what i have tried:
# Determine no of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
# calculate percentages for Fraud & Valid
outlier_fraction = len(Fraud) / float(len(Valid))
print(outlier_fraction)
print('Fraud Cases : {}'.format(len(Fraud)))
print('Valid Cases : {}'.format(len(Valid)))
# Get all the columns from dataframe
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]
# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
# define a random state
state = 1
# define the outlier detection method
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit te data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid and 1 for fraudulent
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# run classification metrics
print('{}:{}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred ))
print(classification_report(Y, y_pred ))
Here's what it returns :
ValueError: contamination must be in (0, 0.5]
and it throws this error for y_pred = clf.predict(X) line, as pointed in Traceback.
I'm new to machine learning, don't have much idea about ** contamination**, so where i did something wrong?
Help me, please!
Thanks in advance!
ValueError: contamination must be in (0, 0.5]
This means that contamination must be strictly larger than 0.0 and less than or equal to 0.5. (What does this square bracket and parenthesis bracket notation mean [first1,last1)? is a good question on the brackets notation) As you have commented, print(outlier_fraction) outputs 0.0, the problem lies in the first 6 lines of the code you posted.
LocalOutlierFactor is an unsupervised outlier detection algorithm, introduced in this paper. Each algorithm, has its own parameters which really change the behavior of the algorithm. You should always study those parameters and their effect on the algorithm before applying the method, or else you may be lost in the land of massive parameter options.
In the case of LocalOutlierFactor, it assumes your outliers are not more than half of your dataset. In practice, I'd say, even if the outliers take up to 30% of your dataset, they're not outliers anymore. They're simply a different type, or class of data.
On the other hand, you cannot expect the outlier detection algorithm to work if you tell it that you have 0 outliers, which may be the case for you if the outlier_fraction is actually 0.

Scikit-learn, GroupKFold with shuffling groups?

I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.
Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.
Shuffling is in group sence - so whole groups should be shuffle among each other.
Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.
If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.
matrix, label, groups as inputs
EDIT: This solution does not work.
I think using sklearn.utils.shuffle is an elegant solution!
For data in X, y and groups:
from sklearn.utils import shuffle
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=0)
Then use X_shuffled, y_shuffled and groups_shuffled with GroupKFold:
from sklearn.model_selection import GroupKFold
group_k_fold = GroupKFold(n_splits=10)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
Of course, you probably want to shuffle multiple times and do the cross-validation with each shuffle. You could put the entire thing in a loop - here's a complete example with 5 shuffles (and only 3 splits instead of your required 10):
X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]
n_shuffles = 5
group_k_fold = GroupKFold(n_splits=3)
for i in range(n_shuffles):
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=i)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
# do something with splits here, I'm just printing them out
print 'Shuffle', i
print 'groups_shuffled:', groups_shuffled
for train_idx, val_idx in splits:
print 'Train:', train_idx
print 'Val:', val_idx
The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds)
In the GroupKfold the shape of the group is the same as data shape
For data in X, y and groups:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import datetime
X = np.array([[1,2,1,1], [3,4,7,8], [5,6,1,3], [7,8,4,7]])
y=np.array([0,2,1,2])
groups=np.array([2,1,0,1])
group_kfold = GroupKFold(n_splits=len(groups.unique))
group_kfold.get_n_splits(X, y, groups)
param_grid ={
'min_child_weight': [50,100],
'subsample': [0.1,0.2],
'colsample_bytree': [0.1,0.2],
'max_depth': [2,3],
'learning_rate': [0.01],
'n_estimators': [100,500],
'reg_lambda': [0.1,0.2]
}
xgb = XGBClassifier()
grid_search = GridSearchCV(xgb, param_grid, cv=group_kfold.split(X, Y, groups), n_jobs=-1)
result = grid_search.fit(X,Y)
Here is a performant solution that essentially reassigns the values of the keys in a way that respects the original groups.
Code is shown below, but the 4 steps are:
Shuffle the grouping-key vector. The key goal here is rearrange the first time each grouping key appears.
Use np.unique() to return the first_index values for each unique key and the inverse_index values that could be used to reconstruct the grouping-key vector.
Use fancy indexing of the inverse indexes operating on the first_index values to construct a new array of grouping keys where each grouping key has been transformed to a number representing the order in which it first shows up in the shuffled grouping vector.
This new vector of grouping keys can be used in the standard GroupKFold splitter to get a different set of splits than the original because you have reordered the grouping indexes.
To give a quick example, imagine your original grouping-key vector was [3, 1, 1, 5, 3, 5], then this procedure would create a new grouping key vector [0, 1, 1, 2, 0, 2]. The 3's have become 0's because they were the first key to show up, the 1's have become 1's because they were the second key to show up, and the 5's have become 2's because they were the 3rd key to show up. As long as you shuffle the keys, you will get a transformation of grouping-keys, leading to a different set of splits by GroupKFold.
Code:
# Say that A is the official grouping key
A = list(range(10)) + list(range(10))
B = list(range(20))
y = np.zeros(20)
X = pd.DataFrame({
'group': A,
'var': B
})
X = X.sample(frac=1)
original_grouping_vector = X['group']
unique_values, indexes, inverse = np.unique(original_grouping_vector, return_inverse=True, return_index=True)
new_grouping_vector = indexes[inverse] # This is where the magic happens!
splitter = GroupKFold()
for train, test in splitter.split(X, y, groups=new_grouping_vector):
print(X.iloc[test, :])
The above will print out different splits upon shuffling because the grouping-keys are being reordered, causing the value of new_grouping_vector to change.

Using prepared data for Sci-kit classification

I am trying to use the Sci-kit learn python library to classify a bunch of urls for the presence of certain keywords matching a user profile. A user has name, email address ... and a url assigned to them. I have created a txt with the result of each profile data match on each link so it is in the format:
Name Email Address
0 1 0 =>Relavent
1 1 0 =>Relavent
0 1 1 =>Relavent
0 0 0 =>Not Relavent
Where the 0 or 1 signifies that the attribute was found on the page(each row is a webpage)
How do i give this data to the sci-kit so it can use it to run a classifier? The examples i have seen all have data coming from a predefined sch-kit library such as digits or iris or are being generated in the format i already have. I just dont know how to use the data format i have to provide to the library
The above is a toy example and i have many more features than 3
The data needed is a numpy array (in this case a "matrix") with the shape (n_samples, n_features).
A simple way to read the csv-file to the right format by using numpy.genfromtxt. Also refer this thread.
Let the contents of a csv file (say file.csv in the current working directory) be:
a,b,c,target
1,1,1,0
1,0,1,0
1,1,0,1
0,0,1,1
0,1,1,0
To load it we do
data = np.genfromtxt('file.csv', skip_header=True)
The skip_header is set to True, to prevent reading the header column (The a,b,c,target line). Refer numpy's documentation for more details.
Once you load the data, you need to do some pre-processing based on your input data format. The preprocessing could be something like splitting the input and the targets (classification) or splitting the whole dataset into a training and validation set (for cross-validation).
To split the input (feature matrix) from the output (target vector) we do
features = data[:, :3]
targets = data[:, 3] # The last column is identified as the target
For the above given CSV data, the arrays will use will look like:
features = array([[ 0, 1, 0],
[ 1, 1, 0],
[ 0, 1, 1],
[ 0, 0, 0]]) # shape = ( 4, 3)
targets = array([ 1, 1, 1, 0]) # shape = ( 4, )
Now these matrices are passed to the estimator objects fit function. If you are using the popular svm classifier then
>>> from sklearn.svm import LinearSVC
>>> linear_svc_model = LinearSVC()
>>> linear_svc_model.fit(X=features, y=targets)

Categories