Hello I have a multiclass classification model ready and trained on dataset
Label Feat1 Feat2 Feat3 Feat4
Class1 10 21 12 2
Class2 3 6 7 9
Class3 14 8 8 10
Class4 1 5 5 9
I currently can use the predict function in Sckit-Learn to apply the best model for predicting the single class . So I can get column Predicted_Label. How to approach the problem in order to get a list of prediction i.e 2nd or 3rd Best Prediction
Test_Data_Set
Feat1 Feat2 Feat3 Feat4 Predicted_Label Predicted_Label_2nd_Best_Prediction
1 3 10 7 Class1 [Class1,Class4]
Please refer to this question: Understanding predict_proba from MultiOutputClassifier
You need to use predict_proba() on your model to get probabilities for every class, for every row of your training dataset. In your case you will get an array of length 4 if you have 4 classes.
You can then get the second best prediction class for the second largest probability.
Multiclass MultiOutput Classification example on sklearn documentation
Note: Every array of length 4 from predict_proba() will add up to 1.
Related
I'm building a predictive model for whether a car is sport car or not. The model works fine, however I would like to join the predicted values back to the unique IDs and visualize the proportion, etc. Basically I have two dataframes:
Testing with labelled data - test_cars
CarId
Feature1
Feature2
IsSportCar
1
90
150
True
2
60
200
False
3
560
500
True
Unlabelled data to be predicted - cars_new
CarId
Feature1
Feature2
4
88
666
5
55
458
6
150
125
from sklearn.neighbors import KNeighborsClassifier
# Create arrays for the features and the response variable
y = test_cars['IsSportCar'].values
X = test_cars.drop(['IsSportCar','CarId'], axis=1).values
X_new = cars_new.drop(['CarId'], axis=1).values
# Create a k-NN classifier with 10 neighbors
knn = KNeighborsClassifier(n_neighbors=10)
# Fit the classifier to the data
knn.fit(X,y)
y_pred = knn.predict(X_new)
The model works fine, but I would like to join the predicted values back to each car (CarId), so the car_new dataframe would be outputted with predicted column "IsSportCar":
CarId
Feature1
Feature2
IsSportCar
4
88
666
False
5
55
458
True
6
150
125
True
Any ideas how to join the predicted values back to the unique IDs?
cars_new['IsSportCar'] = y_pred
I assume y_pred is the variable you want to put into cars_new?
I've been following a tutorial trying to understand machine learning while trying out what he's doing at the same time.
My array is:
0 44 72000
2 27 48000
1 30 54000
2 38 61000
1 40 6.377777777777778101
0 35 58000
2 38.77777777777777857 52000
0 48 79000
1 50 83000
0 37 67000
The first column used to contain country name but he used label encoder to transform it to 0s,1s and 2s.
He wanted to also use OneHotEncoder to transform that column to more features but since his videos are a bit outdated he used categorical_features with OneHotEncoder but in my sklearn version OneHotEncoder has been changed and i don't have that parameter anymore.
So how can I use OneHotEncoder now on that specific feature?
What he tried was:
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Assuming that your data X has a shape (n_rows, features).
If you like to apply one-hot encoding to say, the first column. A quick approach would be
onehotencoder = OneHotEncoder()
one_hot = onehotencoder.fit_transform(X[:,0:1]).toarray()
A better approach to apply one-hot encoding only a specific column would be to use ColumnTransformer
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("country", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
one hot encoding based on categories. You can represent your data with one hot vectors. For instance if you have 2 classes your vector have length 2:
[_,_]
So each class can be represented in here by just using 0s and 1s. Represented class index take 1 and others take 0. For instance class0 will be:
[1,0]
Class1 will be:
[0,1]
In your example, you have 3 classes. Therefore your one-hot-vector will have length of 3. Each class represented like that:
Class0 -> [1,0,0]
Class1 -> [0,1,0]
Class2 -> [0,0,1]
Then your array will looks like:
[1,0,0] 44 72000
[0,0,1] 27 48000
[0,1,0] 30 54000
[0,0,1] 38 61000
[0,1,0] 40 6.377777777777778101
[1,0,0] 35 58000
[0,0,1] 38.77777777777777857 52000
[1,0,0] 48 79000
[0,1,0] 50 83000
[1,0,0] 37 67000
I hope this clarify your question. You can write your own function to do that.
I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:
Count Year Month Day Count-1 Count-2 Friday Monday Saturday Sunday Thursday Tuesday Wednesday
196.0 2017.0 7.0 10.0 196.0 196.0 0 1 0 0 0 0 0
264.0 2017.0 7.0 11.0 196.0 196.0 0 0 0 0 0 1 0
274.0 2017.0 7.0 12.0 264.0 196.0 0 0 0 0 0 0 1
286.0 2017.0 7.0 13.0 274.0 264.0 0 0 0 0 1 0 0
502.0 2017.0 7.0 14.0 286.0 274.0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)
predictions = rf.predict(test_features)
The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.
First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.
Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.
If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks
Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.
Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.
forecasting = rf.predict(dataset_to_be_forecasted)
I'm working on a model which will predict a number from others opinion. For this i will use Linear Regression from Sklearn.
For example, i have 5 agents from witch i collect data over time of theirs last changes in each iteration, if they didn't insert it yet, data contains Nan, till their first change. Data looks something like this:
a1 a2 a3 a4 a5 target
1 nan nan nan nan 3 4.5
2 4 nan nan nan 3 4.5
3 4 5 nan nan 3 4.5
4 4 5 5 nan 3 4.5
5 4 5 5 4 3 4.5
6 5 5 5 4 3 4.5
So in each iteration/change i want to predict end number. As we know linear regression doesn't allow you to have an = Nan's in data. I replace them with an = 0, witch doesn't ruin answer, because formula of linear regression is: result = a1*w1 + a2*w2 + ... + an*wn + c.
Current questions i have at the moment:
Does my solution somehow effects on fit? Is there any better solution for my problem? Should i learn my model only with full data than use it with current solution?
Setting nan's to 0 and training a linear regression to find coefficients for each of the variables is fine depending on the use case.
Why?
You are essentialy training the model and telling it that for many rows - the importance of variable a1 ,a2 , etc (when the value is nan and set to 0).
If the NAN's are because of data not being filled in yet, then setting them to 0 and training your model is wrong. It's better to train your model after all the data has been entered (atleast for all the agents who have entered some data) This can later be used to predict for new agents. Else , your coefficients will be over fit for 0's(NAN's) if many agents have not yet entered in their data.
Based on the end target(which is a continuous variable) , linear regression is a good approach to go by.
I have a problem with a predictive modelling problem. Hopefully, someone have time and can help me. The starting position is shown below. S1-S2 are sensor measurements and RUL is my target value.
DataStructure:
id period s1 s2 s3 RUL
1 1 510.23 643.43 1585.29 6
1 2 512.34 644.89 1586.12 5
1 3 514.65 645.11 1587.99 4
1 4 512.98 647.59 1588.45 3
1 5 516.34 649.04 1590.65 2
1 6 518.12 652.62 1593.09 1
2 1 509.77 640.61 1584.91 9
2 2 510.26 642.06 1586.00 8
2 3 511.95 643.62 1588.09 7
2 4 513.51 646.51 1589.45 6
2 5 512.17 648.06 1589.54 5
2 6 515.56 646.11 1586.22 4
2 7 518.78 649.34 1586.96 3
2 8 519.90 650.30 1588.95 2
2 9 521.05 651.39 1591.34 1
3 1 501.11 653.99 1580.45 8
3 2 511.45 643.23 1584.09 7
3 3 505.45 643.78 1586.11 6
3 4 504.45 643.43 1588.34 5
3 5 506.45 643.71 1589.89 4
3 6 511.45 643.33 1591.21 3
3 7 516.45 643.61 1592.42 2
3 8 518.45 643.05 1596.77 1
Target:
My target is to predict the remaining usefull live (RUL) of unseen data. In this case I have only 1 type of machine with different id's (That means 1 type and 3 different physical systems). For the prediction the id doesn't matter, because it's the same machine. Furthermore, I want to add new features. The moving average of s1 s2 and s3. So I have to add three new column with the names a1, a2 and a3.
For instance, a1 should look like:
a1
NaN
NaN
512.41
513.32
514.66
515.81
NaN
NaN
510.66
511.91
512.54
513.75
515.50
518.08
519.91
NaN
NaN
506.00
507.12
505.45
507.45
511.45
515.45
The next problem is, that I can't work with NaN, because it's a string. How can I ignore/work with it for a1, a2, and a3 ?
Next Question is: How can I use a regression models like RandomForest and Bagged Decision Trees with a train_test_split to predict the RUL of unseen new data? (Of course I need more data, this example gives just the structure.) [s1],[s2],[s3] are my inputs and RUL is the output.
Furthermore, I want to evaluate the model with Mean Absolut Error, Mean Squared Error and R².
Finaly, I want to use the gridsearch Method for tuning.
Thank you:
Thank you in advance. I know what I wanna do, but I'm not able to realize it with python. A complete code would be perfect.
The standard way of solving this problem is through imputation. SciKitLearn has a built in package for imputation. The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
There are 3 strategies for replacing the NaNs:
1) replacing it with the mean of the column
2) replacing it with the mode of the column
3) replacing it with the median of the column
An example of usage would look like this:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy = 'mean', axis = 1)
a1 = Imputer.fit_transform(a1, strategy = 'mean')
There are also usage examples available here: http://scikit-learn.org/stable/modules/preprocessing.html#imputation