What does fit() exactly does here? - python

Well, basically i want to know what does the fit() function does in general, but especially in the pieces of code down there.
Im taking the Machine Learning A-Z Course because im pretty new to Machine Learning (i just started). I know some basic conceptual terms, but not the technical part.
CODE1:
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', verbose = 0)
missingvalues = missingvalues.fit(X[:, 1:3])
X[:, 1:3] = missingvalues.transform(X[:, 1:3])
Some other example where I still have the doubt
CODE 2:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
print(sc_X)
X_train = sc_X.fit_transform(X_train)
print(X_train)
X_test = sc_X.transform(X_test)
I think that if I know like the general use for this function and what exactly does in general, I'll be good to go. But certaily I'd like to know what is doing on that code

Here is also a nice check-up possibility: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
The fit-method is always to learn something in machine learning.
You normally have the following steps:
Seperate your data into two/three datasets
Pick one part of your data to learn/train something (normally X_train) with fit
Use the learned algorithm you predict something to unseen data (normally X_test) with predict
In your first example: missingvalues.fit(X[:, 1:3])
You are training SimpleImputerbased on your data Xwhere you are only using column 1,2,3, with transform you used this training to overwrite this data.
In your second example: You are training StandardScalerwith X_trainand are using this training for both datasets X_train, X_test, the StandardScaler learnes from X_trainthat means if he learned that 10 has to be converted to 2, he will convert 10 to 2 in both sets X_train, X_test.

Sklearn uses Classes. See the Python documentation for more info about Classes in Python. For more info about sklearn in particular, take a look at this sklearn documentation.
Here's a short description of how you are using Classes in sklearn.
First you instantiate your sklearn Classes with sc_X = StandardScaler() or missingvalues = SimpleImputer(...).
The objects, sc_X and missingvalues, each have methods. You can use the methods typing object_name.method_name(...). For example, you used the fit_transform() method of the sc_X instance when you typed, sc_X.fit_transform(...). This method will take your data and return a scaled version of it. It both fits (determines the scaling parameters) and transforms (applies scaling) to your data. The transform() method will transform new data, using the same scaling parameters it learned for your previous data.
In the first example, you have separated the fit and transform methods into two separate lines, but the idea is similar -- you first learn the imputation parameters with the fit method, and then you transform your data.
By the way, I think missingvalues = missingvalues.fit(X[:, 1:3]) could be changed to missingvalues.fit(X[:, 1:3]).

Related

Non overlapping data in train test validation split python

I'm trying to create a function for some deep learning issues for satellite images classification. I have searched through a lot of libraries and I haven't found my needs I tried this sikit-learn but I feel that it is not what I need
Any hint for a specialised function that I may not see?
The sklearn train_test_split seems to fit all your needs.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
This should do the trick. You can use the permutation array on the X and y data separately if you like.
num_tr, num_va = int(len(data)*0.5), int(len(data)*0.2)
perm = np.random.permutation(len(data))
tr_data = data[perm[:num_tr]]
va_data = data[perm[num_tr:num_tr+num_va]]
te_data = data[perm[num_tr+num_va:]]

Remove some features from sklearn PolynomialFeatures

I am using sklearn module PolynomialFeatures to fit my model with polynoms over my datas.
To this extent I am doing the following :
P = PolynomialFeatures(3, interaction_only=False, include_bias=False)
model = make_pipeline(P, Ridge(tol=0.001, alpha=1, fit_intercept=False))
model.fit(initial_conditions, times_of_flight)
It works well and now I would like to be able to remove some of these features to refine my model. Say I would like to remove every features that contain one of the two first variables, x_1 and x_2, without the other.
I have tried to modify my PolynomialFeatures attributes (powers_, n_input_features_...) before fitting but Scikit returns me a sklearn.exceptions.NotFittedError error.
How should I proceed ?

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).
I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.
For instance:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
my_model_name = XGBClassifier()
my_model_name.fit(X,Y)`
where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.
Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set.
Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.
You can get the features names by:
model.get_booster().feature_names
You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.
But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.
Then you should be able to:
change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)
EDIT:
Thanks to #Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:
xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)
For more info on this topic, look at How to get feature importance.
I tried the above answers, and didn't work while loading the model after training.
So, the working code for me is :
model.feature_names
it returns a list of the feature names
I think, it is best to turn numpy array back into pandas DataFrame. E.g.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
Y=label
X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)
my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)
xgb.plot_importance(my_model_name)
plt.show()
This will show the original names.

Up-/downsampling with One vs. rest classifier

I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?
As per the discussion in comments, what you want can be done like this:
from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN
# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
# This pipeline will resample the data and
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()),
('clf', MultinomialNB())])
# OVR will transform the `y` as you know and
# then pass single label data to different copies of pipe
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)
Explanation of code:
Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.
Step 3:
a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.
b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.
Hope this helps.

How can we predict using RandomForestClassifier obtained from pyspark.ml

I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.
So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.
I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.

Categories