I've used sklearn for machine learning modelling over the last couple of years and grew accustomed to what seems like a very logical and cohesive framework:
from sklearn.ensemble import RandomForestClassifier
# define a model
clf = RandomForestClassifier()
# fit the model to data
clf.fit(X,y)
#make prediction on a test set
preds = clf.predict_proba(X_test)[:,1]
I'm now trying to learn some R, and want to start doing some of the same things I was doing in sklearn. The first thing that you notice coming from the sklearn world is the diverse syntax across packages. Which is understandable, but kind of inconvenient.
caret seems like a nice solution to that problem, creating cohesion across all the different R packages (i.e. randomForest, gbm,...).
Though I'm still puzzled by some of default choices (i.e. the train() method seems to default to some sort of grid search). Also, caret seems to be using plyr behind the scenes, which messes up some of dplyr methods like summarise. Since I do lots of data manipulation with dplyr that's kind of a problem.
Can you help me figure out what the caret's equivalent of the sklearn's model/fit/predict_proba is? Also, is there a way to deal with the plyr/dplyr issue?
The equivalent of making a prediction in the caret library would be to change the type in ?predict.train. It should be altered to this:
predict(model, data, type="prob")
If you want to mix dplyr/plyr then the easiest way to explicitly call it by using:
dplyr::summarise
or
plyr::summarise
If you had already tried to use predict(..., type="prob") and come up with a weird error which you didn't understand and gave up, I would recommend reading in this thread: Predicting Probabilities for GBM with caret library
Related
I've found a post similar to my question: XGBoost - Country Feature should be labeld or one hot encoded?
I have 2 columns (colour,dayofweek) which is encoded like 1,2,3,...6,7.
In theory, if I don't OHE it, it would cause the algo to think there is an numeric ranking, eg 1>2>3. To avoid this, I should OHE.
So I create 2 pipelines with OHE and without running several algo:
for k,v in model_dict.items():
pipeline_dict[k] = Pipeline([('preprocessor',ct['ohe']),('model',v)])
for k,v in model_dict.items():
pipeline_dict_no_ohe[k] = Pipeline([('model',v)])
The results for KNN, Gaussian Naive Bayees, XGB, RandomForest, DecisionTree:
With OHE:knn = 0.73622, gnb = 0.65814, xgb = 0.78996, rf = 0.79015, dt = 0.79041
Without OHE: knn = 0.77133, gnb = 0.70049, xgb = 0.94987, rf = 0.94138, dt = 0.83169
Very surprising to me. Going by the results, I would pick without OHE. But this sound so wrong as it does not seem to be the right thing to do.
Questions:
What is the reason for a better results without OHE ?
Does the algo really think there is a ranking to colour & dayofweek ?
Would it give problems in the future if the model thinks there is a ranking in the model ?
The reason why I have concern is that recently, in a test, I have corrupted my DataFrame but it gave fantastic result ! That's why I asked to get a better confidence of the model I'm creating.
Thanks very much !!!
The models will think there is an ordering, and even arithmetic relationships (Monday+Wednesday=Thursday, e.g.) with this encoding, unless they are programmed to recognize the integers as being representatives of categories (not the case with any of your sklearn models). Tree models don't deal with arithmetic at all, only ordering, and sometimes they can perform well even with the artificial ordering imposed; for that, have a search at datascience.SE or stats.SE, e.g. my answer here and the Linked questions from there.
Your dayofweek actually is ordinal, although a cyclic representation might be even better. colour though is almost surely not.
To the questions:
It's nearly impossible to say without the data. Tree models doing better makes some sense to me (not as a rule, but as a possibility), but kNN and GNB are surprising.
Yes.
If you're properly scoring on unseen test data, this should be fine. Data drift might cause more issues with this "improper" encoding, but it's really very hard to say.
Is there away to perform probabilistic PCA using python and sci-kit learn.? I am trying to perform ppca but I can't find a library that does it.
https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
Theres an example that kind of gets into it and I think will help you. It looks like you have to do your own scoring to get the exact probabilistic PCA implementation you're after for your data. Probably playing around with the results of an implementation similar to that will help you figure out your issues.
I'm using python's XGBRegressor and R's xgb.train with the same parameters on the same dataset and I'm getting different predictions.
I know that XGBRegressor uses 'gbtree' and I've made the appropriate comparison in R, however, I'm still getting different results.
Can anyone lead me in the right direction on how to differentiate the 2 and/or find R's equivalence to python's XGBRegressor?
Sorry if this is a stupid question, thank you.
Since XGBoost uses decision trees under the hood it can give you slightly different results between fits if you do not fix random seed so the fitting procedure becomes deterministic.
You can do this via set.seed in R and numpy.random.seed in Python.
Noting Gregor's comment you might want to set nthread parameter to 1 to achieve full determinism.
I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com
Does anyone know if the data type of a variable plays a (negative) role when running a machine learning algorithm in ski kit learn?
Here's a little background that may influence responses to this question: I have a 299 variable dataset where the output variable is a dummy variable. This will be a classification problem and I would like to try different options like logistic regression and tree based models. When I imported my dataset with pandas, I noticed that some of the variables were assigned a data type of int64 when, in fact, they are categorical variables. Is this going to be a problem for the machine learning algorithm? Please forgive me if this is a silly question...I am still relatively new to the machine learning world and while I have not seen anything in the literature on this topic, I did want to make sure I don't go off track before I even start.
It will be for scikit-learn, as scikit-learn does not support categorical features. It will end up treating that integer values as a numeric feature, and will not behave as you might hope. It does support re-encoding them in a numeric form (see here ), however that is sub-optimal compared to using a library and algorithms that naturally support numeric and categorical features.