I am trying to learn XGBRanker on my own, and I have encounter some questions:
I am passing multiple dates' ranking data of five teams as my training set. And I am using dates as my group in the Ranker. Do I need to sort the training set by dates(group) then by rank (ascending or descending here?)
The prediction is a list of number that indicate the predicting ranks of the teams. And I learned that ndcg_score is a good evaluation tool in ranking. But the y_true,and y_score in sklearn.metrics.ndcg_score are defined as to have a shape of (n_samples, n_labels). Does that mean I need to OneHotEncoder my prediction since I have 5 teams and 5 ranks(1,2,3,4,5). Or I can just do something like ndcg_score(y_test,y_predict)?
What is a good ndcg score that when I run a totally reverse scenario ndcg_score([[1,2,3,4,5,6,7,8,9]],[[9,8,7,6,5,4,3,2,1]]), it gives 0.67. Given that, how can I compares the models...
Greatly appreciated for any helps.
Related
i want to train a logistic regression model on a dataset which has a categorical HomePlanet column contains 3 distinct values as : Earth , Europa , Mars
when i do :
pd.get_dummies(train['HomePlanet'])
it seperates all categories as columns.Then i train the model with that dataset.
I can also make numerical categories by doing
train['HomePlanet'] = train['HomePlanet'].replace({'Earth':1 , 'Europa':2 , 'Mars':3 })
is it logical if i use the second way to convert the categorical data then train the model?
The first approach is called 'One Hot Encoding' and the second is called 'Label Encoding'. Generally OHE is preferred over LE because LE can introduce the properties of similarity and ranking, when in fact these don't exist in the data.
Similarity - The idea that if categories are encoded with numbers that are closer to eachother, then they are more similar. In your example, one would expect Earth to be more similar to Europa than to Mars.
Ranking - Labels are assigned based on a specific order that is relevant to your problem, e.g size, distance, importance etc. For example in your case, you would be saying that Mars is bigger than Europa, and Europa is bigger than Earth.
I would say that in your example, one hot encoding will work better, but there are cases where label encoding makes more sense. For example to convert product reviews from "very bad, bad, neutral, good, very good" to "0,1,2,3,4" respectively. In this case, very good is the best option, so it is assigned a large number. Also very good is more similar to good than it is to very bad, therefore the number of very good (4) is closer to the number of good (3) than it is to very bad (0)
My output when using a model.predict with my test set is shown below. Whereas the accuracy from my epochs was in the 0.8 range.
`[[1.63658711e-04 9.99836326e-01]
[2.59015225e-02 9.74098504e-01]
[9.78065059e-02 9.02193546e-01]
[1.09802298e-02 9.89019811e-01]
[3.25678848e-04 9.99674320e-01]
[3.48023442e-03 9.96519804e-01]
[1.56172812e-02 9.84382689e-01]
[4.83522518e-03 9.95164752e-01]
[6.11863611e-03 9.93881345e-01]
[3.42085288e-04 9.99657869e-01]
[5.51505107e-03 9.94484961e-01]...]'
My aim is to predict whether a person has heart disease or not, how do I compare the values of my test set and my true values and find out the performance of my model.
Without any code I am assuming you are classifying between two classes. The two outputs are the probability that the inferenced data is one class or the other.
[1.63658711e-04 9.99836326e-01]
This shows us that it thinks that the probability of class 1 is .000163658 (.0163658%) and class 2 .999836 (99.836%).
I'm guessing that class 1 is heart disease and class 2 is not heart disease based on the number of each.
Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?
Below is an example
I am training my model with:
model.fit(X_train,y)
where X_train=data['var1','var2']
But then during prediction, when I use:
model.predict(X_test)
X_test is defined as: X_test=data['var1','var3']
where var3 could be a completely different variable than var2.
Does predict assume that var3 is the same as var2 because it is the second column in X_test?
What if:
X_live was defined as: X_live=data['var2','var1']
Would predict know to re-order X to line them up correctly?
The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.
Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.
Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.
Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.
To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.
Firstly answer your question "Does predict assume that var3 is the same as var2 because it is the second column in X_test?"
No; any machine Learning model does not have any such assumption on
the data that you are passing into the fit function or the predict
function. What the model simply sees is an array of numbers, let it
be a multidimensional array of higher order. It is completely on the
user to concern about the features.
Let's take a simple classification problem, where you have 2 groups:
First one is a group of kids, with short height, and thereby lesser weight,
Second group is of mature adults, with higher age, height and weight.
Now you want to classify the below individual into any one of the classes.
Age
Height
Weight
10
120
34
Any well trained classifier can easily classify this data point to the group of kids, since the age and weight are small. The vector which the model will now consider is [ 10, 120, 34 ].
But now let us reorder the feature columns, in the following way - [ 120, 10, 34 ]. But you know that the number 120, you want to refer to the height if the individual and not age! But it is pretty sure that the model won't understand what you know or expect, and it is bound to classify the point to the group of adults.
Hope that answers both your questions.
I am trying to predict medications given to patients. For each medication I have a column in the predictions (through softmax) indicating the probability that the patient will get this medication.
But obviously people can get several meds at once, so I have another model that tries to predict the number of different medications given.
I would like to evaluate them in a single TensorFlow call (I currently have a bunch of slow NumPy hacks), but I can't pass tensorflow.nn.top_k an array of ks (one for each patient, i.e. row), only a fixed integer - which doesn't work because different patients will get different numbers of meds.
Ultimately I'm trying to tensorflow.list_diff between the actually prescribed medication indices and the predicted ones. And then maybe the tensorflow.size of it.
tensorflow.list_diff(
tensorflow.where( # get indices of medications
tensorflow.equal(medication_correct_answers, 1) # convert 1 to True
),
tensorflow.nn.top_k( # get most likely medications
medication_soft_max, # medication model
tensorflow.argmax(count_soft_max, 1) # predicted count
)[1] # second element are the indices
)[:, 0] # get unmatched medications elements
Bonus question: Would it be possible to train a model directly on this instead of two seperate cross entropies? It doesn't really look differentiable to me - or do only the underlying softmaxes need to be differentiable?
The length of the predicted list is indeed not differentiable. You need to add an extra softmax output to the model predicting the length of the list, or add many sigmoid outputs predicting which entries should be included.
I wrote a paper about transcribing variable-length text sequences from images, and the appendix goes into a lot of detail with a worked example for how the math works:
http://arxiv.org/abs/1312.6082
Im using RandomForestClassifier for a probability prediction task. I have a featureset of around 50 features and two possible labels - first team wins and second team wins.
The feature set contains features for both teams, and the way I built it, since I know which team won, was have 50% of the set labeled 1st team wins, and 50% labeled 2nd team wins - with the respective features placed in the correct place in the feature set - for each match in training data, which initially has the winning team as the first one, I swap the features per team and change the label to second team wins, using a counter modulo 2.
The problem i see is that if I change the counter to start from 1 or 0, it makes a huge change in the final predictions, meaning that the data-set is asymmetrical. To tackle this problem I tried to add every match twice in normal order where the label is first team wins , and reversed with the label being second team wins. The question is - how does this affect the behavior of the model? I see some negative effect after making this change, although not enough to be statistically significant. It does however increase the running time for building the feature set and fitting the model obviously.
Will randomizing the label and team order be a more solid approach? what are my options?
Since you're comparing corresponding team features to each other, an alternative would be to reduce:
TeamA: featureA1, featureA2, featureA3 ... featureAN
TeamB: featureB1, featureB2, featureB3 ... featureBN
Output: which team wins
to:
Input: featureA1-featureB1, featureA2-featureB2, featureA3-featureB3, ..., featureAN - featureBN
Output: positive if team A wins, negative if team B wins
and train your classifier on that. The benefit of this approach is that you now have half the number of features to compare, and no longer have to worry about the order of the teams.