building a feature set for scikit learn - python

Im using RandomForestClassifier for a probability prediction task. I have a featureset of around 50 features and two possible labels - first team wins and second team wins.
The feature set contains features for both teams, and the way I built it, since I know which team won, was have 50% of the set labeled 1st team wins, and 50% labeled 2nd team wins - with the respective features placed in the correct place in the feature set - for each match in training data, which initially has the winning team as the first one, I swap the features per team and change the label to second team wins, using a counter modulo 2.
The problem i see is that if I change the counter to start from 1 or 0, it makes a huge change in the final predictions, meaning that the data-set is asymmetrical. To tackle this problem I tried to add every match twice in normal order where the label is first team wins , and reversed with the label being second team wins. The question is - how does this affect the behavior of the model? I see some negative effect after making this change, although not enough to be statistically significant. It does however increase the running time for building the feature set and fitting the model obviously.
Will randomizing the label and team order be a more solid approach? what are my options?

Since you're comparing corresponding team features to each other, an alternative would be to reduce:
TeamA: featureA1, featureA2, featureA3 ... featureAN
TeamB: featureB1, featureB2, featureB3 ... featureBN
Output: which team wins
to:
Input: featureA1-featureB1, featureA2-featureB2, featureA3-featureB3, ..., featureAN - featureBN
Output: positive if team A wins, negative if team B wins
and train your classifier on that. The benefit of this approach is that you now have half the number of features to compare, and no longer have to worry about the order of the teams.

Related

XGBRanker Evaluation

I am trying to learn XGBRanker on my own, and I have encounter some questions:
I am passing multiple dates' ranking data of five teams as my training set. And I am using dates as my group in the Ranker. Do I need to sort the training set by dates(group) then by rank (ascending or descending here?)
The prediction is a list of number that indicate the predicting ranks of the teams. And I learned that ndcg_score is a good evaluation tool in ranking. But the y_true,and y_score in sklearn.metrics.ndcg_score are defined as to have a shape of (n_samples, n_labels). Does that mean I need to OneHotEncoder my prediction since I have 5 teams and 5 ranks(1,2,3,4,5). Or I can just do something like ndcg_score(y_test,y_predict)?
What is a good ndcg score that when I run a totally reverse scenario ndcg_score([[1,2,3,4,5,6,7,8,9]],[[9,8,7,6,5,4,3,2,1]]), it gives 0.67. Given that, how can I compares the models...
Greatly appreciated for any helps.

Is it reasonable to use 2 feature selection steps?

I'm building a model to identify a subset of features to classify an object belong which group. In detail, I have a dataset of 11 objects in which 5 belong to group A and 6 belong to group B, each object has been characterized with a mutation status of 19,000 genes and the values are binary, mutation or no-mutation. My aim is to identify a group of genes among those 19,000 genes so I can predict the object belongs to group A or B. For example, if the object has gene A, B, C mutation and D, E gene with no mutation, it belongs to group A, if not it belongs to group B.
Since I have a large number of features (19,000), I will need to perform feature selection. I'm thinking maybe I can remove features with low variance first as a primary step and then apply the recursive feature elimination with cross-validation to select optimal features. And also don't know yet which model I should use to do the classification, SVM or random forest.
Can you give me some advice? Thank you so much.
Obviously in a first step you can delete all features with zero variance. Also, with 11 observations against the remaining features you will not be able to "find the truth" but maybe "find some good candidates". Whether you'll want to set a lower limit of the variance above zero depends on whether you have additional information or theory. If not, why not leave feature selection in the hands of the algorithm?

Runtime error on my code for CodeChef: Fit To Play (PLAYFIT)

I submitted code for this CodeChef problem:
Rayne Wooney has been one of the top players for his football club for
the last few years. But unfortunately, he got injured during a game a
few months back and has been out of play ever since.
He's got proper treatment and is eager to go out and play for his team
again. Before doing that, he has to prove to his fitness to the coach
and manager of the team. Rayne has been playing practice matches for
the past few days. He's played N practice matches in all.
He wants to convince the coach and the manager that he's improved over
time and that his injury no longer affects his game. To increase his
chances of getting back into the team, he's decided to show them stats
of any 2 of his practice games. The coach and manager will look into
the goals scored in both the games and see how much he's improved. If
the number of goals scored in the 2nd game(the game which took place
later) is greater than that in 1st, then he has a chance of getting
in. Tell Rayne what is the maximum improvement in terms of goal
difference that he can show to maximize his chances of getting into
the team. If he hasn't improved over time, he's not fit to play.
Scoring equal number of goals in 2 matches will not be considered an
improvement. Also, he will be declared unfit if he doesn't have enough
matches to show an improvement.
Input:
The first line of the input contains a single integer T, the number of test cases. Each test case begins with a single integer
N, the number of practice matches Rayne has played.
The next line contains N integers. The i th integer,
gi, on this line represents the number of goals Rayne scored in his i th practice match. The matches are given
in chronological order i.e. j > i means match number j took place
after match number i.
Output:
For each test case output a single line containing the maximum goal difference that Rayne can show to his coach and manager.
If he's not fit yet, print "UNFIT".
Constraints:
1 ≤ T ≤ 10
1 ≤ N ≤ 100000
0 ≤ gi ≤ 1000000 (Well, Rayne's a legend! You can expect him to score so many goals!)
My code:
for _ in range(int(input())):
num = int(input())
goals = list(map(int,input().split()))
list1 = []
for i in range(num-1):
diff = goals[i+1]-goals[i]
list1.append(diff)
if max(list1)>0:
print(max(list1))
else:
print('UNFIT')
Codechef's giving me a Runtime Error. Why is that?
The constraints allow N to be just 1. In that case your list1 is empty, and max([]) will produce the following error:
ValueError: max() arg is an empty sequence
The challenge mentions what you should return in such a case:
Also, he will be declared unfit if he doesn't have enough matches to show an improvement.
To solve this issue, change this line:
if max(list1)>0:
to:
if list1 and max(list1)>0:
Other remarks
Use meaningful variable names. list1 tells us what the data type is, but not what it is for.
Building list1 gives some overhead and will use O(N) extra space. Try to do it without creating a list.

How to evaluate lists with predicted lengths? (`tensorflow.nn.top_k` with array of `k`s from another model)

I am trying to predict medications given to patients. For each medication I have a column in the predictions (through softmax) indicating the probability that the patient will get this medication.
But obviously people can get several meds at once, so I have another model that tries to predict the number of different medications given.
I would like to evaluate them in a single TensorFlow call (I currently have a bunch of slow NumPy hacks), but I can't pass tensorflow.nn.top_k an array of ks (one for each patient, i.e. row), only a fixed integer - which doesn't work because different patients will get different numbers of meds.
Ultimately I'm trying to tensorflow.list_diff between the actually prescribed medication indices and the predicted ones. And then maybe the tensorflow.size of it.
tensorflow.list_diff(
tensorflow.where( # get indices of medications
tensorflow.equal(medication_correct_answers, 1) # convert 1 to True
),
tensorflow.nn.top_k( # get most likely medications
medication_soft_max, # medication model
tensorflow.argmax(count_soft_max, 1) # predicted count
)[1] # second element are the indices
)[:, 0] # get unmatched medications elements
Bonus question: Would it be possible to train a model directly on this instead of two seperate cross entropies? It doesn't really look differentiable to me - or do only the underlying softmaxes need to be differentiable?
The length of the predicted list is indeed not differentiable. You need to add an extra softmax output to the model predicting the length of the list, or add many sigmoid outputs predicting which entries should be included.
I wrote a paper about transcribing variable-length text sequences from images, and the appendix goes into a lot of detail with a worked example for how the math works:
http://arxiv.org/abs/1312.6082

How to generate statistically probably locations for ships in battleship

I made the original battleship and now I'm looking to upgrade my AI from random guessing to guessing statistically probably locations. I'm having trouble finding algorithms online, so my question is what kinds of algorithms already exist for this application? And how would I implement one?
Ships: 5, 4, 3, 3, 2
Field: 10X10
Board:
OCEAN = "O"
FIRE = "X"
HIT = "*"
SIZE = 10
SEA = [] # Blank Board
for x in range(SIZE):
SEA.append([OCEAN] * SIZE)
If you'd like to see the rest of the code, I posted it here: (https://github.com/Dbz/Battleship/blob/master/BattleShip.py); I didn't want to clutter the question with a lot of irrelevant code.
The ultimate naive solution wold be to go through every possible placement of ships (legal given what information is known) and counting the number of times each square is full.
obviously, in a relatively empty board this will not work as there are too many permutations, but a good start might be:
for each square on board: go through all ships and count in how many different ways it fits in that square, i.e. for each square of the ships length check if it fits horizontally and vertically.
an improvement might be to also check for each possible ship placement if the rest of the ships can be placed legally whilst covering all known 'hits' (places known to contain a ship).
to improve performance, if only one ship can be placed in a given spot, you no longer need to test it on other spots. also, when there are many 'hits', it might be quicker to first cover all known 'hits' and for each possible cover go through the rest.
edit: you might want to look into DFS.
Edit: Elaboration on OP's (#Dbz) suggestion in the comments:
hold a set of dismissed placements ('dissmissed') of ships (can be represented as string, say "4V5x3" for the placement of length 4 ship in 5x3, 5x4, 5x5, 5x6), after a guess you add all the placements the guess dismisses, then for each square hold a set of placements that intersect with it ('placements[x,y]') then the probability would be:
34-|intersection(placements[x,y], dissmissed)|/(3400-|dismissed|)
To add to the dismissed list:
if guess at (X,Y) is a miss add placements[x,y]
if guess at (X,Y) is a hit:
add neighboring placements (assuming that ships cannot be placed adjacently), i.e. add:
<(2,3a,3b,4,5)>H<X+1>x<Y>, <(2,3a,3b,4,5)>V<X>x<Y+1>
<(2,3a,3b,4,5)>H<X-(2,3,3,4,5)>x<Y>, <(2,3a,3b,4,5)>V<X>x<Y-(2,3,3,4,5)>
2H<X+-1>x<Y+(-2 to 1)>, 3aH<X+-1>x<Y+(-3 to 1)> ...
2V<X+(-2 to 1)>x<Y+-1>, 3aV<X+(-3 to 1)>x<Y+-1> ...
if |intersection(placements[x,y], dissmissed)|==33, i.e. only one placement possible add ship (see later)
check if any of the previews hits has only one possible placement left, if so, add the ship
check to see if any of the ships have only possible placement, if so, add the ship
adding a ship:
add all other placements of that ship to dismissed
for each (x,y) of the ships placement add placements[x,y] with out the actual placement
for each (x,y) of the ships placement mark as hit guess (if not already known) run stage 2
for each (x,y) neighboring the ships placement mark as miss guess (if not already known) run stage 1
run stage 3 and 4.
i might have over complicated this, there might be some redundant actions, but you get the point.
Nice question, and I like your idea for statistical approach.
I think I would have tried a machine learning approach for this problem as follows:
First model your problem as a classification problem.
The classification problem is: Given a square (x,y) - you want to tell the likelihood of having a ship in this square. Let this likelihood be p.
Next, you need to develop some 'features'. You can take the surrounding of (x,y) [as you might have partial knowledge on it] as your features.
For example, the features of the middle of the following mini-board (+ indicates the square you want to determine if there is a ship or not in):
OO*
O+*
?O?
can be something like:
f1 = (0,0) = false
f2 = (0,1) = false
f3 = (0,2) = true
f4 = (1,0) = false
**note skipping (1,1)
f5 = (1,2) = true
f6 = (2,0) = unknown
f7 = (2,1) = false
f8 = (2,2) = unknown
I'd implement features relative to the point of origin (in this case - (1,1)) and not as absolute location on board (so the square up to (3,3) will also be f2).
Now, create a training set. The training set is a 'labeled' set of features - based on some real boards. You can create it manually (create a lot of boards), automatically by a random generator of placements, or by some other data you can gather.
Feed the training set to a learning algorithm. The algorithm should be able to handle 'unknowns' and be able to give probability of "true" and not only a boolean answer. I think a variation of Naive Bayes can fit well here.
After you have got a classifier - exploit it with your AI.
When it's your turn, choose to fire upon a square which has the maximal value of p. At first, the shots will be kinda random - but with more shots you fire, you will have more information on the board, and the AI will exploit it for better predictions.
Note that I gave features based on a square of size 1. You can of course choose any k and find features on this bigger square - it will give you more features, but each might be less informative. There is no rule of thumb which will be better - and it should be tested.
Main question is, how are you going to find statistically probable locations. Are they already known or you want to figure them out?
Either case, I'd just make the grid weighed. In your case, the initial weight for each slot would be 1.0/(SIZE^2). The sum of weights must be equal to 1.
You can then adjust weights based on the statistics gathered from N last played games.
Now, when your AI makes a choice, it chooses a coordinate to hit based on weighed probabilities. The quick and simple way to do that would be:
Generate a random number R in range [0..1]
Start from slot (0, 0) adding the weights, i.e. S = W(0, 0) + W(0, 1) + .... where W(n, m) is the weight of the corresponding slot. Once S >= R, you've got the coordinate to hit.
This can be optimised by pre-calculating cumulative weights for each row, have fun :)
Find out which ships are still alive:
alive = [2,2,3,4] # length of alive ships
Find out spots where you have not shot, for example with a numpy.where()
Loop over spots where you can shoot.
Check the sides of the given position. Go left and right, how many spaces? Go up and down, how many spaces? If you can fit a boat in that many spaces, you can fit any smaller boat, so this loop I'd do it from the largest ship downwards, and I'd add to the counts in this position as many +1 as ships smaller than the one that fits.
Once you have done all of this, the position with more points should be the most probable to attack and hit something.
Of course, it can get as complicated as you want. You can also ask yourself, instead of which is my next hit, which combinations of hits will give me the victory in less number of hits or any other combination/parametrization of the problem. Good luck!

Categories