Is it reasonable to use 2 feature selection steps? - python

I'm building a model to identify a subset of features to classify an object belong which group. In detail, I have a dataset of 11 objects in which 5 belong to group A and 6 belong to group B, each object has been characterized with a mutation status of 19,000 genes and the values are binary, mutation or no-mutation. My aim is to identify a group of genes among those 19,000 genes so I can predict the object belongs to group A or B. For example, if the object has gene A, B, C mutation and D, E gene with no mutation, it belongs to group A, if not it belongs to group B.
Since I have a large number of features (19,000), I will need to perform feature selection. I'm thinking maybe I can remove features with low variance first as a primary step and then apply the recursive feature elimination with cross-validation to select optimal features. And also don't know yet which model I should use to do the classification, SVM or random forest.
Can you give me some advice? Thank you so much.

Obviously in a first step you can delete all features with zero variance. Also, with 11 observations against the remaining features you will not be able to "find the truth" but maybe "find some good candidates". Whether you'll want to set a lower limit of the variance above zero depends on whether you have additional information or theory. If not, why not leave feature selection in the hands of the algorithm?

Related

Is there any way to ensure that your CNN model predicts outputs following certain thresholds?

I have a network which takes in an input image and outputs 37 values that are essentially the features. For e.g. the entire output class is a series of questions whose values are the percentage of people who agreed upon the said feature. 0.60 for class1 and 0.4 for class12.
Now, there are some conditions such that the output of the model can't have one class that is higher than the other. E.g. class1.1 must be higher than class3.2 as it is a higher question in the decision tree.
Is there any way we can implement this?
Instead of directly outputting a, b from your neural network you can output a, a + ReLU(b) which ensures the second output is higher than or equal to the first.

Does name and order of Features matter for prediction algorithm

Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?
Below is an example
I am training my model with:
model.fit(X_train,y)
where X_train=data['var1','var2']
But then during prediction, when I use:
model.predict(X_test)
X_test is defined as: X_test=data['var1','var3']
where var3 could be a completely different variable than var2.
Does predict assume that var3 is the same as var2 because it is the second column in X_test?
What if:
X_live was defined as: X_live=data['var2','var1']
Would predict know to re-order X to line them up correctly?
The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.
Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.
Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.
Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.
To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.
Firstly answer your question "Does predict assume that var3 is the same as var2 because it is the second column in X_test?"
No; any machine Learning model does not have any such assumption on
the data that you are passing into the fit function or the predict
function. What the model simply sees is an array of numbers, let it
be a multidimensional array of higher order. It is completely on the
user to concern about the features.
Let's take a simple classification problem, where you have 2 groups:
First one is a group of kids, with short height, and thereby lesser weight,
Second group is of mature adults, with higher age, height and weight.
Now you want to classify the below individual into any one of the classes.
Age
Height
Weight
10
120
34
Any well trained classifier can easily classify this data point to the group of kids, since the age and weight are small. The vector which the model will now consider is [ 10, 120, 34 ].
But now let us reorder the feature columns, in the following way - [ 120, 10, 34 ]. But you know that the number 120, you want to refer to the height if the individual and not age! But it is pretty sure that the model won't understand what you know or expect, and it is bound to classify the point to the group of adults.
Hope that answers both your questions.

How to find values below (or above) average

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.
from StringIO import StringIO
myst="""01/01/2016 8781262
01/02/2016 8958598
01/03/2016 8787628
01/04/2016 9770861
01/05/2016 8409410
01/06/2016 8924784
01/07/2016 8597500
01/08/2016 6436862
01/09/2016 1542677
"""
u_cols=['month', 'count']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)
Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?
This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.
A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.
Define a zscore function
def zscore(s):
return (s - np.mean(s)) / np.std(s)
Apply it to the count column
zscore(df['count'])
0 0.414005
1 0.488906
2 0.416694
3 0.831981
4 0.256946
5 0.474624
6 0.336390
7 -0.576197
8 -2.643349
Name: count, dtype: float64
Notice that the September observation is 2.6 standard deviations away.
Use abs and gt to identify outliers
zscore(df['count']).abs().gt(2)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
Name: count, dtype: bool
Again, September comes back true.
Tie it all together to filter your original dataframe
df[zscore(df['count']).abs().gt(2)]
filter the other way
df[zscore(df['count']).abs().le(2)]
first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.
But on the other side:
In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.
So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.
STRAIGHTFORWARD APPROACH
If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).
In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).
Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:
High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.
ALTERNATIVES
If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).
In this case I would regard two different approaches, depending again on how does your data look like:
Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.
Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.
I hope it helps!
Cheers
One way to filter outliers is the interquartile range (IQR, wikipedia), which is the difference between 75% (Q3) and 25% quartile (Q1).
The outliers are defined if the data falls below Q1 - k * IQR resp. above Q3 + k * IQR.
You can select the constant k based on your domain knowledge (a common choice is 1.5).
Given the data, a filter in pandas could look like this:
iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]

How to evaluate lists with predicted lengths? (`tensorflow.nn.top_k` with array of `k`s from another model)

I am trying to predict medications given to patients. For each medication I have a column in the predictions (through softmax) indicating the probability that the patient will get this medication.
But obviously people can get several meds at once, so I have another model that tries to predict the number of different medications given.
I would like to evaluate them in a single TensorFlow call (I currently have a bunch of slow NumPy hacks), but I can't pass tensorflow.nn.top_k an array of ks (one for each patient, i.e. row), only a fixed integer - which doesn't work because different patients will get different numbers of meds.
Ultimately I'm trying to tensorflow.list_diff between the actually prescribed medication indices and the predicted ones. And then maybe the tensorflow.size of it.
tensorflow.list_diff(
tensorflow.where( # get indices of medications
tensorflow.equal(medication_correct_answers, 1) # convert 1 to True
),
tensorflow.nn.top_k( # get most likely medications
medication_soft_max, # medication model
tensorflow.argmax(count_soft_max, 1) # predicted count
)[1] # second element are the indices
)[:, 0] # get unmatched medications elements
Bonus question: Would it be possible to train a model directly on this instead of two seperate cross entropies? It doesn't really look differentiable to me - or do only the underlying softmaxes need to be differentiable?
The length of the predicted list is indeed not differentiable. You need to add an extra softmax output to the model predicting the length of the list, or add many sigmoid outputs predicting which entries should be included.
I wrote a paper about transcribing variable-length text sequences from images, and the appendix goes into a lot of detail with a worked example for how the math works:
http://arxiv.org/abs/1312.6082

building a feature set for scikit learn

Im using RandomForestClassifier for a probability prediction task. I have a featureset of around 50 features and two possible labels - first team wins and second team wins.
The feature set contains features for both teams, and the way I built it, since I know which team won, was have 50% of the set labeled 1st team wins, and 50% labeled 2nd team wins - with the respective features placed in the correct place in the feature set - for each match in training data, which initially has the winning team as the first one, I swap the features per team and change the label to second team wins, using a counter modulo 2.
The problem i see is that if I change the counter to start from 1 or 0, it makes a huge change in the final predictions, meaning that the data-set is asymmetrical. To tackle this problem I tried to add every match twice in normal order where the label is first team wins , and reversed with the label being second team wins. The question is - how does this affect the behavior of the model? I see some negative effect after making this change, although not enough to be statistically significant. It does however increase the running time for building the feature set and fitting the model obviously.
Will randomizing the label and team order be a more solid approach? what are my options?
Since you're comparing corresponding team features to each other, an alternative would be to reduce:
TeamA: featureA1, featureA2, featureA3 ... featureAN
TeamB: featureB1, featureB2, featureB3 ... featureBN
Output: which team wins
to:
Input: featureA1-featureB1, featureA2-featureB2, featureA3-featureB3, ..., featureAN - featureBN
Output: positive if team A wins, negative if team B wins
and train your classifier on that. The benefit of this approach is that you now have half the number of features to compare, and no longer have to worry about the order of the teams.

Categories