LibSVM model interpretation - python

I need to classify some values between two classes.
I have about 30 values that I can use as a training set and each value has 10 different dimensions.
I am using libSVM (in Python) and it seems that it works quite well.
I am trying also to give an interpretation to the model computed by libSVM, because I think that some dimensions are more "important" than others in the classification process.
For instance, consider the following example:
y, x = [1,1,1,-1,-1,-1],[[1,-1],[1,0],[1,1],[-1,-1],[-1,0],[-1,1]]
prob = svm_problem(y, x)
param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10
m = svm_train(prob, param)
svm_save_model('model_file', m)
It is clear that the second dimension of x list's elements is useless to classify this data set.
My question is:
is there any systematic way to detect these kind of situations analyzing the model generated by libSVM?

A little bit late, but:
It is your responsibility to check if a feature is important or not - so you have to choose your features manually that they meet your application's requirements. The SVM tries to get the best result with the features you put in - it wouldn't make much sense to ignore given data just because the choice will get clearer (but maybe more wrong).
Only you can know which features are good, and which not. You have to find them by hand/brain.

Related

Should you shuffle the input for a word2vec model before or after assigning negative context pairs for each target word?

I'm working on a word2vec with negative sampling implementation using python and tensorflow+keras. The initial input for the script is a list of positive target-word pairs, which is processed via looping through them and assigning a number of negative examples to each, then the positive + k negative samples are appended in the corresponding order to a new list. That list is later (after a few adjustments) passed to a keras model.fit():
model.fit([data[:, 0], data[:, 1]], data[:, 2],
batch_size = numof_positives * (numof_negatives + 1))
I looked through some examples, and from what I understand, the batches passed to the neural network should contain the negative context words of those positives that are present in the batch, meaning that shuffling of the data should take place before assigning the negatives. On the other hand, I did not realize that keras' model.fit() has its shuffle argument on True by default, so first it was run with the data being shuffled after the assignment as well. Now that I've added shuffle=False, it seems like it affected the quality of the resulting embedding vectors negatively. Can that be the case? Where should the input be shuffled? What are the implications of passing completely randomly ordered data vs ordered batches?
I may have a few trust issues with the shuffle argument of keras' model.fit(), after experiencing this bug regarding shuffle='batch' first hand.
The value of shuffling with respect to word2vec training (that I'm familiar with) is to avoid cases where all examples of a word, or all similar senses, are clumped together in one range of the training data. (You don't want th model to get really good at those examples where a word/sense is overrepresented... only to then have that overly-specialized performance to be lost when a long run of samples where those same words/senses are completely unrepresented.)
That's something you can achieve with a corpus-shuffle before any batch-shuffle – which might separately bring its own benefits, for similar reasons, by achieving better interleaving of contrasting microexamples.
The word2vec implementations I know tend to do the backprop-updates of a positive-example closely-alongside the corresponding N synthetic negative-examples from the same context. That is, no extra shuffling will move them further from each other.
But it's not impossible further shuffling could help! So I'd not express any opinion on what theoretically "should" happen, compared to any empirical observations, of tradeoffs seen in real runs. (Do what works best!)
(Alternatively, if the actual goal is perfectly-reproducing some other implementation's choices, then you'd mainly want to mimic its actual code, and verify by comparing quantitative results on same-data/same-evaluation tasks.)

very different values from normed_vector cosine similarity and most_similar

I have a certain Doc2Vec model built on website data. I am trying to use the embeddings to find websites that are most similar to each other. To do so, I am doing a cosine similarity of the matrix. I am also comparing this to the output of most_similar().
The problem, they are providing substantively different matches (not only slightly different).
To make concrete, for a firm of index value 791 and text on value text I compare.
text = self.website_info.iloc[791].text
tokens = text.split()
vec = self.word2vec_model.infer_vector(tokens,negative=0)
most_similar = self.word2vec_model.docvecs.most_similar([vec])
to
self.word2vec_model.init_sims()
mat = self.word2vec_model.docvecs.get_normed_vectors()
w2v_sim = np.dot(mat, mat.T)
sims = pd.DataFrame(pd.Series(w2v_sim[791]))
sims.rename(columns={0:'sim'}, inplace = True)
sims.sort_values(by='sim',ascending=False,inplace=True)
most_similar = sims.head(20)
I also see that the embedding vectors real and inferred are substantively different. Not just normalization or values, but big differences in the sign of the components.
There's a bunch in your code that doesn't quite make sense.
If you're using Gensim-4.0 or higher – where .get_normed_vectors() exists – there's never a need to call .init_sims(). (In fact, it should be showing a deprecation warning.)
Only a Doc2Vec model will support .infer_vector() – so it's odd to name your variable, word2vec_model
The .infer_vector() method doesn't take a negative=0 argument - so that code would generate an error in the Gensim library that you otherwise appear to be using. (And, if it did somehow take a negative argument changing the inference to use 0 negative-examples, that'd break inference - which should use the same negative value as during model training.)
I'm also not sure about your alternate calculation - in particular, dirving it through Pandas instead of native Numpy operations seems unnnecessary, and an all-to-all comparison would often be very expensive, in any model with a sufficiently-large number of documents.
But also: the inferred vector will essentially never be identical to the vector for the same text during training. It'll just be 'close', if everything about the model is working well, with sufficient training data & parameters (especially enough epochs & not too many vector_size dimensions). (See this FAQ item for more details on how there's an inherent 'jitter' between runs/inferences.)
So I'd suggest:
1st, check how similar the inferred-vector for your item #791 is to the vector created by bulk-training. (You could both compare them to each other, or compare the list of top-N .most_similar() items to each.) If they're very different, there may be other problems with the model training (data/parameters) that make the model underpowered. (In some cases, more epochs or fewer vector_size dimensions can help a little to make a model more consistent, run-to-run, but if your data is thin there will be limits to how well Doc2Vec can work.)
Check that your alternate calculation of the nearest-neighbors exactly matches that returned by .most_similar(), when using the exact same (not inferred) origin-vector. If it doesn't, that'd be a separate issue than any looseness/variance between the vector from bulk-training and that from later re-inference.
Try to evaluate the actual quality of the .most_similar() results - either by ad hoc eyeballing, or some sort of rigorous domain-expert golden-standard of which docs 'should' be judged alike. The calculation by the .most_similar() method is a typical approach, and usually what people want - so knowing if that's helpful for your data/model/goals may be more interesting than whether you can match it with a separate external calculation.
If you're still having problems, be sure in any followup comments, question edits, or new questions to say a bit more about:
the size of your training set, in documents/words-per-doc/unique-vocabulary;
the model-parameters you've chosen; and…
the code process you used to train the model (to be sure it doesn't mimic some serious errors common in poor-quality online guides).
Those can help determine if something else more foundational is wrong/weak with your model.

Question regarding DecisionTreeClassifier

I am making an explainable model with the past data, and not going to use it for future prediction at all.
In the data, there are a hundred X variables, and one Y binary class and trying to explain how Xs have effects on Y binary (0 or 1).
I came up with DecisionTree classifier as it clearly shows us that how decisions are made by value criterion of each variable
Here are my questions:
Is it necessary to split X data into X_test, X_train even though I am not going to predict with this model? ( I do not want to waste data for the test since I am interpreting only)
After I split the data and train model, only a few values get feature importance values (like 3 out of 100 X variables) and rest of them go to zero. Therefore, there are only a few branches. I do not know reason why it happens.
If here is not the right place to ask such question, please let me know.
Thanks.
No it is not necessary but it is a way to check if your decision tree is overfitting and just remembering the input values and classes or actually learning the pattern behind it. I would suggest you look into cross-validation since it doesn't 'waste' any data and trains and tests on all the data. If you need me to explain this further, leave a comment.
Getting any number of important features is not an issue since it does depend very solely on your data.
Example:
Let's say I want to make a model to tell if a number will be divisible by 69 (my Y class).
I have my X variables as divisibility by 2,3,5,7,9,13,17,19 and 23.
If I train the model correctly, I will get feature importance of only 3 and 23 as very high and everything else should have very low feature importance.
Consequently, my decision tree (trees if using ensemble models like Random Forest / XGBoost) will have less number of splits.
So, having less number of important features is normal and does not cause any problems.
No, it isn't. However, I would still split train-test and measure performance separately. While an explainable model is nice, it is significantly less nicer if it's a crap model. I'd make sure it had at least a reasonable performance before considering interpretation, at which point the splitting is unnecessary.
The number of important features is data-dependent. Random forests do a good job providing this as well. In any case, fewer branches is better. You want a simpler tree, which is easier to explain.

Subsampling an unbalanced dataset in tensorflow

Tensorflow beginner here. This is my first project and I am working with pre-defined estimators.
I have an extremely unbalanced dataset where positive outcomes represent roughly 0.1% of the total data and I suspect this imbalance to considerably affect the performance of my model. As a first attempt to solve the issue, since I have tons of data, I would like to throw away most of my negatives in order to create a balanced dataset. I can see two ways of doing it: preprocess the data to keep only a thousandth of the negatives then save it in a new file before passing it to tensorflow, for example with pyspark; and asking tensorflow to use only one negative out of a thousand it finds.
I tried to code this last idea but didn't manage. I modified my input function to read like
def train_input_fn(data_file="../data/train_input.csv", shuffle_size=100_000, batch_size=128):
"""Generate an input function for the Estimator."""
dataset = tf.data.TextLineDataset(data_file) # Extract lines from input files using the Dataset API.
dataset = dataset.map(parse_csv, num_parallel_calls=3)
dataset = dataset.shuffle(shuffle_size).repeat().batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
# TRY TO IMPLEMENT THE SELECTION OF NEGATIVES
thrown = 0
flag = np.random.randint(1000)
while labels == 0 and flag != 0:
features, labels = iterator.get_next()
thrown += 1
flag = np.random.randint(1000)
print("I've thrown away {} negative examples before going for label {}!".format(thrown, labels))
return features, labels
This, of course, doesn't work because iterators don't know what's inside them, so the labels==0 condition is never satisfied. Also, there is only one print in the stdout, meaning that this function is only called once (and meaning that I still don't understand how tensorflow really works). Anyways, is there a way to implement what I want?
PS: I suspect that the previous code, even if it worked as intended, would return less than a thousandth of the initial negatives due to the count restarting every time it finds a positive. This is a minor issue, and so far I could even find a magic number inside the flag that gives me the expected result without worrying too much about the mathematical beauty of it.
You will probably get better results by oversampling your under-represented class rather than throwing away data in your over-represented class. This way you keep the variance in the over-represented class. You might as well use the data you have.
The easiest way to achieve this is probably to create two Datasets, one for each class. Then you can use Dataset.interleave to sample equally from both datasets.
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave
Oversampling can be easily achieved with following code:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.7, 0.3])
Tensorflow has a good guide on dealing with unbalanced data you can find more ideas here:
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversampling

Limit neural network output to subset of trained classes

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.
import numpy as np
def softmax(arr):
return np.exp(arr)/np.exp(arr).sum()
#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])
#Ideas -----
'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)
'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions
'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])
All of these have issues. The first one causes invalid choices to default to:
1/np.exp(arr).sum()
since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)
The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous.
An example of what I'm trying to say:
output
Out[75]: array([ 5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
softmax(output)
Out[76]: array([ 0.70454877, 0.14516581, 0.13660832, 0.00894051, 0.00473658])
softmax(output[1:])
Out[77]: array([ 0.49133596, 0.46237183, 0.03026052, 0.01603169])
(Arrays were ordered to make it easier.)
In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.
I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.
EDIT: I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.
I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.
First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:
1. Multiply restricted before sending it through softmax.
Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.
Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:
import torch
logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))
which yields:
tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])
Discussion
Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."
Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:
2. NN are known to be overconfident for classification problems
I can imagine this being solved in multiple ways:
2.1 Ensemble
Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:
import torch
predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
print(combined_logits)
print(torch.nn.functional.softmax(combined_logits))
This would gives us the following probabilities after softmax:
[0.11291057 0.7576356 0.1293983 0.00005554 0.]
(notice the first class is now the most probable)
You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.
Still I would not mix this approach with manual selection of outputs.
2.2 Transform the problem into binary
This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.
Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.
If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection.
After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).
Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.
Conclusions
If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.
If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.
First ask yourself: what is the benefit of excluding certain outputs based on external data. In your post, I don't see why exactly you want to exclude them.
Saving them won't save computation as óne connection (or óne neuron) has effect on multiple outputs: you can't disable connections/neurons.
Is it really necessary to exclude certain classes? If your network is trained well enough, it will know if it's a capital or not.
So my answer: I don't think you should fiddle with any operation before the softmax. This will give you false conclusions. So you have the following options:
Multiply the results of the softmax by the restrictions.
Don't multiply, if the highest class is 'a', convert it to 'A' as output (convert output to lowercase)
Train a network that sees no difference between capital and non-capital letters

Categories