Apparently contradictory results from topk/sort and pick

Apparently contradictory results from topk/sort and pick - python

I'm predicting roughly one of 100K possible outputs with a MXNet model, using a fairly standard softmax output. I want to compare the probability assigned to the true label versus the top predictions under the model. To get the former I'm using the pick operator; the later I've tried the cheap version (topk operator) and the expensive version (sort/argsort + slice).
In both cases I'm getting contradictory results. Specifically, there are numerous cases where the probability of the true label (retrieved with pick) is significantly higher than the highest probability output (retrieved with topk/sort). I think this means I'm doing something wrong but don't understand what. It does not happen for all predictions, but it does for a significant fraction.
Can anybody give me a hint as to what is going on?
Code follows:
for batch in data_iter:
model.forward(batch, is_train=False)
predictions = model.get_outputs()[0]
labels = batch.label[0].as_in_context(predictions.context)
# scores = mx.nd.topk(predictions, axis=1, k=6, ret_typ='value')
scores = mx.nd.sort(predictions, axis=1, is_ascend=0)
scores = mx.nd.slice_axis(scores, axis=1, begin=0, end=6)
label_score = mx.nd.pick(predictions, labels, axis=1)
equal = label_score.asnumpy() <= scores.asnumpy()[:, 0]
if not np.all(equal):
#I think this should never happen but it does frequently

Testing with MXNet 1.1.0, the following code shows that the problem doesn't happen:
for _ in range(10):
predictions = nd.random.uniform(shape=(100, 100000))
labels = nd.array(np.random.randint(0, 99999, size=(100, 1)))
scores = mx.nd.sort(predictions, axis=1, is_ascend=0)
scores = mx.nd.slice_axis(scores, axis=1, begin=0, end=6)
label_score = mx.nd.pick(predictions, labels, axis=1)
equal = label_score.asnumpy() <= scores.asnumpy()[:, 0]
if not np.all(equal):
print("ERROR")

Related

Odd linear model results

I'm unit acceptance testing some code I wrote. It's conceivable that at some point in the real world we will have input data where the dependent variable is constant. Not the norm, but possible. A linear model should yield coefficients of 0 in this case (right?), which is fine and what we would want -- but for some reason I'm getting some wild results when I try to fit the model on this use case.
I have tried 3 models and get diffirent weird results every time -- or no results in some cases.
For this use case all of the dependent observations are set at 100, all the freq_weights are set at 1, and the independent variables are a binary coded dummy set of 20 features.
In total there are 150 observations.
Again, this data is unlikely in the real world but I need my code to be able to work on this ugly data. IDK why I'm getting such erroneous and different results.
As I understand with no variance in the dependent variable I should be getting 0 for all my coefficients.
freq = freq['Freq']
Indies = sm.add_constant(df)
model = sm.OLS(df1, Indies)
res = model.fit()
res.params
yields:
const 65.990203
x1 17.214836
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
results = reg.fit(method = 'lbfgs', max_start_irls=0)
results.params
yields:
const 83.205034
x1 82.575228
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
result2 = reg.fit()
result2.params
yields
PerfectSeparationError: Perfect separation detected, results not available

Python SKlearn contamination must be in (0, 0.5] error

I'm new to Machine Learning and working on a project using Python(3.6), Pandas, Numpy and SKlearn. I have done classifications and reshaping but while in prediction it throws an error as contamination must be in (0, 0.5].
Here's what i have tried:
# Determine no of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
# calculate percentages for Fraud & Valid
outlier_fraction = len(Fraud) / float(len(Valid))
print(outlier_fraction)
print('Fraud Cases : {}'.format(len(Fraud)))
print('Valid Cases : {}'.format(len(Valid)))
# Get all the columns from dataframe
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]
# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
# define a random state
state = 1
# define the outlier detection method
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit te data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid and 1 for fraudulent
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# run classification metrics
print('{}:{}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred ))
print(classification_report(Y, y_pred ))
Here's what it returns :
ValueError: contamination must be in (0, 0.5]
and it throws this error for y_pred = clf.predict(X) line, as pointed in Traceback.
I'm new to machine learning, don't have much idea about ** contamination**, so where i did something wrong?
Help me, please!
Thanks in advance!

ValueError: contamination must be in (0, 0.5]
This means that contamination must be strictly larger than 0.0 and less than or equal to 0.5. (What does this square bracket and parenthesis bracket notation mean [first1,last1)? is a good question on the brackets notation) As you have commented, print(outlier_fraction) outputs 0.0, the problem lies in the first 6 lines of the code you posted.

LocalOutlierFactor is an unsupervised outlier detection algorithm, introduced in this paper. Each algorithm, has its own parameters which really change the behavior of the algorithm. You should always study those parameters and their effect on the algorithm before applying the method, or else you may be lost in the land of massive parameter options.
In the case of LocalOutlierFactor, it assumes your outliers are not more than half of your dataset. In practice, I'd say, even if the outliers take up to 30% of your dataset, they're not outliers anymore. They're simply a different type, or class of data.
On the other hand, you cannot expect the outlier detection algorithm to work if you tell it that you have 0 outliers, which may be the case for you if the outlier_fraction is actually 0.

Removing low quality tensor predictions from softmax

I want to apply a filter to a tensor and remove values that do not meet my criteria. For example, lets say I have a tensor that looks like this:
softmax_tensor = [[ 0.05 , 0.05, 0.2, 0.7], [ 0.25 , 0.25, 0.3, 0.2 ]]
Right now, the classifier picks the argmax of the tensors to predict:
predictions = [[3],[2]]
But this isn't exactly what I want because I loose information about the confidence of that prediction. I would rather not make a prediction than to make an incorrect prediction. So what I would like to do is return filtered tensors like so:
new_softmax_tensor = [[ 0.05 , 0.05, 0.2, 0.7]]
new_predictions = [[3]]
If this were straight-up python, I'd have no trouble:
new_softmax_tensor = []
new_predictions = []
for idx,listItem in enumerate(softmax_tensor):
# get two highest max values and see if they are far enough apart
M = max(listItem)
M2 = max(n for n in listItem if n!=M)
if M2 - M > 0.3: # just making up a criteria here
new_softmax_tensor.append(listItem)
new_predictions.append(predictions[idx])
but given that tensorflow works on tensors, I'm not sure how to do this - and if I did, would it break the computation graph?
A previous SO post suggested using tf.gather_nd, but in that scenario they already had a tensor that they wated to filter on. I've also looked at tf.cond but still don't understand. I would imagine many other people would benefit from this exact same solution.
Thanks all.

Two things that I would do to solve your problem :
First, I would return the value of the softmax tensor. You look for a reference to it somewhere (you keep a reference to it when you create it, or you find it back in the appropriate tensor collection) And then evaluate it in a sess.run([softmaxtensor,prediction],feed_dict=..) And then you play with it with python as much as you like.
Second If you want to stay within the graph, I would use the build-it tf.where(), working quite alike the np.where function from numpy package doc there

Ok. I've got it sorted out now. Here is a working example.
import tensorflow as tf
#Set dummy example tensor
original_softmax_tensor = tf.Variable([
[0.4,0.2,0.2,0.9,0.1],
[0.5,0.2,0.2,0.9,0.1],
[0.6,0.2,0.2,0.1,0.99],
[0.1,0.8,0.2,0.09,0.99]
],name='original_softmax_tensor')
#Set dummy prediction tensor
original_predictions = tf.Variable([3,3,4,4],name='original_predictions')
#Now create a place to store my new variables
new_softmax_tensor = original_softmax_tensor
new_predictions = original_predictions
#set my cutoff variable
min_diff = tf.constant(0.3)
#initialize
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op) #execute init_op
#There's probably a better way to do this, but I had to do this hack to get
# the difference between the top 2 scores
tmp_diff1, _ = tf.nn.top_k(original_softmax_tensor,k=2,sorted=True)
tmp_diff2, _ = tf.nn.top_k(original_softmax_tensor,k=1,sorted=True)
#subtracting the max scores from both, makes the largest one '0'
actual_diff = tf.subtract(tmp_diff2,tmp_diff1)
#The max value for each will be the actual value of interest
actual_diff = tf.reduce_max(actual_diff,reduction_indices=[1])
#Create a boolean tensor that says to keep or not
cond_result = actual_diff > min_diff
#Keep only the values I want
new_predictions = tf.boolean_mask(original_predictions,cond_result)
new_softmax_tensor = tf.boolean_mask(new_softmax_tensor,cond_result)
new_predictions.eval()
new_softmax_tensor.eval()
# return these if this is in a function

TensorFlow Averaging with Dynamic Lengths

I am trying to do a Mean operation given the actual lengths of sequences. (Masking Zero vectors)
My inputs sequence_outpus are of (batch_size, max_len, dimensions)
I have a tensor that stores the actual lengths of each sequence in the batch. I used the function from https://danijar.com/variable-sequence-lengths-in-tensorflow/
def length(sequence):
used = tf.sign(tf.reduce_max(tf.abs(sequence), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int64)
return length
I do this:
lengths = length(sequence_outputs)
lengths = tf.cast(length, tf.float32)
lengths = tf.expand_dims(lengths,1)
sentence_outputs = tf.reduce_sum(sentence_outputs,1) / lengths
The graph compiles but I am getting NaN loss values. Furthermore my lengths become negative values when debugging with eval().
This seems to be a simple problem but I've been stuck with this for sometime and would appreciate some help!
Thanks!

I see no issue. Your code is slightly over-complicated. The following code
import numpy as np
import tensorflow as tf
# creating data
B = 15
MAX_LEN = 4
data = np.zeros([B, MAX_LEN], dtype=np.float32)
for b in range(B):
current_len = np.random.randint(2, MAX_LEN)
current_vector = np.concatenate([np.random.randn(current_len), np.zeros(MAX_LEN - current_len)], axis=-1)
print("{}\t\t{}".format(current_vector, current_vector.shape))
data[b, ...] = current_vector
data_op = tf.convert_to_tensor(data)
def tf_length(x):
assert len(x.get_shape().as_list()) == 2
length = tf.count_nonzero(x, axis=1, keepdims=True)
return length
x = tf.reduce_sum(data_op, axis=1) / tf_length(data_op)
# test gradients
grads = tf.gradients(tf.reduce_mean(x), [data_op])
with tf.Session() as sess:
print sess.run(grads)
runs perfectly fine here without any NaNs. Are you sure, you are really using this code? If I need to guess, I would bet you forget the tf.abs somewhere in your sequence length computation.
Be aware: your length function, as well as tf_length in this post, assume non-zero values in the sequence! The calculating the sequence-length should be the task of the data-producer and fed into the computation graph. Everything else, I consider as a hacky solution.

Printing remaining features in Feature Reduction

I am running a feature reduction (from 500 to around 30) for a randomforest classifier algo. I can reduce the number of features, but I want to see what features are left at every point in the reduction.As you can see below, I have made an attempt, but does not work.
X does not contain the ColumnNames. Ideally, it could be possible to also have the columnnames in X but only fit from row, then printing X would be possible I think.
I am sure there is a much better way though...
Anybody know how to do this?
FEATURES = []
readThisFile = r'C:\ManyFeatures.txt'
featuresFile = open(readThisFile)
AllFeatures = featuresFile.read()
FEATURES = AllFeatures.split('\n')
featuresFile.close()
Location = r'C:\MASSIVE.xlsx'
data = pd.read_excel(Location)
X = np.array(data[FEATURES])
y = data['_MiniTARGET'].values
for x in range(533, 10,-100):
X = SelectKBest(f_classif, k=x).fit_transform(X, y)
#U=pd.DataFrame(X)
#print (U.feature_importances_)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apparently contradictory results from topk/sort and pick - python

Related

Odd linear model results

Python SKlearn contamination must be in (0, 0.5] error

Removing low quality tensor predictions from softmax

TensorFlow Averaging with Dynamic Lengths

Printing remaining features in Feature Reduction

Categories

Resources