Tensorflow Beginner, Basic Question On Linear Model

Tensorflow Beginner, Basic Question On Linear Model - python

https://www.tensorflow.org/tutorials/estimator/linear
I am following the Tensorflow documentation to implement a Linear Classifier but I like to use my own data instead of the tutorial set. I just have a few general questions.
My dataset is as follows. It's not a time series.
row[0] - float (changed to binary, 0 = negative, 1 = positive) VALUE TO ESTIMATE
row[1] - string (categorical, changed to vocabulary, ints 1,2,3,4,5,6,7,8,9)
row[2-19] - float (positive and negative)
row[20-60] - ints (percentile ranks, ints 10,20,30,40,50,60,70,80,90)
row[61-95] - ints (binary 1, 0)
I started by using 50k (45k training) rows of data and num_epochs=100, batch_size=256.
{'accuracy': 0.8912, 'accuracy_baseline': 0.8932, 'auc': 0.7101819, 'auc_precision_recall': 0.2830853, 'average_loss': 0.30982444, 'label/mean': 0.1068, 'loss': 0.31013006, 'precision': 0.4537037, 'prediction/mean': 0.11840516, 'recall': 0.0917603, 'global_step': 17600}
Does the column I want to estimate need to be a column of binaries for this model?
Is it a bad idea to mix data types like this? Would it be necessary to normalize the data using something like preprocessing.Normalization ?
Should I alter the epochs/batch if I want to use more data?
The accuracy seems high but the loss also seems quite high, why is that?
Any other suggestions?
Thanks for any help or advice.

Here is the answer to your questions.
By default tf.estimator.LinearClassifier considers as binary classification with n_classes=2, but you can have more than 2 classes as well.
For a linear classification normalizing data won't affect much in terms of accuracy compared to non linear classifier accuracy change after normalizing on the same data.
You can observe the change in accuracy and loss, if it does not change much for about 5-10 epochs, you can restrict the number of epochs there only. Again you can repeat the same step by changing the batch size.
Accuracy and loss are not dependent on each other, consider an example of your case to classify 0 and 1. A model with 2 classes that always predicts 0.51 for the true class would have the same accuracy as one that predicts 0.99. Best model would be with high accuracy and with less loss, if your model is giving good accuracy and high loss that means your model made huge errors on few data.
Try to tune your model hyper parameters based on several observations and to feed quality data with some preprocessing is always best way to reach high accuracy and less loss, with some additional data would be good to have always.

Related

why did i got 2 different losses for sparse_categorical_crossentropy and categorical_crossentropy?

I trained a model for multiclass classification. There were three classes. In the first approach, I trained a model by converting the classes into one-hot vectors and training a model with loss function, categorical crossentropy, I achieved a loss of 0.07 in 1000 epochs. But when I used the same approach, but this time I did not converted the classes into one-hot vectors and used sparse_categorical_crossentropy, as the loss function, this time i achieved a loss of 0.05 in 1000 epochs.. Does this mean that sparse_categorical_crossentropy is better than categorical_crossentropy?
Thank You!

You can't compare two loss functions in term of losses since the definition of loss itself changed. you can compare the performance on the same test dataset.
In general use sparse_categorical_crossentropy when your classes are mutually exclusive (e.g. when each sample belongs exactly to one class) and categorical_crossentropy when one sample can have multiple classes or labels are soft probabilities (like [0.5, 0.3, 0.2]).
You got different losses because the representation of the labels changes, actually in keras the sparse_categorical_crossentropy is defined as categorical crossentropy with integer targets

Using sample and class weights in sklearn

I am trying to run a random forest on a highly unbalanced sample. There are issues both with the sample weights and the class weights. However, when I use the sklearn documentation to include the appropriate weights, I still get highly unbalanced predictions. For example, I have class weights of
{'A': 0.05555555555555555, 'B': 1.0, 'C': 1.0}
This should reweight the data to be about 60% A, 25% B, 15% C. However, when I run the model with weights, I get more or less the same results on the fitted class probabilities. I also tried doing the "balanced" option just to test, and I still got highly skewed results (predicting probabilities close to 1 for every observation of A). And I've tried this with and without the sample weights and with and without the class weights and I get more or less the same results. Am I implementing this incorrectly?
clf=RandomForestClassifier(n_estimators=1000,class_weight=class_weights)
clf=RandomForestClassifier(n_estimators=1000)
clf.fit(x,y,sample_weight=weights)
print("Accuracy: ",metrics.accuracy_score(y, clf.predict(x)))
new_arts = pd.DataFrame(data=clf.predict_proba(full_data_scaled),
columns=clf.classes_,
index=full_data_scaled.index.values)

First thing that you could check, is the actual dimensionality of your classifier in relation to your dataset. You use in both cases 1000 estimators. This can highly overfit if you are using a small dataset.
Second, I am assuming you use the Gini criterion to split. Maybe you can check if the criterion 'entropy' results in the same output.

Improving accuracy on a multi-class image classifiier

I am building a classifier using the Food-101 dataset. The dataset has predefined training and test sets, both labeled. It has a total of 101,000 images. I’m trying to build a classifier model with >=90% accuracy for top-1. I’m currently sitting at 75%. The training set was provided unclean. But now, I would like to know some of the ways I can improve my model and what are some of the things I’m doing wrong.
I’ve partitioned the train and test images into their respective folders. Here, I am using 0.2 of the training dataset to validate the learner by running 5 epochs.
np.random.seed(42)
data = ImageList.from_folder(path).split_by_rand_pct(valid_pct=0.2).label_from_re(pat=file_parse).transform(size=224).databunch()
top_1 = partial(top_k_accuracy, k=1)
learn = cnn_learner(data, models.resnet50, metrics=[accuracy, top_1], callback_fns=ShowGraph)
learn.fit_one_cycle(5)
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.153797 1.710803 0.563498 0.563498 19:26
1 1.677590 1.388702 0.637096 0.637096 18:29
2 1.385577 1.227448 0.678746 0.678746 18:36
3 1.154080 1.141590 0.700924 0.700924 18:34
4 1.003366 1.124750 0.707063 0.707063 18:25
And here, I’m trying to find the learning rate. Pretty standard to how it was in the lectures:
learn.lr_find()
learn.recorder.plot(suggestion=True)
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Min numerical gradient: 1.32E-06
Min loss divided by 10: 6.31E-08
Using the learning rate of 1e-06 to run another 5 epochs. Saving it as stage-2
learn.fit_one_cycle(5, max_lr=slice(1.e-06))
learn.save('stage-2')
epoch train_loss valid_loss accuracy top_k_accuracy time
0 0.940980 1.124032 0.705809 0.705809 18:18
1 0.989123 1.122873 0.706337 0.706337 18:24
2 0.963596 1.121615 0.706733 0.706733 18:38
3 0.975916 1.121084 0.707195 0.707195 18:27
4 0.978523 1.123260 0.706403 0.706403 17:04
Previously I ran 3 stages altogether but the model wasn’t improving beyond 0.706403 so I didn’t want to repeat it. Below is my confusion matrix. I apologize for the terrible resolution. Its the doing of Colab.
Since I’ve created an additional validation set, I decided to use the test set to validate the saved model of stage-2 to see how well it was performing:
path = '/content/food-101/images'
data_test = ImageList.from_folder(path).split_by_folder(train='train', valid='test').label_from_re(file_parse).transform(size=224).databunch()
learn.load('stage-2')
learn.validate(data_test.valid_dl)
This is the result:
[0.87199837, tensor(0.7584), tensor(0.7584)]

Try augmentations like RandomHorizontalFlip, RandomResizedCrop,
RandomRotate, Normalize etc from torchvision transforms. These always help a lot in classification problems.
Label smoothing and/or Mixup precision training.
Simply try using a more optimized architecture, like EfficientNet.
Instead of OneCycle, a longer, more manual training approach may help. Try Stochastic Gradient Descent with a weight decay of 5e-4 and a Nesterov Momentum of 0.9. Use Warmup Training of around 1-3 epochs, and then regular training of around 200 epochs. You could set a manual learning rate schedule or cosine annealing or some other scheme. This entire method will consume a lot more time and effort than the usual onecycle training, and should be explored only if other methods don't show considerable gains.

Which optimization metric for differenced data when doing a multi-step forecast?

I've written an LSTM in Keras for univariate time series forecasting. I'm using an input window of size 48 and an output window of size 12, i.e. I'm predicting 12 steps at once. This is working generally well with an optimization metric such as RMSE.
For non-stationary time series I'm differencing the data before feeding the data to the LSTM. Then after predicting, I take the inverse difference of the predictions.
When differencing, RMSE is not suitable as an optimization metric as the earlier prediction steps are a lot more important than later steps. When we do the inverse difference after creating a 12-step forecast, then the earlier (differenced) prediction steps are going to affect the inverse difference of later steps.
So what I think I need is an optimization metric that gives the early prediction steps more weight, preferably exponentially.
Does such a metric exist already or should I write my own? Am I overlooking something?

Just wrote my own optimization metric, it seems to work well, certainly better than RMSE.
Still curious what's the best practice here. I'm relatively new to forecasting.
from tensorflow.keras import backend as K
def weighted_rmse(y_true, y_pred):
weights = K.arange(start=y_pred.get_shape()[1], stop=0, step=-1, dtype='float32')
y_true_w = y_true * weights
y_pred_w = y_pred * weights
return K.sqrt(K.mean(K.square(y_true_w - y_pred_w), axis=-1))

Set "training=False" of "tf.layers.batch_normalization" when training will get a better validation result

I use TensorFlow to train DNN. I learned that Batch Normalization is very helpful for DNN , so I used it in DNN.
I use "tf.layers.batch_normalization" and follow the instructions of the API document to build the network: when training, set its parameter "training=True", and when validate, set "training=False". And add tf.get_collection(tf.GraphKeys.UPDATE_OPS).
Here is my code:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
input_node_num=257*7
output_node_num=257
tf_X = tf.placeholder(tf.float32,[None,input_node_num])
tf_Y = tf.placeholder(tf.float32,[None,output_node_num])
dropout_rate=tf.placeholder(tf.float32)
flag_training=tf.placeholder(tf.bool)
hid_node_num=2048
h1=tf.contrib.layers.fully_connected(tf_X, hid_node_num, activation_fn=None)
h1_2=tf.nn.relu(tf.layers.batch_normalization(h1,training=flag_training))
h1_3=tf.nn.dropout(h1_2,dropout_rate)
h2=tf.contrib.layers.fully_connected(h1_3, hid_node_num, activation_fn=None)
h2_2=tf.nn.relu(tf.layers.batch_normalization(h2,training=flag_training))
h2_3=tf.nn.dropout(h2_2,dropout_rate)
h3=tf.contrib.layers.fully_connected(h2_3, hid_node_num, activation_fn=None)
h3_2=tf.nn.relu(tf.layers.batch_normalization(h3,training=flag_training))
h3_3=tf.nn.dropout(h3_2,dropout_rate)
tf_Y_pre=tf.contrib.layers.fully_connected(h3_3, output_node_num, activation_fn=None)
loss=tf.reduce_mean(tf.square(tf_Y-tf_Y_pre))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i1 in range(3000*num_batch):
train_feature=... # Some processing
train_label=... # Some processing
sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1}) # when train , set "training=True" , when validate ,set "training=False" , get a bad result . However when train , set "training=False" ,when validate ,set "training=False" , get a better result .
if((i1+1)%277200==0):# print validate loss every 0.1 epoch
validate_feature=... # Some processing
validate_label=... # Some processing
validate_loss = sess.run(loss,feed_dict={tf_X:validate_feature,tf_Y:validate_label,flag_training:False,dropout_rate:1})
print(validate_loss)
Is there any error in my code ?
if my code is right , I think I get a strange result:
when training, I set "training = True", when validate, set "training = False", the result is not good . I print validate loss every 0.1 epoch , the validate loss in 1st to 3st epoch is
0.929624
0.992692
0.814033
0.858562
1.042705
0.665418
0.753507
0.700503
0.508338
0.761886
0.787044
0.817034
0.726586
0.901634
0.633383
0.783920
0.528140
0.847496
0.804937
0.828761
0.802314
0.855557
0.702335
0.764318
0.776465
0.719034
0.678497
0.596230
0.739280
0.970555
However , when I change the code "sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1})" , that : set "training=False" when training, set "training=False" when validate . The result is good . The validate loss in 1st epoch is
0.474313
0.391002
0.369357
0.366732
0.383477
0.346027
0.336518
0.368153
0.330749
0.322070
0.335551
Why does this result appear ? Is it necessary to set "training=True" when training, set "training=False" when validate ?

TL;DR: Use smaller than the default momentum for the normalization layers like this:
tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )
TS;WM:
When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance. When training = False, those internal variables also don't get updated. Since they are initialized to mean = 0 and variance = 1 it means that batch normalization is effectively turned off - the layer subtracts zero and divides the result by 1.
So if you train with training = False and evaluate like that, that just means you're training your network without any batch normalization whatsoever. It will still yield reasonable results, because hey, there was life before batch normalization, albeit admittedly not that glamorous...
If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99100 is 0.366. If you have numerically large values, the difference can be enormous.
So if you have a relatively small number of batches you processed, then the internally stored mean and variance can still be significantly off by the time you're running the test. Then your network is trained on properly normalized data and is tested on mis-normalized data.
In order to speed up the convergence of the internal batch normalization values, you can apply a smaller momentum, like 0.9:
tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )
(repeat for all batch normalization layers.) Please note that there is a downside to this, however. Random fluctuations in your data will "tug" on your stored mean and variance a lot more with a small momentum like this and the resulting values (later used in inference) can be greatly influenced by where you exactly stop the training, which is clearly not optimal. It is useful to have as large a momentum as possible. Depending on the number of training steps, we generally use 0.9, 0.99, 0.999 for 100, 1,000, 10,000 training steps respectively. No point in going over 0.999.
Another important thing is proper randomization of the training data. If you're training first with let's say the smaller numeric values of your whole data set, then the normalization will converge even slower. Best to completely randomize the order of training data and making sure you use a batch size of at least 14 (rule of thumb.)
Side note: it is known that zero debiasing the values can speed up convergence significantly, and the ExponentialMovingAverage class has this feature. But the batch normalization layers don't have this feature, save for tf.slim's batch_norm, if you're willing to restructure your code for slim.

The reason that you set Training = False improves performance is that Batch normalization has four variables (beta, gamma, mean, variance). It is true that mean and variance don't get updated when Training = False. However, gamma and beta still get updated. So your model has two extra variables and thus has a better performance.
Also, I guess that your model has a relatively good performance without batch normalization.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.