About the specific shapes of learning curves

About the specific shapes of learning curves - python

My model throws up learning curves as I have shown below. Are these fine? I am a beginner and all across the internet I see that as training examples increase the Training score should decrease and then converge. But here the training score is increasing and then converging. Therefore I would like to know does this indicate a bug in my code / something wrong with my input?
Okay I figured out what was wrong with my code.
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
I had not entered a regularization parameter for Logistic Regression.
But now,
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(C=1000,solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
The learning curve looks alright.
Can anybody tell me why this is so? i.e. with default reg term the training score increases and with lower reg it decreases?
Data details: 10 classes. Images of varying sizes. (Digit Classification - street view digits)

You need to be more precise regarding your metrics. What metrics are used here?
Loss in general means: lower is better, while Score usually means: higher is better.
This also means, that the interpretation of your plot is dependent on the used metrics during training and cross-validation.

Have a look at the related webpage of scipy:
http://scikit-learn.org/stable/modules/learning_curve.html
The score is typically some measure that needs to be maximized (ROCAUC, accuracy,...). Intuitively you could expect that the more training examples you see the better your model gets and hence the higher the score is. There are however some subtleties regarding overfitting and underfitting that you should keep in mind.

Building off of Alex's answer, it looks like the default regularization parameter for your model underfits the data a bit, because when you relaxed regularization, you see 'more appropriate' learning curves. It doesn't matter how many examples you throw at a model that underfits.
As for your concern of why the training score increases in the first case rather than decreases -- it's probably a consequence of the multiclass data you're using. With fewer training examples, you have fewer numbers of images of each class (because lc tries to keep the same class distribution in each fold of the cv), so with regularization (if you call C=1 regularization, that is), it may be harder for your model to accurately guess some of the classes.

Related

Worse results when training on entire dataset

After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:
What is validation data used for in a Keras Sequential model?
Your model doesn't "see" your validation set and isn´t in any way trained on it
https://machinelearningmastery.com/train-final-machine-learning-model/
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.
However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.
Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).

Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff
As in the comment #CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.
EXPLAINATION
Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.
MY ASSUMPTION
I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-
play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification
If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.
SOLUTION
By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.
How to make BETTER DATA
Balance the data
Remove outliers
Scale (Normalize) the data
CONCLUSION
It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.

Constant accuracy over epochs

I am training a gan and I am the accuracy doesn't change over epoch meanwhile the loss is deacresing.
Is there something wrong or it is normal because it's a gan?
Thank you in advance.

In order to fully answer this question (specific to your case) we'd need to know what loss function you are using and how you measure accuracy.
In general, this can certainly happen for a variety of reasons. The easiest reason to illustrate is with a simple classifier. Suppose you have a 2-class classification problem (for simplicity) and an input $x$ and label (1, 0), i.e. the label says it belongs to class 1 and not class 2. When you feed $x$ through your network you get an output: $y=(p_1, p_2)$. If $p_1 > p_2$ then the prediction is correct (i.e. you chose the right class). The loss function can continue to go down until $p_1=1$ and $p_2=0$ (the target). So, you can have lots of correct predictions (high accuracy) but still have room to improve the output to better match the labels (room for improved loss).

Keras: Over fitting Conv2D

I'm trying to build a convolutional based model. I trained two different structures as following. As you can see for single layer there isn't any obvious change along number of epochs. Bi-layer Conv2D presents improving in accuracy and losses for train dataset, but validation characteristics are going to be a tragedy.
According to the fact that I can't increase my data-set what should I do to improve validation characteristics?
I've examined regularizer L1 & L2 but they didn't affect my model.

1) You can use adaptive learning rate (exponential decay or step dependent may work for you) Furthermore, you can try extreme high learning rates when your model goes into local minimum.
2) If you are training with images, you can flip, rotate or other stuff to increase your dataset size and maybe some other augmentation techniques might work for your case.
3) Try to change the model like deeper, shallower, wider, narrower.
4) If you are doing a classification model, please ensure that you are not using sigmoid as your activation function in the end unless you are doing binary classification.
5) Always check your dataset's situation before training session.
Your train-test split may not be suitable for your case.
There might be extreme noises in your data.
Some amount of your data might be corrupted.
Note: I will update them whenever a new idea comes to my mind. Furthermore, I didn't want to repeat the comments and other answers, both of them are having valuable information for your case.

The validation becomes a tragedy because model is overfitting on the training data you can try if any of this works,
1)Batch normalisation would be a good option to go with.
2)Try reducing batch size.

I tried a variety of models known to work well on small datasets, but as I suspected, and as is my ultimate verdict - it is a lost cause.
You don't have nearly enough data to train a good DL model, or even an ML model like SVM - as matter's exacerbated by having eight separate classes; your dataset would stand some chance with an SVM for binary classification, but none for 8-class. As a last resort, you can try XGBoost, but I wouldn't bet on it.
What can you do? Get more data. There's no way around it. I don't have an exact number, but for 8-class classification, I'd say you need anywhere from 50-200x your current data to get reasonable results. Mind also that your validation performance is bound to be much worse on a bigger validation set, accounted for in this number.
For readers, OP shared his dataset with me; shapes are: X = (1152, 1024, 1), y = (1152, 8)

dealing with imbalanced data after encoding for classification

I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.
While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.
This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.
Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,
After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )
g = data[Target_col_Y_name]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print('The % distribution between the retention and non-retention flag\n')
print (df)
# The code o/p to show the imbalance is
The % distribution between the retention and non-retention flag
counts percentage
Non Retained 13105 93.868634
Retained 856 6.131366
My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.
Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.

I would like to add something in addition to Elias answer.
Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.
You should definitely learn about confusion matrix and metrics that come along with it (like AUC).
One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.
For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).
In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.
As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.
Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.
Good luck

Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.
Few points just to clear up:
For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.

Fitting the training error of a neural network

I am attempting to curve fit the training error of a neural network as a function of the number of training iterations. An example is shown in red in the image below. Here I've trained for 3000 iterations. What I'm interested in is whether I can find a function that I can fit on the first 1000 (or so) iterations to extrapolate out to 3000 iterations with some reasonable accuracy.
However, I don't know what functional form would be best for me to use. At first I tried an exponential of the form f(x)=A+Bexp(-Cx), which is shown in blue. Obviously this doesn't work too well. The exponential dies off way too fast and then basically just becomes the constant term.
Perhaps it's just difficult, since the beginning of the training shows a very sharp drop off of the error but then transitions to something much more gradual for higher iterations. But maybe someone with experience in neural network training and/or experience in fitting unknown functions might have some ideas. I've been trying various exponential forms and polynomial fits within scipy/numpy but with no success. I've varied the number of iterations used in the fit as well (including throwing out the small iteration numbers).
Any thoughts?

I think exponential fitting may work. In your f(x)=A+B*exp(-C*x), I choose A = 0.005, B = 0.045, and C = 1/250, I will get,
It's just about the parameter tuning. Yet I am trying to understand the motivation that you want to fit the learning curve. I think the interpolation method includes the 'extrapolation' option that you can used to predict the error after more epochs. If you want to precisely learn the curve, you can use another neural network with linear hidden layer and output to 'learn' the curve again, though I didn't try whether it works.

Check out this page: http://www.astroml.org/sklearn_tutorial/practical.html
The useful thing for your situation it describes is diagnosing whether your algorithms appear to be high bias or high variance on your data set, and offering specific directions to go in for either case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.