ML wrong prediction on Japan Crossword puzzle

ML wrong prediction on Japan Crossword puzzle - python

I’m trying to study machine learning in hands-on way. I found exercise for myself to create neural network that solves “Japan crosswords” for fixed size images (128*128).
Very simple example (4*4) demonstrates the conception: black & white picture encoded by top and left matrices. Number in matrix means continues length of black line. Easy to prove left and top matrix have dimension at max (N*(N/2)) and ((N/2)*N) correspondingly.
I have a python generator that creates random b&w images and 2 reduced matrices. Top and left matrices are fed as input (left is transposed to match top) and b&w as an expected output. Input is treated as 3-dim (128 * 64 * 2) where 2 – is top and left correspondingly.
Following is my current topology that try to build function (128 * 64 * 2) -> (128, 128, 1)
Model: "model"
Layer (type) Output Shape Param #
interlaced_reduce (InputLaye [(None, 128, 64, 2)] 0
small_conv (Conv2D) (None, 128, 64, 32) 288
leaky_re_lu (LeakyReLU) (None, 128, 64, 32) 0
medium_conv (Conv2D) (None, 128, 64, 64) 8256
leaky_re_lu_1 (LeakyReLU) (None, 128, 64, 64) 0
large_conv (Conv2D) (None, 128, 64, 128) 32896
leaky_re_lu_2 (LeakyReLU) (None, 128, 64, 128) 0
up_sampling2d (UpSampling2D) (None, 128, 128, 128) 0
dropout (Dropout) (None, 128, 128, 128) 0
dense (Dense) (None, 128, 128, 1) 129
Total params: 41,569
Trainable params: 41,569
Non-trainable params: 0
After train on 50 images I got the following statistic (please note, I tried to normalize input matrices to [0,1] without any success, current statistic demonstrate non-normalized case) :
...
Epoch 50/50 2/2 [==============================] - 1s 687ms/step -loss: 18427.2871 - mae: 124.9277
Then prediction produces following:
You can see left – expected random image and right – result of prediction. In prediction I intentionally use grey-scaled image to understand how close my result to target. But as you can see – the prediction is far from expected and is close to source form of top/left reduce matrices.
So my questions:
1) What layers I’m missing?
2) What should be improved in existing topology?
p.s. this is cross post from Cross Validated Stackexchange, because nobody even viewed question that site

So it's hard to say what model would work best without training and testing the actual model, but from the results you've gotten so far here's a few options you could try.
Try adding a fully connected hidden layer
From the model you posted, it seems that you have a few convolution layers, followed by an up-sampling and dropout layer, and finally a single dense layer for your output nodes. Potentially, adding additional dense layers (for e.g. 128 or more or less nodes) before your final output layer might help. While the multiple convolution layers help the neural net to build up a sort of hierarchical understanding of the image, the hypothesis class might not be complex enough. Adding one or more dense layers might help with this.
Try using a multilayer perceptron
Convolution layers are often used to process images because they help build up a hierarchical understanding of the image that is somewhat scale/shift/rotation invariant. However, considering the problem that you're solving, a global understanding of the input might be more beneficial than identifying shift-invariant features.
As such, one possible option would be to remove the convolution layers and to use a multilayer perceptron (MLP).
Let us think of the input as two matrices of numbers, and the output is a matrix of 1s and 0s that correspond to 'black' and 'white'. You could then try a model with the following layers:
A Flatten layer that takes in your two reduced matrices as inputs and flattens
them
A hidden dense layer, maybe with something like 128 nodes and relu activation. You should experiment with the number of layers, nodes, and activation.
An output dense layer with 16384 (128x128) nodes. You could apply a softmax activation to this layer which could help the optimiser during the training process. Then, when creating your final image, set values < 0.5 to 0 and values >= 0.5 to 1, and reshape and reformat the matrix into a square image.
Of course, no guarantees that an MLP would work well, but if often does especially when given sufficient amounts of data (perhaps in the 1000s or more number of training examples).
Try using a deterministic algorithm
Looking at the structure of this problem, it seems that it could be solved more appropriately with a deterministic algorithm, which would fall under more the branch of traditional artificial intelligence rather than deep learning. This is also another potential route to explore.

The model you build is a conventional model (seen by the use of Conv2D). This layer are good in analyzing something given its neighbors. Making them very powerful for image classification or segmentation.
In your case the result of a pixels is depending on the whole line and column.
Neural networks seems to be unsuited for your problem, but if you want to continue look in to replacing conv layers with Conv(1xN) and Conv(Nx1). It will still be very hard to make it work.
The hard way: These puzzle exist out of a strong recurrent process. Each step the correct spots get filled int with a zero or one. Based on those the next get filled in. So a recurrent neural network would make most sense to me. Where the convolution is used to have the prediction of the neighbors influence its current prediction

Related

Imbalanced aspect ratio of input in CNN

Consider the following model using keras in TensorFlow.
Conv2D(
filter = 2^(5+i) # i = num of times called conv2D
kernel = (3, 3),
strides = (1, 1),
padding = 'valid')
MaxPooling2D(
pool_size = (2, 2))
Layer Output Shape Param
-----------------------------------------------
L0: Shape (50, 250, 1 ) 0
L1: Conv2D (48, 248, 32 ) 320
L2: MaxPooling2D (24, 124, 32 ) 0
L3: Conv2D_1 (22, 122, 64 ) 18496
L4: MaxPooling2D_1 (11, 61, 64 ) 0
L5: Conv2D_2 (9, 59, 128) 73856
L6: MaxPooling2D_2 (4, 29, 128) 0
L7: Conv2D_3 (2, 27, 256) 295168 !!
L8: MaxPooling2D_3 (1, 13, 256) 0
L9: Flatten (3328) 0
L10: Dense (512) 1704448 !!!
L11: ...
Here, an input shape with ratio of 1:5 is used. After L8, there cannot be any more convolutional layers as one side is 1. Actually in cases where input_side < kernel_size, there can be no more convolutional layers; the layer is forced to be flattened into a vector with high number of units – resulted from the large shape [1][3] and the large number of filters [2] deep into the network. The Dense layer [4] follows will have a high number of parameters that requires a lot of computation time.
To reduce the number of parameters specific to the problems (highlighted in [x]) above, I think of these methods:
Adding a (1, 2) stride to early layers of Conv2D. (Refer to this thread)
Reduce the number of filters, say, from [32, 64, 128, 256, ...] to [16, 24, 32, 48, ...].
Resize the input data to a square-shaped input so that more Conv2D layers can be applied.
Future reduce the number of units in the first Dense layer, say, from 512 to 128.
My question is, will these method work and how much will they affect the performance of the CNN? Is there any better approach to the problem? Thanks.

First of all, you can try 'same' padding instead of valid. It will save you from these diminishing numbers you are getting, somewhat.
For point 1:
Adding a non-uniform stride is only good if your data has more variation in a certain direction in this case, horizontal.
For point 2:
The number of filters don't help or hurt the way your dimensions are changing. This would hurt your performance if your model was not overfitting.
For point 3:
Resize the input to square shape would seem like a good idea but would lead to unnecessary dead neurons because of all that extras you are adding. I would suggest against it. This may hurt performance and lead to overfitting.
For point 4:
Here, again the number of units don't change the dimensions. This would hurt your performance if your model was not overfitting.
Lastly, your network is deep enough to get good results. Rather than trying to go smaller and smaller. Try increasing your Conv2D layers in between the MaxPools, that would be much better.

LSTM predicting constant value throughout

I understand that it is a long post, but help in any of the sections is appreciated.
I have some queries about the prediction method of my LSTM model. Here is a general summary of my approach:
I used a dataset having 50 time series for training. They start with a value of 1.09 all the way up to 0.82, with each time series having between 570 to 2000 datapoints (i.e, each time series has a different length, but similar trend).
I converted them to the dataset accepted by keras' LSTM/Bi-LSTM layers in the format:
[1, 0.99, 0.98, 0.97] ==Output==> [0.96]
[0.99, 0.98, 0.97, 0.96] ==Output==> [0.95]
and so on..
Shapes of the input and output containers (arrays): input(39832, 5, 1) and output(39832, )
Error-free training
Prediction on an initial points of data (window) having shape (1, 5, 1). This has been taken from the actual data.
The predicted output is one value, which is appended to a separate list (for plotting), as well as appended to the window, and the first value of the window dropped out. This window is then fed as input to the model to generate the next prediction point.
Continue this until I get the whole curve for both models (LSTM and Bi-LSTM)
However, the prediction is not even close to the actual data. It flatlines to a fixed value, whereas it should be somewhat like the black curve (which is the actual data)
Figure:https://i.stack.imgur.com/Ofw7m.png
Model (similar code goes for Bi-LSTM model):
model_lstm = Sequential()
model_lstm.add(LSTM(128, input_shape=(timesteps, 1), return_sequences= True))
model_lstm.add(Dropout(0.2))
model_lstm.add(LSTM(128, return_sequences= False))
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(1))
model_lstm.compile(loss = 'mean_squared_error', optimizer = optimizers.Adam(0.001))
Curve prediction initialize:
start = cell_to_test[0:timesteps].reshape(1, timesteps, 1)
y_curve_lstm = list(start.flatten())
y_window = start
Curve prediction:
while len(y_curve_lstm) <= len(cell_to_test):
yhat = model_lstm.predict(y_window)
yhat = float(yhat)
y_curve_lstm.append(yhat)
y_window = list(y_window.flatten())
y_window.append(yhat)
y_window.remove(y_window[0])
y_window = np.array(y_window).reshape(1, timesteps, 1)
#print(yhat)
Model summary:
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_5 (LSTM) (None, 5, 128) 66560
_________________________________________________________________
dropout_5 (Dropout) (None, 5, 128) 0
_________________________________________________________________
lstm_6 (LSTM) (None, 128) 131584
_________________________________________________________________
dropout_6 (Dropout) (None, 128) 0
_________________________________________________________________
dense_5 (Dense) (None, 1) 129
=================================================================
Total params: 198,273
Trainable params: 198,273
Non-trainable params: 0
_________________________________________________________________
And in addition to diagnosing the problem, I am really trying to find the answers to the following questions (I looked up other sources, but in vain):
Is my data enough to train the LSTM model? I have been told that it requires thousands of data points, so I feel that my current dataset more than suffices the condition.
Is my model less/more complex than it needs to be?
Does increasing the number of epochs, layers, and the neurons per layer always lead to a 'better' model, or are there optimal values for the same? If the latter, then is there a method to find this optimal point, or is hit-and-trail the only way?
I trained with the number of epochs=25, which gave me a loss of 1.25 * 10e-4. Should the loss be lower for the model to predict the trend? (I am focused on getting the shape first, accuracy later, because the training takes too long with higher epochs)
In continuation to the previous question, does loss have the same unit as the data? The reason why I am asking this is because the data has a resolution of up to 10e-7.
Once again, I understand that it has been a long post, but help in any of the sections is appreciated.

LSTM having a systematic offset between predictions and ground truth

Currently i think i'm experiencing a systematic offset in a LSTM model, between the predictions and the ground truth values. What's the best approach to continue further from now on?
The model architecture, along with the predictions & ground truth values are shown below. This is a regression problem where the historical data of the target plus 5 other correlated features X are used to predict the target y. Currently the input sequence n_input is of length 256, where the output sequence n_out is one. Simplified, the previous 256 points are used to predict the next target value.
X is normalized. The mean squared error is used as the loss function. Adam with a cosine annealing learning rate is used as the optimizer (min_lr=1e-7, max_lr=6e-2).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
cu_dnnlstm_8 (CuDNNLSTM) (None, 256) 270336
_________________________________________________________________
batch_normalization_11 (Batc (None, 256) 1024
_________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 256) 0
_________________________________________________________________
dropout_11 (Dropout) (None, 256) 0
_________________________________________________________________
dense_11 (Dense) (None, 1) 257
=================================================================
Total params: 271,617
Trainable params: 271,105
Non-trainable params: 512
_________________________________________________________________
Increasing the node size in the LSTM layer, adding more LSTM layers (with return_sequences=True) or adding dense layers after the LSTM layer(s) only seems to lower the accuracy. Any advice would be appreciated.
Additional information on the image. The y-axis is a value, x-axis is the time (in days). NaNs have been replaced with zero, because the ground truth value in this case can never reach zero. That's why the odd outliers are in the data.
Edit:
I made some changes to the model, which increased accuracy. The architecture is the same, however the features used have changed. Currently only the historical data of the target sequence itself is used as a feature. Along with this, n_input got changed so 128. Switched Adam for SGD, mean squared error with the mean absolute error and finally the NaNs have been interpolated instead of being replaced with 0.
One step ahead predictions on the validation set look fine:
However, the offset on the validation set remains:
It might be worth noting that this offset also appears on the train set for x < ~430:

It looks like your model is overfitting and is simply always returning the value from the last timestep as a prediction. Your dataset is probably too small to have a model with this amount of parameters converge. You'll need to resort to techniques that combat overfitting: agressive dropout, adding more data, or try simpler, less overparameterized methods.
This phenomenon (LSTMs returning a shifted version of the input) has been a recurring theme in many stackoverflow questions. The answers there might contain some useful information:
LSTM Sequence Prediction in Keras just outputs last step in the input
LSTM model just repeats the past in forecasting time series
LSTM NN produces “shifted” forecast (low quality result)
Keras network producing inverse predictions
Stock price predictions of keras multilayer LSTM model converge to a constant value
Keras LSTM predicted timeseries squashed and shifted
LSTM Time series shifted predictions on stock market close price
Interesting results from LSTM RNN : lagged results for train and validation data
Finally, be aware that, depending on the nature of your dataset, there simply might be no pattern to be discovered in your data at all. You see this a lot with people trying to predict the stock market with LSTMs (there is a question on stackoverflow on how to predict the lottery numbers).

The answer is much simpler than we thought...
I saw multiple people saying this is due to overfitting and datasize. Some other people stated this is due to rescaling.
After several try, I found the solution: Try to do detrending before feed the data to RNN.
For example, you can do a simple degree-2 polynomial fitting of the data which will give you a polynomial formula. And it is possible to reduce the each data value corresponding to the formula value. Then we got a new dataset and we can feed it to the LSTM, after prediction we can just add the trend back to the result and the results should look better.

CNN Keras: How many weights will be trained?

I have a little comprehension problem with CNN. And I'm not quite sure how many filters and thus weights are trained.
Example: I have an input layer with the 32x32 pixels and 3 channels (i.e. shape of (32,32,3)). Now I use a 2D-convolution layer with 10 filters of shape (4,4). So I end up with 10 channels each with shape of (28,28), but do I now train a separate filter for each input channel or are they shared? Do I train 3x10x4x4 weights or do I train 10x4x4 weights?

You can find out the number of (non-)trainable parameters of a model in Keras using the summary function:
from keras import models, layers
model = models.Sequential()
model.add(layers.Conv2D(10, (4,4), input_shape=(32, 32, 3)))
model.summary()
Here is the output:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 29, 29, 10) 490
=================================================================
Total params: 490
Trainable params: 490
Non-trainable params: 0
In general, for a 2D-convolution layer with k filters with size of w*w applied on an input with c channels the number of trainable parameters (considering one bias parameter for each filter, in the default case) is equal to k*w*w*c+k or k*(w*w*c+1). In the example above we have: k=10, w=4, c=3 therefore we have 10*(4*4*3+1) = 490 trainable parameters. As you can infer, for each channel there are separate weights and they are not shared. Furthermore, the number of parameters of a 2D-convolution layer does not depend on the width or height of the previous layer.
Update:
A convolution layer with depth-wise shared weights: I am not aware of such a layer and could not find a built-in implementation of that in Keras or Tensorflow either. But after thinking about it, you realize that it is essentially equivalent to summing all the channels together and then applying a 2D-convolution on the result. For example in case of a 32*32*3 image, first all the three channels are summed together resulting in a 32*32*1 tensor and then a 2D-convolution can be applied on that tensor. Therefore at least one way of achieving a 2D-convolution with shared weights across channels could be like this in Keras (which may or may not be efficient):
from keras import models, layers
from keras import backend as K
model = models.Sequential()
model.add(layers.Lambda(lambda x: K.expand_dims(K.sum(x, axis=-1)), input_shape=(32, 32, 3)))
model.add(layers.Conv2D(10, (4,4)))
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
lambda_1 (Lambda) (None, 32, 32, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 29, 29, 10) 170
=================================================================
Total params: 170
Trainable params: 170
Non-trainable params: 0
One good thing about that Lambda layer is that it could be added in any place (e.g. after the convolution layer). But I think the most important question to ask here is: "Why using a 2D-conv layer with depth-wise shared weighs would be beneficial?" One obvious answer is that the network size (i.e. the total number of trainable parameters) is reduced and therefore there might be a decrease in training time, which I suspect would be negligible. Further, using shared weights across channels implies that the patterns present in different channels are more or less similar. But this is not always the case, for example in RGB images, and therefore by using shared weights across channels I guess you might observe a (noticeable) decrease in network accuracy. So, at least, you should have in mind this trade-off and experiment with it.
However, there is another kind of convolution layer, which you might be interested in, called "Depth-wise Separable Convolution" which has been implemented in Tensorflow, and Keras supports it as well. The idea is that on each channel a separate 2D-conv filter is applied and afterwards the resulting feature maps are aggregated using k 1*1 convolutions(k here is the number of output channels). It basically separates the learning of spatial features and depth-wise features. In his paper, "Xception: Deep Learning with Depthwise Separable Convolutions", Francois Chollet (the creator of Keras) shows that using depth-wise separable convolutions improves both the performance and accuracy of network. And here you can read more about different kinds of convolution layers used in deep learning.

How to train a network in Keras for varying output size

I have basic neural network created with Keras. I train the network successfully with vectors of data and corresponding output data that is a vector with two elements. It represents a coordinate (x, y). So in goes an array, out comes an array.
Problem is that I am unable to use training data where a single input vector should correspond to many coordinates. Effectively, I desire a vector of coordinates as output, without prior knowledge of the number of coordinates.
Network is created by
model = Sequential()
model.add(Dense(20, input_shape=(196608,)))
model.add(Dense(10))
model.add(Dense(2))
and model summary shows the output dimensions for each layer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20) 3932180
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
_________________________________________________________________
dense_3 (Dense) (None, 2) 22
=================================================================
I realize the network structure only allows a length 2 vector as output. Dense layers also do not accept None as their size. How do I modify the network so that it can train on and output a vector of vectors (list of coordinates)?

A recurrent neural networks (RNNs) would be much more appropriate, this models are typicall called seq2seq, that is, sequence to sequence. Recurrent nets use layers like LSTM and GRU, and can input and output variable length sequences. Just look at things like Machine Translation done with RNNs.
This can be done directly with keras, and there are many examples lying around the internet, for example this one.

An rnn is not what you want for predicting coordinates. Instead, I would recommend using a model that predicts coordinates and associated confidences. So you would have 100 coordinate predictions for every forward pass through your model. Each of those predictions would have another associated prediction that determines if it is correct or not. Only predictions that are above a certain confidence threshold would count. That confidence threshold is what allows the model to choose how many points it wants to use each time (with a maximum number set by the number of outputs which in this example is 100).
r-cnn is a model that does just that. Here is the first keras implementaion I found on github https://github.com/yhenon/keras-frcnn.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.