I have a dataset with three labels.
First, I'm loading my data into a Dataset with the ImageFolder class and the CenterCrop transform.
So every picture has now three channels r, g, b with 224x224 values from 0..1. I thought my NN has 224×224×3 Input Nodes.
I get this error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (21504x224 and 150528x200)
Assuming that the first mat1 is the Image (I have no clue where the 21504 is coming from) and the second one is the first nn.Linear(2242243, 200) Layer, I can see why they cannot be multiplied. So, I changed the nn.Linear(2242243) to nn.Linear(224). Now it works, but..
transform = transforms.Compose([transforms.Resize((300, 200)),
transforms.CenterCrop(224),
transforms.ToTensor()])
dataset = datasets.ImageFolder('/content/drive/MyDrive/Colab Notebooks/data', transform=transform)
# ... BatchSize = 32
model = nn.Sequential(
nn.Linear(224, 200), # <- The Size of one Row!
nn.Sigmoid(),
nn.Linear(200, 40),
nn.Sigmoid(),
nn.Dropout(p=0.2),
nn.Linear(40, 40),
nn.Sigmoid(),
nn.Linear(40, 20),
nn.Sigmoid(),
nn.Linear(20, 3)
).to(device)
# ...
prediction = model(x)
For some Reaseon my prediction has now this from.
prediction[32][3][224]
I would expect 32 items in a list for a Batch. Every Item from the Batch contains 3 values with the probability what the label is. But why does the Height / width came up here?.
I think I have to change the Format of the dataset from 224x224x3 (3d) to 150528 1d, but operations like view() did not work, probably because of the lazy loading from ImageFolder.
The NN works fine with the MINST dataset. (28x28) In and (10) Out. So my guess is: The dataset has to be transformed, but I can't figure out how.
It can work if you apply transforms.Lambda and change transforms in the following way:
transform = transforms.Compose([transforms.Resize((300, 200)),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Lambda(lambda x: torch.flatten(x))])
As a result you will have (batch_size, 150528) shape for each batch which means that you need to specify 150528 input size for the first linear layer. Not sure that it's reasonable to classify images with linear fully-connected layers in such a case. It will work slowly and unlikely converge to some good predictions. You'd better consider using convolution layers instead.
Related
Recently, I'm trying to implement a simple neural network for my extracted features.
So what I'd like the neural network to do for me is as below:
Once an image passes through the encoder, the corresponding embedding of that image is obtained.
The embedding's shape and visualization is as follow:
Visualization of image's embedding
, which the 13*13 gray area stands for small patches that construct the whole image and 64 means flattened pixel value for each small patches. In other words, an image will be divided into 13*13 (169) small patches and each patches pixel is flattened to 64 values, forming the third dimension of this block.
So my goal for this task is that I'd like to construct a neural network which is able to calculate a scalar score for each patch, meaning that for this entire image, the NN will calculate score for each gray little patch, so for each image, there would be in total 169 scalar score.
In order to do so, I've tried to do convolution over the third dimension:
This is my training function:
for epoch in range(epochs):
model.train()
for i, sample in enumerate(train_loader):
fe, target = sample['fe'], sample['label'] # fe shape: (64, 13, 13, 64)
fe = np.transpose(fe, (0, 3, 1, 2))
fe = fe.cuda() # (64, 64, 13, 13)
target = target.cuda()
output = model(fe)
...
This is my model:
class ADModel(nn.Module):
def __init__(self):
super(ADModel, self).__init__()
self.conv = nn.Conv2d(in_channels=64, out_channels=1, kernel_size=1, padding=0)
def forward(self, feature):
scores = self.conv(feature)
and also, I've tried to place a fully connected layer with input channel 64 and output channel 1. Yet, both of these trials seemed to be unable to convolve at the right dimension and for fully connected approach, I barely have idea about how to do so. :(
So I'm currently wondering if there's other better ways to achieve what I'm expecting to do or if my assumption at first is wrong.
Thank you for reading until here, I'm really appreciated!
I have gray scale images which I got their arrays of pixels in x_train and x_test.
x_train is of size (2500, 21, 512) and x_test of size (500, 21, 512).
I want to do a CNN to get as output y_train as also (2500,21,512) and y_test as (500,21,512) but which are the arrays of other images that I want the network to predict.
In the MNIST they do it but by taking y_train and y_test as a vector of values and then take the output as (3000, 1). How could I do the same but for my images?
Hmmmm I don't fully understand your question, but I will take a stab. Please let me know if I misinterpreted your question.
Your model takes the following input:
x_train: the image.
And outputs:
x_hat = an image with the same dimensions as `x_train`
Judging by the described architecture, it seems like you are building a convolutional autoencoder. Am I correct?
If so, you have to do the following:
You need to add a channel of dimension one so that the CNN can receive the input, which can be done by reshaping the tensor. Convolutional neural network inputs are as follows: (batch_size, channels, width, height).If you don't want to add a channel, you can use a simple feed-forward neural network (or MLP). If this is the case, you will still have to flatten the inputs into the following dimension: (batch_size, pixels). For a more concrete example, given the mnist dataset, if the batch_size is 32, your input dimension will be (32, 784), since MNIST images are 28 x 28. By flattening the image, you get input size of 784.
You can create a convolutional autoencoder by doing strided convolutions to downsample the images in the encoder layers. Afterwards, you can take the intermediate representation and do an upsampling operation via transposed convolutions. If you want to train a model that can actually generate samples instead of reconstructing, I recommend looking up variational autoencoders and generative adversarial networks.
The implementation will vary depending on the framework (E.g. PyTorch, TensorFlow, etc.)
When loading a model and using it for inference, I feed in an array of images,
image_tile_tensor where the shape is (total_tile, tile_height, tile_width, 3). image_tile_tensor is a numpy array.
I input this into my model for inference using the code below:
image_tile_tensor = tf.image.convert_image_dtype(image_tile_tensor, tf.float32)
image_tile_prediction = model.predict(image_tile_tensor, verbose=1)
During inference, all the prediction come out nicely except for the last few image tile in image_tile_tensor. For example, if I have 112 total image tile for inference, the last 12 image tiles has prediction with all values equal to NaN while the first 100 image tile has the expected prediction value.
Any idea what the problem might be? I am abit lost and dont know where to start debug this at the moment. If it helps, the input tile_height and tile_width are (192,192) and training batch size = 16.
I want to create a neural network, that -easy speaking- creates an image out of an image (greyscale)
I have successfully created a dataset of 3200 examples of input and output (label) images.
(I know the dataset should be larger but that is not the problem right now)
The input [Xin] has the size (3200, 50, 30), since it is 50*30 pixels
The output [yout] has the size of (3200, 30, 20) since it is 30*20 pixels
I want to try out a fully connected network (later on a CNN)
The built of the fully connected model looks like that:
# 5 Create Model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(256, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(30*20, activation=tf.nn.relu))
#compile the model
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# 6 Train the model
model.fit(Xin, yout, epochs=1) #train the model
After that I get the following error:
ValueError: Shape mismatch: The shape of labels (received (19200,)) should equal the shape of logits except for the last dimension (received (32, 600)).
I already tried to flatten yout:
youtflat = yout.transpose(1,0,2).reshape(-1,yout.shape[1]*yout.shape[2])
but this resulted in the same error
It appears you're flattening your labels (yout) completely, i.e., you're losing batch dimension. If your original yout has a shape of (3200, 30, 20) you should reshape it to have a shape of (3200, 30*20) which equals (3200, 600):
yout = numpy.reshape((3200, 600))
Then it should work
NOTE
The suggested fix however only removes the error. I see many problems with your method though. For the task you're trying to perform (getting an image as output), you cannot use sparse_categorical_crossentropy as loss and accuracy as metrics. You should use 'mse' or 'mae' instead.
I have data set of colored images in the form of ndarray (100, 20, 20, 3) and 100 corresponding labels. When passing them as input to a fully connected neural network (not CNN), what should I do with the 3 values of RGB? Average them perhaps lose some information, but if not manipulating them, my main issue is batch size, as demo-ed below in pytorch.
for epoch in range(n_epochs):
for i, (images, labels) in enumerate(train_loader):
# because of rgb values, now images is 3 times the length of labels
images = Variable(images.view(-1, 400))
labels = Variable(labels)
optimizer.zero_grad()
outputs = net(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
This returns 'ValueError: Expected input batch_size (300) to match target batch_size (100).' Should I have reshaped images into (1, 1200) dimension tensors? Thanks in advance for answers.
Since size of labels is (100,), so your batch data should be with shape of (100, H, W, C). I'm assuming your data loader is returning a tensor with shape of (100,20,20,3). The error happens because you reshape the tensor to (300,400).
Check your network architecture whether the input tensor shape is (20,20,3).
If your network can only accept single channel images, you can first convert your RGB to grayscale images.
Or, modify your network architecture to make it accept 3 channels images. One convenient way is adding an extra layer reducing 3 channels to 1 channel, and you do not need to change the other parts of the network.
Use Grey scaled image to reduce the batch size