I am trying to use the following link to understand how a CVAE works. Although i can see how this works for something like a 28x28x1 input image, I'm not sure how to modify this to work for something like an input image of size 64x64x3.
I have tried looking at other sources for information, but all of them use the MNIST dataset used in the example above. None of them really explain why they chose the numbers for filters, kernels, or strides. I need help understanding this and how to modify the network to work for a 64x64x3.
None of them really explain why they chose the numbers for filters,
kernels, or strides.
I'm new to CNNs too, but from what I understand it's really more about experimentation, there is not an exact formula that would give you the amount of filters you have to use or the correct size, it depends on your problem, if you are trying to recognize an object and the object has small features that would make it "recognizable" to the network, then using a filter of small size may be best, but if you think that the features that allow the network to recognize the object are "bigger", then using filters of bigger size may be best, but again from what I've learned, these are just tips, you may have a CNN that has a completely different configuration.
Related
I have an image that consists of only black and white pixels. 500x233px. I want to use this image as an input, or maybe all the pixels in the image as individual inputs, and then receive 3 floating point values as output using machine learning.
I have spent all day on this and have come up with nothing. The best I can find is some image classification libraries, but I am not trying to classify an image. I’m trying to get 3 values that range from -180.0 to 180.0.
I just want to know where to start. Tensorflow seems like it could probably do what I want, but I have no idea where to start with it.
I think my main issue is that I don’t have one output for each input. I’ve been trying to use each pixel’s value (0 or 1) as an input, but my output doesn’t apply to each pixel, only to the image as a whole. I’ve tried creating a string of each pixel’s value and using that as one input, but that didn’t seem to work either.
Should I be using neural networks? Or genetic algorithms? Or something else? Or would I be better off with only receiving one of the three outputs I need, and just training three different models for each output? Even then, I’m not sure how to get a floating point value out of these things. Maybe machine learning isn’t even the correct approach.
Any help is greatly appreciated!
I am currently trying to get the Faster R-CNN network from here to work in windows with tensorflow. For that, I wanted to re-implement the ROI-Pooling layer, since it is not working in windows (at least not for me. If you got any tips on porting to windows with tensorflow, I would highly appreciate your comments!). According to this website, what you do is, you take your proposed roi from your feature map and max pool its content to a fixed output size. This fixed output is needed for the following fully connected layers, since they only accept a fixed size input.
The problem now is the follwing:
After conv5_3, the last convolutional layer before roi pooling, the box that results from the region proposal network is mostly 5x5 pixels in size. This is totally fine, since the objects I want to detect usually have dimensions of 80x80 pixels in the original image (downsampling factor due to pooling is 16). However, I now have to max pool an area of 5x5 pixels and ENLARGE it to 7x7, the target size for the ROI-Pooling. My first try by simply doing interpolation did not work. Also, padding with zeros did not work. I always seem to get the same scores for my classes.
Is there anything I am doing wrong? I do not want to change the dimensions of any layer and I know that my trained network in general works because I have the reference implementation running in Linux on my dataset.
Thank you very much for your time and effort :)
There is now an official TF implementation of Faster-RCNN, and other object detection algorithms, in their Object Detection API, you should probably check it out.
If you still want to code it yourself, I wondered exactly the same thing as you and could not find an answer about how you're supposed to do. My three guesses would be:
interpolation, but it changes the feature values, so it destroys some information...
Resizing to 35x35 just by copying 7 times each cell and then max-pooling back to 7x7 (you don't have to actually do the resizing and then the pooling , for instance in 1D it basically reduces itself to output[i]=max(input[floor(i*5/7)], input[ceil(i*5/7)]), with a similar max over 4 elements in 2D -be careful, I might have forgotten some +1/-1 or something-). I see at least two problems: some values are over-represented, being copied more than others; but worse, some (small) values will not even be copied at all in the output ! (which you should avoid given that you can store more information in the output than in the input)
Making sure all input feature values are copied at least once exactly in the output, at the best possible place (basically copy input[i] to output[j] with j=floor((i+1)*7/5)-1)). For the remaining spots, either leave a 0 or do interpolation. I would think this solution is the best, maybe with interpolations but I'm really not sure at all.
It looks like smallcorgi's implementation uses my 2nd solution (without actually resizing, just using max pooling), since it's the same implementation as for the case where the input is bigger than the output.
I know it's late but i post this answer because it might help others. I have written a code that explains how roi pooling works in different height and width conditions for both pool and region.
you can see the link of the code in github:
https://github.com/Parsa33033/RoiPooling
I would like to understand more the machine learning technics, I have read and watch a bunch of things on Python, sklearn and supervised feed forward net but I am still struggling to see how I can apply all this to my project and where to start with. Maybe it is a little bit too ambitious yet.
I have the following algorithm which generates nice patterns as binary format inputs on csv file. The outputs and the goal is to predict the next row.
The simplify logic of this algorithm is the prediction of the next line (top line being the most recent one) would be 0,0,1,1,1,0 and then the next after that would become either 0,0,0,1,1,0 or come back to its previous step 0,1,1,1,0. However you can see the model is slightly more complex and noisy this is why I would like to introduce some machine learnings here. I am aware to have a reliable prediction I will need to introduce other relevant inputs afterwards.
Would someone please help me to get started and stand on my feet here?
I don't like throwing this here and not being able to provide a single piece of code but I am slightly confused to where to start.
Should I pass as input each (line-1) as vectors and then the associated output would be the top line? Should I build the array manually with all my dataset?
I guess I have to use the sigmoid function and python seems the most common way to answer this but for the synapses (or weights), I understand I need also to provide a constant, should this be 1?
Finally assuming you want this to run continuously what would be required?
Please would you share with me readings or simplification tasks that could help me to increase my knowledge with all this.
Many thanks.
This is a fairly straightforward question, but I am new to the field. Using this tutorial I have a great way of detecting certain patterns or features. However, the images I'm testing are large and often the feature I'm looking for only occupies a small fraction of the image. When I run it on the entire picture the classification is bad, though when zoomed it and cropped the classification is good.
I've considered writing a script that breaks an image into many different images and runs the test on all (time isn't a huge concern). However, this still seems inefficient and unideal. I'm wondering about suggestions for the best, but also easiest to implement, solution for this.
I'm using Python.
This may seem to be a simple question, which it is, but the answer is not so simple. Localisation is a difficult task and requires much more leg work than classifying an entire image. There are a number of different tools and models that people have experimented with. Some models include R-CNN which looks at many regions in a manner not too dissimilar to what you suggested. Alternatively you could look at a model such as YOLO or TensorBox.
There is no one answer to this, and this gets asked a lot! For example: Does Convolutional Neural Network possess localization abilities on images?
The term you want to be looking for in research papers is "Localization". If you are looking for a dirty solution (that's not time sensitive) then sliding windows is definitely a first step. I hope that this gets you going in your project and you can progress from there.
I found an implementation for BEGAN using CNTK.
(https://github.com/2wins/BEGAN-cntk)
This uses MNIST dataset instead of Celeb A which was used in the original paper.
However, I don't understand the result images, which looks quite deterministic:
Output images of the trained generator (iter: 30000)
For different noise samples, I expect different outputs come from it. But it doesn't do that regardless of any hyper-parameters. Which part of the code does make the problem?
Please explain it.
Use higher gamma (for example gamma=1 or 1.3, more than 1 actually). Then it will improve certainly but would not make it perfect. Take enough iterations like 200k.
Please look at the paper carefully. It says the parameter gamma controls diversity.
One of the results that I obtained is .
I'm also looking for the best parameters and best results, but haven't yet.
Looks like your model might be getting stuck in a particular mode. One idea would be to add an additional condition on the class labels. Conditional GANs have been proposed to overcome such limitations.
http://www.foldl.me/uploads/2015/conditional-gans-face-generation/paper.pdf
This is an idea that would be worth exploring.