I have been attempting to create a model that given an image, can read the text from it. I am attempting to do this by implementing a cnn, rnn, and ctc. I am doing this with TensorFlow and Keras. There are a couple of things I am confused about. For reading single digits, I understand that your last layer in the model should have 9 nodes, since those are the options. However, for reading words, aren't there infinitely many options, so how many nodes should I have in my last layer. Also, I am confused as to how I should add my ctc to my Keras model. Is it as a loss function?
I see two options here:
You can construct your model to recognize separate letters of those words, then there are as many nodes in the last layer as there are letters and symbols in the alphabet that your model will read.
You can make output of your model as a vector and then "decode" this vector using some other tool that can encode/decode words as vectors. One such tool I can think of is word2vec. Or there's an option to download some database of possible words and create such a tool yourself.
Description of your model is very vague. If you want to get more specific help, then you should provide more info, e.g. some model architecture.
Related
I'm new to machine learning, and I've been given a task where I'm asked to extract features from a data set with continuous data using representation learning (for example a stacked autoencoder).
Then I'm to combine these extracted features with the original features of the dataset and then use a feature selection technique to determine my final set of features that goes into my prediction model.
Could anyone point me to some resources or demos or sample code of how I could get started on this? I'm very confused on where to begin on this and would love some advice!
Okay, say you have an input of (1000 instances and 30 features). What I would do based on what you told us is:
Train an autoencoder, a neural network that compresses the input and then decompresses it, which has as a target your original input. The compressed representation lies in the latent space and encapsulates information about the input which is not directly accessible by humans. Now you may find such networks in tensorflow or pytorch. Tensorflow is easier and more straightforward so it could be better for you. I will provide this link (https://keras.io/examples/generative/vae/) for a variational autoencoder that may do the job for you. This has Conv2D layers so it performs really well for image data, but you can play around with the architecture. I cannot tell u more because you did not provide more info for your dataset. However, the important thing is the following:
After your autoencoder is trained properly and you need to make sure of it, (it adequately reconstructs the input) then you need to extract the aforementioned latent inputs (you will find more in the link). Now, that will be let's say 16 numbers but you can play with it. These 16 numbers were built to preserve info regarding your input. You said you wanted to combine these numbers with your input so might as well do that and end up with 46 input features. Now the feature selection part has to do with selecting the input features that are more useful for your model. That is not very interesting, you may find more information (https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e) and one way to select features is by training many models with different feature subsets. Remember, techniques such as PCA are for feature extraction not selection. I cannot provide any demo that does the whole thing but there are sources that can help. Remember, your autoencoder is supposed to return 16 numbers for each training example. Your autoencoder is trained only on your train data, with your train data as targets.
I have training data as two columns
1.'Sentences'
2.'Relevant_text' (text in this column is a subset of text in the column 'Sentences')
I tried training a RNN with LSTM directly treating 'Sentences' as input and 'Relevant_text' and output but the results were disappointing.
I want to know how to approach this type of problem? Does this kind of problem have a name? Which models should I explore?
If the target text is the subset of the input text, then, I believe, this problem can be solved as a tagging problem: make your neural network for each word predict whether it is "relevant" or not.
On the one hand, the problem of taking a text and selecting its subset that best reflects its meaning is called extractive summarization, and has lots of solutions, from the well known unsupervised textRank algorithm to complex BERT-based neural models.
On the other hand, technically your problem is just binary token-wise classification: you label each token (word or other symbol) of your input text as "relevant" or not, and train any neural network architecture which is good for tagging on this data. Specifically, I would look into architectures for POS tagging, because they are very well studied. Typically, it is BiLSTM, maybe with a CRF head. More modern models are based on pretrained contextual word embeddings, such as BERT (maybe, you won't even need to fine tune them - just use it as a feature extractor, and add a BiLSTM on top). If you want a more lightweight model, you can consider a CNN over pretrained and fixed word embeddings.
One final parameter you should time playing with is the threshold for classifying the word as relevant - maybe, the default one, 0.5, is not the best choice. Maybe, instead of keeping all the tokens with probability-of-being-important higher than 0.5, you would like to keep the top k tokens, where k is fixed or is some percentage of the whole text.
Of course, more specific recommendations would be dataset-specific, so if you could share your dataset, it would be a great help.
I Am an absolute beginner with Tensorflow. I have searched, but did not found how to do this:
If I have a list of strings like this:
["sentace1", "...", "sentance5000"]
How do I train a neural network to create similar sentences? What is the logic of generating data, text, images? Can someone explain to me using code, through this relatively basic example?
Also, If I'd add more layers and different types of data, could it create for example pictures or music?
A thousand thanks :)
Generation of music and text differs form generation of images. Text and music generation can be done with sequence models (LSTM, RNN, GRU etc), while image generation can be done with GAN - Generative Adversarial Network
Text generation:
For text generation, first step is to create embeddings form your sentence either from pre-trained embedding models (word2vec, GloVe etc) then apply this embedding to your sentences. There are several other embedding techniques that you can explore.
Next step is to fit your embedded features to a sequence model. Probably this one you can refer as starting point.
Music generation:
Music generation can be done with sequence models, difference being instead of sequence of words you have sound waveform, spectrogram, note, chord etc.
Image generation:
This one is different then above two.
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. https://arxiv.org/abs/1406.2661
Coming to your questions:
If I'd add more layers and different types of data, could it create for example pictures or music?
As stated, music and text generation can be done with similar type of network architecture (as both follows a sequence), while images needs to be treated differently.
Popular model for generating data in natural language processing is word2vec. It encodes words to vector space, then similar words to given can be generated. Basicly you can vectorise almost anything so variation you are looking for is sentence2vec which works similar but you give sentence as input and it encodes sentences to vectors.
Here is tutorial for tensorflow: tensorflow.org/tutorials/representation/word2vec
You can also try gensim implementation of w2v:
https://radimrehurek.com/gensim/models/word2vec.html
About:
Also, If I'd add more layers and different types of data, could it
create for example pictures or music?
For generating images you should read about Auto Encoders (eg. VAE) and Generative Adversarial Networks (GANs). These architectures works different from NLP architectures.
I am new in Tensorflow and I am trying to build model which will be able to perform OCR on my images. I have to read 9 characters (fixed in all images), numbers and letters. My model would be similar to this
https://matthewearl.github.io/2016/05/06/cnn-anpr/
My questions would be, should I train my model against each character firstly and after combine characters to get full label represented. Or I should train on full label straight ?
I know that I need to pass to model, images + labels for corresponding image, what is the format of those labels, is it textual file, I am bit confused about that part, so any explanation about format of labels which are passed to model would be helpful ? I appreciate, thanks.
There are a couple of ways to deal with this (the following list is not exhaustive).
1) The first one is word classification directly from your image. If your vocabulary of 9 characters is limited you can train a word specific classifier. You can then convolve this classifier with your image and select the word with the highest probability.
2) The second option is to train a character classifier, find all characters in your image, and find the most likely line that has the 9 character you are looking for.
3) The third option is to train a text detector, find all possible text boxes. Then read all text boxes with a sequence-based model, and select the most likely solution that follows your constraints. A simple sequence-based model is introduced in the following paper: http://ai.stanford.edu/~ang/papers/ICPR12-TextRecognitionConvNeuralNets.pdf. Other sequence-based models could be based on HMMs, Connectionist Temporal Classification, Attention based models, etc.
4) The fourth option are attention-based models that work end-to-end to first find the text and then output the characters one-by-one.
Note that this list is not exhaustive, there can be many different ways to solve this problem. Other options can even use third party solutions like Abbyy or Tesseract to help solve your problem.
I'd recommend to train an end-to-end OCR model with attention. You can try the Attention OCR which we used to transcribe street names https://github.com/tensorflow/models/tree/master/research/attention_ocr
My guess it should work pretty well for your case. Refer to the answer https://stackoverflow.com/a/44461910 for instructions on how to prepare the data for it.
I followed the tutorial given at this site, which detailed how to perform text classification on the movie dataset using CNN. It utilized the movie review dataset to find predict positive and negative reviews.
My question is, is there any way to find the most important learned features from the model? Does Tensorflow/Theano has any support for this?
Thanks !
A word of warning: if you can trace the classification back to specific input features, it's quite possible that CNN is the wrong ML paradigm for your application. Most text processing uses RNN, bag-of-words, bi-grams, and other simple linear combinations.
The structure of a CNN is generally antithetical to identifying the importance of individual features. Because of the various non-linear layers, it is rarely possible to pick out any one feature as important; rather, the combinations of inputs form small structures of inference, which then convolve to form more complex structures, until the final output is driven by a series of neighbor relationships, cut-offs, poolings, and other items.
This is why back-propagation is so important to running CNNs: the causation chain does not reverse cleanly. Otherwise, we'd reduce the process to a simple linear NN with one hidden layer.
If you want to analyze what's happening, try visualizing your intermediate layers. There are various modules to help with that; for instance, try a search for "+theano +visualize +CNN -news" (the last is to remove the high-traffic references to Cable News Network). There are plenty of examples in image processing; we won't know how much it might help your text processing, until you try it.