Does anybody know a way to access the outputs of the intermediate layers from BERT's hosted models on Tensorflow Hub?
The model is hosted here. I have explored the meta graph and found the only signatures available are "tokens, "tokenization_info", and "mlm". The first two are illustrated in the examples on github, and the masked language model signature doesn't help much. Some models like inception allow you to access all of the intermediate layers, but not this one.
Right now, all I can think of to do is:
Run [i.values() for i in tf.get_default_graph().get_operations()] to get the names of the tensors, find the ones I want (out of thousands) then
tf.get_default_graph().get_tensor_by_name(name_of_the_tensor) to access the values and stitch them together and connect them to my downstream layers.
Anybody know a cleaner solution with Tensorflow?
BERT is one of the latest achievements in the transformer language models area. Unlike predecessors BERT can achieve bidirectional architecture using MLM (Masked Language Model). This provides better contextualized word / sentence embedding for a variety of NLP solutions. As for general usage. Bert provides SOTA embeds with its last layer. But for research purposes, it is also advised to consider intermediate layers for text representations. The picture below shows the effect of the different use-cases with intermediate layers.
As it can be seen from the picture it is mostly advised to sum last four layers. Summing last four layers gives less embedding dimension compared to concatenation and it results with %0.2 difference. Intermediate layers can be achieved with scripts provided from BERT original GitHub page but implementing this scripts to downstream NLP task requires custom Keras layers. Instead TensorFlow-Hub provides one-line BERT with Keras layer. BERT TensorFlow-Hub solutions are updated on regular basis. First two versions only provided sentence (pooled output) or word (sequence_output). With v3 BERT now provides intermediate layer information. Link to BERT V3 is provided below.
BERT-LARGE v3 TF-HUB
In the given page, section named "Advanced topics" states the following information.
The intermediate activations of all L=24 Transformer blocks (hidden layers) are returned as a Python list: outputs["encoder_outputs"][i] is a Tensor of shape [batch_size, seq_length, 1024] with the outputs of the i-th Transformer block, for 0 <= i < L. The last value of the list is equal to sequence_output.
In v3, one can achieve intermediate layer information using (encoder_outputs). Intermediate layers are returned as python lists to carry out concatenation or sum operations or else. Another extension in v3 is that BERT TensorFlow-Hub now provides pre-processor. BERT takes three inputs "input_word_ids, input_mask, and input_type_ids". Pre-processor can take a string as an input and returns required inputs for BERT.
I haven't had a chance to test this approach but if it is not very curicual I recommend using last layer information. BERT is very powerful and GPU-dependent text representator compared to old lookup-table approaches. Common issue that researchers faced with bert is OOM problems with single GPU. To solve this use tf2 with memory growth. I will try to test BERT Hub v3 and give more feedback.
Related
Now, I want feature of image to compute their similarity. We can get feature using pre-trained VGG19 model in tensorflow easily. But VGG19 model has many layers, and I don't know which layer should I use to get feature. Which layer's output is appropriate for this problem?
# I think this how is correct to extract feature
model = tf.keras.application.VGG19(include_top=True,
weight='imagenet')
input = model.input
output = model.layers[-2].output
extract_model = tf.keras.Model(input, output)
It's my infer that the more closer to last output, the more the model output powerful feature. But some tutorials says 'use include_top=False to extract feature' (e.g Image Captioning with Attention TensorFlow)
So, I don't know which layer should I use. Please try to help me here in this thread.
The include_top=False may be used because the last 3 layers (for that specific model) are fully connected layers which are not typically good feature vectors. If the model directly outputs a feature vector, then you don't need it.
Most people use the last layer for transfer learning, but it may depend on your application. For example, Gatys et. al. show that the first few layers of VGG are sensitive to the style of the image and later layers are sensitive to the content.
I would probably try all of them in a hyperparameter search and see which gives the best performance. If by image similarity you mean the similarity of objects contained inside, I would probably start with the last layer.
How can I use Long Short-term Memory (LSTM) to predict a future value x(t+1) (out of sample prediction) based on a historical dataset. I read and tried many web tutorials for forecasting and prediction using lstm, but still far away from the point. What's the exact procedure to do this prediction? Is it just as simple as shifting the target array (n)steps where n is the number of future predicts and do the prediction operation? or there's another techniques?
please help or leave a suggestion.
Can you provide the framework you are using? tensorflow? pytorch? which web tutorials specifically?
Assuming you are going tensorflow, you can copy and paste code from one of these, test that it works on the provided dataset, then modify the input encoding functions to fit your dataset, then run on your dataset.
https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo (best)
https://github.com/sebastianheinz/stockprediction
https://github.com/talolard/MarketVectors/blob/master/preparedata.ipynb (you will have to replace fc layers with lstm, and fiddle with inputs)
In general procedure is something like (assuming tensorflow):
Download Dataset
Create a function to load batches of data
Create a function to encode batch of data (normalization, other transforms)
Create LSTM layer to recieve series of inputs.
Create output layer (usually fully connected) to take last lstm state and predict output of your desired size.
Create a tf session to wire everything together, and hit run.
Some questions to ask conceptually about which network use:
How many inputs to how many outputs - see this excellent http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf by Karpathy
How far back do you consider the stock prices eg {t-100... t} or {t-10 ...t} which may dictate size of hidden layers.
What other information do you think is relevant to the model? does stock A influence stock B? in which case you may have 2 lstms outputing a state to your fully connected layer...
I'm tryinig to train my LSTM model in tensorflow and my module has to calculate parameter inside parameter. And i want to train both parameters altogether.
More details are in the picture below.
I think that tensorflow LSTM module's input must be a perfect sequence and parameters like "tf.placeholder".
How can i do this in tensorflow? Or can you recommend another appropriate framework better than tensorflow in this task?
Sorry for my poor english.
First of all your usage of the word parameter is quite confusing. Normally parameters are referred as trainable parameters and therefore every variable which is trained by the optimizer. There are also so-called hyper-parameters, which have to be set per hand e.g. like the model topology.
Tensorflow work with tensors, which are representations of data which are used to build the workflow and are filled with data during run time via placeholder which is like an entry point for the data.
Also, if you have trouble to build your model in tensorflow, then there is also keras. Keras can run with tensorflow as its backend but model building is much easier. Also, keras is also available in the tensorflow API as tf.keras. In keras one or multiple LSTMs are simplified as a layer which can be added to your model.
If you like a more specific answer to your question, please provide code to describe your problem.
I am searching for a way to forward lower layer output to a higher layer with a loaded VGG16 model using CNTK.
The background of my problem is:
I reimplemented some parts of Fully Convolutional Networks for Semantic Segmentation but then I ran into some problems: Starting with this example I first replaced the fully connected layers with fully convolutional and slit the sequence in the model definition part into chunks where I could simply access pool3 and pool4 for the later usage in eg. Convolution2D((1,1), num_classes, name='score_pool4')(pool4). This works fine but after building the model I noticed, that I need to implement an own way to read batches because the build-in reader does not support 2D labels right now. Now I simply read the images using OpenCV and replaced the training_session(...).train() with a for loop and trainer.train_minibatch({model['features']: my_loaded_features, model['labels']: my_2D_labels}) this works well but because of the removed training_session part I don't know where I could apply the existing VGG16 weights.
My problem is:
I searched for transfer learning examples where those guys load models using C.load_model(...) and then clone the needed layers but now I am wondering how could I access cloned_layers->pool4 (in the middle of the loaded model) if I also want to use it in deeper layers.
I tried Convolution2D((1,1), num_classes, name='score_pool4')(cloned_layers.find_by_name('pool4'))but I ended up with some error messages while learner initialization because of "unknown shape information" in used weight variables.
So how can I access those layers within the loaded model for later (deeper) usage?
Thanks for reading (and maybe helping)!
If you are looking to read custom data. There are two tutorials on building your own readers. https://cntk.ai/pythondocs/manuals.html
Regarding cloning parts of a network - here is a link to another post on StackOverflow that has exemplar code
Is there a way to learn unsupervised features from set of images. Similar to word2vec or doc2vec, where neural network is learnt and given new document we get its features.
Expecting similar to this example shows that it can load learnt nn-model and predict features for new images.
Is there any simple example how to implement cnn over images and get their features back will help !!
Suppose in this example
If I want to get cnn features for all X_train and X_test ... is there any way?
Also, if we can get weights per layer per image, we can stack them and use as features. In that case is there a way to get the same.
Using those features for unsupervised task would be easier, if we consider them as vectors.
If I correctly understood your question, this task is quite common in a deep learning field. In case of images what I consider the best is a convolutional autoencoder. You may read about this architecture e.g. here
http://people.idsia.ch/~ciresan/data/icann2011.pdf
Previous version of Keras supported this architecture as one of core layers, though from version 1.0 I noticed that it disappeared from documentation. But - it's still quite easy to build it from a scratch :)
In noimage cases there are also another approaches like e.g. Restricted Boltzmann Machines.
UPDATE :
When it comes to what sort of activations are the best for obtaining new features from neural network activations - from my personal experience - it depends on the size of the net which you use. If you use a network which last layer is wide (has a lot of nodes) it might be useful to get only last layer (due to number of parameters if you want to consider also previous layers - it may harm the performance of learning). But - if (like in case of some MNIST networks) your last layer is not sufficient for this task - you may try using also previous layers activation or even all net activity. To be honest - I'm not expecting much of improvement in this case - but you may try. I think that you should use both approaches - starting from taking only last layer activations - and then trying to check the behaviour of your new classifier when you add activations from previous layers.
What I will strongly advise to you is also getting some insights from what sort of features network is learning - using T-SNE embeddings of it activations. In many cases I found it useful - e.g. checking if the size of a layer is sufficient. Using T-SNE you may check if the features obtained from last layer are good discriminators of your classes. It may also give you good insights about your data and what neural networks are really learning (alongside with amazing visualizations :) )