SpaCy TextCategorizer Pipeline detailed - python

I'm currently working on NLP project. Actually, when i researched how to deal with NLP, i found some articles about SpaCy. But, because i'm still newbie on python, i don't understand how SpaCy TextCategorizer Pipeline works.
Is there any detailed about how this pipeline works? Is TextCategorizer Pipeline also using text feature extraction such as Bag of Words, TF-IDF, Word2Vec or anything else? And what model architecture use in SpaCy TextCategorizer? Is there someone who could explain me about this?

There's a lot of info in the docs:
https://spacy.io/usage/examples#textcat shows a code example
https://spacy.io/api/textcategorizer provides details on the architecture:
The model supports classification with multiple, non-mutually exclusive labels. You can change the model architecture rather easily, but by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document. The TextCategorizer uses its own CNN model, to avoid sharing weights with the other pipeline components. The document tensor is then summarized by concatenating max and mean pooling, and a multilayer perceptron is used to predict an output vector of length nr_class, before a logistic activation is applied elementwise. The value of each output neuron is the probability that some class is present.

Related

Using tensorflow classification for feature extraction

I am currently working on a system that extracts certain features out of 3D-objects (Voxelgrids to be precise), and i would like to compare those features to automatically made features when it comes to performance (classification) in a tensorflow cNN with some other data, but that is not the point here, just for background.
My idea now was, to take a dataset (modelnet10), train a tensorflow cNN to classify them, and then use what it learned there on my dataset - not to classify, but to extract features.
So i want to throw away everything the cnn does,except for what it takes from the objects.
Is there anyway to get these features? and how do i do that? i certainly have no idea.
Yes, it is possible to train models exclusively for feature extraction. This is called transfer learning where you can either train your own model and then extract the features or you can extract features from pre-trained models and then use it in your task if your task is similar in nature to that of what the pre-trained model was trained for. You can of course find a lot of material online for these topics. However, I am providing some links below which give details on how you can go about it:
https://keras.io/api/applications/
https://keras.io/guides/transfer_learning/
https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/
https://www.kaggle.com/angqx95/feature-extractor-fine-tuning-with-keras

Fine-tune a BERT model for context specific embeddigns

I'm trying to find information on how to train a BERT model, possibly from the Huggingface Transformers library, so that the embedding it outputs are more closely related to the context o the text I'm using.
However, all the examples that I'm able to find, are about fine-tuning the model for another task, such as classification.
Would anyone happen to have an example of a BERT fine-tuning model for masked tokens or next sentence prediction, that outputs another raw BERT model that is fine-tuned to the context?
Thanks!
Here is an example from the Transformers library on Fine tuning a language model for masked token prediction.
The model that is used is one of the BERTForLM familly. The idea is to create a dataset using the TextDataset that tokenizes and breaks the text into chunks. Then use a DataCollatorForLanguageModeling to randomly mask tokens in the chunks when traing, and pass the model, the data and the collator to the Trainer to train and evaluate the results.

Training a Bert word embedding model in tensorflow

I have my own corpus of plain text. I want to train a Bert model in TensorFlow, similar to gensim's word2vec to get the embedding vectors for each word.
What I have found is that all the examples are related to any downstream NLP tasks like classification. But, I want to train a Bert model with my custom corpus after which I can get the embedding vectors for a given word.
Any lead will be helpful.
If you have access to the required hardware, you can dig into NVIDIA's training scripts for BERT using TensorFlow. The repo is here. From the medium article:
BERT-large can be pre-trained in 3.3 days on four DGX-2H nodes (a
total of 64 Volta GPUs).
If you don't have an enormous corpus, you will probably have better results fine-tuning an available model. If you would like to do so, you can look into huggingface's transformers.

Keras for find sentences similarities from pre-trained word2vec

I have pre-trained word2vec from gensim. And Using gensim for finding the similarities between words works as expected. But I am having problem in finding the similarities between two different sentences. Using of cosine similarities is not a good option for sentences and Its not giving good result. Soft Cosine similarities in gensim gives a little better results but still, it is also not looking good.
I found WMDsimilarities in gensim. This is a bit better than softcosine and cosine.
I am thinking if there is more option like using deep learning like keras and tensorflow to find the sentences similarities from pre-trained word2vec. I know the classification can be done using word embbeding and this seems somewhat better options but then I need to find a training data and labeled it from the scratch.
So, I am wondering if there is any other option which can be used pre-trained word2vec in keras and get the sentences similarities. Is there way. I am open to any suggestions and advice.
Before reimplementing the wheel I'd suggest to try doc2vec method from gensim, it works quite well and it's easy to use.
To implement it in Keras reusing the embeddings you have computed with gensim:
Store the word embeddings in a file, one word per line with the corresponding embedding. Alternatively you can do as #Paul suggested and skip the step 2 and reuse the layer in step 3.
Load word embeddings into a Keras Embedding layer. You can checkout this Keras tutorial for more details (check how embedding_layer variable is initialized).
Then a sequence to sequence model can be used to compute the embedding of the text. In which you have an encoder that embeds the string and the decoder that converts the embedding back to a string. Here is a Keras tutorial that translates from English to French. You can use a similar process to transform your text into your text and pick the internal embedding for your similarity metric.
You can also have a look how the paragraph to vector model works, you can also implement it using Keras and loading the word embedding weights that you have computed.

TensorFlow implementing Seq2seq Sentiment analysis

I'm currently playing with Tensorflow Seq2seq model, trying to implement sentiment analysis. My idea is to feed the encoder with IMDB comment, the decoder with [Pad] or [Go] and the target with [neg]/[pos]. Most of my code is quite similar with the example of seq2seq translation. But the result I get is quite strange. For each batch, the results are either all [neg] or all [pos].
"encoder input : I was hooked almost immediately.[pad][pad][pad]"
"decoder input : [pad]"
"target : [pos]"
Since this result is very particular, I was wondering if anyone knows what would lead to this kind of thing?
I would recommend to try using a simpler architecture - RNN or CNN encoder that feeds into logistic classifier. This architectures has been showing very good results on sentiment analysis (amazon reviews, yelp reviews, etc).
For examples of such models, you can see here - various encoders (LSTM or Convolution) on words and characters.

Categories