Using different word2vec training data in spaCy

Using different word2vec training data in spaCy - python

So I'd like to use some of this training data in spaCy when I use the similarity() method.
I'd also like to maybe use the pre-trained vectors also on this page.
But the spaCy docs seem lacking here, does anyone know how to do this?

Unfortunately the docs for this still aren't linked on the site! We're reworking the docs. But, does this answer your question: https://spacy.io/tutorials/load-new-word-vectors

Related

How to predict some images by trained model on Tensorflow

I'm a super beginner. I checked the official site Transfer learning with a pretrained ConvNet. I'd like to make predictions by the site's trained model. The following code is wrong? And I'd like to know that “image” and “class name”. Hm.. How should I do?... Please give me some advice…
predictions = model.predict(test_batches)[1]
print(predictions)
#[-0.11642772]

First of all Tensorflow is horrible. Incredibly error-prone and difficult to install and use. I would recommend using Pytorch tutorials instead. Second, the link you posted has a Colab button. Did you try clicking that button? Colab makes installation easier because it’s online and not on your computer.
Also transfer learning is NOT a beginner topic. Maybe try an easier notebook, if that works for you: https://github.com/sgrvinod/Deep-Tutorials-for-PyTorch/blob/master/README.md

Opening a pickled sklearn file in R

Does anyone know if I am able to open a pickled sklearn Python algorithm in R? Or if I can save a trained model in sklearn in a different way that can be opened and used in R? Specifically, I am looking at a gradient boosting model. Thanks!

I don't recommend doing what you are doing. It's a lot of extra work that you don't need.
However, in case you find yourself obliged to do that, I would think of saving my model in the binary format. This your best option.
This is possible for Xgboost see link here.
Read this answer on how to save xgboost as a binary file: link

You may want to take a look at the Reticulate package, it will allow you to call python code from R.
https://rstudio.github.io/reticulate/

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem:
I would like to know how I could create my own vocab.bpe file.
I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file.
I have reviewed the code in gpt-2/src/encoder.py but, I have not been able to find any hint. Any help or idea?
Thank you so much in advance.

check out here, you can easily create the same vocab.bpe using the following command:
python learn_bpe -o ./vocab.bpe -i dataset.txt --symbols 50000

I haven't worked with GPT2, but bpemb is a very good place to start for subword embeddings. According to the README
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.
I've used the pretrained embeddings for one of my projects along with sentencepiece and it turned out to be very useful.

How to plot the tree of a Light GBM .joblib model?

I'm very new to Machine Learning! My problem concern a model created with LighGBM. I'm not the creator of this model, so I want to see the tree that this model generates. The model is in the format .joblib, and I want to know as much information as possible of it. On the LighGBMclassifier Documentation i don't find anything that can solve my problem. Thanks to the code below, I only know the number of classes.
model = joblib.load("*.joblib.dat")
model.classes_
Output:
array([0, 1])
I want to know the number of rows, the rules and if it's possible, even the plot of the tree. Thank you all!

When you can not find something in the LightGBM documentation, look for it at XGBoost documentation. LightGBM documentation has loss of missing features that the framework has actually in it.

Named Entity Recognition in practice

I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.
I understand the basic concept behind it, but I suspect I am missing some details.
From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.
Do you know if:
In order to work, the text has to go through chunking before the model is trained, right? Otherwise it wouldn't be able to perform anything useful?
Are the text and annotations typically normalized prior to the training of the model? So that if a named entity is at the beginning or middle of a sentence it can still work?
Specifically in spaCy, how are things implemented concretely? Is it a HMM, CRF or something else that is used to build the model?
Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.

In https://spacy.io/models/en#en_core_web_md they say English multi-task CNN trained on OntoNotes. So I imagine that's how they obtain the NEs. You can see that the pipeline is
tagger, parser, ner
and read more here: https://spacy.io/usage/processing-pipelines. I would try to remove the different components and see what happens. This way you could see what depends on what. I'm pretty sure NER depends on tagger, but not sure whether requires the parser. All of them of course require the tokenizer
I don't understand your second point. If an entity is at the beginning or middle of a sentence is just fine, the NER system should be able to catch it. I don't see how you're using the word normalize in a position of text context.
Regarding the model, they mention multi-task CNN, so I guess the CNN is the model for NER. Sometimes people use a CRF on top, but they don't mention it so probably is just that. According to their performance figures, it's good enough

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using different word2vec training data in spaCy - python

So I'd like to use some of this training data in spaCy when I use the similarity() method. I'd also like to maybe use the pre-trained vectors also on this page. But the spaCy docs seem lacking here, does anyone know how to do this?

Unfortunately the docs for this still aren't linked on the site! We're reworking the docs. But, does this answer your question: https://spacy.io/tutorials/load-new-word-vectors

Related

How to predict some images by trained model on Tensorflow

Opening a pickled sklearn file in R

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

How to plot the tree of a Light GBM .joblib model?

Named Entity Recognition in practice

Categories

Resources