How does word2vec parse a text file?

How does word2vec parse a text file? - python

When I use word2vec.word2vec(train="corpus.txt"), how does it parse words out of the file?
Could somebody give me an example or related resources? Thanks in advance.

There are more different resources how to do it. One of the possible way to use word2vec technique with gensim is here or on git.
The main idea to use word2vec is the opportunity to handle words like vectors. It is very comfortable from calculation process.
Assuming you have a text where there are many words. If you will create dictionary only using those words you will have misunderstanding later because their meaning into multi dimension space will be wrong. If you will be use vectors from base on given word2vec models from Google, etc. you will have better distribution of words into defined space.
Having model, you can easy calculate similarity and so on to extract meaning from the text. It's already a logical part and will be related with you intention.

Related

Huggingface NER with custom data

I have a csv data as below.
**token** **label**
0.45" length
1-12 size
2.6" length
8-9-78 size
6mm length
Whenever I get the text as below
6mm 8-9-78 silver head
I should be able to say length = 6mm and size = 8-9-78. I'm new to NLP world, I'm trying to solve this using Huggingface NER. I have gone through various articles. I'm not getting how to train with my own data. Which model/tokeniser should I make use of? Or should I build my own? Any help would be appreciated.

I would maybe look at spaCy's pattern matching + NER to start. The pattern matching rules spacy provides are really powerful, especially when combined with their statistical NER models. You can even use the patterns you develop to create your own custom NER model. This will give you a good idea of where you still have gaps or complexity that might require something else like Huggingface, etc.
If you are willing to pay, you can also leverage prodigy which provides a nice UI with Human In the Loop interactions.
Adding REGEX entities to SpaCy's Matcher

I had two options one is Spacy (as suggested by #scarpacci) and other one is SparkNLP. I opted for SparkNLP and found a solution. I formatted the data in CoNLL format and trained using Spark's NerDlApproach and GLOVE word embedding.

Documents in training data belongs to a particular topic in LDA

I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.

You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...

How to get similar words from a custom input dictionary of word to vectors in gensim

I am working on a document similarity problem. For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector.
I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. How could I do that?

You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. That being said, I think I can help you out generally, even without know what you want gensim to do. gensim is quite powerful, and offers lots of different functionality.
Assuming your dictionary is already in gensim format, you can load it like this:
from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')
There - now you can use it with gensim, and run analyses and model to your heart's desire. For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc.
For document similarity with a trained LDA model, see this stackoverflow question.
For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents.
gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim, I would recommend looking at the documentation and examples.
Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec), and might lose valuable information.

String similarity (semantic meaning) in python

How can I calculate the string similarity (semantic meaning) between 2 string?
For example if I have 2 string like "Display" and "Screen" the string similarity must be close to 100%
If I have "Display" and "Color" the screen similarity must be close to 0%
I'm writing my script in Python... My question is if exists some library or framework to do this kind or think... In alternative can someone suggest me a good approach?

Based on your examples, I think you are looking for semantical similarity. You can do this for instance by using WordNet, but you will have to add for instance that you are working with nouns and possible iterate over the different meanings of the word. The link shows two examples that calculate the similarity according to various implementations.
Most implementations are however computationally expensive: they make use of a large amount of text to calculate how often two words are close to each other, etc.

What you're looking to solve is an NLP problem; which, if you're not familiar with, can be a hassle. The most popular library out there is NTLK, which has a lot of AI tools. A quick google of what you're looking for yields logic of semantics: http://www.nltk.org/book/ch10.html
This is a computationally heavy process, since it involves loading a dictionary of the entire English language. If you have a small subset of examples, you might be better off creating a mapping yourself.

I am not good at in NPL, but I think Levenshtein Distance Algorithm can help you solve this problem.Becuase I use this algorithm to calculate the similarity between to strings. And the preformance is not bad.
The following are my CPP code, click the link, maybe you can transform the code to Python.I will post the Python code later.
If you understance Dynamic Programming, I think you can understande it.
enter link description here

Have a look in following libraries:
https://simlibrary.wordpress.com/
http://sourceforge.net/projects/semantics/
http://radimrehurek.com/gensim/

Check out word2vec as implemented in the Gensim library. One of its features is to compute word similarity.
https://radimrehurek.com/gensim/models/word2vec.html
More details and demos can be found here.
I believe this is the state of the art right now.

As another user suggested, the Gensim library can do this using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. From your examples:
word_vectors.similarity('display', 'color')
>>> 0.3068566
word_vectors.similarity('display', 'screen')
>>> 0.32314363
Compare those resulting numbers and you will see the words display and screen are more similar than display and color are.

Is stemming used when gensim creates a dictionary for tf-idf model?

I am using Gensim python toolkit to build tf-idf model for documents. So I need to create a dictionary for all documents first. However, I found Gensim does not use stemming before creating the dictionary and corpus. Am I right ?

You are correct. Gensim doesn't do anything special other than convert what you give it into different models.
Here is the relevant quote and the link that it is from:
The ways to process documents are so varied and application- and
language-dependent that I decided to not constrain them by any
interface. Instead, a document is represented by the features
extracted from it, not by its “surface” string form: how you get to
the features is up to you.
From Strings to Vectors

I was also struggling with the same case. To overcome i first stammed documents using NLTK and later processed it with gensim. Probably it can be a easier and handy way to perform your task.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.