Break down text into units of sense - text segmentation NLP Python

Break down text into units of sense - text segmentation NLP Python - python

I have a dataframe text column (in french) and I want to split each text into sentences by their meaning ( break down text into units of sense ), any idea how to do it with Python libraries and NLP techniques ?!
P.S I tried NLTK sent_tokenize and word tokenize but it’s not well split respecting the meaning
For example:
“ text discussing sports and then economic and then school systems”
=> I want to break down the text into sentences like this:
sport related text
economic related text
school system related text
Or at least extract tags out of the whole text, so for this example: I’ll have the following tags:
sports/economic/school.
If I can achieve one of these two cases would be great

Unfortunately, based on my knowledge, there is not a straightforward answer to that.
However, there might be some workarounds, what I suggest is to apply transformers to the list of phrases that you have, something like this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence1 = ['This framework generates embeddings for each input sentence']
sentence2 = ['This is another phrase']
#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence1, sentence2)
The result provide a list of dense space vectors representing each one of the sentences.
There is a list of different transformers models on https://huggingface.co/sentence-transformers they differ on the language of application, parameters and performances.
After that I would set a list of word that may used as tags, those may be arbitraly chosen or otherwise took from the corpus itself based on the top N most frequent words (some pre-processing of the text might be necessary in this case).
With NLTK library it should look like this:
import nltk
words= nltk.tokenize.word_tokenize(corpus)
stopwords = nltk.corpus.stopwords.words('english') #remove stopwords
common_words_without_stopwords = nltk.FreqDist(w.lower() for w in words if w not in stopwords)
mostCommon= common_words.most_common(10).keys() #top 10 most common words, that you can use as tags
Then, I would convert also those tags into vectors and eventually iterate a for loop cycle that aims to identify the top N tags that are the closest to the single phrase.
For what concerns the latter, there are different ways to compute the distance of 2 dense vectors, like: Euclidean distance, cosine similarity, l-norm.. It's a matter of choice.
For example, with euclidean distance:
import numpy as np
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(embedding[0]- embedding[1])

Related

highlight similar sentences in two documents and not just display similarity score

I am working on a problem where I need to find exact or similar sentences in two or more documents. I read a lot about cosine similarity and how it can be used to detect similar text.
Here is the code that I tried:
my_file = open("test.txt", "r")
content = my_file.read()
content_list = content.split(".")
my_file.close()
print("test:"content_list)
my_file = open("original.txt", "r")
og = my_file.read()
print("og:"og)
Output
test:['As machines become increasingly capable', ' tasks considered to require "intelligence" are often removed from the definition of AI,']
og:AI applications include advanced web search engines (e.g., Google), recommendation
systems (used by YouTube, Amazon and Netflix), understanding human
speech (such as Siri and Alexa), self-driving cars (e.g., Tesla),
automated decision-making and competing at the highest level in
strategic game systems (such as chess and Go).[2][citation needed] As
machines become increasingly capable, tasks considered to require
"intelligence" are often removed from the definition of AI, a
phenomenon known as the AI effect.[3] For instance, optical character
recognition is frequently excluded from things considered to be AI,[4]
having become a routine technology.
but when I am using Cosine similarity, using the code:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def compute_cosine_similarity(text1, text2):
# stores text in a list
list_text = [text1, text2]
# converts text into vectors with the TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit_transform(list_text)
tfidf_text1, tfidf_text2 =
vectorizer.transform([list_text[0]]),
vectorizer.transform([list_text[1]])
# computes the cosine similarity
cs_score = cosine_similarity(tfidf_text1, tfidf_text2)
return np.round(cs_score[0][0],2)
for i in content_list:
cosine_similarity12 = compute_cosine_similarity(i,og)
print('The cosine similarity of sentence 1 and 2 is
{}.'.format(cosine_similarity12))
the output I am getting is:
The cosine similarity of sentence and og is 0.14.
The cosine similarity of sentence and og is 0.4.
I tried splitting the test sentence by '.' and then tried to compare each sentence with the original document. But the cosine similarity results are not what I expected. I need to know what I am doing wrong and how I can get similar sentences from the original document for plagiarism checking. The condition being I want to point out similar sentences(or exact sentences) from the original document.
I even thought of comparing each line of two documents (test, og), but that would really increase the complexity.
I am worried because cosine similarity isn't giving a good score even when I just used the exact same sentences from a big paragraph. I really need help in this and would like to know what am doing wrong.

How can I train a model with gensim lib and wikipedia2vec txt?

I'm trying to classify a dataset of files. In this dataset I have a colunm of texts and a column of labels. I want to build a model, based on wikipedia corpus, but I'm a little lost in the middle.
What I did so far...
I preprocessed my Text column (removing stopwords, whitespaces, deaccent, etc) and saved on a new csv file. Then I tagged using gensim lib:
def apply_preprocessing(fname, tokens_only=False):
with smart_open.open(fname) as f:
for i, line in enumerate(f):
tokens = gensim.utils.simple_preprocess(line)
if tokens_only:
yield tokens
else:
yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
preprocessed_dataset = '/content/drive/MyDrive/dataset/preprocessed_dataset.csv'
preprocessed_dataset_train = list(apply_preprocessing(preprocessed_dataset))
This gives me an array of arrays of words contained in my Text column for each documment I have in my preprocessed_dataset.
I know that doing this loop I get each array of words:
for doc_id in range(len(preprocessed_dataset_train)):
preprocessed_dataset_train[doc_id].words
My goal is to give those words and "say": Based on this wikipedia pretrained embeddings (https://wikipedia2vec.github.io/wikipedia2vec/pretrained/), how similiar are one doc to another based on what you learn with this wikipedia corpus?
How do I use this pretrained wikipedia? This is already a file of words vectors, right? If so, How can I use it to analyse my preprocessed_dataset_train?
What's the next step should I do/understand to get to my goal?
I'm sorry for so many questions, when I think I'm understanding the road, I'm lost again and again.

I want to Train 4 more Word2vec models and average the resulting embedding matrices

I wrote the code below, I used Used spacy to restrict the words in the tweets to content words, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:
love_VERB old-fashioneds_NOUN
now I want to Train 4 more Word2vec models and average the resulting embedding matrices.
but I dont have any idea for it, can you help me please ?
# Tokenization of each document
from gensim.models.word2vec import FAST_VERSION
from gensim.models import Word2Vec
import spacy
import pandas as pd
from zipfile import ZipFile
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
zf.extractall()
# nrows , max amount of rows
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.values.tolist()
nlp = spacy.load('en_core_web_sm') # you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
vocab = [s for s in new_sentences]
sentences = documents[:103] # first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ in included_tags:
new_sentence.append(token.text.lower()+'_'+token.pos_)
new_sentences.append(new_sentence)
# initialize model
w2v_model = Word2Vec(
size=100,
window=15,
sample=0.0001,
iter=200,
negative=5,
min_count=1, # <-- it seems your min_count was too high
workers=-1,
hs=0
)
new_sentences
w2v_model.build_vocab(vocab)
w2v_model.train(vocab,
total_examples=w2v_model.corpus_count,
epochs=w2v_model.epochs)
w2v_model.wv['car_NOUN']

There's no reason to average together vectors from multiple training runs; it is more likely to destroy any value from the individual runs than provide any benefit.
No one run creates the 'right' final positions, nor do they all approach some idealized positions. Rather, each just creates a set-of-vectors that is internally comparable to others in that same co-trained set. Comparisons or combinations with vectors from other, non-interleaved training runs are usually going to be nonsense.
Instead, aim for one adequate run. If vectors move around a lot, in repeated runs, that's normal. But each reconfiguration should be about as useful, if used for word-to-word comparisons, or analysis of word neighborhoods/directions, or as input to downstream algorithms. If they vary wildly in usefulness, there are likely other inadequacies in the data or model parameters. (For example: too little data – word2vec requires lots to give meaningful results – or a model that's too large for the data – making it prone to overfitting.)
Other observations about your setup:
Just 103 sentences/texts is tiny for word2vec; you shouldn't expect the vectors from such a run to have any of the value that the algorithm would usually provide. (Running such a tiny dataset might be helpful for verifying no halting-errors in the process, or familiarize yourself with the steps/APIs, but the results will tell you nothing.)
min_count=1 is almost always a bad idea in word2vec and similar algorithms. Words that only appear once (or a few times) don't have the variety of subtly-different uses that are needed to train it into a balanced position against other words – so they wind up with weak/strange final positions, and the sheer number of such words dilutes the training effectiveness for other more-frequent words. The common practice of discarding rare words usually gets better results.
iter=200 is an extreme choice which is typically only valuable to try to squeeze results out of inadequate data. (In such a case, you might also have to reduce the vector-size from normal 100-plus dimensions.) So if you seem to need that, getting more data should be a top priority. (Using 20x more data is far, far better than using 20x more iterations on smaller data – but involves the same amount of training time.)

add new words to GoogleNews by gensim

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?
#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])
I receive an error: keyError: "word 'to' not in vocabulary"
is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!
For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.
#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})
then I calculate the mean of embedding vectors of most similar words to each missing word.
missing_embd={}
for key,value in word_to_idx.items():
if value==0:
similar_words=model.wv.most_similar(key)
similar_embeddings=[model.wv[a[0]] for a in similar_words]
missing_embd[key]=mean(similar_embeddings)
And then I add these news embeddings to word2vec model by:
for word,embd in missing_embd.items():
# model.wv.build_vocab(word,update=True)
model.wv.syn0[model.wv.vocab[word].index]=embd
There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words.
But when I check it by this:
for w in tokens_lower:
if(w in model.wv.vocab)==False:
print(w)
print("***********")
I found a lot of missing words.
Now, I have 3 questions:
1- why missing_embed is empty while there are some missing words?
2- Is it possible that GoogleNews doesn't have words like "to"?
3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

Here is a scenario where we are adding a missing lower case word.
from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)
'Quoran' in embedding.vocab
Output : True
'quoran' in embedding.vocab
Output : False
Here Quoran is present but quoran in lower case is missing
# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)
'quoran' in embedding.vocab
Output : True

It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.
It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.
(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)
You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.
And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.

Sentence matching with gensim word2vec: manually populated model doesn't work

I'm trying to solve a problem of sentence comparison using naive approach of summing up word vectors and comparing the results. My goal is to match people by interest, so the dataset consists of names and short sentences describing their hobbies. The batches are fairly small, few hundreds of people, so i wanted to give it a try before digging into doc2vec.
I prepare the data by cleaning it completely, removing stop words, tokenizing and lemmatizing. I use pre-trained model for word vectors which returns adequate results when finding similarities for some test words. Also tried summing up the sentence words to find similarities in the original model - the matches do make sense. The similarities would be around general sense of the phrase.
For sentence matching I'm trying the following: create an empty model
b = gs.models.Word2Vec(min_count=1, size=300, sample=0, hs=0)
Build vocab out of names (or person id's), no training
#first create vocab with an empty vector
test = [['test']]
b.build_vocab(test)
b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0
#populate vocab from an array
b.build_vocab([personIds], update=True)
Summ each sentence's word vectors and store the results into the model for a corresponding id
#sentences are pulled from pandas dataset df. 'a' is a pre-trained model i use to get vectors for each word
def summ(phrase, start_model):
'''
vector addition function
'''
#starting with a vector of 0's
sum_vec = start_model.word_vec("cat_NOUN")*0
for word in phrase:
sum_vec += start_model.word_vec(word)
return sum_vec
for i, row in df.iterrows():
try:
personId = row["ID"]
summVec = summ(df.iloc[i,1],a)
#updating syn0 for each name/id in vocabulary
b.wv.syn0[b.wv.vocab[personId].index] = summVec
except:
pass
I understand that i shouldn't be expecting much accuracy here, but the t-SNE print doesn't show any clustering whatsoever. Finding similarities method also fails to find matches (<0.2 similarity coefficient basically for everything). [
Wondering if anyone has an idea of where did i go wrong? Is my approach valid at all?

Your code, as shown, neither does any train() of word-vectors (using your local text), nor does it pre-load any vectors from elsewhere. So any vectors which do exist – created by the build_vocab() calls – will still just be in their randomly-initialized starting locations, and be useless for any semantic purposes.
Suggestions:
either (a) train your own vectors from your text, which makes sense if you have a good quantity of text; or (b) load vectors from elsewhere. But don't try to do both. (Or, in the case of the code above, neither.)
The update=True option for build_vocab() should be considered an expert, experimental option – only worth tinkering with if you've already had things working in simpler modes, and you're sure you need it and understand all the implications.
Normal use won't ever explicitly re-assign new values into the Word2Vec model's syn0 property - those are managed by the class's training routines, so you never need to zero them out or modify them. You should tally up your own text summary vectors, based on word-vectors, outside the model in your own data structures.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.