I'm trying to classify a dataset of files. In this dataset I have a colunm of texts and a column of labels. I want to build a model, based on wikipedia corpus, but I'm a little lost in the middle.
What I did so far...
I preprocessed my Text column (removing stopwords, whitespaces, deaccent, etc) and saved on a new csv file. Then I tagged using gensim lib:
def apply_preprocessing(fname, tokens_only=False):
with smart_open.open(fname) as f:
for i, line in enumerate(f):
tokens = gensim.utils.simple_preprocess(line)
if tokens_only:
yield tokens
else:
yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
preprocessed_dataset = '/content/drive/MyDrive/dataset/preprocessed_dataset.csv'
preprocessed_dataset_train = list(apply_preprocessing(preprocessed_dataset))
This gives me an array of arrays of words contained in my Text column for each documment I have in my preprocessed_dataset.
I know that doing this loop I get each array of words:
for doc_id in range(len(preprocessed_dataset_train)):
preprocessed_dataset_train[doc_id].words
My goal is to give those words and "say": Based on this wikipedia pretrained embeddings (https://wikipedia2vec.github.io/wikipedia2vec/pretrained/), how similiar are one doc to another based on what you learn with this wikipedia corpus?
How do I use this pretrained wikipedia? This is already a file of words vectors, right? If so, How can I use it to analyse my preprocessed_dataset_train?
What's the next step should I do/understand to get to my goal?
I'm sorry for so many questions, when I think I'm understanding the road, I'm lost again and again.
Related
I have a dataframe text column (in french) and I want to split each text into sentences by their meaning ( break down text into units of sense ), any idea how to do it with Python libraries and NLP techniques ?!
P.S I tried NLTK sent_tokenize and word tokenize but it’s not well split respecting the meaning
For example:
“ text discussing sports and then economic and then school systems”
=> I want to break down the text into sentences like this:
sport related text
economic related text
school system related text
Or at least extract tags out of the whole text, so for this example: I’ll have the following tags:
sports/economic/school.
If I can achieve one of these two cases would be great
Unfortunately, based on my knowledge, there is not a straightforward answer to that.
However, there might be some workarounds, what I suggest is to apply transformers to the list of phrases that you have, something like this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence1 = ['This framework generates embeddings for each input sentence']
sentence2 = ['This is another phrase']
#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence1, sentence2)
The result provide a list of dense space vectors representing each one of the sentences.
There is a list of different transformers models on https://huggingface.co/sentence-transformers they differ on the language of application, parameters and performances.
After that I would set a list of word that may used as tags, those may be arbitraly chosen or otherwise took from the corpus itself based on the top N most frequent words (some pre-processing of the text might be necessary in this case).
With NLTK library it should look like this:
import nltk
words= nltk.tokenize.word_tokenize(corpus)
stopwords = nltk.corpus.stopwords.words('english') #remove stopwords
common_words_without_stopwords = nltk.FreqDist(w.lower() for w in words if w not in stopwords)
mostCommon= common_words.most_common(10).keys() #top 10 most common words, that you can use as tags
Then, I would convert also those tags into vectors and eventually iterate a for loop cycle that aims to identify the top N tags that are the closest to the single phrase.
For what concerns the latter, there are different ways to compute the distance of 2 dense vectors, like: Euclidean distance, cosine similarity, l-norm.. It's a matter of choice.
For example, with euclidean distance:
import numpy as np
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(embedding[0]- embedding[1])
I'm starting to get familiar with Word2Vec, but I'm struggeling with a problem and coudln't find something similar...
I want to use gensims Word2Vec on an imported PDF document (a book). To import I used PyPDF2 and stored the whole book into a list. Furthermore, I used gensims simple_preprocess in order to preprocess the data. This worked so far, I got the following output:
text=['schottky','diode','semiconductors',...]
So then I tried to use the Word2Vec:
from gensim.models import Word2Vec
model=Word2Vec(text, size=100, window=5, min_count=5, workers=4)
words=list(model.wv.vocab)
but the output was like this:
print(words)
['c','h','t','k','d',...]
I expected also the same words as in the text list and not just some characters. When I tried to find relations between words (e.g. 'schottky' and 'diode') I got the error-message that none of these words is included in the vocabulary.
My first thought was that the import is wrong, but I got the same result with textract instead of PyPDF2.
Does someone know what's the problem? Thanks for your help!
Appendix:
Importing the book
content_text=[]
number_of_inputs=len(os.listdir(path))
file_to_open=path
open_file=open(file_to_open,'rb')
read_pdf=PyPDF2.PdfFileReader(open_file)
number_of_pages=read_pdf.getNumPages()
page_content=""
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText()
content_text.append(page_content)
Word2Vec requires as its sentences parameter a training corpus that is:
an iterable sequence (such as a list)
where each item is itself a list of string-tokens
If you supply just a list-of-strings, each string is seen as a list-of-one-character-strings, resulting in all the one-letter words you're seeing.
So, use a list-of-lists-of-words, more like:
[
['schottky','diode','semiconductors'],
]
(Note also that you generally won't get interesting Word2Vec results on tiny toy-sized data sets of just a few texts and just dozens to hundreds of words. You need many thousands of unique words, across many dozens of contrasting examples of each word, to induce the useful word-vector arrangements that Word2Vec is known for.)
Instead of
text=['schottky','diode','semiconductors']
Use this
text=[['schottky','diode','semiconductors']]
More info: Gensim word2vec
I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here
So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.
So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here
texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]
However, I have been running into a memory error, because of the large word count.
I also tried the TokenVectorizer in Python. I had got a memory error for this too.
def simple_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
return words
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])
How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?
I have an i7, 16GB Desktop if that matters.
EDIT
Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!
So, LDA requires one to tokenize the documents into words and then
create a word frequency dictionary.
If the only output you need from this is a dictionary with the word count, I would do the following:
Process files one by one in a loop. This way you store only one file in memory. Process it, then move to the next one:
# for all files in your directory/directories:
with open(current_file, 'r') as f:
for line in f:
# your logic to update the dictionary with the word count
# here the file is closed and the loop moves to the next one
EDIT: When it comes to issues with keeping a really large dictionary in memory, you have to remember that Python reserves a lot of memory for keeping the dict low density - a price for a fast lookup possibilities. Therefore, you must search for another way of storing the key-value pairs, for e.g. a list of tuples, but the cost will be much slower lookup. This question is about that and has some nice alternatives described there.
I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?
#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])
I receive an error: keyError: "word 'to' not in vocabulary"
is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!
For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.
#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})
then I calculate the mean of embedding vectors of most similar words to each missing word.
missing_embd={}
for key,value in word_to_idx.items():
if value==0:
similar_words=model.wv.most_similar(key)
similar_embeddings=[model.wv[a[0]] for a in similar_words]
missing_embd[key]=mean(similar_embeddings)
And then I add these news embeddings to word2vec model by:
for word,embd in missing_embd.items():
# model.wv.build_vocab(word,update=True)
model.wv.syn0[model.wv.vocab[word].index]=embd
There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words.
But when I check it by this:
for w in tokens_lower:
if(w in model.wv.vocab)==False:
print(w)
print("***********")
I found a lot of missing words.
Now, I have 3 questions:
1- why missing_embed is empty while there are some missing words?
2- Is it possible that GoogleNews doesn't have words like "to"?
3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.
Here is a scenario where we are adding a missing lower case word.
from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)
'Quoran' in embedding.vocab
Output : True
'quoran' in embedding.vocab
Output : False
Here Quoran is present but quoran in lower case is missing
# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)
'quoran' in embedding.vocab
Output : True
It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.
It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.
(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)
You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.
And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.
I'm trying to solve a problem of sentence comparison using naive approach of summing up word vectors and comparing the results. My goal is to match people by interest, so the dataset consists of names and short sentences describing their hobbies. The batches are fairly small, few hundreds of people, so i wanted to give it a try before digging into doc2vec.
I prepare the data by cleaning it completely, removing stop words, tokenizing and lemmatizing. I use pre-trained model for word vectors which returns adequate results when finding similarities for some test words. Also tried summing up the sentence words to find similarities in the original model - the matches do make sense. The similarities would be around general sense of the phrase.
For sentence matching I'm trying the following: create an empty model
b = gs.models.Word2Vec(min_count=1, size=300, sample=0, hs=0)
Build vocab out of names (or person id's), no training
#first create vocab with an empty vector
test = [['test']]
b.build_vocab(test)
b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0
#populate vocab from an array
b.build_vocab([personIds], update=True)
Summ each sentence's word vectors and store the results into the model for a corresponding id
#sentences are pulled from pandas dataset df. 'a' is a pre-trained model i use to get vectors for each word
def summ(phrase, start_model):
'''
vector addition function
'''
#starting with a vector of 0's
sum_vec = start_model.word_vec("cat_NOUN")*0
for word in phrase:
sum_vec += start_model.word_vec(word)
return sum_vec
for i, row in df.iterrows():
try:
personId = row["ID"]
summVec = summ(df.iloc[i,1],a)
#updating syn0 for each name/id in vocabulary
b.wv.syn0[b.wv.vocab[personId].index] = summVec
except:
pass
I understand that i shouldn't be expecting much accuracy here, but the t-SNE print doesn't show any clustering whatsoever. Finding similarities method also fails to find matches (<0.2 similarity coefficient basically for everything). [
Wondering if anyone has an idea of where did i go wrong? Is my approach valid at all?
Your code, as shown, neither does any train() of word-vectors (using your local text), nor does it pre-load any vectors from elsewhere. So any vectors which do exist – created by the build_vocab() calls – will still just be in their randomly-initialized starting locations, and be useless for any semantic purposes.
Suggestions:
either (a) train your own vectors from your text, which makes sense if you have a good quantity of text; or (b) load vectors from elsewhere. But don't try to do both. (Or, in the case of the code above, neither.)
The update=True option for build_vocab() should be considered an expert, experimental option – only worth tinkering with if you've already had things working in simpler modes, and you're sure you need it and understand all the implications.
Normal use won't ever explicitly re-assign new values into the Word2Vec model's syn0 property - those are managed by the class's training routines, so you never need to zero them out or modify them. You should tally up your own text summary vectors, based on word-vectors, outside the model in your own data structures.