Ok, so the title might sound a bit confusing, but here's an analogy of what I'm trying to achieve. Let's imagine that we have the following dataset:
Brand name
Product type
Product_Description
Nike
Shoes
These black shoes are wonderful. They are elegant, and really comfortable
BMW
Car
This car goes fast. If you like speed, you'll like it.
Suzuki
Car
A family car, elegant and made for long journeys.
Call of Duty
VideoGame
A nervous shooter, putting you in the shoes of a desperate soldier, who has nothing left to lose.
Adidas
Shoes
Sneakers made for men, and women, who always want to go out with style.
This is just a made-up sample, but let's imagine this list goes on for a lot of other products.
What I'm trying to achieve here, is to cluster the elements (whether it is shoes, cars, or videogames), based on the words used in their respective description. Thus, I would obtain brands that are clustered together according to their description, but perhaps not belonging to the same type (e.g.: Suzuki + Adidas), and to get the name of the brands that are clustered together.
To do so, I relied on a word embedding method. After cleaning the description (stop words, non-alphanumeric characters) and tokenized it, I used a FastText model (the Wikipedia one) to evaluate the embeddings in the product descriptions.
def clean_text(text, tokenizer, stopwords):
text = str(text).lower() # Lowercase words
text = re.sub(r"\[(.*?)\]", "", text) # Remove [+XYZ chars] in content
text = re.sub(r"\s+", " ", text) # Remove multiple spaces in content
text = re.sub(r"\w+…|…", "", text) # Remove ellipsis (and last word)
text = re.sub(r"<a[^>]*>(.*?)</a>", r"\1", text) #Remove html tags
text = re.sub(f"[{re.escape(punctuation)}]", "", text)
text = re.sub(r"(?<=\w)-(?=\w)", " ", text) # Replace dash between words
text = re.sub(
f"[{re.escape(string.punctuation)}]", "", text
) # Remove punctuation
doc = nlp_model(text)
tokens = [token.lemma_ for token in doc]
#tokens = tokenizer(text) # Get tokens from text
tokens = [t for t in tokens if not t in stopwords] # Remove stopwords
tokens = ["" if t.isdigit() else t for t in tokens] # Remove digits
tokens = [t for t in tokens if len(t) > 1] # Remove short tokens
return tokens #Clean the Text
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
df = pd.read_csv("./mockup.csv")
custom_stopwords = set(stopwords.words("english"))
df["Product_Description"] = df["Product_Description"].fillna("")
df["tokens"] = df["Product_Description"].map(lambda x: clean_text(x, word_tokenize, custom_stopwords))
model = KeyedVectors.load_word2vec_format('./wiki-news-300d-1M.vec')
The problem is that I'm a bit of a beginner in word embeddings and clustering. As I said, my goal would be to cluster brands according to the words used in their description (the hypothesis is perhaps some brands are linked together through the words used in their description?), thus forgoing the old classification (shoes, cars, videogames...). I would also like to get the key brands of each cluster (so cluster 1 = Suzuki + Adidas, Cluster 2 = Call of Duty + Nike, Cluster 3 = BMW + ..., etc...).
Does anyone have any ideas on how to tackle this problem? I read several tutorials online on word embeddings and clustering, and to be completely honest, I am a bit lost.
Thank you for your help.
Related
I have a script to cluster keywords, utilizing pandas and polyfuzz. With English, it works like expected. Trying to use the script with keywords in German, the script recognizes multiple keywords wrongly.
What means "wrongly recognized": clustering recognizes the first and second word in the keyword. And as you can see on the screenshot, columns G and H (First Word and Second Word) contain other words, then corresponding keywords in column B (Keyword):
The script fails not always with German - multiple keywords are clustered correctly. But the part of wrongly recognized keywords is very high, up to 20%.
Could somebody explain to me why the script failed with German keywords and, in the best case, improve the script enabling it to work with German?
Here is the part of the script, which does clustering:
# find keywords from one column in another in any order and count the frequency
df_matched['Cluster Name'] = df_matched['Cluster Name'].str.strip()
df_matched['Keyword'] = df_matched['Keyword'].str.strip()
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[0]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[1]
df_matched['Total Keywords'] = df_matched['First Word'].str.count(' ') + 1
def ismatch(s):
A = set(s["First Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found'] = df_matched.apply(ismatch, axis=1)
df_matched = df_matched. fillna('')
def ismatch(s):
A = set(s["Second Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found 2'] = df_matched.apply(ismatch, axis=1)
# todo - document this algo. Essentially if it matches on the second word only, it renames the cluster to the second word
# clean up code nd variable names
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == True), "Cluster Name"] = df_matched["Second Word"]
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == False), "Cluster Name"] = "zzz_no_cluster_available"
# count cluster_size
df_matched['Cluster Size'] = df_matched['Cluster Name'].map(df_matched.groupby('Cluster Name')['Cluster Name'].count())
df_matched.loc[df_matched["Cluster Size"] == 1, "Cluster Name"] = "zzz_no_cluster_available"
df_matched = df_matched.sort_values(by="Cluster Name", ascending=True)
Here are two datasets:
Working dataset in English: http://dl.dropboxusercontent.com/s/zrobh2x4bs3ztlf/working-dataset-english.txt
Badly working dataset in German: http://dl.dropboxusercontent.com/s/i1p3j3zi1t0cev3/badly-working-dataset-german.txt
And here, the working Colab with the whole script.
I opened the full code to understand where df_matched came from.
I'm not 100% sure of what you are trying to do, but I think that the problem comes from before the snippet you shared here.
It comes from the way that df_matched is created. It uses fuzzy matching to create clusters. So the words of "Cluster Name" are not all guaranteed to be present in "Keyword".
If you run the code for the English data, and check the words in position -1 and -2 (last two words of the Cluster Name) instead of 0 and 1...
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[-1]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[-2]
...then calculate how many of them are not found...
print((~df_matched["Found"]).sum())
print((~df_matched["Found 2"]).sum())
# 140
# 10
...you can see that for 104 out of 158 rows, the last word is not part of the keywords.
(I don't know if you care about the first two words more than the last two... but this looks worse than the 20% you noticed in the German data.)
For the German one the problem is more visible because this language uses a lot of compound words and many frequent suffixes (e.g., "ung")... So they will fuzzy-match a lot.
Example of df_matched for German: the "From" words are not present in "To"... but there are large overlaps.
This is df_matched for English: some words of "From" are not even close to the words in "To"... and similarity score can be worse than in the German dataset.
Possible improvements
I think that the part where you could improve the clustering is this (from the colab notebook)
df_1_list = df_1.Keyword.tolist() # create list from df
model = PolyFuzz("TF-IDF")
cluster_tags = df_1_list[::]
cluster_tags = set(cluster_tags)
cluster_tags = list(cluster_tags)
print("Cleaning up the cluster tags.. Please be patient!")
substrings = {w1 for w1 in tqdm(cluster_tags) for w2 in cluster_tags if w1 in w2 and w1 != w2}
longest_word = set(cluster_tags) - substrings
longest_word = list(longest_word)
shortest_word_list = list(set(cluster_tags) - set(longest_word))
try:
model.match(df_1_list, shortest_word_list)
except ValueError:
print("Empty Dataframe, Can't Match - Check the URL Filter!")
sys.exit()
model.group(link_min_similarity=sim_match_percent)
df_matched = model.get_matches()
Here you compute the similarity between df_1_list and shortest_word_list.
shortest_word_list is created by looking for substrings, which might lead to weird clusters is German because of compound words.
You could try and normalize the text with (language-specific) stemming or lemmatization before / instead of checking for substrings and creating clusters. This should help and transform each word in their "root form" and retain their meaning.
Yoy can use the spaCy library, which provide language-specific
pretrained models for stemming, embedding and other language operations.
You can select the correct model for each language and use the lemmatization function to replace each word of df_1_list with their "base form" before trying to cluster.
Lemmatization example
import spacy
nlp = spacy.load("en_core_web_sm") # load English or German model
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Link to spaCy German model: https://spacy.io/models/de
I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!
infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.
I have a use case where I want to match one list of words with a list of sentences and bring the most relevant sentences
I am working in python. What I have already tried is using KMeans where we cluster our set of documents into the clusters and then predict the sentence that in which structure it resides. But in my case I have already available list of words available.
def getMostRelevantSentences():
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
#TODO: now this should return me the 2nd and last item from the Sentences list as the words list mostly matches with them
So from the above code I want to return the sentences which are closely matching with the words provided. I don't want to use the supervised machine learning here. Any help will be appreciated.
So finally I have used this super library called gensim to generate the similarity.
import gensim
from nltk.tokenize import word_tokenize
def getSimilarityScore(raw_documents, words):
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/usr/workdir',tf_idf[corpus],
num_features=len(dictionary))
query_doc_bow = dictionary.doc2bow(words)
query_doc_tf_idf = tf_idf[query_doc_bow]
return sims[query_doc_tf_idf]
You can use this method as:
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
words_lower = [w.lower() for w in words]
getSimilarityScore(Sentences,words_lower)
i want to do data augmentation for sentiment analysis task by replacing words with it's synonyms from wordnet but replacing is random i want to loop over the synonyms and replace word with all synonyms one at the time to increase data-size
sentences=[]
for index , r in pos_df.iterrows():
text=normalize(r['text'])
words=tokenize(text)
output = ""
# Identify the parts of speech
tagged = nltk.pos_tag(words)
for i in range(0,len(words)):
replacements = []
# Only replace nouns with nouns, vowels with vowels etc.
for syn in wordnet.synsets(words[i]):
# Do not attempt to replace proper nouns or determiners
if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
break
# The tokenizer returns strings like NNP, VBP etc
# but the wordnet synonyms has tags like .n.
# So we extract the first character from NNP ie n
# then we check if the dictionary word has a .n. or not
word_type = tagged[i][1][0]
if syn.name().find("."+word_type+"."):
# extract the word only
r = syn.name()[0:syn.name().find(".")]
replacements.append(r)
if len(replacements) > 0:
# Choose a random replacement
replacement = replacements[randint(0,len(replacements)-1)]
print(replacement)
output = output + " " + replacement
else:
# If no replacement could be found, then just use the
# original word
output = output + " " + words[i]
sentences.append([output,'positive'])
Even I'm working with a similar kind of project, generating new sentences from a given input but without changing the context from the input text.
While coming across this, I found a data augmentation technique. Which seems to work well on the augmentation part. EDA(Easy Data Augmentation) is a paper[https://github.com/jasonwei20/eda_nlp].
Hope this helps you.
I've built a web crawler which fetches me data. The data is typically structured. But then and there are a few anomalies. Now to do analysis on top of the data I am searching for few words i.e searched_words=['word1','word2','word3'......] I want the sentences in which these words are present. So I coded as below :
searched_words=['word1','word2','word3'......]
fsa = re.compile('|'.join(re.escape(w.lower()) for w in searched_words))
str_df['context'] = str_df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent) if w.lower() in words)])
It is working but the problem I am facing is if there is/are missing white-spaces after a fullstop in the text I am getting all such sentences as such.
Example :
searched_words = ['snakes','venomous']
text = "I am afraid of snakes.I hate them."
output : ['I am afraid of snakes.I hate them.']
Desired output : ['I am afraid of snakes.']
If all tokenizers (including nltk) fail you you can take matters into your own hands and try
import re
s='I am afraid of snakes.I hate venomous them. Theyre venomous.'
def findall(s,p):
return [m.start() for m in re.finditer(p, s)]
def find(sent, word):
res=[]
indexes = findall(sent,word)
for index in indexes:
i = index
while i>0:
if sent[i]!='.':
i-=1
else:
break
end = index+len(word)
nextFullStop = end + sent[end:].find('.')
res.append(sent[i:nextFullStop])
i=0
return res
Play with it here. There's some dots left in there as I do not know what you want to do exactly with them.
What it does is it finds all occurences of said word, and gets you the Sentence all they way back to the previous dot. This is for an edge case only but you can tune it easily, specific to your needs.