Clustering script fails with German, but works like expected with English - python

I have a script to cluster keywords, utilizing pandas and polyfuzz. With English, it works like expected. Trying to use the script with keywords in German, the script recognizes multiple keywords wrongly.
What means "wrongly recognized": clustering recognizes the first and second word in the keyword. And as you can see on the screenshot, columns G and H (First Word and Second Word) contain other words, then corresponding keywords in column B (Keyword):
The script fails not always with German - multiple keywords are clustered correctly. But the part of wrongly recognized keywords is very high, up to 20%.
Could somebody explain to me why the script failed with German keywords and, in the best case, improve the script enabling it to work with German?
Here is the part of the script, which does clustering:
# find keywords from one column in another in any order and count the frequency
df_matched['Cluster Name'] = df_matched['Cluster Name'].str.strip()
df_matched['Keyword'] = df_matched['Keyword'].str.strip()
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[0]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[1]
df_matched['Total Keywords'] = df_matched['First Word'].str.count(' ') + 1
def ismatch(s):
A = set(s["First Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found'] = df_matched.apply(ismatch, axis=1)
df_matched = df_matched. fillna('')
def ismatch(s):
A = set(s["Second Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found 2'] = df_matched.apply(ismatch, axis=1)
# todo - document this algo. Essentially if it matches on the second word only, it renames the cluster to the second word
# clean up code nd variable names
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == True), "Cluster Name"] = df_matched["Second Word"]
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == False), "Cluster Name"] = "zzz_no_cluster_available"
# count cluster_size
df_matched['Cluster Size'] = df_matched['Cluster Name'].map(df_matched.groupby('Cluster Name')['Cluster Name'].count())
df_matched.loc[df_matched["Cluster Size"] == 1, "Cluster Name"] = "zzz_no_cluster_available"
df_matched = df_matched.sort_values(by="Cluster Name", ascending=True)
Here are two datasets:
Working dataset in English: http://dl.dropboxusercontent.com/s/zrobh2x4bs3ztlf/working-dataset-english.txt
Badly working dataset in German: http://dl.dropboxusercontent.com/s/i1p3j3zi1t0cev3/badly-working-dataset-german.txt
And here, the working Colab with the whole script.

I opened the full code to understand where df_matched came from.
I'm not 100% sure of what you are trying to do, but I think that the problem comes from before the snippet you shared here.
It comes from the way that df_matched is created. It uses fuzzy matching to create clusters. So the words of "Cluster Name" are not all guaranteed to be present in "Keyword".
If you run the code for the English data, and check the words in position -1 and -2 (last two words of the Cluster Name) instead of 0 and 1...
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[-1]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[-2]
...then calculate how many of them are not found...
print((~df_matched["Found"]).sum())
print((~df_matched["Found 2"]).sum())
# 140
# 10
...you can see that for 104 out of 158 rows, the last word is not part of the keywords.
(I don't know if you care about the first two words more than the last two... but this looks worse than the 20% you noticed in the German data.)
For the German one the problem is more visible because this language uses a lot of compound words and many frequent suffixes (e.g., "ung")... So they will fuzzy-match a lot.
Example of df_matched for German: the "From" words are not present in "To"... but there are large overlaps.
This is df_matched for English: some words of "From" are not even close to the words in "To"... and similarity score can be worse than in the German dataset.
Possible improvements
I think that the part where you could improve the clustering is this (from the colab notebook)
df_1_list = df_1.Keyword.tolist()  # create list from df
model = PolyFuzz("TF-IDF")
cluster_tags = df_1_list[::]
cluster_tags = set(cluster_tags)
cluster_tags = list(cluster_tags)
print("Cleaning up the cluster tags.. Please be patient!")
substrings = {w1 for w1 in tqdm(cluster_tags) for w2 in cluster_tags if w1 in w2 and w1 != w2}
longest_word = set(cluster_tags) - substrings
longest_word = list(longest_word)
shortest_word_list = list(set(cluster_tags) - set(longest_word))
try:
    model.match(df_1_list, shortest_word_list)
except ValueError:
    print("Empty Dataframe, Can't Match - Check the URL Filter!")
    sys.exit()
model.group(link_min_similarity=sim_match_percent)
df_matched = model.get_matches()
Here you compute the similarity between df_1_list and shortest_word_list.
shortest_word_list is created by looking for substrings, which might lead to weird clusters is German because of compound words.
You could try and normalize the text with (language-specific) ​stemming or lemmatization before / instead of checking for substrings and creating clusters. This should help and transform each word in their "root form" and retain their meaning.
Yoy can use the spaCy library, which provide language-specific
pretrained models for stemming, embedding and other language operations.
You can select the correct model for each language and use the lemmatization function to replace each word of df_1_list with their "base form" before trying to cluster.
Lemmatization example
import spacy
nlp = spacy.load("en_core_web_sm") # load English or German model
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Link to spaCy German model: https://spacy.io/models/de

Related

Keyphrase extraction in Python - How to preprocess the text to get better performances

I'm trying to extract keyphrases from some English texts but I think that the quality of my results is affected by how the sentences are formulated. For example:
Sentence 1
import pke
text = "Manufacture of equipment for the production and use of hydrogen."
# define the valid Part-of-Speeches to occur in the graph
pos = {'NOUN', 'PROPN', 'ADJ'}
# define the grammar for selecting the keyphrase candidates
grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"
extractor = pke.unsupervised.PositionRank()
extractor.load_document(input=text, language='en')
extractor.grammar_selection(grammar = grammar)
extractor.candidate_selection(maximum_word_number = 5)
extractor.candidate_weighting(window = 5, pos = pos)
keyphrases = extractor.get_n_best(n = 10, redundancy_removal = True)
#dict_keys(['bert', 'state-of-the-art model'])
keyphrases
returns this:
[('equipment', 0.2712123844387682),
('production', 0.24805759926043025),
('manufacture', 0.20214941371717332),
('use', 0.14005307983173715),
('hydrogen', 0.1385275227518909)]
While:
Sentence 2
text = "Equipment manufacture for hydrogen production and hydrogen use"
with the same piece of code returns this:
[('hydrogen production', 0.5110246649313613),
('hydrogen use', 0.4067693357279659),
('equipment manufacture', 0.3619113634611547)]
which, in my opinion, is a better result since allows me to understand what we're talking about.
I wonder if there's a way to preprocess Sentence 1 making it more similar to Sentence 2. I've already tried with Neuralcoref but, in this particular case, doesn't help me.
Thank you in advance for any suggestion.
Francesca

Clustering brands using words embeddings

Ok, so the title might sound a bit confusing, but here's an analogy of what I'm trying to achieve. Let's imagine that we have the following dataset:
Brand name
Product type
Product_Description
Nike
Shoes
These black shoes are wonderful. They are elegant, and really comfortable
BMW
Car
This car goes fast. If you like speed, you'll like it.
Suzuki
Car
A family car, elegant and made for long journeys.
Call of Duty
VideoGame
A nervous shooter, putting you in the shoes of a desperate soldier, who has nothing left to lose.
Adidas
Shoes
Sneakers made for men, and women, who always want to go out with style.
This is just a made-up sample, but let's imagine this list goes on for a lot of other products.
What I'm trying to achieve here, is to cluster the elements (whether it is shoes, cars, or videogames), based on the words used in their respective description. Thus, I would obtain brands that are clustered together according to their description, but perhaps not belonging to the same type (e.g.: Suzuki + Adidas), and to get the name of the brands that are clustered together.
To do so, I relied on a word embedding method. After cleaning the description (stop words, non-alphanumeric characters) and tokenized it, I used a FastText model (the Wikipedia one) to evaluate the embeddings in the product descriptions.
def clean_text(text, tokenizer, stopwords):
text = str(text).lower() # Lowercase words
text = re.sub(r"\[(.*?)\]", "", text) # Remove [+XYZ chars] in content
text = re.sub(r"\s+", " ", text) # Remove multiple spaces in content
text = re.sub(r"\w+…|…", "", text) # Remove ellipsis (and last word)
text = re.sub(r"<a[^>]*>(.*?)</a>", r"\1", text) #Remove html tags
text = re.sub(f"[{re.escape(punctuation)}]", "", text)
text = re.sub(r"(?<=\w)-(?=\w)", " ", text) # Replace dash between words
text = re.sub(
f"[{re.escape(string.punctuation)}]", "", text
) # Remove punctuation
doc = nlp_model(text)
tokens = [token.lemma_ for token in doc]
#tokens = tokenizer(text) # Get tokens from text
tokens = [t for t in tokens if not t in stopwords] # Remove stopwords
tokens = ["" if t.isdigit() else t for t in tokens] # Remove digits
tokens = [t for t in tokens if len(t) > 1] # Remove short tokens
return tokens #Clean the Text
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
df = pd.read_csv("./mockup.csv")
custom_stopwords = set(stopwords.words("english"))
df["Product_Description"] = df["Product_Description"].fillna("")
df["tokens"] = df["Product_Description"].map(lambda x: clean_text(x, word_tokenize, custom_stopwords))
model = KeyedVectors.load_word2vec_format('./wiki-news-300d-1M.vec')
The problem is that I'm a bit of a beginner in word embeddings and clustering. As I said, my goal would be to cluster brands according to the words used in their description (the hypothesis is perhaps some brands are linked together through the words used in their description?), thus forgoing the old classification (shoes, cars, videogames...). I would also like to get the key brands of each cluster (so cluster 1 = Suzuki + Adidas, Cluster 2 = Call of Duty + Nike, Cluster 3 = BMW + ..., etc...).
Does anyone have any ideas on how to tackle this problem? I read several tutorials online on word embeddings and clustering, and to be completely honest, I am a bit lost.
Thank you for your help.

How to improve a German text classification model in spaCy

I am working on a text classification project and using spacy for this. Right now I have an accuracy equal to almost 70% but that is not enough. I've been trying to improve the model for past two weeks, but no successful results so far. And here I am looking for an advice about what I should do or try. Any help would be highly appreciated!
So, here is what I do so far:
1) Preparing the data:
I have an unbalanced dataset of German news with 21 categories (like POLITICS, ECONOMY, SPORT, CELEBRITIES etc). In order to make categories equal I duplicate small classes. As a result I have 21 files with almost 700 000 lines of text. I then normalize this data using the following code:
import spacy
from charsplit import Splitter
POS = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'NUM'] # allowed parts of speech
nlp_helper = spacy.load('de_core_news_sm')
splitter = Splitter()
def normalizer(texts):
arr = [] # list of normalized texts (will be returned from the function as a result of normalization)
docs = nlp_helper.pipe(texts) # creating doc objects for multiple lines
for doc in docs: # iterating through each doc object
text = [] # list of words in normalized text
for token in doc: # for each word in text
token = token.lemma_.lower()
if token not in stop_words and token.pos_ in POS: # deleting stop words and some parts of speech
if len(word) > 8 and token.pos_ == 'NOUN': # only nouns can be splitted
_, word1, word2 = splitter.split_compound(word)[0] # checking only the division with highest prob
word1 = word1.lower()
word2 = word2.lower()
if word1 in german and word2 in german:
text.append(word1)
text.append(word2)
elif word1[:-1] in german and word2 in german: # word[:-1] - checking for 's' that joins two words
text.append(word1[:-1])
text.append(word2)
else:
text.append(word)
else:
text.append(word)
arr.append(re.sub(r'[.,;:?!"()-=_+*&^#/\']', ' ', ' '.join(text))) # delete punctuation
return arr
Some explanations to the above code:
POS - a list of allowed parts of speech. If the word I'm working with at the moment is a part of speech that is not in this list -> I delete it.
stop_words - just a list of words I delete.
splitter.split_compound(word)[0] - returns a tuple with the most likely division of the compound word (I use it to divide long German words into shorter and more widely used). Here is the link to the repository with this functionality.
To sum up: I find the lemma of the word, make it lower case, delete stop words and some parts of speech, divide compound words, delete punctuation. I then join all the words and return an array of normalized lines.
2) Training the model
I train my model using de_core_news_sm (to make it possible in the future to use this model not only for classification but also for normalization). Here is the code for training:
nlp = spacy.load('de_core_news_sm')
textcat = nlp.create_pipe('textcat', config={"exclusive_classes": False, "architecture": 'simple_cnn'})
nlp.add_pipe(textcat, last=True)
for category in categories:
textcat.add_label(category)
pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(n_iter):
shuffle(data)
batches = spacy.util.minibatch(data)
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.25)
Some explanations to the above code:
data - list of lists, where each list includes a line of text and a dictionary with categories (just like in the docs)
'categories' - list of categories
'n_iter' - number of iterations for training
3) At the end I just save the model with to_disk method.
With the above code I managed to train a model with 70% accuracy. Here is a list of what I've tried so far to improve this score:
1) Using another architecture (ensemble) - didn't give any improvements
2) Training on non normalized data - the result was much worse
3) Using pretrained BERT model - could'n do it (here is my unanswered question about it)
4) Training de_core_news_md instead of de_core_news_sm - didn't give any improvements (tried it because according to the docs there could be an improvement thanks to the vectors (if I understood it correctly). Correct me if I'm wrong)
5) Training on data, normalized in a slightly different way (without lower casing and punctuation deletion) - didn't give any improvements
6) Changing dropout - didn't help
So right now I am a little stuck about what to do next. I would be very grateful for any hint or advice.
Thanks in advance for your help!
The first thing I would suggest is increasing your batch size. After that your optimizer (Adam if possible) and learning rate for which I don't see the code here. You can finally try changing your dropout.
Also, if you are trying neural networks and plan on changing a lot, it would be better if you could switch to PyTorch or TensorFlow. In PyTorch, you will have HuggingFace library, which has BERT in-built in it.
Hope this helps you!

Extract sentence based on regex conditions in python

I have a dataset containing 9000 sentences from which I need 20/20 statements based upon some conditions. However, when I try to match those conditions either the sentence is outputted or the conditions are not met. The first 20 sentences should contain one verb.
For the second part I would like to have sentences that contain more than 2 verbs.
Right now I have the following code for checking if the amount of verbs is less than 2
import re
import spacy
import en_core_web_md
nlp=en_core_web_md.load()
test = "This sentence has just 1 verb"
test2 = "I have put multiple verbs in this sentence because it is possible and I want it"
doc1 = nlp(test)
doc2 = nlp(test2)
empt = []
for item in doc1.sents:
verbs = 0
for token in item:
if token.pos_ == "VERB":
verbs += 1
if verbs < 2:
empt.append(item)
However, I end up with an empty list.
Can someone tell me what I am doing wrong so i can adjust this code for every additional condition?
You just need to pull the last two lines back two indentation levels. You only want to check the number of verbs in the entire sentence after all the tokens have been considered.

How to replace words with their synonyms of word-net?

i want to do data augmentation for sentiment analysis task by replacing words with it's synonyms from wordnet but replacing is random i want to loop over the synonyms and replace word with all synonyms one at the time to increase data-size
sentences=[]
for index , r in pos_df.iterrows():
text=normalize(r['text'])
words=tokenize(text)
output = ""
# Identify the parts of speech
tagged = nltk.pos_tag(words)
for i in range(0,len(words)):
replacements = []
# Only replace nouns with nouns, vowels with vowels etc.
for syn in wordnet.synsets(words[i]):
# Do not attempt to replace proper nouns or determiners
if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
break
# The tokenizer returns strings like NNP, VBP etc
# but the wordnet synonyms has tags like .n.
# So we extract the first character from NNP ie n
# then we check if the dictionary word has a .n. or not
word_type = tagged[i][1][0]
if syn.name().find("."+word_type+"."):
# extract the word only
r = syn.name()[0:syn.name().find(".")]
replacements.append(r)
if len(replacements) > 0:
# Choose a random replacement
replacement = replacements[randint(0,len(replacements)-1)]
print(replacement)
output = output + " " + replacement
else:
# If no replacement could be found, then just use the
# original word
output = output + " " + words[i]
sentences.append([output,'positive'])
Even I'm working with a similar kind of project, generating new sentences from a given input but without changing the context from the input text.
While coming across this, I found a data augmentation technique. Which seems to work well on the augmentation part. EDA(Easy Data Augmentation) is a paper[https://github.com/jasonwei20/eda_nlp].
Hope this helps you.

Categories