How to improve a German text classification model in spaCy - python

I am working on a text classification project and using spacy for this. Right now I have an accuracy equal to almost 70% but that is not enough. I've been trying to improve the model for past two weeks, but no successful results so far. And here I am looking for an advice about what I should do or try. Any help would be highly appreciated!
So, here is what I do so far:
1) Preparing the data:
I have an unbalanced dataset of German news with 21 categories (like POLITICS, ECONOMY, SPORT, CELEBRITIES etc). In order to make categories equal I duplicate small classes. As a result I have 21 files with almost 700 000 lines of text. I then normalize this data using the following code:
import spacy
from charsplit import Splitter
POS = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'NUM'] # allowed parts of speech
nlp_helper = spacy.load('de_core_news_sm')
splitter = Splitter()
def normalizer(texts):
arr = [] # list of normalized texts (will be returned from the function as a result of normalization)
docs = nlp_helper.pipe(texts) # creating doc objects for multiple lines
for doc in docs: # iterating through each doc object
text = [] # list of words in normalized text
for token in doc: # for each word in text
token = token.lemma_.lower()
if token not in stop_words and token.pos_ in POS: # deleting stop words and some parts of speech
if len(word) > 8 and token.pos_ == 'NOUN': # only nouns can be splitted
_, word1, word2 = splitter.split_compound(word)[0] # checking only the division with highest prob
word1 = word1.lower()
word2 = word2.lower()
if word1 in german and word2 in german:
text.append(word1)
text.append(word2)
elif word1[:-1] in german and word2 in german: # word[:-1] - checking for 's' that joins two words
text.append(word1[:-1])
text.append(word2)
else:
text.append(word)
else:
text.append(word)
arr.append(re.sub(r'[.,;:?!"()-=_+*&^#/\']', ' ', ' '.join(text))) # delete punctuation
return arr
Some explanations to the above code:
POS - a list of allowed parts of speech. If the word I'm working with at the moment is a part of speech that is not in this list -> I delete it.
stop_words - just a list of words I delete.
splitter.split_compound(word)[0] - returns a tuple with the most likely division of the compound word (I use it to divide long German words into shorter and more widely used). Here is the link to the repository with this functionality.
To sum up: I find the lemma of the word, make it lower case, delete stop words and some parts of speech, divide compound words, delete punctuation. I then join all the words and return an array of normalized lines.
2) Training the model
I train my model using de_core_news_sm (to make it possible in the future to use this model not only for classification but also for normalization). Here is the code for training:
nlp = spacy.load('de_core_news_sm')
textcat = nlp.create_pipe('textcat', config={"exclusive_classes": False, "architecture": 'simple_cnn'})
nlp.add_pipe(textcat, last=True)
for category in categories:
textcat.add_label(category)
pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(n_iter):
shuffle(data)
batches = spacy.util.minibatch(data)
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.25)
Some explanations to the above code:
data - list of lists, where each list includes a line of text and a dictionary with categories (just like in the docs)
'categories' - list of categories
'n_iter' - number of iterations for training
3) At the end I just save the model with to_disk method.
With the above code I managed to train a model with 70% accuracy. Here is a list of what I've tried so far to improve this score:
1) Using another architecture (ensemble) - didn't give any improvements
2) Training on non normalized data - the result was much worse
3) Using pretrained BERT model - could'n do it (here is my unanswered question about it)
4) Training de_core_news_md instead of de_core_news_sm - didn't give any improvements (tried it because according to the docs there could be an improvement thanks to the vectors (if I understood it correctly). Correct me if I'm wrong)
5) Training on data, normalized in a slightly different way (without lower casing and punctuation deletion) - didn't give any improvements
6) Changing dropout - didn't help
So right now I am a little stuck about what to do next. I would be very grateful for any hint or advice.
Thanks in advance for your help!

The first thing I would suggest is increasing your batch size. After that your optimizer (Adam if possible) and learning rate for which I don't see the code here. You can finally try changing your dropout.
Also, if you are trying neural networks and plan on changing a lot, it would be better if you could switch to PyTorch or TensorFlow. In PyTorch, you will have HuggingFace library, which has BERT in-built in it.
Hope this helps you!

Related

Difference between 2 kinds of text corpus vocabulary creations with spacy

I am trying to retrieve the vocabulary of a text corpus with spacy.
The corpus is a list of strings with each string being a document from the corpus.
I came up with 2 different methods to create the vocabulary. Both work but yield slightly different results and i dont know why.
The first approach results in a vocabulary size of 5000:
words = nlp(" ".join(docs))
vocab2 = []
for word in words:
if word.lemma_ not in vocab2 and word.is_alpha == True:
vocab2.append(word.lemma_)
The second approach results in a vocabulary size of 5001 -> a single word more:
vocab = set()
for doc in docs:
doc = nlp(doc)
for token in doc:
if token.is_alpha:
vocab.add(token.lemma_)
Why do the 2 results differ?
My best guess would be that the model behind nlp() somehow interprets different token when it has the input as a whole vs. the input per document.

Clustering script fails with German, but works like expected with English

I have a script to cluster keywords, utilizing pandas and polyfuzz. With English, it works like expected. Trying to use the script with keywords in German, the script recognizes multiple keywords wrongly.
What means "wrongly recognized": clustering recognizes the first and second word in the keyword. And as you can see on the screenshot, columns G and H (First Word and Second Word) contain other words, then corresponding keywords in column B (Keyword):
The script fails not always with German - multiple keywords are clustered correctly. But the part of wrongly recognized keywords is very high, up to 20%.
Could somebody explain to me why the script failed with German keywords and, in the best case, improve the script enabling it to work with German?
Here is the part of the script, which does clustering:
# find keywords from one column in another in any order and count the frequency
df_matched['Cluster Name'] = df_matched['Cluster Name'].str.strip()
df_matched['Keyword'] = df_matched['Keyword'].str.strip()
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[0]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[1]
df_matched['Total Keywords'] = df_matched['First Word'].str.count(' ') + 1
def ismatch(s):
A = set(s["First Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found'] = df_matched.apply(ismatch, axis=1)
df_matched = df_matched. fillna('')
def ismatch(s):
A = set(s["Second Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found 2'] = df_matched.apply(ismatch, axis=1)
# todo - document this algo. Essentially if it matches on the second word only, it renames the cluster to the second word
# clean up code nd variable names
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == True), "Cluster Name"] = df_matched["Second Word"]
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == False), "Cluster Name"] = "zzz_no_cluster_available"
# count cluster_size
df_matched['Cluster Size'] = df_matched['Cluster Name'].map(df_matched.groupby('Cluster Name')['Cluster Name'].count())
df_matched.loc[df_matched["Cluster Size"] == 1, "Cluster Name"] = "zzz_no_cluster_available"
df_matched = df_matched.sort_values(by="Cluster Name", ascending=True)
Here are two datasets:
Working dataset in English: http://dl.dropboxusercontent.com/s/zrobh2x4bs3ztlf/working-dataset-english.txt
Badly working dataset in German: http://dl.dropboxusercontent.com/s/i1p3j3zi1t0cev3/badly-working-dataset-german.txt
And here, the working Colab with the whole script.
I opened the full code to understand where df_matched came from.
I'm not 100% sure of what you are trying to do, but I think that the problem comes from before the snippet you shared here.
It comes from the way that df_matched is created. It uses fuzzy matching to create clusters. So the words of "Cluster Name" are not all guaranteed to be present in "Keyword".
If you run the code for the English data, and check the words in position -1 and -2 (last two words of the Cluster Name) instead of 0 and 1...
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[-1]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[-2]
...then calculate how many of them are not found...
print((~df_matched["Found"]).sum())
print((~df_matched["Found 2"]).sum())
# 140
# 10
...you can see that for 104 out of 158 rows, the last word is not part of the keywords.
(I don't know if you care about the first two words more than the last two... but this looks worse than the 20% you noticed in the German data.)
For the German one the problem is more visible because this language uses a lot of compound words and many frequent suffixes (e.g., "ung")... So they will fuzzy-match a lot.
Example of df_matched for German: the "From" words are not present in "To"... but there are large overlaps.
This is df_matched for English: some words of "From" are not even close to the words in "To"... and similarity score can be worse than in the German dataset.
Possible improvements
I think that the part where you could improve the clustering is this (from the colab notebook)
df_1_list = df_1.Keyword.tolist()  # create list from df
model = PolyFuzz("TF-IDF")
cluster_tags = df_1_list[::]
cluster_tags = set(cluster_tags)
cluster_tags = list(cluster_tags)
print("Cleaning up the cluster tags.. Please be patient!")
substrings = {w1 for w1 in tqdm(cluster_tags) for w2 in cluster_tags if w1 in w2 and w1 != w2}
longest_word = set(cluster_tags) - substrings
longest_word = list(longest_word)
shortest_word_list = list(set(cluster_tags) - set(longest_word))
try:
    model.match(df_1_list, shortest_word_list)
except ValueError:
    print("Empty Dataframe, Can't Match - Check the URL Filter!")
    sys.exit()
model.group(link_min_similarity=sim_match_percent)
df_matched = model.get_matches()
Here you compute the similarity between df_1_list and shortest_word_list.
shortest_word_list is created by looking for substrings, which might lead to weird clusters is German because of compound words.
You could try and normalize the text with (language-specific) ​stemming or lemmatization before / instead of checking for substrings and creating clusters. This should help and transform each word in their "root form" and retain their meaning.
Yoy can use the spaCy library, which provide language-specific
pretrained models for stemming, embedding and other language operations.
You can select the correct model for each language and use the lemmatization function to replace each word of df_1_list with their "base form" before trying to cluster.
Lemmatization example
import spacy
nlp = spacy.load("en_core_web_sm") # load English or German model
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Link to spaCy German model: https://spacy.io/models/de

Parsing a List of Tweets in Order to Utlize Gensim Word2Vec

I'm working on an NLP problem and my goal is to be able to pass my data into sklearn's algos after having used Word2Vec via Python's Gensim Library. The underlying problem I am trying to solve is binary classification of a series of tweets. To do so I am modifying the code in this git repo.
Here is part of the code relating to tokenization:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
input_file["tokens"] = input_file["text"].apply(tokenizer.tokenize)
all_words = [word for tokens in input_file["tokens"] for word in tokens]
sentence_lengths = [len(tokens) for tokens in input_file["tokens"]]
vocabulary = sorted(set(all_words))
Now here is the part where I use Gensim's sklearn-api to try to vectorize my tweets:
from sklearn.model_selection import train_test_split
from gensim.test.utils import common_texts
from gensim.sklearn_api import W2VTransformer
text = input_file["text"].tolist()
labels = input_file["label"].tolist()
X_train, X_test, y_train, y_test = train_test_split(text, labels, test_size=0.2,random_state=40)
model = W2VTransformer(size=10, min_count=1, seed=1)
X_train_w2v = model.fit(common_texts).transform(X_train)
This results in the following error:
KeyError: "word 'Great seeing you again, don't be a stranger!' not in vocabulary"
It seems that part of the issue is that Gensim is expecting to be fed one word at a time and instead it is getting entire tweets.
X_train is of type list, here are the first three elements of the list:
["Great seeing you again, don't be a stranger!",
"Beautiful day here in sunny Prague. Not a cloud in the sky",
" pfft! i wish I had a laptop like that"]
Update
In order to remedy this, I have tried the following:
X_train_list = []
for sentence in X_train:
word_list = sentence.split(' ')
while("" in word_list):
word_list.remove("")
X_train_list.append(word_list)
model = W2VTransformer(size=10, min_count=1, seed=1)
X_train_tfidf = model.fit(common_texts).transform(X_train_list)
This produces the following error:
KeyError: "word 'here' not in vocabulary"
To be honest, this one blows my mind! How a common word like 'here' is not in the vocabulary is beyond me. Also wondering if tweets with stray letters will throwing errors, I imagine the weird jumbles of letters that often pass for words will cause similar issues.
The Gensim model indeed expects a list of word-lists as input instead of just a list of sentences. Your X_train should look like this:
[["Great", "seeing", "you", "again", "..."],
["Beautiful", "day", "..."],
...
]
Update: As for the new part of your question the problem is that common_texts is a tiny dataset consisting of only 9 sentences so it is not so surprising that the vocabulary is very small. Try training on a bigger dataset before using transform. In your case you can probably find a dataset of tweets to train on. You could also consider using FastText if you want to get vectors for out-of-vocabulary words.

Understanding LDA / topic modelling -- too much topic overlap

I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach).
I have a small number of literary texts (novels) and would like to extract some general topics using LDA.
I'm using the gensim module in Python along with some nltk features. For a test I've split up my original texts (just 6) into 30 chunks with 1000 words each. Then I converted the chunks into document-term matrices and ran the algorithm. This is the code (although I think it doesn't matter for the question) :
# chunks is a 30x1000 words matrix
dictionary = gensim.corpora.dictionary.Dictionary(chunks)
corpus = [ dictionary.doc2bow(chunk) for chunk in chunks ]
lda = gensim.models.ldamodel.LdaModel(corpus = corpus, id2word = dictionary,
num_topics = 10)
topics = lda.show_topics(5, 5)
However the result is completely different from any example I've seen in that the topics are full of meaningless words that can be found in all source documents, e.g. "I", "he", "said", "like", ... example:
[(2, '0.009*"I" + 0.007*"\'s" + 0.007*"The" + 0.005*"would" + 0.004*"He"'),
(8, '0.012*"I" + 0.010*"He" + 0.008*"\'s" + 0.006*"n\'t" + 0.005*"The"'),
(9, '0.022*"I" + 0.014*"\'s" + 0.009*"``" + 0.007*"\'\'" + 0.007*"like"'),
(7, '0.010*"\'s" + 0.009*"I" + 0.006*"He" + 0.005*"The" + 0.005*"said"'),
(1, '0.009*"I" + 0.009*"\'s" + 0.007*"n\'t" + 0.007*"The" + 0.006*"He"')]
I don't quite understand why that happens, or why it doesn't happen with the examples I've seen. How do I get the LDA model to find more distinctive topics with less overlap? Is it a matter of filtering out more common words first? How can I adjust how many times the model runs? Is the number of original texts too small?
LDA is extremely dependent on the words used in a corpus and how frequently they show up. The words you are seeing are all stopwords - meaningless words that are the most frequent words in a language e.g. "the", "I", "a", "if", "for", "said" etc. and since these words are the most frequent, it will negatively impact the model.
I would use the nltk stopword corpus to filter out these words:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
Then make sure your text does not contain any of the words in the stop_words list (by whatever pre processing method you are using) - an example is below
text = text.split() # split words by space and convert to list
text = [word for word in text if word not in stop_words]
text = ' '.join(text) # join the words in the text to make it a continuous string again
You may also want to remove punctuation and other characters ("/","-") etc.) then use regular expressions:
import re
remove_punctuation_regex = re.compile(r"[^A-Za-z ]") # regex for all characters that are NOT A-Z, a-z and space " "
text = re.sub(remove_punctuation_regex, "", text) # sub all non alphabetical characters with empty string ""
Finally, you may also want to filter on most frequent or least frequent words in your corpus, which you can do using nltk:
from nltk import FreqDist
all_words = text.split() # list of all the words in your corpus
fdist = FreqDist(all_words) # a frequency distribution of words (word count over the corpus)
k = 10000 # say you want to see the top 10,000 words
top_k_words, _ = zip(*fdist.most_common(k)) # unzip the words and word count tuples
print(top_k_words) # print the words and inspect them to see which ones you want to keep and which ones you want to disregard
That should get rid of the stopwords and extra characters, but still leaves the vast problem of topic modelling (which I wont try to explain here but will leave some tips and links).
Assuming you know a little bit about topic modelling, lets start. LDA is a bag of words model, meaning word order doesnt matter. The model assigns a topic distribution (of a predetermined number of topics K) to each document, and a word distribution to each topic. A very insightful high level video explains this here. If you want to see more of the mathematics, but still at an accessible level, check out this video. The more documents the better, and usually longer documents (with more words) also fair better using LDA - this paper shows that LDA doesnt perform well with short texts (less than ~20 words). K is up to you to choose, and really depends on your corpus of documents (how large it is, what different topics it covers etc.). Usually a good value of K is between 100-300, but again this really depends on your corpus.
LDA has two hyperparamters, alpha and beta (alpha and eta in gemsim) - a higher alpha means each text will be represented by more topics (so naturally a lower alpha means each text will be represented by less topics). A high eta means each topic is represented by more words, and a low eta means each topic is represented by less words - so with a low eta you would get less "overlap" between topics.
There's many insights you could gain using LDA
What are the topics in a corpus (naming topics may not matter to your application, but if it does this can be done by inspecting the words in a topic as you have done above)
What words contribute most to a topic
What documents in the corpus are most similar (using a similarity metric)
Hope this has helped. I was new to LDA a few months ago but I've quickly gotten up to speed using stackoverflow and youtube!

Latent Dirichlet allocation(LDA) performance by limiting word size for Corpus Documents

I have been generating topics with yelp data set of customer reviews by using Latent Dirichlet allocation(LDA) in python(gensim package). While generating tokens, I am selecting only the words having length >= 3 from the reviews( By using RegexpTokenizer):
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)
This will allow us to filter out the noisy words of length less than 3, while creating the corpus document.
How will filtering out these words effect performance with the LDA algorithm?
Generally speaking, for the English language, one and two letter words don't add information about the topic. If they don't add value they should be removed during the pre-processing step. Like most algorithms, less data in will speed up the execution time.
Words less than length 3 are considered stop words. LDAs build topics so imagine you generate this topic:
[I, him, her, they, we, and, or, to]
compared to:
[shark, bull, greatwhite, hammerhead, whaleshark]
Which is more telling? This is why it is important to remove stopwords. This is how I do that:
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stopword2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result

Categories