I am working on a project for building a high precision word alignment between sentences and their translations in other languages, for measuring translation quality. I am aware of Giza++ and other word alignment tools that are used as part of the pipeline for Statistical Machine Translation, but this is not what I'm looking for. I'm looking for an algorithm that can map words from the source sentence into the corresponding words in the target sentence, transparently and accurately given these restrictions:
the two languages do not have the same word order, and the order keeps changing
some words in the source sentence do not have corresponding words in the target sentence, and vice versa
sometimes a word in the source correspond to multiple words in the target, and vice versa, and there can be many-to-many mapping
there can be sentences where the same word is used multiple times in the sentence, so the alignment needs to be done with the words and their indexes, not only words
Here is what I did:
Start with a list of sentence pairs, say English-German, with each sentence tokenized to words
Index all words in each sentence, and create an inverted index for each word (e.g. the word "world" occurred in sentences # 5, 16, 19, 26 ... etc), for both source and target words
Now this inverted index can predict the correlation between any source word and any target word, as the intersection between the two words divided by their union. For example, if the tagret word "Welt" occurs in sentences 5, 16, 26,32, The correlation between (world, Welt) is the number of indexes in the intersection (3) divided by the number of indexes in the union (5), and hence the correlation is 0.6. Using the union gives lower correlation with high frequency words, such as "the", and the corresponding words in other languages
Iterate over all sentence pairs again, and use the indexes for the source and target words for a given sentence pairs to create a correlation matrix
Here is an example of a correlation matrix between an English and a German sentence. We can see the challenges discussed above.
In the image, there is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm.
Here is some of what I tried:
It is possible in some cases that the intended alignment is simply the word pair with the highest correlation in its respective column and row, but in many cases it's not.
I have tried things like Dijkstra's algorithm to draw a path connecting the alignment points, but it doesn't seem to work this way, because it seems you can jump back and forth to earlier words in the sentence because of the word order, and there is no sensible way to skip words for which there is no alignment.
I think the optimum solution will involve something
like expanding rectangles which start from the most likely
correspondences, and span many-to-many correspondences, and skip
words with no alignment, but I'm not exactly sure what would be a
good way to implement this
Here is the code I am using:
import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
return random.random() #adjust this to get the actual correlation value
all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src indexes
src_word=src_words[i] #identify the correponding src word
for j in range(len(trg_words)): #iterate over trg indexes
trg_word=trg_words[j] #identify the correponding trg word
val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of each word (or from the data provided in the speadsheet)
all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val
all_pairs_vals.sort(key=lambda x:-x[-1]) #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
if j0 in used_j: continue #same if the current row was used before
selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
used_j.append(j0)
for a in all_pairs_vals: #list all pairs and indicate which ones were selected
i0,j0,val0=a
if (i0,j0) in selected_alignments: print(a, "<<<<")
else: print(a)
It's problematic because it doesn't accomodate the many-to-many, or even the one to many alignments, and can err easily in the beginning by selecting a wrong pair with highest correlation, excluding its row and column from future selection. A good algorithm would factor in that a certain pair has the highest correlation in its respective row/column, but would also consider the proximity to other pairs with high correlations.
Here is some data to try if you like, it's in Google sheets:
https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing
Word alignment remains an open research topic to some extent. The probabilistic models behind Giza++ are fairly non-trivial, see: http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf
There is a lot of existing approaches you could take, such as:
implement the "IBM models" used by Giza++ yourself (or if you're brave, try the NLTK implementation)
implement the (much much simpler) algorithm behind fast_align https://www.aclweb.org/anthology/N13-1073/
implement some form of HMM-based alignment https://www.aclweb.org/anthology/C96-2141/
use deep learning, there are multiple possibilities there; this paper seems to contain a nice overview of approaches https://www.aclweb.org/anthology/P19-1124.pdf (typically people try to leverage the attention mechanism of neural MT models to do this)
This is a very difficult machine learning problem and while it's not impossible that simple approaches such as yours could work, it might be a good idea to study the existing work first. That being said, we have seen quite a few breakthroughs from surprisingly simple techniques in this field so who knows :-)
I highly recommend testing Awesome-Align. It relies on multilingual BERT (mBERT) and the results look very promising. I even tested it with Arabic, and it did a great job on a difficult alignment example since Arabic is a morphology-rich language, and I believe it would be more challenging than a Latin-based language such as German.
As you can see, one word in Arabic corresponds to multiple words in English, and yet Awesome-Align managed to handle the many-to-many mapping to a great extent. You may give it a try and I believe it will meet your needs.
There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN
Good luck!
Recently, there were also two papers using bi-/multilingual word/contextual embeddings to do the word alignment. Both of them construct a bipartite graph where the words are weighted with their embedding distances and use graph algorithms to get the alignment.
One paper does a maximum matching between the graph parts. Because the matching is not symmetrical, they do it from both sides and use similar symmetrization heuristics as FastAlign.
The other one mentions the alignment only briefly uses minimum-weighted edge cover on the graph and uses it as the alignment.
Both of them claim to be better than FastAlign.
As the question is specifically addressing Python implementations, and Giza++ and FastAlign still seem to represent SOTA, one might look into
https://pypi.org/project/systran-align/: replicates FastAlign. Seems to be relatively mature. Also note that the original FastAlign code contains a Python wrapper (https://github.com/clab/fast_align/blob/master/src/force_align.py).
https://www.nltk.org/api/nltk.align.html: replicates most GIZA models (a good compromise between performance and quality is IBM4). However, it is rather unclear how thoroughly tested and how well maintained that is, as people generally prefer to work with GIZA++ directly.
Most research code on the topic will nowadays come in Python and be based on embeddings, e.g., https://github.com/cisnlp/simalign, https://github.com/neulab/awesome-align, etc. However, the jury is still out on whether they outperform the older models and if so, for which applications. In the end, you need to go for a compromise between context awareness (reordering!), precision, recall and runtime. Neural models have great potential on being more context aware, statistical models have more predictable behavior.
NLTK package of Python has a function dispersion plot, which shows location of chosen words in text. If there any numeric measure of such dispersion that can be calculated in python? E.g. I want to measure weather the word "money" is spread among the text or rather concentrated in one chapter?
I believe there are multiple metrics that can be used to give a quantitative measure of what you are defining as informativeness of a word over a body of text.
Methodology
Since you mention chapter and text as the levels you wish to evaluate, the basic methodology would be the same:
Break a given text into chapters
Evaluate model on chapter and text level
Compare evaluation on chapter and text level
If the comparison is over a threshold you could claim it is meaningful or informative. Other metrics on the two levels could be used depending on the model.
Models
There are a few models that can be used.
Raw counts
Raw counts of words could be used on chapter and text levels. A threshold of percentage could be used to determine a topic as representative of the text.
For example, if num_word_per_chapter/num_all_words_per_chapter > threshold and/or num_word_per_text/num_all_words_text > threshold then you could claim it is representative. This might be a good baseline. It is essentially a bag-of-words like technique.
Vector Space Models
Vector space models are used in Information Retrieval and Distributional Semantics. They usually used sparse vectors of counts or TF-IDF. Two vectors are compared with cosine similarity. Closer vectors have smaller angles and are considered "more alike".
You could create chapter-term matrices and average cosine similarity metrics for a text body. If the average_cos_sim > threshold you could claim it is more informative of the topic.
Examples and Difficulties
Here is a good example of VSM with NLTK. This may be a good place to start for a few tests.
The difficulties I foresee are:
Chapter Splitting
Finding Informative Threshold
I can't give you a more practical code based answer at this time, but I hope this gives you some options to start with.
I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)
Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book
use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.
I aim to find the similarities between words for about ~10,000 words. I'm using the "word.path_similarity(otherword)" method of the wordnet library but the results I'm getting for the path_similarity are in the range 0-0.1 as opposed to being distributed over 0-1. How is it possible that similarities between 10,000 random words all end up in that narrow range?
Is there a better way to use WordNet for finding similarity between two words?
For context, here's how this is calculated:
Claculate the length of the shortest path between the two synsets/words (inclusive).
Return the score as 1/pathlen
Therefore a score <.2 is indicative of a pathlength > 5 steps. Inclusive of the two input synsets, that means there are at least 4 synsets between them.
With that said: you're complaint seems to be "according to this metric, two words chosen at random are pretty consistently unrelated! What's going on?" Well, your similarity metric is telling you that random words are generally not closely related. This shouldn't be that surprising. Why are you calculating similarities between random words to begin with?
So I'm making a python class which calculates the tfidf weight of each word in a document. Now in my dataset I have 50 documents. In these documents many words intersect, thus having multiple same word features but with different tfidf weight. So the question is how do I sum up all the weights into one singular weight?
First, let's get some terminology clear. A term is a word-like unit in a corpus. A token is a term at a particular location in a particular document. There can be multiple tokens that use the same term. For example, in my answer, there are many tokens that use the term "the". But there is only one term for "the".
I think you are a little bit confused. TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF converts a document into a mapping of terms to weights. So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. There is no separate score for tokens sharing a term inside the doc.