I am working on a project for building a high precision word alignment between sentences and their translations in other languages, for measuring translation quality. I am aware of Giza++ and other word alignment tools that are used as part of the pipeline for Statistical Machine Translation, but this is not what I'm looking for. I'm looking for an algorithm that can map words from the source sentence into the corresponding words in the target sentence, transparently and accurately given these restrictions:
the two languages do not have the same word order, and the order keeps changing
some words in the source sentence do not have corresponding words in the target sentence, and vice versa
sometimes a word in the source correspond to multiple words in the target, and vice versa, and there can be many-to-many mapping
there can be sentences where the same word is used multiple times in the sentence, so the alignment needs to be done with the words and their indexes, not only words
Here is what I did:
Start with a list of sentence pairs, say English-German, with each sentence tokenized to words
Index all words in each sentence, and create an inverted index for each word (e.g. the word "world" occurred in sentences # 5, 16, 19, 26 ... etc), for both source and target words
Now this inverted index can predict the correlation between any source word and any target word, as the intersection between the two words divided by their union. For example, if the tagret word "Welt" occurs in sentences 5, 16, 26,32, The correlation between (world, Welt) is the number of indexes in the intersection (3) divided by the number of indexes in the union (5), and hence the correlation is 0.6. Using the union gives lower correlation with high frequency words, such as "the", and the corresponding words in other languages
Iterate over all sentence pairs again, and use the indexes for the source and target words for a given sentence pairs to create a correlation matrix
Here is an example of a correlation matrix between an English and a German sentence. We can see the challenges discussed above.
In the image, there is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm.
Here is some of what I tried:
It is possible in some cases that the intended alignment is simply the word pair with the highest correlation in its respective column and row, but in many cases it's not.
I have tried things like Dijkstra's algorithm to draw a path connecting the alignment points, but it doesn't seem to work this way, because it seems you can jump back and forth to earlier words in the sentence because of the word order, and there is no sensible way to skip words for which there is no alignment.
I think the optimum solution will involve something
like expanding rectangles which start from the most likely
correspondences, and span many-to-many correspondences, and skip
words with no alignment, but I'm not exactly sure what would be a
good way to implement this
Here is the code I am using:
import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
return random.random() #adjust this to get the actual correlation value
all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src indexes
src_word=src_words[i] #identify the correponding src word
for j in range(len(trg_words)): #iterate over trg indexes
trg_word=trg_words[j] #identify the correponding trg word
val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of each word (or from the data provided in the speadsheet)
all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val
all_pairs_vals.sort(key=lambda x:-x[-1]) #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
if j0 in used_j: continue #same if the current row was used before
selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
used_j.append(j0)
for a in all_pairs_vals: #list all pairs and indicate which ones were selected
i0,j0,val0=a
if (i0,j0) in selected_alignments: print(a, "<<<<")
else: print(a)
It's problematic because it doesn't accomodate the many-to-many, or even the one to many alignments, and can err easily in the beginning by selecting a wrong pair with highest correlation, excluding its row and column from future selection. A good algorithm would factor in that a certain pair has the highest correlation in its respective row/column, but would also consider the proximity to other pairs with high correlations.
Here is some data to try if you like, it's in Google sheets:
https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing
Word alignment remains an open research topic to some extent. The probabilistic models behind Giza++ are fairly non-trivial, see: http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf
There is a lot of existing approaches you could take, such as:
implement the "IBM models" used by Giza++ yourself (or if you're brave, try the NLTK implementation)
implement the (much much simpler) algorithm behind fast_align https://www.aclweb.org/anthology/N13-1073/
implement some form of HMM-based alignment https://www.aclweb.org/anthology/C96-2141/
use deep learning, there are multiple possibilities there; this paper seems to contain a nice overview of approaches https://www.aclweb.org/anthology/P19-1124.pdf (typically people try to leverage the attention mechanism of neural MT models to do this)
This is a very difficult machine learning problem and while it's not impossible that simple approaches such as yours could work, it might be a good idea to study the existing work first. That being said, we have seen quite a few breakthroughs from surprisingly simple techniques in this field so who knows :-)
I highly recommend testing Awesome-Align. It relies on multilingual BERT (mBERT) and the results look very promising. I even tested it with Arabic, and it did a great job on a difficult alignment example since Arabic is a morphology-rich language, and I believe it would be more challenging than a Latin-based language such as German.
As you can see, one word in Arabic corresponds to multiple words in English, and yet Awesome-Align managed to handle the many-to-many mapping to a great extent. You may give it a try and I believe it will meet your needs.
There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN
Good luck!
Recently, there were also two papers using bi-/multilingual word/contextual embeddings to do the word alignment. Both of them construct a bipartite graph where the words are weighted with their embedding distances and use graph algorithms to get the alignment.
One paper does a maximum matching between the graph parts. Because the matching is not symmetrical, they do it from both sides and use similar symmetrization heuristics as FastAlign.
The other one mentions the alignment only briefly uses minimum-weighted edge cover on the graph and uses it as the alignment.
Both of them claim to be better than FastAlign.
As the question is specifically addressing Python implementations, and Giza++ and FastAlign still seem to represent SOTA, one might look into
https://pypi.org/project/systran-align/: replicates FastAlign. Seems to be relatively mature. Also note that the original FastAlign code contains a Python wrapper (https://github.com/clab/fast_align/blob/master/src/force_align.py).
https://www.nltk.org/api/nltk.align.html: replicates most GIZA models (a good compromise between performance and quality is IBM4). However, it is rather unclear how thoroughly tested and how well maintained that is, as people generally prefer to work with GIZA++ directly.
Most research code on the topic will nowadays come in Python and be based on embeddings, e.g., https://github.com/cisnlp/simalign, https://github.com/neulab/awesome-align, etc. However, the jury is still out on whether they outperform the older models and if so, for which applications. In the end, you need to go for a compromise between context awareness (reordering!), precision, recall and runtime. Neural models have great potential on being more context aware, statistical models have more predictable behavior.
Related
I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this
Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.
I have tried one e.g,
'Positive' and 'Negative' they are not similar words instead they are opposite but still spaCy gives me 81% similarity ratio for them.
here is my code,
import spacy
nlp = spacy.load('en_core_web_lg')
word1 = nlp(u'negative')
word2 = nlp(u'positive')
word1_word2 = word1.similarity(word2)
print(word1_word2)
Typically, word similarities like this are computed using cosine similarity between their corresponding word vectors. Words often used in the same contexts end up in similar locations in the vector space, on the assumption that words that get used similarly mean similar things. E.g., King and Queen might be similar, and King and Man might be similar, but Queen and Man should be a bit less similar (though they still both refer to "people", and they're both nouns, so they'll probably still be more similar than, say, Man and Combusted).
You want these words ('Positive' and 'Negative') to be negatives of each other (cosine similarity of -1), but they're similar because they're almost exactly the same word besides one being the negation of the other. The global semantic vector space incorporates many more ideas than just negation, and so these two words end up being very similar in other ways. What you can do is compute their average vector, then Positive -> average = - (Negative -> average), and that difference vector Positive -> average (or, more precisely, "Positive" - ("Positive" - "Negative") / 2) would approximate the idea of negation that you're particularly interested in. That is, you could then add that vector to other cases to negate them too, e.g. "Yes" + ("Negative" - "Positive") ~= "No"
All that just to say, the effect you're observing is not a fault of Spacy, and you won't avoid it by using Gensim or Sklearn, it's due the nature of what "similarity" means in this context. If you want more comprehensible, human-designed semantic relationships between words, consider looking at WordNet, which is manually created and would be more likely to explicitly have some "negation" relation between your two words.
NLTK package of Python has a function dispersion plot, which shows location of chosen words in text. If there any numeric measure of such dispersion that can be calculated in python? E.g. I want to measure weather the word "money" is spread among the text or rather concentrated in one chapter?
I believe there are multiple metrics that can be used to give a quantitative measure of what you are defining as informativeness of a word over a body of text.
Methodology
Since you mention chapter and text as the levels you wish to evaluate, the basic methodology would be the same:
Break a given text into chapters
Evaluate model on chapter and text level
Compare evaluation on chapter and text level
If the comparison is over a threshold you could claim it is meaningful or informative. Other metrics on the two levels could be used depending on the model.
Models
There are a few models that can be used.
Raw counts
Raw counts of words could be used on chapter and text levels. A threshold of percentage could be used to determine a topic as representative of the text.
For example, if num_word_per_chapter/num_all_words_per_chapter > threshold and/or num_word_per_text/num_all_words_text > threshold then you could claim it is representative. This might be a good baseline. It is essentially a bag-of-words like technique.
Vector Space Models
Vector space models are used in Information Retrieval and Distributional Semantics. They usually used sparse vectors of counts or TF-IDF. Two vectors are compared with cosine similarity. Closer vectors have smaller angles and are considered "more alike".
You could create chapter-term matrices and average cosine similarity metrics for a text body. If the average_cos_sim > threshold you could claim it is more informative of the topic.
Examples and Difficulties
Here is a good example of VSM with NLTK. This may be a good place to start for a few tests.
The difficulties I foresee are:
Chapter Splitting
Finding Informative Threshold
I can't give you a more practical code based answer at this time, but I hope this gives you some options to start with.
I am trying to classify (cluster) our companies Curriculum Vitae (CVs). There is about 100 CV's in total. The idea is to find similar people based on their CV content. I have already transformed the word docs into text files and read all of the candidates into a python dictionary with the format:
cvdict = { 'name1' : "cv text", 'name2', : 'cv text', ... }
I have also remove most punctuation, lowercased it, removed numbers etc., and removed words with a length less than x (4)
My questions:
Is clustering the correct approach? If not, which Machine Learning algorithm would be a suitable initial focus for this task.
Any pointers as to some python code i can use to transverse this dictionary and 'cluster' the content. Based on the clustering of the content it should output the ‘keys’=candidate names as clustered groups.
So from what I understood you want to see potential groups/clusters in the set of CVs.
the idea of cvdict is great, but you also need to convert all texts to numbers ! you are half way through. so think about matrix/excel sheet/table. where you have the profile of each employee in each line.
name1,cv_text1
name2,cv_text2
name3,cv_text3 ...
Yes, as you can guess, the length of cv_text can vary. Some people have a lengthy resume some other not ! which words can categorize the company employee. Some how we need to make them all equal size; Also, not all words are informative, you need to think which words can capture your idea; In Machine Learning they call it "Feature" vector or matrix. So my suggestion would be drive a set of words and mark if the person has mentioned that word in his skill.
managment marketing customers statistics programming
name1 1 1 0 0 0
name2 0 0 0 1 1
name3 0 0 1 1 0
or instead of a 0/1 matrix you can put how many times that word was mentioned in the resume.
again you can just extract all possible words from all resumes. NLTK is an awesome module for doing text analysis and it has some built-in function for you to polish you text. have a look at the first half of this slide.
Then you can use any kind of clustering method, for example hierarchical https://code.activestate.com/recipes/578834-hierarchical-clustering-heatmap-python/
there are already packages for doing such analysis; either in scipy or scikit and I am sure for each you can find a tons of examples. The key step is the one you are already working on; representing your data as a matrix.
Couple more hints to earlier comment:
I would not throw away words less than 4 characters long. Instead I would use a stop list of common words. You don't want to throw away things like C++ or C#
One good technique of building a matrix above is to use TF-IDF metric. What it is is essentially a measure of how frequently some word occurs in a particular document vs. how frequently it occurs in the entire collection. So things like 'the' are very common so they will be downgraded very quickly. If only 5 people in your company now C++ this will boost up the metric for this word a lot.
You might want to consider to use a stemmer like a 'porter algorithm'. This algorithm will combine words like 'statistics' and 'statistical'.
Most machine learning algorithms have a problem with very wide matrices. Unfortunately, your resume base is only 100 documents which is considered quite low vs how many potential terms you will have. The reason these techniques work for google and NSA is because human languages tend to have tens of thousands words in active use vs billions of documents they have to index. For your task I would try to shrink you dataset to no more than 30-40 columns. Be very aggressive on throwing away the common words.
Unfortunately the biggest weakness of most of the clustering techniques is that you have to set a number of clusters in advance. A common approach that people use is to set up some type of measure of how good your clusters are and start running the clustering algorithm first with very few clusters and keep increasing until your metrics starts to drop off. Look up Andrew Ng machine learning course on the interwebs. He explains these technique very well.
Of course hierarchical clustering is not affected by the point 5.
Instead of clustering you can try building decision tree. Although not super accurate, decision trees have a great advantage to visualize the built model. By looking at the three you can easily see the reason where built the way they are.
Besides scipy and scikit, which are very good. Take a look at Orange Toolbox. It has a lot of good algorithms with good visualization tools. They way you program it is just by connecting boxes with arrows. After you got satisfied with your model you can easily dump it out to the run as a script.
Hope this helps.
I'm not sure how exactly to word this question, so here's an example:
string1 = "THEQUICKBROWNFOX"
string2 = "KLJHQKJBKJBHJBJLSDFD"
I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.
In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...
Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.
Thanks for the help!
I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.
I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.
Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.
I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).
The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.
For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.
I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."
The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.
Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)