So I have a my data set currently looks like the following:
['microsoft','bizspark'],
['microsoft'],
['microsoft', 'skype'],
['amazon', 's3'],
['amazon', 'zappos'],
['amazon'],
....
etc.
Now what I would love to do is cluster these in regards to one another, using the Levenstein distance to calculate word scores.
Now I would iterate through all of the lists and compare the distance to the following lists.
microsoft -> ['microsoft','bizspark'], ['microsoft'], ['microsoft', 'skype'],
amazon -> ['amazon', 's3'], ['amazon', 'zappos'], ['amazon'], ....
The question is how to do this? Should I calculate each levenstein distance on a word by word basis i.e. for ['amazon', 'zappos'] and ['microsoft','bizspark'], I would firstly get pairs: (amazon, microsoft), (amazon, bizspark), (zappos, microsoft, (zappos, bizspark) and calculate the distance of each pair.
Or should I really just create strings from these and then calculate the distance?
What I should then end up with is an NXN matrix with the distance:
['microsoft','bizspark'] | ['amazon', 'zappos'] ....
['microsoft','bizspark'] 1 | ?
_-------------------------------------------------------------------------
['amazon', 'zappos'] ? | 1
...
....
Then how do I apply clustering to this to determine a cut-off threshold?
One such suggestion using single words is discussed here
But I'm not sure how to go about it with regards to word lists!?
Please note, in regards to implementation I am using Python libaries, such as Numpy, Scipy, Pandas and as needed.
What you match against probably depends primarily on what your goals are. If you want to match either word, you probably should match against both words separately. If you want to match against phrases, then ' '.join()'ing them is probably a good idea.
BTW, I recently did some fuzzy matching using difflib.get_close_matches(). It's in the Python Standard Library. I don't have anything against whatever Levenstein distance library you may use; I just wanted to point out this option as one that worked for me.
Maybe "frequent itemset mining" is more what you looking for than clustering.
It will find frequent word combinations, such that each document may be part of multiple patterns.
Related
I'm working with a kind of unique situation. I have words in Language1 that I've defined in English. I then took each English word, took its word vector from a pretrained GoogleNews w2v model, and average the vectors for every definition. The result, an example with a 3 dimension vector:
L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}
What I want to do is cluster (using K-means probably, but I'm open to other ideas) the keys of the dict by their numpy-array values.
I've done this before with standard w2v models, but the issue I'm having is that this is a dictionary. Is there another data set I can convert this to? I'm inclined to write it to a csv/make it into a pandas datafram and use Pandas or R to work on it like that, but I'm told that floats are problem when it comes to things requiring binary (as in: they lose information in unpredictable ways). I tried saving my dictionary to hdf5, but dictionaries are not supported.
Thanks in advance!
If I understand your question correctly, you want to cluster words according to their W2V representation, but you are saving it as dictionary representation. If that's the case, I don't think it is a unique situation at all. All you got to do is to convert the dictionary into a matrix and then perform clustering in the matrix. If you represent each line in the matrix as one word in your dictionary you should be able to reference the words back after clustering.
I couldn't test the code below, so it may not be completely functional, but the idea is the following:
from nltk.cluster import KMeansClusterer
import nltk
# make the matrix with the words
words = L1_words.keys()
X = []
for w in words:
X.append(L1_words[w])
# perform the clustering on the matrix
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
# print the cluster each word belongs
for i in range(len(X)):
print(words[i], assigned_clusters[i])
You can read more in details in this link.
I have 12 Million company names in my db. I want to match them with a list offline.
I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like
G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:
\s+inc\.?$ -> Incorporated
\s+corp\.?$ -> Corporation
You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.
You can then use Levenshtein distance or another fuzzy matching algorithm.
You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :
import fuzzyset
fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
fz.add(l)
#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
Interzoid's Company Name Match Advanced API generates similarity keys to help solve this...you call the API to generate a similarity key that eliminates all of the noise, known synonyms, soundex, ML, etc... you then match on the similarity key rather than the data itself for much higher match rates (commercial API, disclaimer: I work for Interzoid)
https://interzoid.com/services/getcompanymatchadvanced
Use MatchKraft to fuzzy match company names on two lists.
http://www.matchkraft.com/
Levenstiens distance is not enough to solve this problem. You also need the following:
Heuristics to improve execution time
Information retrieval (Lucene) and SQL
Company names database
It is better to use an existing tool rather than creating your program in Python.
I have a large dataset containing a mix of words and short phrases, such as:
dataset = [
"car",
"red-car",
"lorry",
"broken lorry",
"truck owner",
"train",
...
]
I am trying to find a way to determine the most similar word from a short sentence, such as:
input = "I love my car that is red" # should map to "red-car"
input = "I purchased a new lorry" # should map to "lorry"
input = "I hate my redcar" # should map to "red-car"
input = "I will use my truck" # should map to "truck owner"
input = "Look at that yellow lorri" # should map to "lorry"
I have tried a number of approaches to this with no avail, including:
Vectoring the dataset and the input using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input value against each individual, vectorized item value from the dataset.
The problem is, this only really works if the input contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai" then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train" in the dataset.
The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, i.e.:
input = "broke" # should map to "broken lorry" given the above dataset
If someone could suggest other potential approach I could try, that would be much appreciated.
As #Aaalok has suggested in the comments, one idea is to use a different distance/similarity function. Possible candidates include
Levenshtein distance (measures the number of changes to transform one string into the other)
N-gram similarity (measures the number of shared n-grams between both strings)
Another possibility is feature generation, i.e. enhancing the items in your dataset with additional strings. These could be n-grams, stems, or whatever suits your needs. For example, you could (automatically) expand red-car into
red-car red car
Paragraph vector or doc2vec should solve your problem. Provided you've enough and proper dataset. Of course, you'll have to do lot of tuning to get your results right. You could try gensim/deeplearning4j. But you may have to use some other methods to manage spelling mistakes.
I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.
It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.
I am working on text compression and I want to use the knowledge of mining closed frequent sequences. The existing algorithms like GSP, CloSpan, ClaSP, Bide mine all frequent sequences both continuous and non continuous. Can you help me in finding such algorithm?
For example if the sequence database is
SID Sequence
1 CAABC
2 ABCB
3 CABC
4 ABBCA
and minimum support is 2
the existing algorithms consider the subsequence 'CB' of sequence with id 1 but I don't want that.
Modern sequential pattern mining algorithms try to prune the search space to reduce running time. The search space is exponentially larger as a "non-continuous" sub-sequence can be any combination from the input sequences. In your case, the search space is far smaller given that the sequences are continuous i.e. we already know the combinations. So, you can probably make an algorithm for this on your own if you like, and it would still be reasonably faster.
Here's how a crude recursive example could look like:
def f(str, database, minSupp):
freq = 0
if len(patt) == 0:
return ""
#count frequency
for trans in db:
if patt in trans:
freq += 1
if freq >= minSupp:
return patt
else: #break it down
r = []
r.append(f(patt[1:], db, minSupp)) #All but the first element
r.append(f(patt[:-1], db, minSupp)) #All but the last element
return r
This just demonstrates one way of doing it. Of course, it crappy.
To do this much faster, you can use probably write some conditions so that a recursive call is not made in case the pattern is known.
An even faster way would be that you maintain an inverted-index of all patterns and then incrementally update them to create super-patterns using the Apriori-condition. For this, you can refer to some slide-show explaining how the Apriori algorithm works (using your candidate generation method; one example of which is in the algorithm above).
As such no algorithm exists for finding continuous sequences for compression. You can just modify the existing algorithms to mine only continuous sequences. I suggest you to modify BIDE algorithm to find continuous sub sequences only.