I have 12 Million company names in my db. I want to match them with a list offline.
I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like
G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:
\s+inc\.?$ -> Incorporated
\s+corp\.?$ -> Corporation
You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.
You can then use Levenshtein distance or another fuzzy matching algorithm.
You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :
import fuzzyset
fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
fz.add(l)
#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
Interzoid's Company Name Match Advanced API generates similarity keys to help solve this...you call the API to generate a similarity key that eliminates all of the noise, known synonyms, soundex, ML, etc... you then match on the similarity key rather than the data itself for much higher match rates (commercial API, disclaimer: I work for Interzoid)
https://interzoid.com/services/getcompanymatchadvanced
Use MatchKraft to fuzzy match company names on two lists.
http://www.matchkraft.com/
Levenstiens distance is not enough to solve this problem. You also need the following:
Heuristics to improve execution time
Information retrieval (Lucene) and SQL
Company names database
It is better to use an existing tool rather than creating your program in Python.
Related
I have a large dataset containing a mix of words and short phrases, such as:
dataset = [
"car",
"red-car",
"lorry",
"broken lorry",
"truck owner",
"train",
...
]
I am trying to find a way to determine the most similar word from a short sentence, such as:
input = "I love my car that is red" # should map to "red-car"
input = "I purchased a new lorry" # should map to "lorry"
input = "I hate my redcar" # should map to "red-car"
input = "I will use my truck" # should map to "truck owner"
input = "Look at that yellow lorri" # should map to "lorry"
I have tried a number of approaches to this with no avail, including:
Vectoring the dataset and the input using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input value against each individual, vectorized item value from the dataset.
The problem is, this only really works if the input contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai" then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train" in the dataset.
The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, i.e.:
input = "broke" # should map to "broken lorry" given the above dataset
If someone could suggest other potential approach I could try, that would be much appreciated.
As #Aaalok has suggested in the comments, one idea is to use a different distance/similarity function. Possible candidates include
Levenshtein distance (measures the number of changes to transform one string into the other)
N-gram similarity (measures the number of shared n-grams between both strings)
Another possibility is feature generation, i.e. enhancing the items in your dataset with additional strings. These could be n-grams, stems, or whatever suits your needs. For example, you could (automatically) expand red-car into
red-car red car
Paragraph vector or doc2vec should solve your problem. Provided you've enough and proper dataset. Of course, you'll have to do lot of tuning to get your results right. You could try gensim/deeplearning4j. But you may have to use some other methods to manage spelling mistakes.
I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.
It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.
I am working on text compression and I want to use the knowledge of mining closed frequent sequences. The existing algorithms like GSP, CloSpan, ClaSP, Bide mine all frequent sequences both continuous and non continuous. Can you help me in finding such algorithm?
For example if the sequence database is
SID Sequence
1 CAABC
2 ABCB
3 CABC
4 ABBCA
and minimum support is 2
the existing algorithms consider the subsequence 'CB' of sequence with id 1 but I don't want that.
Modern sequential pattern mining algorithms try to prune the search space to reduce running time. The search space is exponentially larger as a "non-continuous" sub-sequence can be any combination from the input sequences. In your case, the search space is far smaller given that the sequences are continuous i.e. we already know the combinations. So, you can probably make an algorithm for this on your own if you like, and it would still be reasonably faster.
Here's how a crude recursive example could look like:
def f(str, database, minSupp):
freq = 0
if len(patt) == 0:
return ""
#count frequency
for trans in db:
if patt in trans:
freq += 1
if freq >= minSupp:
return patt
else: #break it down
r = []
r.append(f(patt[1:], db, minSupp)) #All but the first element
r.append(f(patt[:-1], db, minSupp)) #All but the last element
return r
This just demonstrates one way of doing it. Of course, it crappy.
To do this much faster, you can use probably write some conditions so that a recursive call is not made in case the pattern is known.
An even faster way would be that you maintain an inverted-index of all patterns and then incrementally update them to create super-patterns using the Apriori-condition. For this, you can refer to some slide-show explaining how the Apriori algorithm works (using your candidate generation method; one example of which is in the algorithm above).
As such no algorithm exists for finding continuous sequences for compression. You can just modify the existing algorithms to mine only continuous sequences. I suggest you to modify BIDE algorithm to find continuous sub sequences only.
So I have a my data set currently looks like the following:
['microsoft','bizspark'],
['microsoft'],
['microsoft', 'skype'],
['amazon', 's3'],
['amazon', 'zappos'],
['amazon'],
....
etc.
Now what I would love to do is cluster these in regards to one another, using the Levenstein distance to calculate word scores.
Now I would iterate through all of the lists and compare the distance to the following lists.
microsoft -> ['microsoft','bizspark'], ['microsoft'], ['microsoft', 'skype'],
amazon -> ['amazon', 's3'], ['amazon', 'zappos'], ['amazon'], ....
The question is how to do this? Should I calculate each levenstein distance on a word by word basis i.e. for ['amazon', 'zappos'] and ['microsoft','bizspark'], I would firstly get pairs: (amazon, microsoft), (amazon, bizspark), (zappos, microsoft, (zappos, bizspark) and calculate the distance of each pair.
Or should I really just create strings from these and then calculate the distance?
What I should then end up with is an NXN matrix with the distance:
['microsoft','bizspark'] | ['amazon', 'zappos'] ....
['microsoft','bizspark'] 1 | ?
_-------------------------------------------------------------------------
['amazon', 'zappos'] ? | 1
...
....
Then how do I apply clustering to this to determine a cut-off threshold?
One such suggestion using single words is discussed here
But I'm not sure how to go about it with regards to word lists!?
Please note, in regards to implementation I am using Python libaries, such as Numpy, Scipy, Pandas and as needed.
What you match against probably depends primarily on what your goals are. If you want to match either word, you probably should match against both words separately. If you want to match against phrases, then ' '.join()'ing them is probably a good idea.
BTW, I recently did some fuzzy matching using difflib.get_close_matches(). It's in the Python Standard Library. I don't have anything against whatever Levenstein distance library you may use; I just wanted to point out this option as one that worked for me.
Maybe "frequent itemset mining" is more what you looking for than clustering.
It will find frequent word combinations, such that each document may be part of multiple patterns.
The following code that I wrote takes a set of 68,000 items and tries to find similar items based on text location in the strings. The process takes a bit on this i3 4130 I'm temporarily using to code on - is there any way to speed this up? I'm making a type of 'did you mean?' function, so I need to sort on the spot of what the user enters.
I'm not trying to compare by similarity in a dictionary that's already created using keywords, I'm trying to compare the similar between the user's input on the fly and all existing keys. The user may mistype a key, so that's why it would say "did you mean?", like Google search does.
Sorting does not affect the time, according to averaged tests.
def similar_movies(movie):
start=time.clock()
movie=capitalize(movie)
similarmovies={}
allmovies=all_movies() #returns set of all 68000 movies
for item in allmovies:
'''if similar(movie.lower(),item.lower())>.5 or movie in item: #older algorithm
similarmovies[item]=similar(movie.lower(),item.lower())'''
if movie in item: #newer algorithm,
similarmovies[item]=1.0
print item
else:
similarmovies[item]=similar(movie.lower(),item.lower())
similarmovieshigh=sorted(similarmovies, key=similarmovies.get, reverse=True)[:10]
print time.clock()-start
return similarmovieshigh
Other functions used:
from difflib import SequenceMatcher
def similar(a, b):
output=SequenceMatcher(None, a, b).ratio()
return output
def all_movies(): #returns set of all keys in sub dicts(movies)
people=list(ratings.keys())
allmovies=[]
for item in people:
for i in ratings[item]:
allmovies.append(i)
allmovies=set(allmovies)
return allmovies
The dictionary is in this format, except with thousands of names:
ratings={'Shane': {'Avatar': 4.2, '127 Hours': 4.7}, 'Joe': {'Into The Wild': 4.5, 'Unstoppable': 3.0}}
Your algorithm is going to be O(n2), since within every title, the in operator has to check every sub-string of the title to determine if the entered text is within it. So yeah, I can understand why you would want this to run faster.
An i3 doesn't provide much compute power, so pre-computing as much as possible is the only solution, and running extra software such as a database is probably going to provide poor results, again due to the capability.
You might consider using a dictionary of title words (possibly with pre-computed phonetic changes to eliminate most common misspellings - the Porter Stemmer algorithm should provide some helpful reduction rules, e.g. to allow "unstop" to match "unstoppable").
So, for example, one key in your dictionary would be "wild" (or a phonetic adjustment), and the value associated with that key would be a list of all titles that contain "wild"; you would have the same for "the", "into", "avatar", "hours", "127", and all other words in your list of 68,000 titles. Just as an example, your dictionary's "wild" entry might look like:
"wild": ["Into The Wild", "Wild Wild West", "Wild Things"]
(Yes, I searched for "wild" on IMDB just so this list could have more entries - probably not the best choice, but not many titles have "avatar", "unstoppable", or "hours" in them).
Common words such as "the" might have enough entries that you would want to exclude them, so a persistent copy of the dictionary might be helpful to allow you to make specific adjustments, although it isn't necessary, and the compute time should be relatively quick at start-up.
When the user types in some text, you split the text into words, apply any phonetic reductions if you choose to use them, and then concatenate all of the title lists for all of the words from the user, including duplicates.
Then, count the duplicates and sort by how many times a title was matched. If a user types "The Wild", you'd have two matches on "Into The Wild" ("the" and "wild"), so it should sort higher than titles with only "the" or "wild" but not both in them.
Your list of ratings can be searched after the final sorted list is built, with ratings appended to each entry; this operation should be quick, since your ratings are already within a dictionary, keyed by name.
This turns an O(n2) search into a O(log(n)) search for each word entered, which should make a big difference in performance, if it suits your needs.
In all_movies(): instead of appending to a list you could add to a set and not cast keys() to a list:
def all_movies():
allmovies = set()
for item in ratings.keys():
for i in ratings[item]:
allmovies.add(i)
return allmovies
EDIT: or only using one for-loop:
def all_movies():
result = []
for rating_dict in ratings.values()
result += rating_dict.keys()
return result
Nothing I could spot in similar_movies.
Also have a look at celery: http://docs.celeryproject.org/en/latest/ for multi-processing,
especially the chunks concept: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks
If you're developing for a production system, I'd suggest using a full text search engine like Whoosh (Python), Elastic Search (Java), or Apache Solr (Java). A full text search engine is a server that builds an index to implement full text search including fuzzy or proximity searches efficiently. Many popular database system also features full search text engine like PostgreSQL FTS and MySQL FTS that may be an acceptable alternative if you are already using these database engines.
If this code is developed mostly for self learning and you want to learn how to implement fuzzy searches, you may want to look at normalizing the movie titles in the index and the search terms. There are methods like Soundex and Metaphone that normalizes search terms based on how it likely sounds in English and this normalized term can be used to create the search index. PostgreSQL have implementation of these algorithms. Note that these algorithms are very basic building blocks, a proper full text search engine will take into account misspelling, synonyms, stop words, language specific quirks, and optimizations like parallel/distributed processing, etc.