For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules
Let's take as example the word "wider"
As it is an adjective the rule for lemmatization should be take from this list:
ADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
As I understand the process will be like this:
1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.
Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?
Here it can be tested.
Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
Class
It starts off with initializing 3 variables:
class Lemmatizer(object):
#classmethod
def load(cls, path, index=None, exc=None, rules=None):
return cls(index or {}, exc or {}, rules or {})
def __init__(self, index, exceptions, rules):
self.index = index
self.exc = exceptions
self.rules = rules
Now, looking at the self.exc for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer
Why don't Spacy just read a file?
Most probably because declaring the string in-code is faster that streaming strings through I/O.
Where does these index, exceptions and rules come from?
Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html
Rules
Looking at it even closer, the rules on https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy rules from nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749
And these rules originally comes from the Morphy software https://wordnet.princeton.edu/man/morphy.7WN.html
Additionally, spacy had included some punctuation rules that isn't from Princeton Morphy:
PUNCT_RULES = [
["“", "\""],
["”", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
Exceptions
As for the exceptions, they were stored in the *_irreg.py files in spacy, and they look like they also come from the Princeton Wordnet.
It is evident if we look at some mirror of the original WordNet .exc (exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet package from nltk, we see that it's the same list:
alvas#ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
alvas#ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc
1490 adj.exc
Index
If we look at the spacy lemmatizer's index, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk:
alvas#ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj
1 This software and database is being provided to you, the LICENSEE, by
2 Princeton University under the following license. By obtaining, using
3 and/or copying this software and database, you agree that you have
4 read, understood, and will comply with these terms and conditions.:
5
6 Permission to use, copy, modify and distribute this software and
7 database and its documentation for any purpose and without fee or
8 royalty is hereby granted, provided that you agree to comply with
9 the following copyright notice and statements, including the disclaimer,
10 and that the same appear on ALL copies of the software, database and
11 documentation, including modifications that you make for internal
12 use or for distribution.
13
14 WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved.
15
16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
18 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
23 OTHER RIGHTS.
24
25 The name of Princeton University or Princeton may not be used in
26 advertising or publicity pertaining to distribution of the software
27 and/or database. Title to copyright in this software, database and
28 any associated documentation shall at all times remain with
29 Princeton University and LICENSEE agrees to preserve same.
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 | being born or beginning; "the nascent chicks"; "a nascent insurgency"
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels
On the basis that the dictionary, exceptions and rules that spacy lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy applies the rules using the index and exceptions.
We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
The main action comes from the function rather than the Lemmatizer class:
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
# forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
Why is the lemmatize method outside of the Lemmatizer class?
That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that #staticmethod and #classmethod exist perhaps there are other considerations as to why the function and class has been decoupled
Morphy vs Spacy
Comparing spacy lemmatize() function against the morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy(), the main processes in Oliver Steele's Python port of the WordNet morphy are:
Check the exception lists
Apply rules once to the input to get y1, y2, y3, etc.
Return all that are in the database (and check the original too)
If there are no matches, keep applying rules until we find a match
Return an empty list if we can't find anything
For spacy, possibly, it's still under development, given the TODO at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76
But the general process seems to be:
Look for the exceptions, get them if the lemma from the exception list if the word is in it.
Apply the rules
Save the ones that are in the index lists
If there are no lemma from step 1-3, then just keep track of the Out-of-vocabulary words (OOV) and also append the original string to the lemma forms
Return the lemma forms
In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk implementation of morphy does the same,e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
Checking for infinitive before lemmatization
Possibly another point of difference is how morphy and spacy decides what POS to assign to the word. In that respect, spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).
But that's possible in spacy because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.
Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:
person
number
gender
Updated
SpaCy did make changes to their lemmatizer after the initial answer (12 May 17). I think the purpose was to make the lemmatization faster without look-ups and rules processing.
So they pre-lemmatize words and leave them in a lookup hash-table to make the retrieval O(1) for words that they have pre-lemmatized https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py
Also, in efforts to unify the lemmatizers across languages, the lemmatizer is now located at https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92
But the underlying lemmatization steps discussed above is still relevant to the current spacy version (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)
Epilogue
I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"
But before even answering the question before, "What exactly is a lemma?" might the better question to ask.
TLDR: spaCy checks whether the lemma it's trying to generate is in the known list of words or exceptions for that part of speech.
Long Answer:
Check out the lemmatizer.py file, specifically the lemmatize function at the bottom.
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
For English adjectives, for instance, it takes in the string we're evaluating, the index of known adjectives, the exceptions, and the rules, as you've referenced, from this directory (for English model).
The first thing we do in lemmatize after making the string lower case is check whether the string is in our list of known exceptions, which includes lemma rules for words like "worse" -> "bad".
Then we go through our rules and apply each one to the string if it is applicable. For the word wider, we would apply the following rules:
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
and we would output the following forms: ["wid", "wide"].
Then, we check if this form is in our index of known adjectives. If it is, we append it to the forms. Otherwise, we add it to oov_forms, which I'm guessing is short for out of vocabulary. wide is in the index, so it gets added. wid gets added to oov_forms.
Lastly, we return a set of either the lemmas found, or any lemmas that matched rules but weren't in our index, or just the word itself.
The word-lemmatize link you posted above works for wider, because wide is in the word index. Try something like He is blandier than I. spaCy will mark blandier (word I made up) as an adjective, but it's not in the index, so it will just return blandier as the lemma.
There is a set of rules and a set of words known for each word type(adjective, noun, verb, adverb). The mapping happens here:
INDEX = {
"adj": ADJECTIVES,
"adv": ADVERBS,
"noun": NOUNS,
"verb": VERBS
}
EXC = {
"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG
}
RULES = {
"adj": ADJECTIVE_RULES,
"noun": NOUN_RULES,
"verb": VERB_RULES,
"punct": PUNCT_RULES
}
Then on this line in lemmatizer.py the correct index, rules and exc (excl I believe stands for exceptions e.g. irregular examples) get loaded:
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
self.exc.get(univ_pos, {}),
self.rules.get(univ_pos, []))
All the remaining logic is in the function lemmatize and is surprisingly short. We perform the following operations:
If there is an exception(i.e. the word is irregular) including the provided string, use it and add it to the lemmatized forms
For each rule in the order they are given for the selected word type check if it matches the given word. If it does try to apply it.
2a. If after applying the rule the word is in the list of known words(i.e. index), add it to the lemmatized forms of the word
2b. Otherwise add the word to a separate list called oov_forms(here I believe oov stands for "out of vocabulary")
In case we've found at least one form using the rules above we return the list of forms found, otherwise we return the oov_forms list.
Related
I'm preparing text for a word cloud, but I get stuck.
I need to remove all digits, all signs like . , - ? = / ! # etc., but I don't know how. I don't want to replace again and again. Is there a method for that?
Here is my concept and what I have to do:
Concatenate texts in one string
Set chars to lowercase <--- I'm here
Now I want to delete specific signs and divide the text into words (list)
calculate freq of words
next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
Text example:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.
So to set upper case characters to lower case characters you could do the following:
so just store your text to a string variable, for example STRING and next use the command
STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
now your string will be free of capital letters.
To remove the special characters again module re can help you with the sub command :
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
with these command your string will be free of special characters
And to determine the word frequency you could use the module collections from where you have to import Counter.
Then use the following command to determine the frequency with which the words occur:
Counter(STRING.split()).most_common()
I'd probably try to use string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)
I have a list of strings and I want to find popular prefixes. The prefixes are special in that they occur as strings in the input list.
I found a similar question here but the answers are geared to find the one most common prefix:
Find *most* common prefix of strings - a better way?
While my problem is similar, it differs in that I need to find all popular prefixes. Or to maybe state it a little simplistically, rank prefixes from most common to least.
As an example, consider the following list of strings:
in, india, indian, indian flag, bull, bully, bullshit
Prefixes rank:
in - 4 times
india - 3 times
bull - 3 times
...and so on. Please note - in, bull, india are all present in the input list.
The following are not valid prefixes:
ind
bu
bul
...since they do not occur in the input list.
What data structure should I be looking at to model my solution? I'm inclined to use a "trie" with a counter on each node that tracks how many times has that node been touched during the creation of the trie.
All suggestions are welcome.
Thanks.
p.s. - I love python and would love if someone could post a quick snippet that could get me started.
words = [ "in", "india", "indian", "indian", "flag", "bull", "bully", "bullshit"]
Result = sorted([ (sum([ w.startswith(prefix) for w in words ]) , prefix ) for prefix in words])[::-1]
it goes through every word as a prefix and checks how many of the other words start with it and then sorts the result. the[::-1] simply reverses that order
If we know the length of the prefix (say 3)
from nltk import FreqDist
suffixDist=FreqDist()
for word in vocabulary:
suffixDist[word[-3:]] +=1
commonSuffix=[suffix for (suffix,count) in suffixDist.most_common(150) ]
print(commonSuffix)
I have a line of text that I want to make into N3 format so i can eventually change them to RDF. Each line of the text file has an entry like this:
09827177 18 n 03 aristocrat 0 blue_blood 0 patrician 0 013 # 09646208 n 0000 #m 08404938 n 0000 + 01594891 a 0306 + 01594891 a 0102 ~ 09860027 n 0000 ~ 09892248 n 0000 ~ 10103592 n 0000 ~ 10194721 n 0000 ~ 10304832 n 0000 ~ 10492384 n 0000 ~ 10493649 n 0000 ~ 10525325 n 0000 ~ 10526235 n 0000 | a member of the aristocracy
I am trying to make triples out of the above statement so they will look like the table below.
Subject Predicate Object
(synset_offset)
09807754 lex_filenum 18
09807754 ss_type n
09807754 lexical_entry aristocrat
09807754 lexical_entry blue_blood
09807754 lexical_entry patrician
09807754 has_pointer 09623038
09623038 ss_type n
09623038 source_target 0000
09807754 description a member of aristocracy
I have been able to read most of the variables from each line of the text using this:
f = open("wordnetSample.txt", "r")
for line in f:
L = line.split()
L2 = line.split('|')
synset_offset = L[0]
lex_filenum = L[1]
ss_type = L[2]
word = (L[4:4 + 2 * int(L[3]):2])
gloss = (L2[1].split('\n')[0])
The problem I am having is that I don't know what namespaces to use or anything like that. I am new to this style of formatting and to python in general. I have been researching and feel it should be something like this:
'''<http://example.org/#'''+synset_offset+'''> <http://xmlns.com/foaf/0.1/lex_filenum> '''+lex_filenum+''' .
I have also been told that Turtle notation may be a better option, but i just cant get my head around it.
In RDF, resources and properties are identified by IRIs. The choice of how you select resource and property IRIs is really up to you. If you have own a domain name, you might choose to use IRIs based on that. If you are pulling data from someplace else, and it makes sense to use names based on that, you might choose to use IRIs based on that. If some of the resources or properties are already identified somewhere by IRIs, it's always good to try to reuse those, but it's not always easy to find those.
In your case, where the data is coming from WordNet, you should probably be very interested in the W3C Working Draft, RDF/OWL Representation of WordNet. I don't know whether the approaches and namespaces therein have been widely adopted or not, but the approach is surely something that you can learn something from. For instance
Each instance of Synset, WordSense and Word has its own URI. There is a pattern for the URIs so that (a) it is easy to determine from the URI the class to which the instance belongs; and (b) the URI provides some information on the meaning of the entity it represents. For example, the following URI
http://www.w3.org/2006/03/wn/wn20/instances/synset-bank-noun-2
is a NounSynset. This NounSynset contains a WordSense which is the first sense of the word "bank". The pattern for instances of Synset is: wn20instances: + synset- + %lexform%- + %type%- + %sensenr%. The %lexform% is the lexical form of the first WordSense of the Synset (the first WordSense in the Princeton source as signified by its "wordnumber", see Overview of the WordNet Prolog distribution). The %type% is one of noun, verb, adjective, adjective satellite and adverb. The %sensenr% is the number of the WordSense that is contained in the synset. This pattern produces a unique URI because the WordSense uniquely identifies the synset (a WordSense belongs to exactly one Synset).
The schema also defines lots of properties for the WordNet schema. You should probably reuse these IRIs where possible.
I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...
I'm working with a large database of businesses.
I'd like to be able to compare two business names for similarity to see if they possibly might be duplicates.
Below is a list of business names that should test as having a high probability of being duplicates, what is a good way to go about this?
George Washington Middle Schl
George Washington School
Santa Fe East Inc
Santa Fe East
Chop't Creative Salad Co
Chop't Creative Salad Company
Manny and Olga's Pizza
Manny's & Olga's Pizza
Ray's Hell Burger Too
Ray's Hell Burgers
El Sol
El Sol de America
Olney Theatre Center for the Arts
Olney Theatre
21 M Lounge
21M Lounge
Holiday Inn Hotel Washington
Holiday Inn Washington-Georgetown
Residence Inn Washington,DC/Dupont Circle
Residence Inn Marriott Dupont Circle
Jimmy John's Gourmet Sandwiches
Jimmy John's
Omni Shoreham Hotel at Washington D.C.
Omni Shoreham Hotel
I've recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you'd consider for matching generic strings.
First, I'd recommend taking a look at a paper, How to play the “Names Game”: Patent retrieval comparing different heuristics by Raffo and Lhuillery. The published version is here, and a PDF is freely available here. The authors provide a nice summary, comparing a number of different matching strategies. They consider three stages, which they call parsing, matching, and filtering.
Parsing consists of applying various cleaning techniques. Some examples:
Standardizing lettercase (e.g., all lowercase)
Standardizing punctuation (e.g., commas must be followed by spaces)
Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
Standardizing legal control terms (e.g., converting "Co." to "Company")
In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list.
Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the Jaccard index is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching.
In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made.
Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards.
I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway.
I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.
Parsing
Let's use this odd name as an example.
name = "THE | big,- Pharma: LLC" # example of a company name
We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:
from cleanco import cleanco
name = cleanco(name).clean_name() # 'THE | big,- Pharma'
Remove all punctuation:
name = name.translate(None, string.punctuation) # 'THE big Pharma'
(for unicode strings, the following code works instead (source, regex):
import regex
name = regex.sub(ur"[[:punct:]]+", "", name) # u'THE big Pharma'
Split the name into tokens using NLTK:
import nltk
tokens = nltk.word_tokenize(name) # ['THE', 'big', 'Pharma']
Lowercase all tokens:
tokens = [t.lower() for t in tokens] # ['the', 'big', 'pharma']
Remove stop words. Note that it might cause problems with companies like On Mars will be incorrectly matched to Mars, because On is a stopword.
from nltk.corpus import stopwords
tokens = [t for t in tokens if t not in stopwords.words('english')] # ['big', 'pharma']
I don't cover accented and special characters here (improvements welcome).
Matching
Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.
To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.
You could use the Levenshtein distance, which could be used to measure the difference between two sequences (basically an edit distance).
Levenshtein Distance in Python
def levenshtein_distance(a,b):
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
if __name__=="__main__":
from sys import argv
print levenshtein_distance(argv[1],argv[2])
There is great library for searching for similar/fuzzy strings for python: fuzzywuzzy. It's a nice wrapper library upon mentioned Levenshtein distance measuring.
Here how your names could be analysed:
#!/usr/bin/env python
from fuzzywuzzy import fuzz
names = [
("George Washington Middle Schl",
"George Washington School"),
("Santa Fe East Inc",
"Santa Fe East"),
("Chop't Creative Salad Co",
"Chop't Creative Salad Company"),
("Manny and Olga's Pizza",
"Manny's & Olga's Pizza"),
("Ray's Hell Burger Too",
"Ray's Hell Burgers"),
("El Sol",
"El Sol de America"),
("Olney Theatre Center for the Arts",
"Olney Theatre"),
("21 M Lounge",
"21M Lounge"),
("Holiday Inn Hotel Washington",
"Holiday Inn Washington-Georgetown"),
("Residence Inn Washington,DC/Dupont Circle",
"Residence Inn Marriott Dupont Circle"),
("Jimmy John's Gourmet Sandwiches",
"Jimmy John's"),
("Omni Shoreham Hotel at Washington D.C.",
"Omni Shoreham Hotel"),
]
if __name__ == '__main__':
for pair in names:
print "{:>3} :: {}".format(fuzz.partial_ratio(*pair), pair)
>>> 79 :: ('George Washington Middle Schl', 'George Washington School')
>>> 100 :: ('Santa Fe East Inc', 'Santa Fe East')
>>> 100 :: ("Chop't Creative Salad Co", "Chop't Creative Salad Company")
>>> 86 :: ("Manny and Olga's Pizza", "Manny's & Olga's Pizza")
>>> 94 :: ("Ray's Hell Burger Too", "Ray's Hell Burgers")
>>> 100 :: ('El Sol', 'El Sol de America')
>>> 100 :: ('Olney Theatre Center for the Arts', 'Olney Theatre')
>>> 90 :: ('21 M Lounge', '21M Lounge')
>>> 79 :: ('Holiday Inn Hotel Washington', 'Holiday Inn Washington-Georgetown')
>>> 69 :: ('Residence Inn Washington,DC/Dupont Circle', 'Residence Inn Marriott Dupont Circle')
>>> 100 :: ("Jimmy John's Gourmet Sandwiches", "Jimmy John's")
>>> 100 :: ('Omni Shoreham Hotel at Washington D.C.', 'Omni Shoreham Hotel')
Another way of solving such kind of problems could be Elasticsearch, which also supports fuzzy searches.
I searched for "python edit distance" and this library came as the first result: http://www.mindrot.org/projects/py-editdist/
Another Python library that does the same job is here: http://pypi.python.org/pypi/python-Levenshtein/
An edit distance represents the amount of work you need to carry out to convert one string to another by following only simple -- usually, character-based -- edit operations. Every operation (substition, deletion, insertion; sometimes transpose) has an associated cost and the minimum edit distance between two strings is a measure of how dissimilar the two are.
In your particular case you may want to order the strings so that you find the distance to go from the longer to the shorter and penalize character deletions less (because I see that in many cases one of the strings is almost a substring of the other). So deletion shouldn't be penalized a lot.
You could also make use of this sample code: http://norvig.com/spell-correct.html
This a bit of an update to Dennis comment. That answer was really helpful as was the links he posted but I couldn't get them to work right off. After trying the Fuzzy Wuzzy search I found this gave me a bunch better set of answers. I have a large list of merchants and I just want to group them together. Eventually I'll have a table I can use to try some machine learning to play around with but for now this takes a lot of the effort out of it.
I only had to update his code a little bit and add a function to create the tokens2frequency dictionary. The original article didn't have that either and then the functions didn't reference it correctly.
import pandas as pd
from collections import Counter
from cleanco import cleanco
import regex
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# token2frequency is just a Counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency[t]**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a)
b_tokens = set(b)
a_uniq = sequence_uniqueness(a, token2frequency)
b_uniq = sequence_uniqueness(b, token2frequency)
if a_uniq==0 or b_uniq == 0:
return 0
else:
return sequence_uniqueness(a_tokens.intersection(b_tokens), token2frequency)/(a_uniq * b_uniq) ** 0.5
def parse_name(name):
name = cleanco(name).clean_name()
#name = name.translate(None, string.punctuation)
name = regex.sub(r"[[:punct:]]+", "", name)
tokens = nltk.word_tokenize(name)
tokens = [t.lower() for t in tokens]
tokens = [t for t in tokens if t not in stopwords.words('english')]
return tokens
def build_token2frequency(names):
alltokens = []
for tokens in names.values():
alltokens += tokens
return Counter(alltokens)
with open('marchants.json') as merchantfile:
merchants = pd.read_json(merchantfile)
merchants = merchants.unique()
parsed_names = {merchant:parse_name(merchant) for merchant in merchants}
token2frequency = build_token2frequency(parsed_names)
grouping = {}
for merchant, tokens in parsed_names.items():
grouping[merchant] = {merchant2: name_similarity(tokens, tokens2, token2frequency) for merchant2, tokens2 in parsed_names.items()}
filtered_matches = {}
for merchant in pcard_merchants:
filtered_matches[merchant] = {merchant1: ratio for merchant1, ratio in grouping[merchant].items() if ratio >0.3 }
This will give you a final filtered list of names and the other names they match up to. It's the same basic code as the other post just with a couple of missing pieces filled in. This also is run in Python 3.8
Consider using the Diff-Match-Patch library. You'd be interested in the Diff process - applying a diff on your text can give you a good idea of the differences, along with a programmatic representation of them.
What you can do is separate the words by whitespaces, commas, etc. and then you you count the number of words it have in common with another name and you add a number of words thresold before it is considered "similar".
The other way is to do the same thing, but take the words and splice them for each caracters. Then for each words you need to compare if letters are found in the same order (from both sides) for an x amount of caracters (or percentage) then you can say that the word is similar too.
Ex: You have sqre and square
Then you check by caracters and find that sqre are all in square and in the same order, then it's a similar word.
The algorithms that are based on the Levenshtein distance are good (not perfect) but their main disadvantage is that they are very slow for each comparison and concerning the fact that you would have to compare every possible combination.
Another way of working out the problem would be, to use embedding or bag of words to transform each company name (after some cleaning and prepossessing ) into a vector of numbers. And after that you apply an unsupervised or supervised ML method depending on what is available.
I created matchkraft (https://github.com/MatchKraft/matchkraft-python). It works on top of fuzzy-wuzzy and you can fuzzy match company names in one list.
It is very easy to use. Here is an example in python:
from matchkraft import MatchKraft
mk = MatchKraft('<YOUR API TOKEN HERE>')
job_id = mk.highlight_duplicates(name='Stackoverflow Job',
primary_list=[
'George Washington Middle Schl',
'George Washington School',
'Santa Fe East Inc',
'Santa Fe East',
'Rays Hell Burger Too',
'El Sol de America',
'microsoft',
'Olney Theatre',
'El Sol'
]
)
print (job_id)
mk.execute_job(job_id=job_id)
job = mk.get_job_information(job_id=job_id)
print (job.status)
while (job.status!='Completed'):
print (job.status)
time.sleep(10)
job = mk.get_job_information(job_id=job_id)
results = mk.get_results_information(job_id=job_id)
if isinstance(results, list):
for r in results:
print(r.master_record + ' --> ' + r.match_record)
else:
print("No Results Found")