I define vocabulary and texte and I wand delete the tokens of my text if it is not in vocabulary.
vocbulary=['Get','owns', 'about', 'for', 'to', 'by', 'person', 'movie', 'school', 'Movie', 'Person']
texte=["good morning i oll know most terrible things you've done in your life up to this point but clearly a commons ot of balance to get a sign to my class i professor annile's keeting and this is criminal law one hundred or as i prefered to college ow to get away with murder\nunlike many of my colleagues i will not be teaching you how to study the law a theorize about it but rather how to practise what was the men"]
for word in nltk.word_tokenize(texte):
if word not in vocabulary:
x = re.sub(word," ", texte)
this is x return:
x
'good morning i oll know most terrible things you've done in your life up to this point but clearly a commons ot of balance to get a sign to my class i professor annile's keeting and this is criminal law one hundred or as i prefered to college ow to get away with murder\nunlike many of my colleagues i will not be teaching you how to study the law a theorize about it but rather how to practise what was the men'
I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc may appear as Visa or American Airlines Group Inc may appear as American Airlines.
How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?
I have tried both fuzzywuzzy and difflib.get_close_matches, but the problem is it looks at each individual word rather than clusters of words:
from fuzzywuzzy import process
from difflib import get_close_matches
company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']
text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'
#using fuzzywuzzy
for word in text.split():
print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))
#using get_close_matches
for word in text.split():
match = get_close_matches(word, company_name, n=1, cutoff=.4)
print(match)
I was working on a similar problem. Fuzzywuzzy internally uses difflib and both of them perform slowly on large datasets.
Chris van den Berg's pipeline converts company names into vectors of 3-grams using a TF-IDF matrix and then compares the vectors using cosine similarity.
The pipeline is quick and gives accurate results for partially matched strings too.
For that type of task I use a record linkage algorithm, it will find those clusters for you with the help of ML. You will have to provide some actual examples so the algorithm can learn to label the rest of your dataset properly.
Here is some info:
https://pypi.org/project/pandas-dedupe/
Cheers,
I'm new to NLP and to Python.
I'm trying to use object standardization to replace abbreviations with their full meaning. I found code online and altered it to test it out on a wikipedia exert. but all the code does is print out the original text. Can any one help out a newbie in need?
heres the code:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel Commuinty",
"EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome had established two new communities alongside the ECSC: the eec and the European Atomic Energy Community (Euratom). However their executives were called Commissions rather than High Authorities")
Thanks in advance, any help is appreciated!
In your case, the lookup dict has the abbreviations for EC and ECSC amongs the words found in your input sentence. Calling split splits the input based on whitespace. But your sentence has the words ECSC. and ECSC: ,ie these are the tokens obtained post splitting as opposed to ECSC thus you are not able to map the input. I would suggest to do some depunctuation and run it again.
What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
Use a list comprehension if you have your strings already in memory:
new = [line for line in the_list if not any(item in line for item in set_of_words)]
If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:
new = (line for line in the_list if not any(item in line for item in set_of_words))
So I have a list of words describing a particular group. For example, one group is based around pets.
The words for the example group pets, are as follows:
[pets, pet, kitten, cat, cats, kitten, puppies, puppy, dog, dogs, dog walking, begging, catnip, lol, catshit, thug life, poop, lead, leads, bones, garden, mouse, bird, hamster, hamsters, rabbits, rabbit, german shepherd, moggie, mongrel, tomcat, lolcatz, bitch, icanhazcheeseburger, bichon frise, toy dog, poodle, terrier, russell, collie, lab, labrador, persian, siamese, rescue, Celia Hammond, RSPCA, battersea dogs home, rescue home, battersea cats home, animal rescue, vets, vet, supervet, Steve Irwin, pugs, collar, worming, fleas, ginger, maine coon, smelly cat, cat people, dog person, Calvin and Hobbes, Calvin & Hobbes, cat litter, catflap, cat flap, scratching post, chew toy, squeaky toy, pets at home, cruft's, crufts, corgi, best in show, animals, Manchester dogs' home, manchester dogs home, cocker spaniel, labradoodle, spaniel, sheepdog, Himalayan, chinchilla, tabby, bobcat, ragdoll, short hair, long hair, tabby cat, calico, tabbies, looking for a good home, neutring, missing, spayed, neutered, declawing, deworming, declawed, pet insurance, pet plan, guinea pig, guinea pigs, ferret, hedgehogs, minipigs, mastiff, leonburger, great dane, four-legged friend, walkies, goldfish, terrapin, whiskas, mr dog, sheba, iams]
Now I plan on enriching this list using NLTK.
So as a start I can get the synset of each word. If we take cats, as an example we obtain:
Synset('cat.n.01')
Synset('guy.n.01')
Synset('cat.n.03')
Synset('kat.n.01')
Synset('cat-o'-nine-tails.n.01')
Synset('caterpillar.n.02')
Synset('big_cat.n.01')
Synset('computerized_tomography.n.01')
Synset('cat.v.01')
Synset('vomit.v.01')
For this we user nltk's wordnet, from nltk.corpus import wordnet as wn.
We can then obtain the lemmas for each synset. By simply adding these lemma's I inturn add quite a bit of noise, how ever I also add some interesting words.
But what I would like to look at is noise reduction, and would appreciate any suggestions or alternate methods to the above.
One such idea, I am trying is to see if the word 'cats' appears in the synset name or definition, to include or exclude those lemmas.
I'd propose to use semantic similarity here with a variant of kNN: for each candidate word compute pairwise semantic similarity to all gold-standard words, then keep only k (try different k from 5 to 100) most similar gold-standard words, compute average (or sum) of similarities to these k words and then use this value in order to discard noise candidates - by sorting and keeping only n best, or by cut-off by experimentally defined threshold.
Semantic similarity can be computed on the basis of WordNet, see related question, or on the basis of vector models learned by word2vec or similar techniques, see related question again.
Actually, you can try to use this technique with all words as candidates, or all/some words occurring in domain-specific texts - in the last case the task is called automatic term recognition and methods can be used for your problem directly or as a source of candidates; search for them on Google scholar; as an example with short description of existed approaches and links to surveys see this paper:
Fedorenko, D., Astrakhantsev, N., & Turdakov, D. (2013). Automatic
recognition of domain-specific terms: an experimental evaluation. In
SYRCoDIS (pp. 15-23).