NLTK tokenize with collocations - python

I'm using NLTK and would like to tokenize a text with respect to collocations : for instance, "New York" should be a single token, whereas naïve tokenization would split "New" and "York".
I know how to find collocations and how to tokenize, but can't find how to combine both...
Thanks.

Approach, which seems to be right for you, called Named Entity Recognition. There are many links devoted to NLTK for Named Entity Recognition. I just give you one example from here
from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
def extract_entities(text):
entities = []
for sentence in sent_tokenize(text):
chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])
return entities
if __name__ == '__main__':
text = """
A multi-agency manhunt is under way across several states and Mexico after
police say the former Los Angeles police officer suspected in the murders of a
college basketball coach and her fiancé last weekend is following through on
his vow to kill police officers after he opened fire Wednesday night on three
police officers, killing one.
"In this case, we're his target," Sgt. Rudy Lopez from the Corona Police
Department said at a press conference.
The suspect has been identified as Christopher Jordan Dorner, 33, and he is
considered extremely dangerous and armed with multiple weapons, authorities
say. The killings appear to be retribution for his 2009 termination from the
Los Angeles Police Department for making false statements, authorities say.
Dorner posted an online manifesto that warned, "I will bring unconventional
and asymmetrical warfare to those in LAPD uniform whether on or off duty."
"""
print extract_entities(text)
Output:
[Tree('GPE', [('Mexico', 'NNP')]), Tree('GPE', [('Los', 'NNP'), ('Angeles', 'NNP')]), Tree('PERSON', [('Rudy', 'NNP')]), Tree('ORGANIZATION', [('Lopez', 'NNP')]), Tree('ORGANIZATION', [('Corona', 'NNP')]), Tree('PERSON', [('Christopher', 'NNP'), ('Jordan', 'NNP'), ('Dorner', 'NNP')]), Tree('GPE', [('Los', 'NNP'), ('Angeles', 'NNP')]), Tree('PERSON', [('Dorner', 'NNP')]), Tree('GPE', [('LAPD', 'NNP')])]
Another approach - use different measures of the information overlap between two
random variables, such as Mutual Information, Pointwise Mutual
Information, t-test and other. There is a good introduction in <<Foundations of Statistical Natural Language Processing>> by Christopher D. Manning and Hinrich Schütze. Chapter 5 Collocations is available for download. This link - example of extracting collocations with NLTK.

Related

How to delete the tokens of my text if it is not in vocabulary?

I define vocabulary and texte and I wand delete the tokens of my text if it is not in vocabulary.
vocbulary=['Get','owns', 'about', 'for', 'to', 'by', 'person', 'movie', 'school', 'Movie', 'Person']
texte=["good morning i oll know most terrible things you've done in your life up to this point but clearly a commons ot of balance to get a sign to my class i professor annile's keeting and this is criminal law one hundred or as i prefered to college ow to get away with murder\nunlike many of my colleagues i will not be teaching you how to study the law a theorize about it but rather how to practise what was the men"]
for word in nltk.word_tokenize(texte):
if word not in vocabulary:
x = re.sub(word," ", texte)
this is x return:
x
'good morning i oll know most terrible things you've done in your life up to this point but clearly a commons ot of balance to get a sign to my class i professor annile's keeting and this is criminal law one hundred or as i prefered to college ow to get away with murder\nunlike many of my colleagues i will not be teaching you how to study the law a theorize about it but rather how to practise what was the men'

How to Find Company Names in Text Using Python

I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc may appear as Visa or American Airlines Group Inc may appear as American Airlines.
How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?
I have tried both fuzzywuzzy and difflib.get_close_matches, but the problem is it looks at each individual word rather than clusters of words:
from fuzzywuzzy import process
from difflib import get_close_matches
company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']
text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'
#using fuzzywuzzy
for word in text.split():
print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))
#using get_close_matches
for word in text.split():
match = get_close_matches(word, company_name, n=1, cutoff=.4)
print(match)
I was working on a similar problem. Fuzzywuzzy internally uses difflib and both of them perform slowly on large datasets.
Chris van den Berg's pipeline converts company names into vectors of 3-grams using a TF-IDF matrix and then compares the vectors using cosine similarity.
The pipeline is quick and gives accurate results for partially matched strings too.
For that type of task I use a record linkage algorithm, it will find those clusters for you with the help of ML. You will have to provide some actual examples so the algorithm can learn to label the rest of your dataset properly.
Here is some info:
https://pypi.org/project/pandas-dedupe/
Cheers,

Object Standarization Using NLTK

I'm new to NLP and to Python.
I'm trying to use object standardization to replace abbreviations with their full meaning. I found code online and altered it to test it out on a wikipedia exert. but all the code does is print out the original text. Can any one help out a newbie in need?
heres the code:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel Commuinty",
"EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome had established two new communities alongside the ECSC: the eec and the European Atomic Energy Community (Euratom). However their executives were called Commissions rather than High Authorities")
Thanks in advance, any help is appreciated!
In your case, the lookup dict has the abbreviations for EC and ECSC amongs the words found in your input sentence. Calling split splits the input based on whitespace. But your sentence has the words ECSC. and ECSC: ,ie these are the tokens obtained post splitting as opposed to ECSC thus you are not able to map the input. I would suggest to do some depunctuation and run it again.

The fastest way to remove items that matches a substring from list - Python

What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
Use a list comprehension if you have your strings already in memory:
new = [line for line in the_list if not any(item in line for item in set_of_words)]
If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:
new = (line for line in the_list if not any(item in line for item in set_of_words))

Extracting more similar words from a list of words

So I have a list of words describing a particular group. For example, one group is based around pets.
The words for the example group pets, are as follows:
[pets, pet, kitten, cat, cats, kitten, puppies, puppy, dog, dogs, dog walking, begging, catnip, lol, catshit, thug life, poop, lead, leads, bones, garden, mouse, bird, hamster, hamsters, rabbits, rabbit, german shepherd, moggie, mongrel, tomcat, lolcatz, bitch, icanhazcheeseburger, bichon frise, toy dog, poodle, terrier, russell, collie, lab, labrador, persian, siamese, rescue, Celia Hammond, RSPCA, battersea dogs home, rescue home, battersea cats home, animal rescue, vets, vet, supervet, Steve Irwin, pugs, collar, worming, fleas, ginger, maine coon, smelly cat, cat people, dog person, Calvin and Hobbes, Calvin & Hobbes, cat litter, catflap, cat flap, scratching post, chew toy, squeaky toy, pets at home, cruft's, crufts, corgi, best in show, animals, Manchester dogs' home, manchester dogs home, cocker spaniel, labradoodle, spaniel, sheepdog, Himalayan, chinchilla, tabby, bobcat, ragdoll, short hair, long hair, tabby cat, calico, tabbies, looking for a good home, neutring, missing, spayed, neutered, declawing, deworming, declawed, pet insurance, pet plan, guinea pig, guinea pigs, ferret, hedgehogs, minipigs, mastiff, leonburger, great dane, four-legged friend, walkies, goldfish, terrapin, whiskas, mr dog, sheba, iams]
Now I plan on enriching this list using NLTK.
So as a start I can get the synset of each word. If we take cats, as an example we obtain:
Synset('cat.n.01')
Synset('guy.n.01')
Synset('cat.n.03')
Synset('kat.n.01')
Synset('cat-o'-nine-tails.n.01')
Synset('caterpillar.n.02')
Synset('big_cat.n.01')
Synset('computerized_tomography.n.01')
Synset('cat.v.01')
Synset('vomit.v.01')
For this we user nltk's wordnet, from nltk.corpus import wordnet as wn.
We can then obtain the lemmas for each synset. By simply adding these lemma's I inturn add quite a bit of noise, how ever I also add some interesting words.
But what I would like to look at is noise reduction, and would appreciate any suggestions or alternate methods to the above.
One such idea, I am trying is to see if the word 'cats' appears in the synset name or definition, to include or exclude those lemmas.
I'd propose to use semantic similarity here with a variant of kNN: for each candidate word compute pairwise semantic similarity to all gold-standard words, then keep only k (try different k from 5 to 100) most similar gold-standard words, compute average (or sum) of similarities to these k words and then use this value in order to discard noise candidates - by sorting and keeping only n best, or by cut-off by experimentally defined threshold.
Semantic similarity can be computed on the basis of WordNet, see related question, or on the basis of vector models learned by word2vec or similar techniques, see related question again.
Actually, you can try to use this technique with all words as candidates, or all/some words occurring in domain-specific texts - in the last case the task is called automatic term recognition and methods can be used for your problem directly or as a source of candidates; search for them on Google scholar; as an example with short description of existed approaches and links to surveys see this paper:
Fedorenko, D., Astrakhantsev, N., & Turdakov, D. (2013). Automatic
recognition of domain-specific terms: an experimental evaluation. In
SYRCoDIS (pp. 15-23).

Categories