I would like to know if there is a way to analyze nouns in a list. For example, if there is an algorithm that discern different categories, so like if the noun is part of the category "animal", "plants", "nature" and so on.
I thought it was possible to achieve this result with Wordnet, but, if I am not wrong, all the nouns in WordNet are categorized as "entity". Here is a script of my WordNet analysis:
lemmas = ['dog', 'cat', 'garden', 'ocean', 'death', 'joy']
hypernyms = []
for i in lemmas:
dog = wn.synsets(i)[0]
temp_list = []
hypernyms_list = ([lemma.name() for synset in dog.root_hypernyms() for lemma in synset.lemmas()])
temp_list.append(hypernyms_list)
flat = list(set([item for sublist in temp_list for item in sublist]))
hypernyms.append(flat)
hypernyms
And the result is: [['entity'], ['entity'], ['entity'], ['entity'], ['entity'], ['entity']].
Can anybody suggest me some techniques to retrieve the category the names belong to, if there is anything available?
Thanks in advance.
One approach I can suggest is using Google's NLP API. This API have feature of identifying Part of Speech as part of Syntax Analysis. Please refer to documentation here -
Google's NLP API - Syntax Analysis
Another option is Stanford's NLP API. Here are reference docs - Stanford's NLP API
Related
I have a list of words and I'm trying to turn plural words in singular in python, then I remove the duplicates. This is how I do it :
import spacy
nlp = spacy.load('fr_core_news_md')
words = ['animaux', 'poule', 'adresse', 'animal', 'janvier', 'poules']
clean_words = []
for word in words:
doc = nlp(word)
for token in doc:
clean_words.append(token.lemma_)
clean_words = list(set(clean_words))
This is the output :
['animal', 'janvier', 'poule', 'adresse']
It works well, but my problem is that 'fr_core_news_md' takes a little too long to load so I was wondering if there was another way to do this ?
The task you trying to do is called lemmatization and it does more than just converting plural to singular, it removes its flexions. It returns the canonical version of a word, the infinitive form of a verb for example.
If you want to use spacy you can make it load quicker by using the disable parameter.
For example spacy.load('fr_core_news_md', disable=['parser', 'textcat', 'ner', 'tagger']).
Alternatively, you use treetagger which is kinda hard to install but works great.
Or the FrenchLefffLemmatizer.
I am using Gensim's Phraser model to find bigrams in some reviews, to be later used in an LDA topic modelling scenario. My issue is that the reviews mention the word "service" quite often and so Phraser finds lots of different bigrams with "service" as one of the pairs (e.g "helpful_service", "good_service", "service_price").
These are then present across multiple topics in the final result*. I'm thinking that I could prevent this from occurring if I was able to tell Phraser not to include "service" when making bigrams. Is this possible?
(*) I am aware that "service"-related bigrams being present across multiple topics might indeed be the optimal result, but I just want to experiment with leaving them out.
Sample code:
# import gensim models
from gensim.models import Phrases
from gensim.models.phrases import Phraser
# sample data
data = [
"Very quick service left a big tip",
"Very bad service left a complaint to the manager"
]
data_words = [doc.split(" ") for doc in data]
# build the bigram model
bigram_phrases = Phrases(data_words, min_count=2, threshold=0, scoring='npmi')
# note I used the arguments above to force "service" based bigrams to be created for this example
bigram_phraser = Phraser(bigram_phrases)
# print the result
for word in data_words:
tokens_ = bigram_phraser[word]
print(tokens_)
The above prints:
['Very', 'quick', 'service_left', 'a', 'big', 'tip']
['Very', 'bad', 'service_left', 'complaint', 'to', 'the', 'manager']
Caution: The following behavior seems to change with version 4.0.0!
If you are indeed only working with bigrams, you can utilize the common_terms={} parameter of the function, which is (according to the docs
[a] list of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.
If I add a simple common_terms={"service"} to your sample code, I am left with tge following result:
['Very', 'quick', 'service', 'left_a', 'big', 'tip']
['Very', 'bad', 'service', 'left_a', 'complaint', 'to', 'the', 'manager']
Starting with version 4.0.0, gensim seemingly dropped this parameter, but replaced it with connector_words), see here. The results should largely be the same, though!
I've been searching for this for a long time and most of the materials I've found were entity named recognition. I'm running topic modeling but in my data, there were too many names in the texts.
Is there any python library which contains (English) names of people? or if not, what would be a good way to remove names of people from each document in corpus?
Here's a simple example:
texts=['Melissa\'s home was clean and spacious. I would love to visit again soon.','Kevin was nice and Kevin\'s home had a huge parking spaces.']
I would suggest using a tokenizer with some capability to recognize and differentiate proper nouns. spacy is quite versatile and its default tokenizer does a decent job of this.
There are hazards to using a list of names as if they're stop words - let me illustrate:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."
"Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
doc = nlp(sentence)
for token in doc:
tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")
So the first two sentences are easy, and spacy identifies the proper nouns "PROPN":
Now, the third sentence has been constructed to show the issue - lots of people have names that are also things. spacy's default tokenizer isn't perfect, but it does a respectable job with the two sides of the task: don't remove names when they are being used as regular words (e.g. bill of goods, work of art), and do identify them when they are being used as names. (you can see that it messed up one of the references to Art (the person).
Not sure if this solution is efficient and robust but it's simple to understand (to me at the very least):
import re
# get a list of existed names (over 18 000) from the file
with open('names.txt', 'r') as f:
NAMES = set(f.read().splitlines())
# your list of texts
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."]
# join the texts into one string
texts = ' | '.join(texts)
# find all the words that look like names
pattern = r"(\b[A-Z][a-z]+('s)?\b)"
found_names = re.findall(pattern, texts)
# get singular forms, and remove doubles
found_names = set([name[0].replace("'s","") for name in found_names])
# remove all the words that look like names but are not included in the NAMES
found_names = [name for name in found_names if name in NAMES]
# loop trough the found names and remove every name from the texts
for name in found_names:
texts = re.sub(name + "('s)?", "", texts) # include plural forms
# split the texts back to the list
texts = texts.split(' | ')
print(texts)
Output:
[' home was clean and spacious. I would love to visit again soon.',
' was nice and home had a huge parking spaces.']
List of the names was obtained here: https://www.usna.edu/Users/cs/roche/courses/s15si335/proj1/files.php%3Ff=names.txt.html
And I completely endorse the recommendation of #James_SO to use more smart tools.
While trying to recover any given WordNet synset's hypernyms through WN NLTK's interface, I am getting what I think are different results from WN's web search interface. For example:
from nltk.corpus import wordnet as wn
bank6ss = wn.synsets("bank")[5] # 'bank' as gambling house funds
bank6ss.hypernyms()
# returns [Synset('funds.n.01')]
That is, only one hypernym found (no others are found with, for instance, instance_hypernyms()). However, when looking at WN's web interface, this sense of 'bank' lists several other hypernyms under "Direct hypernym":
funds, finances, monetary resource, cash in hand, pecuniary resource
What would explain this difference, and how could I get that longer list of hypernyms in NLTK's WordNet?
The WordNet version used in my NLTK installation is 3.0.
I just realized that I'm looking at two different types of output: What is returned in NLTK WordNet is a hypernym synset (Synset['funds.n.01']) while the list of hypernyms in the web interface is composed of lemmas belonging to that one synset.
To fully answer the question, this list of lemmas can be recovered in NLTK as follows:
from nltk.corpus import wordnet as wn
bank6ss = wn.synsets("bank")[5] # 'bank' as gambling house funds
hn1ss = bank6ss.hypernyms()[0]
hn1ss.lemmas()
# returns [Lemma('funds.n.01.funds'),
# Lemma('funds.n.01.finances'),
# Lemma('funds.n.01.monetary_resource'),
# Lemma('funds.n.01.cash_in_hand'),
# Lemma('funds.n.01.pecuniary_resource')]
Or, if only lemma names are of interest:
hn1ss.lemma_names()
# returns [u'funds',
# u'finances',
# u'monetary_resource',
# u'cash_in_hand',
# u'pecuniary_resource']
I want to extract relevant information about few topics. for example:
product information
purchase experience of customer
recommendation of family or friend
In first step, I extract information from one of the website. for instance :
i think AIA does a more better life insurance as my comparison and
the companies comparisonand most important is also medical insurance
in my opinionyes there are some agents that will sell u plans that
their commission is high...dun worry u buy insurance from a company
anything happens u can contact back the company also can ...better
find a agent that is reliable and not just working for the commission
for now , they might not service u in the future...thanksregardsdiana
""
Then by using NLTK in VS2015, I tried to split words.
toks = nltk.word_tokenize(text)
By using pos_tag I can tag my toks
postoks = nltk.tag.pos_tag(toks)
from this part I am not sure what should I do?
Previously, I used IBM text Analytic. In this software I use to create dictionary and then create some pattern and then analysis the data. for instance
:
Sample of Dictionary: insurance_cmp : {AIA, IMG, SABB}
Sample of pattern:
insurance_cmp + Good_Feeling_Pattern
insurance_cmp + ['purchase|Buy'] + Bad_Feeling_Pattern
Good_Feeling_Pattern = [good, like it, nice]
Bad_Feeling_Pattern = [bad, worse, not good, regret]
I tried to know can I simulate the same in NLKT? chunker and create grammar can help me to extract what I am looking for? may I have your idea to improve myself please?
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)
Please help me what could be my next step to reach to my goal?
You just need to follow these video
or read this blog.