For example like:
sentence = 'An old lady lives in a small red house. She has three cute cats, and their names match with their colors: White, Cinnamon, and Chocolate. They are poor but happy.'
So I hope to get 2 lists like these:
adj = ['old','small','red','cute','White','poor','happy']
noun = ['lady','house','cats','names','colors','Cinnamon','Chocolate']
I saw someone mentioned NLTK, but I haven't used the package so I would hope for some instructions.
What you need is called Part-Of-Speech(POS) Tagging, you could check:
NLTK: https://www.nltk.org/book/ch05.html, https://www.geeksforgeeks.org/part-speech-tagging-stop-words-using-nltk-python/
Spacy: https://spacy.io/usage/linguistic-features#pos-tagging
If not enough, there are a lot of additional instructions for newbies out there if you google 'POS Tagging + python'.
btw, I would recommend spacy as it is more modern.
Related
I have strings that are spelled correctly but in all lower case (except for the first character), that I would like to correct for capitalisation (in English - so basically just names of things...). I tried pyspellcheck, autocorrect and symspellpy, which do not consider capitalisation afaik.
So for example the string 'And then we went to see frank from england to have a beer with him.' should be corrected to 'And then we went to see Frank from England to have a beer with him.'.
Do you know any library that can do that?
You can do it with spaCy:
import spacy
nlp=spacy.load('en_core_web_md')
def capitalize_ent(text):
title_text=text.title()
print(text)
doc=nlp(title_text)
words=[]
for x in doc:
if nlp(x.text).ents:
words.append(x.text)
for word in words:
text=text.replace(word.lower(),word)
return text
Don't forget to download the spaCy language model:
python -m spacy download en_core_web_md
I am using nltk PunktSentenceTokenizer for splitting paragraphs into sentences. I have paragraphs as follows:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
Output:
['1.', 'Candidate is very poor in mathematics.', '2.', 'Interpersonal skills are good.', '3.', 'Very enthusiastic about social work']
I tried to add sent starters using below code but that didnt even work out.
from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')
I really appreciate if anybody could drive me towards correct direction
Thanks in advance :)
The use of regular expressions can provide a solution to this type of problem, as illustrated by the code below:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")
previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)
# A sentence must contain at least two consecutive letters:
if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
sentenceStart = candidateSentenceEnd.end()
if candidateSentenceEnd.end() == end:
break
previousMatch=candidateSentenceEnd.start() + 1
the output is:
Candidate is very poor in mathematics.
Interpersonal skills are good.
Very enthusiastic about social work
Many tokenizers including (nltk and Spacy) can handle regular expressions. Adapting this code to their framework might not be trivial though.
I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:
doc2 = nlp("what is the capital of Bangladesh?")
for chunk in doc2.noun_chunks:
print(chunk)
Output:
1. what
2. the capital
3. bangladesh
Expected:
the capital of Bangladesh
I have tried answers from spacy doc and StackOverflow. Nothing worked. It seems only cTakes and Stanford core NLP can give such complex NP.
Any help is appreciated.
Spacy clearly defines a noun chunk as:
A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses." (https://spacy.io/api/doc#noun_chunks)
If you process the dependency parse differently, allowing prepositional modifiers and nested phrases/chunks, then you can end up with what you're looking for.
I bet you could modify the existing spacy code fairly easily to do what you want:
https://github.com/explosion/spaCy/blob/06c6dc6fbcb8fbb78a61a2e42c1b782974bd43bd/spacy/lang/en/syntax_iterators.py
For those who are still looking for this answer
noun_pharses=set()
for nc in doc.noun_chunks:
for np in [nc, doc[nc.root.left_edge.i:nc.root.right_edge.i+1]]:
noun_pharses.add(np)
This is how I get all the complex noun phrase
I'm starting to program with NLTK in Python for Natural Italian Language processing. I've seen some simple examples of the WordNet Library that has a nice set of SynSet that permits you to navigate from a word (for example: "dog") to his synonyms and his antonyms, his hyponyms and hypernyms and so on...
My question is:
If I start with an italian word (for example:"cane" - that means "dog") is there a way to navigate between synonyms, antonyms, hyponyms... for the italian word as you do for the english one? Or... There is an Equivalent to WordNet for the Italian Language ?
Thanks in advance
You are in luck. The nltk provides an interface to the Open Multilingual Wordnet, which does indeed include Italian among the languages it describes. Just add an argument specifying the desired language to the usual wordnet functions, e.g.:
>>> cane_lemmas = wn.lemmas("cane", lang="ita")
>>> print(cane_lemmas)
[Lemma('dog.n.01.cane'), Lemma('cramp.n.02.cane'), Lemma('hammer.n.01.cane'),
Lemma('bad_person.n.01.cane'), Lemma('incompetent.n.01.cane')]
The synsets have English names, because they are integrated with the English wordnet. But you can navigate the web of meanings and extract the Italian lemmas for any synset you want:
>>> hypernyms = cane_lemmas[0].synset().hypernyms()
>>> print(hypernyms)
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
>>> print(hypernyms[1].lemmas(lang="ita"))
[Lemma('domestic_animal.n.01.animale_addomesticato'),
Lemma('domestic_animal.n.01.animale_domestico')]
Or since you mentioned "cattiva_persona" in the comments:
>>> wn.lemmas("bad_person")[0].synset().lemmas(lang="ita")
[Lemma('bad_person.n.01.cane'), Lemma('bad_person.n.01.cattivo')]
I went from the English lemma to the language-independent synset to the Italian lemmas.
Since I found myself wondering how to actually use the wordnet resources after reading this question and its answer, I'm going to leave here some useful information:
Here is a link to the nltk guide.
The two necessary commands to download wordnet data and thus proceed with the usage explained in the other answer are:
import nltk
nltk.download('wordnet')
nltk.download('omw')
I want to find and count the specific bigram words such as "red apple" in the text file.
I already made the text file to the word list, so I couldn't use regex to count the whole phrase. (i.e. bigram) ( or can I ? )
How can I count the specific bigram in the text file? not using nltk or other module... regex can be a solution?
Why you have made text file into list. Also it's not memory efficient.
Instead of text you can use file.read() method directly.
import re
text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']
for i in bigram:
print 'Found', i, len(re.findall(i, text))
out:
Found red apples 2
Found green apples 1
Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? In the latter case have a look at NLTK collocations module. You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such.
And think of this: why and how have you turned the lines to list of words? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. Which, again, is why NLTK is what you need.