Im using this code to determine what the subject, and time/location of a sentence.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
import nltk
text = input('Please enter a sentence: ')
words = text.split()
sentence = pos_tag(words)
grammar = '''
Action: {<NN.*>?<VB.*><RB.*>?}
Location: {<IN><NN.*>+}
Subject: {<DT>?<JJ>*<NN.*>}
'''
cp = nltk.RegexpParser(grammar, "Input")
result = cp.parse(sentence)
result.draw()
How can i print just the subject of the sentence?
You can use the "label" attribute the Tree objects. Here I set up a loop to check each element of the result to see if it is a nltk.tree.Tree instance. Then, if the label is "Subject", it gets appended t
subject = []
for word in result:
print word
if isinstance(word, nltk.tree.Tree):
if word.label() == 'Subject':
subject.append(word)
# Sentence returned multiple subjects, including direct object, so draw first one
subject[0].draw()
Of course, this assumes that whatever is labeled "Subject" is what you want to draw.
Related
I am using NLP with python to find the names from the string. I am able to find the if i have a full name (first name and last name) but in the string i have only first name means my code is not able to recognize as Person. Below is my code.
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
Sriram is working as a python developer
"""
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
#print(sentences)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
#print("Out Side ",chunk)
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
print("In Side ",chunk)
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
names = extract_names(string)
print(names)
My advice is to use the StanfordNLP/Spacy NER, using nltk ne chunks is a little janky. StanfordNLP is more commonly used by researchers, but Spacy is easier to work with. Here is an example using Spacy to print the name of each named entity and its type:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> text = 'Sriram is working as a python developer'
>>> doc = nlp(text)
>>> for ent in doc.ents:
print(ent.text,ent.label_)
Sriram ORG
>>>
Note that it classifies Sriram as an organization, which may be because it is not a common English name and Spacy is trained on English corpa. Good luck!
I am trying to rewrite algorithm that basically takes a input text file and compares with different documents and results the similarities.
Now I want to print output of unmatched words and output a new textile with unmatched words.
From this code, "hello force" is the input and is checked against the raw_documents and prints out rank for matched document between 0-1(word "force" is matched with second document and ouput gives more rank to second document but "hello" is not in any raw_document i want to print unmatched word "hello" as not matched ), But what i want is to print unmatched input word that was not matched with any of the raw_document
import gensim
import nltk
from nltk.tokenize import word_tokenize
raw_documents = ["I'm taking the show on the road",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
s = 0
for i in corpus:
s += len(i)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
query_doc = [w.lower() for w in word_tokenize("hello force")]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]
result = sims[query_doc_tf_idf]
print result
I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
The final output I get is:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
And expected output should look like:
This beautiful day I work exercise text
No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
And, call the stemmer like this:
stemmer.stem(i)
I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.
>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text
To process a tweet properly you can use following code:
import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
text: str: raw text
lemmatizer: an instance of a class implementing the lemmatize() method
(the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)
abc=nltk.pos_tag(geek)
output = []
for value in abc:
value = list(value)
if value[1][0] =="N":
value[1] = 'n'
elif value[1][0] =="V":
value[1] = 'v'
elif value[1][0] =="J":
value[1] = 'a'
elif value[1][0] =="R":
value[1] = 'r'
else:
value[1]='n'
output.append(value)
abc=output
for value in abc:
bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd
here I have use post_tag (only N,V,J,R and converted rest all into noun as well). This will return a tokenized and lemmatized list of words.
I have imported all the books from the NLTK Book library, and I am just trying to figure out how to define a corpus then sentence to be printed.
For example, if I wanted to print sentence 1 of text 3, then sentence 2 of text 4
import nltk
from nltk.book import *
print(???)
print(???)
I've tried the below combinations, which do not work:
print(text3.sent1)
print(text4.sent2)
print(sent1.text3)
print(sent2.text4)
print(text3(sent1))
print(text4(sent2))
I am new to python, so it is likely a v. basic question, but I cannot seem to find the solution elsewhere.
Many thanks, in advance!
Simple example can be given as :
from nltk.tokenize import sent_tokenize
# List of sentences
sentences = "This is first sentence. This is second sentence. Let's try to tokenize the sentences. how are you? I am doing good"
# define function
def sentence_tokenizer(sentences):
sentence_tokenize_list = sent_tokenize(sentences)
print "tokenized sentences are = ", sentence_tokenize_list
return sentence_tokenize_list
# call function
tokenized_sentences = sentence_tokenizer(sentences)
# print first sentence
print tokenized_sentences[0]
Hope this helps.
You need to split the texts into lists of sentences first.
If you already have text3 and text4:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(text3)
print(sents[0]) # the first sentence in the list is at position 0
sents = sent_tokenize(text4)
print(sents[1]) # the second sentence in the list is at position 1
print(text3[0]) # prints the first word of text3
You seem to need both a NLTK tutorial and a python tutorial. Luckily, the NLTK book is both.
I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them.
Can someone please help me the command to be used for that.
import nltk
text1 = "hello he heloo hello hi " // example text
fdist1 = FreqDist(text1)
I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character.
Also i want to know how to input text using text file.
I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.
import nltk
text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))
If you want to read from a file and get the word count, you can do it like so:
input.txt
hello he heloo hello hi
my username is heinst
your username is frooty
python code
import nltk
with open ("input.txt", "r") as myfile:
data=myfile.read().replace('\n', ' ')
data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))
For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.
from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())
text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):
>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
If your input is indeed space-separated words then to find the frequency, use #Boa's answer:
freq = Counter(text_with_space_separated_words.split())
Note: FreqDist is a Counter but it also defines additional methods such as .plot().
If you want to use nltk tokenizers instead:
#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.
In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:
import nltk
from nltk.tokenize import word_tokenize
for f in word_tokenize(inputSentence):
dict[f] = fre[f]
print dict
I think below code is useful for you to get the frequency of each word in the file in dictionary form
myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
z=item.split(" ")
y.append(z)
count=dict()
for name in y:
for items in name:
if items not in count:`enter code here`
count[items]=1
else:
count[items]=count[items]+1
print(count)