I was trying to implement a regex on a list of grammar tags in python, for finding the tense form of the list of grammar. And I wrote the following code to implement it.
Data preprocessing:
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tags = []
for i in range(len(tagged)):
t = tagged[i]
tags.append(t[1])
print(tags)
regex formula i.e. to be implemented
grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
Function to implement the regex on the list tags
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(grammar, tags)
But it returned an error as:
Traceback (most recent call last):
File "/home/samar/Desktop/twitter_tense/main.py", line 35, in <module>
check_grammar(grammar, tags)
File "/home/samar/Desktop/twitter_tense/main.py", line 31, in check_grammar
result = cp.parse(tags)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1276, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1083, in parse
chunkstr = ChunkString(chunk_struct)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 105, in _tag
raise ValueError("chunk structures must contain tagged " "tokens or trees")
ValueError: chunk structures must contain tagged tokens or trees
Your call to the cp.parse() function expects each of the tokens in your sentence to be tagged, however, the tags list you created only contains the tags but not the tokens as well, hence your ValueError. The solution is to instead pass the output from the pos_tag() call (i.e. tagged) to your check_grammar call (see below).
Solution
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
print(tagged)
# Output
>>> [('He', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('been', 'VBN'), ('doing', 'VBG'), ('his', 'PRP$'), ('homework', 'NN'), ('.', '.')]
my_grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}"""
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(my_grammar, tagged)
Output
>>> (S
>>> He/PRP
>>> (Future_Perfect_Continuous will/MD have/VB been/VBN doing/VBG)
>>> his/PRP$
>>> homework/NN
>>> ./.)
Related
I am writing a program to process a set of answers from a csv. The csv is constructed like so:
I have written a program that loads the data:
positive results,resources,priorities,team focus,benefits,help,action steps,today,tomorrow,yesterday
[studied],[schaums outlines], [Continue working on proof of concept for commitment process],[Continue working on proof of concept for commitment process],[By completing the commitment process demonstration I will have something to show stakeholders and I will feel good],[I need resources advice support and financial assistance so that I can continue to develop my project.],[Schedule time to study],[11:00AM review the Art of Discipline
12:00pm lunch
- contact BOA
1:00pm (break and catch up)
1:15pm continue developing the commitment process.
- follow up with XXX
3:00pm break (snack)
3:15pm prepare for meeting with Dr. XXX
3:30pm meeting with Dr. XXX
4:00pm PDMP
5:00pm Run
6:00pm Dinner
6:30pm work on communication skills
7:00pm reading
8:00pm Journaling
8:30pm reading],[6:00-7:30am study
7:30am PDMP
8:00am review email
9:00am review priorities
9:30am meeting with XXXX
12:00pm lunch
5:00pm run
6:00pm dinner
6:30pm PDMP
8:00pm journaling
8:30pm reading]
,[
5:00AM woke up
6:00am reading/ developing communication skills
7:30-12:00pm
1:00pm lunch, dropped off suit to get fitted
2:00pm weekly planning
3:00pm leaving for XXXX
4:00pm dinner
4:30pm budgeting review
5:00pm drove home
5:30-6:30pm planning
7:30pm pack up belongings
8:00pm journaling
8:30pm read]
import spacy
nlp = spacy.load('en_core_web_sm')
punctuations = string.punctuation
def cleanup_text(docs, logging=False):
texts = []
counter = 1
for doc in docs:
if counter % 1000 == 0 and logging:
print("Processed %d out of %d documents." % (counter, len(docs)))
counter += 1
doc = nlp(doc, disable=['parser', 'ner'])
tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
tokens = ' '.join(tokens)
texts.append(tokens)
return pd.Series(texts)
positive_text = [text for text in train[train['benefits'] == 'good']['action steps']]
negative_text = [text for text in train[train['benefits'] == 'bad']['action steps']]
positive_clean = cleanup_text(positive_text)
positive_clean = ' '.join(positive_text).split()
negative_clean = cleanup_text(negative_text)
negative_clean = ' '.join(negative_clean).split()
# 3. Calculate total positive words and negative words
positive_counts = Counter(positive_clean)
negative_counts = Counter(negative_clean)
positive_common_words = [word[0] for word in positive_counts.most_common(20)]
negative_common_counts = [word[1] for word in negative_counts.most_common(20)]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
parser = English()
STOPLIST = set(stopwords.words('english') + list(sklearn_stop_words))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”"]
class CleanTextTransformer(TransformerMixin):
def transform(self, X, **transform_params):
return [cleanText(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
def cleanText(text):
text = text.strip().replace("\n", " ").replace("\r", " ")
text = text.lower()
return text
def tokenizeText(sample):
tokens = parser(sample)
lemmas = []
for tok in tokens:
lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
tokens = lemmas
tokens = [tok for tok in tokens if tok not in STOPLIST]
tokens = [tok for tok in tokens if tok not in SYMBOLS]
return tokens
def printNMostInformative(vectorizer, clf, N):
feature_names = vectorizer.get_feature_names_out()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
topClass1 = coefs_with_fns[:N]
topClass2 = coefs_with_fns[:-(N + 1):-1]
print("Class 1 best: ")
for feat in topClass1:
print(feat)
print("Class 2 best: ")
for feat in topClass2:
print(feat)
vectorizer = CountVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))
clf = LinearSVC()
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])
# data
train1 = train['benefits'].tolist()
labelsTrain1 = train['action steps'].tolist()
test1 = test['benefits'].tolist()
labelsTest1 = test['action steps'].tolist()
# train
pipe.fit(train1, labelsTrain1)
# test
preds = pipe.predict(test1)
print("accuracy:", accuracy_score(labelsTest1, preds))
print("Top 10 features used to predict: ")
printNMostInformative(vectorizer, clf, 10)
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])
transform = pipe.fit_transform(train1, labelsTrain1)
vocab = vectorizer.get_feature_names_out()
for i in range(len(train1)):
s = ""
indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
for idx, num in zip(indexIntoVocab, numOccurences):
s += str((vocab[idx], num))
from sklearn import metrics
print(metrics.classification_report(labelsTest1, preds,
target_names=df['benefits'].unique()))
I would like to use spacy after loading the data to process the positive and negative sentiment from the text content.
Expected:
Data showing the text from the benefits and action steps columns.
Actual:
File "/Users/evangertis/development/PythonAutomation/IGTS/TwilioMessaging/accountability.py", line 214, in <module>
pipe.fit(train1, labelsTrain1)
File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 390, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 348, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/usr/local/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 891, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 847, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "/Users/evangertis/development/PythonAutomation/IGTS/TwilioMessaging/accountability.py", line 170, in transform
return [cleanText(text) for text in X]
File "/Users/evangertis/development/PythonAutomation/IGTS/TwilioMessaging/accountability.py", line 170, in <listcomp>
return [cleanText(text) for text in X]
File "/Users/evangertis/development/PythonAutomation/IGTS/TwilioMessaging/accountability.py", line 177, in cleanText
text = text.strip().replace("\n", " ").replace("\r", " ")
AttributeError: 'float' object has no attribute 'strip'
I have an array of strings in different languages and I would like to remove stop words from these strings.
example of string :
["mai fostul președinte egiptean mohamed morsi ", "em bon jovi lançou o álbum have a nice day a ", " otok škulj är en ö i kroatien den ligger i län"...]
this is the list of languages I'm willing to use :
['French',
'Spanish',
'Thai',
'Russian',
'Persian',
'Indonesian',
'Arabic',
'Pushto',
'Kannada',
'Danish',
'Japanese',
'Malayalam',
'Latin',
'Romanian',
'Swedish',
'Portugese',
'English',
'Turkish',
'Tamil',
'Urdu',
'Korean',
'German',
'Greek',
'Italian',
'Chinese',
'Dutch',
'Estonian',
'Hindi']
I am using Spacy library, but I'm looking for something that support multiple languages.
what I have tried already:
import pandas as pd
import nltk
nltk.download('punkt')
import spacy
nlp = spacy.load("xx_ent_wiki_sm")
from spacy.tokenizer import Tokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
doc = nlp("This is a sentence about Facebook.")
print([(ent.text, ent.label) for ent in doc.ents])
all_stopwords = nlp.Defaults.stop_words
all_stopwords = nlp.Defaults.stop_words
data_text=df1['Text'] #here where i store my strings
for x in data_text:
text_tokens = word_tokenize(x)
tokens_without_sw=[word for word in text_tokens if not word inall_stopwords]
print(tokens_without_sw)
I want to extract all german nouns from a german text in lemmatized form with NLTK.
I also checked spacy but NLTK is much more preferred because in english it already works with the needed performance and requested data structure.
I have the following working code for english:
import nltk
from nltk.stem import WordNetLemmatizer
#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
I get the print as expected:
['year', 'sender', 'key', 'recipient']
Now I tried to do this for German:
import nltk
from nltk.stem import WordNetLemmatizer
germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
And I get a wrong result:
['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']
The lemmatization did not work and the noun extraction did not work.
How is the proper way to apply different languages to this code?
I also checked other solutions like:
from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)
But this would make me start from the beginning.
I have found a way with the HanoverTagger:
from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']
I get the outcome as expected: ['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']
Let's say that I have a document, like so:
import spacy
nlp = spacy.load('en')
doc = nlp('My name is John Smith')
[t for t in doc]
> [My, name, is, John, Smith]
Spacy is intelligent enough to realize that 'John Smith' is a multi-token named entity:
[e for e in doc.ents]
> [John Smith]
How can I make it chunk named entities into discrete tokens, like so:
> [My, name, is, John Smith]
Spacy documentation on NER says that you can access token entity annotations using the token.ent_iob_ and token.ent_type_ attributes.
https://spacy.io/usage/linguistic-features#accessing
Example:
import spacy
nlp = spacy.load('en')
doc = nlp('My name is John Smith')
ne = []
merged = []
for t in doc:
# "O" -> current token is not part of the NE
if t.ent_iob_ == "O":
if len(ne) > 0:
merged.append(" ".join(ne))
ne = []
merged.append(t.text)
else:
ne.append(t.text)
if len(ne) > 0:
merged.append(" ".join(ne))
print(merged)
This will print:
['My', 'name', 'is', 'John Smith']
I'm trying to lemmatize a string according to the part of speech but at the final stage, i'm getting an error. My code:
import nltk
from nltk.stem import *
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import wordnet
wordnet_lemmatizer = WordNetLemmatizer()
text = word_tokenize('People who help the blinging lights are the way of the future and are heading properly to their goals')
tagged = nltk.pos_tag(text)
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
for word in tagged: print(wordnet_lemmatizer.lemmatize(word,pos='v'), end=" ")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-40-afb22c78f770> in <module>()
----> 1 for word in tagged: print(wordnet_lemmatizer.lemmatize(word,pos='v'), end=" ")
E:\Miniconda3\envs\uol1\lib\site-packages\nltk\stem\wordnet.py in lemmatize(self, word, pos)
38
39 def lemmatize(self, word, pos=NOUN):
---> 40 lemmas = wordnet._morphy(word, pos)
41 return min(lemmas, key=len) if lemmas else word
42
E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in _morphy(self, form, pos)
1710
1711 # 1. Apply rules once to the input to get y1, y2, y3, etc.
-> 1712 forms = apply_rules([form])
1713
1714 # 2. Return all that are in the database (and check the original too)
E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in apply_rules(forms)
1690 def apply_rules(forms):
1691 return [form[:-len(old)] + new
-> 1692 for form in forms
1693 for old, new in substitutions
1694 if form.endswith(old)]
E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in <listcomp>(.0)
1692 for form in forms
1693 for old, new in substitutions
-> 1694 if form.endswith(old)]
1695
1696 def filter_forms(forms):
I want to be able to lemmatize that string based on each word's part of speech all at once. Please help.
Firstly, try not to mix top-level, absolute and relative imports like these:
import nltk
from nltk.stem import *
from nltk import pos_tag, word_tokenize
This would be better:
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
(See Absolute vs. explicit relative import of Python module)
The error you're getting is most probably because you are feeding in the outputs of pos_tag as the input to the WordNetLemmatizer.lemmatize(), i.e. :
>>> from nltk import pos_tag
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = 'People who help the blinging lights are the way of the future and are heading properly to their goals'.split()
>>> pos_tag(sent)
[('People', 'NNS'), ('who', 'WP'), ('help', 'VBP'), ('the', 'DT'), ('blinging', 'NN'), ('lights', 'NNS'), ('are', 'VBP'), ('the', 'DT'), ('way', 'NN'), ('of', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('are', 'VBP'), ('heading', 'VBG'), ('properly', 'RB'), ('to', 'TO'), ('their', 'PRP$'), ('goals', 'NNS')]
>>> pos_tag(sent)[0]
('People', 'NNS')
>>> first_word = pos_tag(sent)[0]
>>> wnl.lemmatize(first_word)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1694, in apply_rules
if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'
The input to WordNetLemmatizer.lemmatize() should be str not a tuple, so if you do:
>>> tagged_sent = pos_tag(sent)
>>> def penn2morphy(penntag, returnNone=False):
... morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
... 'VB':wn.VERB, 'RB':wn.ADV}
... try:
... return morphy_tag[penntag[:2]]
... except:
... return None if returnNone else ''
...
>>> for word, tag in tagged_sent:
... wntag = penn2morphy(tag)
... if wntag:
... print wnl.lemmatize(word, pos=wntag)
... else:
... print word
...
People
who
help
the
blinging
light
be
the
way
of
the
future
and
be
head
properly
to
their
goal
Or if you like an easy way out:
pip install pywsd
Then:
>>> from pywsd.utils import lemmatize, lemmatize_sentence
>>> sent = 'People who help the blinging lights are the way of the future and are heading properly to their goals'
>>> lemmatize_sentence(sent)
['people', 'who', 'help', 'the', u'bling', u'light', u'be', 'the', 'way', 'of', 'the', 'future', 'and', u'be', u'head', 'properly', 'to', 'their', u'goal']