I'm attempting to remove all the stop words from text input. The code below removes all the stop words, except one that begin a sentence.
How do I remove those words?
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))
from string import punctuation
exclude_punctuation = set(punctuation)
stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)
def normalized_text(text):
lemma = WordNetLemmatizer()
stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized
sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]
for item in sentence:
print (normalized_text(str(item)))
OUTPUT:
the bird always house
in hill bird nest
The culprit is this line of code:
print (normalized_text(str(item)))
If you try to print str(item) for the first element of your sentence list, you'll get:
['The birds are always in their house.']
which, then, lowered and split becomes:
["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]
As you can see, the first element is ['the which does not match the stop word the.
Solution: Use ''.join(item) to convert item to str
Edit after comment
Inside the text string there are still some apices '. To solve, call the normalized as:
for item in sentence:
print (normalized_text(item))
Then, import the regex module with import re and change:
text.lower().split()
with:
re.split('\'| ', ''.join(text).lower())
Related
I am using huggingface pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the original sentence. I need to retrieve word embedding of a particular sentence.
For example, here is my code:
#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6
from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re
model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)
def find_wordNo_sentence(word, sentence):
print(sentence)
splitted_sen = sentence.split(" ")
print(splitted_sen)
index = splitted_sen.index(word)
for i,w in enumerate(splitted_sen):
if(word == w):
return i
print("not found") #0 base
def return_xlnet_embedding(word, sentence):
word = re.sub(r'[^\w]', " ", word)
word = " ".join(word.split())
sentence = re.sub(r'[^\w]', ' ', sentence)
sentence = " ".join(sentence.split())
id_word = find_wordNo_sentence(word, sentence)
try:
data = model_pipeline(sentence)
n_words = len(sentence.split(" "))
#print(sentence_emb.shape)
n_embs = len(data[0])
print(n_embs, n_words)
print(len(data[0]))
if (n_words != n_embs):
"There is extra tokenized word"
results = data[0][id_word]
return np.array(results)
except:
return "word not found"
return_xlnet_embedding('your', "what is your name?")
Then the output is:
what is your name ['what', 'is', 'your', 'name'] 6 4 6
So the length of tokenized string that is fed to the pipeline is two more than number of my words.
How can I find which one (among these 6 values) are the embedding of my word?
As you may know, huggingface tokenizer contains frequent subwords as well as complete ones. So if you are willing to extract word embeddings for some tokens you should consider that may contain more than one vector! In addition, huggingface pipelines encode input sentences at the first steps and this would be performed by adding special tokens to beginning & end of the actual sentence.
string = 'This is a test for clarification'
print(pipeline.tokenizer.tokenize(string))
print(pipeline.tokenizer.encode(string))
output:
['this', 'is', 'a', 'test', 'for', 'cl', '##ari', '##fication']
[101, 2023, 2003, 1037, 3231, 2005, 18856, 8486, 10803, 102]
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
paragraph = ''' State-run Bharat Sanchar Nigam Ltd (BSNL) is readying to pay November salary in another two days, which will be raised from internal accruals and bank loans.'''
sentence = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()
for i in range(len(sentence)):
words = nltk.word_tokenize(i)
words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
sentence[i] = ' '.join(words)
I am getting an error on this part
words = nltk.word_tokenize(i)
range() produces an iterable of integers. So, when you feed i into nltk.word_tokenize(), you're feeding it an integer. Obviously, an integer is not string-like.
I don't personally know how nltk.word_tokenize() is supposed to work, but based on context clues it would seem you might want to pass the sentence object at the index i instead of just the index i:
words = nltk.word_tokenize(sentence[i])
I would like the program to detect whether a certain word is before the search word and if it is not to add it to a list.
This is what I have come up with myself:
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
all = ["take", "take"]
all2= [w for w in all if not(re.search(r'not' + w + r'\b', sentence))]
print(all2)
The excpected output is ["take"], but it remains the same with ["take, "take]
Watch how it should be formulated: gather all take word occurrences that aren't preceded with word not:
import re
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
search_word = 'take'
all_takes_without_not = re.findall(fr'(?<!\bnot)\s+({search_word})\b', sentence)
print(all_takes_without_not)
The output:
['take']
It may be simpler to first convert you sentence to a list of words.
from itertools import chain
# Get individual words from the string
words = sentence.split()
# Create an iterator which yields the previous word at each position
previous = chain([None], words)
output = [word for prev, word in zip(previous, words) if word=='take' and prev != 'not']
I am relatively new to NLP so please be gentle. I
have a complete list of the text from Trump's tweets since taking office and I am tokenizing the text to analyze the content.
I am using the TweetTokenizer from the nltk library in python and I'm trying to get everything tokenized except for numbers and punctuation. Problem is my code removes all the tokens except one.
I have tried using the .isalpha() method but this did not work, which I thought would as should only be True for strings composed from the alphabet.
#Create a content from the tweets
text= non_re['text']
#Make all text in lowercase
low_txt= [l.lower() for l in text]
#Iteratively tokenize the tweets
TokTweet= TweetTokenizer()
tokens= [TokTweet.tokenize(t) for t in low_txt
if t.isalpha()]
My output from this is just one token.
If I remove the if t.isalpha() statement then I get all of the tokens including numbers and punctuation, suggesting the isalpha() is to blame from the over-trimming.
What I would like, is a way to get the tokens from the tweet text without punctuation and numbers.
Thanks for your help!
Try something like below:
import string
import re
import nltk
from nltk.tokenize import TweetTokenizer
tweet = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play"
def clean_text(text):
# remove numbers
text_nonum = re.sub(r'\d+', '', text)
# remove punctuations and convert characters to lower case
text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation])
# substitute multiple whitespace with single whitespace
# Also, removes leading and trailing whitespaces
text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
return text_no_doublespace
cleaned_tweet = clean_text(tweet)
tt = TweetTokenizer()
print(tt.tokenize(cleaned_tweet))
output:
['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'its', 'kids', 'movie', 'watch', 'it', 'cant', 'help', 'enjoy', 'it', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'it', 'danny', 'glover', 'superb', 'could', 'play']
# Function for removing Punctuation from Text and It gives total no.of punctuation removed also
# Input: Function takes Existing fie name and New file name as string i.e 'existingFileName.txt' and 'newFileName.txt'
# Return: It returns two things Punctuation Free File opened in read mode and a punctuation count variable.
def removePunctuation(tokenizeSampleText, newFileName):
from nltk.tokenize import word_tokenize
existingFile = open(tokenizeSampleText, 'r')
read_existingFile = existingFile.read()
tokenize_existingFile = word_tokenize(read_existingFile)
puncRemovedFile = open(newFileName, 'w+')
import string
stringPun = list(string.punctuation)
count_pun = 0
for word in tokenize_existingFile:
if word in stringPun:
count_pun += 1
else:
word = word + ' '
puncRemovedFile.write(''.join(word))
existingFile.close()
puncRemovedFile.close()
return open(newFileName, 'r'), count_pun
punRemoved, punCount = removePunctuation('Macbeth.txt', 'Macbeth-punctuationRemoved.txt')
print(f'Total Punctuation : {punCount}')
punRemoved.read()
I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.