I am executing a data extraction use-case. To preprocess and tokenize my data, I am using both spacy English and German tokenizers, because the sentences are in both the languages. Here's my code:
import spacy
from spacy.lang.de import German
from spacy.lang.en import English
from spacy.lang.de import STOP_WORDS as stp_wrds_de
from spacy.lang.en.stop_words import STOP_WORDS as stp_wrds_en
import string
punctuations = string.punctuation
# German Parser
parser_de = German()
# English Parser
parser_en = English()
def spacy_tokenizer_de(document):
# Token object for splitting text into 'units'
tokens = parser_de(document)
# Lemmatization: Grammatical conversion of words
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
# Remove punctuations
tokens = [word for word in tokens if word not in punctuations]
tokens_de_str = converttostr(tokens,' ')
tokens_en = spacy_tokenizer_en(tokens_de_str)
print("Tokens EN: {}".format(tokens_en))
tokens_en = converttostr(tokens_en,' ')
return tokens_en
def converttostr(input_seq, separator):
# Join all the strings in list
final_str = separator.join(input_seq)
return final_str
def spacy_tokenizer_en(document):
tokens = parser_en(document)
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
return tokens
Here's a further elucidation of the above code:
1. spacy_tokenizer_de(): Method to parse and tokenize document in German
2. spacy_tokenizer_en(): Method to parse and tokenize document in English
3. converttostr(): Converts list of tokens to a string, so that the English spacy tokenizer can read the input (only accepts document/string format) and tokenize the data.
However, some sentences when parsed, lead to the following error:
Why is a spacy token object coming up in such scenarios, whereas, some of the sentences are being processed successfully? Can anyone please help here?
token.lemma_.strip() if token.lemma_ != '-PRON-' else token.text for token in tokens
You're supposed to get a list of words here, right? Instead, sometimes you return a string (when lemma doesn't equal to '-PRON-') but other times just token but not a string.
You may get a string from token.text.
Related
When I use spaCy for cleaning data, I run the following line:
df['text'] = df.sentence.progress_apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha))
Which lemmatizes each word in the text row if the word in not a stop-word. The problem is that text.lemma_ is applied to the token after the token is checked for being a stop-word or not. Therefore, if the stop-word is not in the lemmatized form, it will not be considered stop word. For example, if I add "friend" to the list of stop words, the output will still contain "friend" if the original token was "friends". The easy solution is to run this line twice. But that sounds silly. Anyone can suggest a solution to remove the stop words that are not in the lemmatized form in the first run?
Thanks!
You can simply check if the token.lemma_ is present in the nlp.Defaults.stop_words:
if token.lemma_.lower() not in nlp.Defaults.stop_words
For example:
df['text'] = df.sentence.progress_apply(
lambda text:
" ".join(
token.lemma_ for token in nlp(text)
if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha
)
)
See a quick test:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.Defaults.stop_words.add("friend") # Adding "friend" to stopword list
>>> text = "I have a lot of friends"
>>> " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha)
'lot friend'
>>> " ".join(token.lemma_ for token in nlp(text) if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha)
'lot'
If you add words in uppercase to the stopword list, you will need to use if token.lemma_.lower() not in map(str.lower, nlp.Defaults.stop_words).
I would like to know if the spacy tokenizer could tokenize words only using the "space" rule.
For example:
sentence= "(c/o Oxford University )"
Normally, using the following configuration of spacy:
nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
print(token)
the result would be:
(
c
/
o
Oxford
University
)
Instead, I would like an output like the following (using spacy):
(c/o
Oxford
University
)
Is it possible to obtain a result like this using spacy?
Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])
nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Before: [This, is, it, 's]
After : [This, is, it's]
You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.
An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)
You'll find out that split is due to SPECIAL rules:
[('TOKEN', 'This'),
('TOKEN', 'is'),
('SPECIAL-1', 'it'),
('SPECIAL-2', "'s"),
('SUFFIX', '.'),
('SPECIAL-1', 'I'),
('SPECIAL-2', "'m"),
('TOKEN', 'fine')]
that could be updated to remove "it's" from exceptions like:
exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I, 'm, fine]
or remove split on apostrophe altogether:
filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I'm, fine]
Note the dot attached to the token, which is due to the suffix rules not specified.
You can find the solution to this very question in the spaCy docs: https://spacy.io/usage/linguistic-features#custom-tokenizer-example. In a nutshell, you create a function that takes a string text and returns a Doc object, and then assign that callable function to nlp.tokenizer:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
# All tokens 'own' a subsequent space character in this tokenizer
spaces = [True] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])
I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.
I have the following code. I have to add more words in nltk stopword list. After i run thsi, it does not add the words in the list
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
new_words = open("stopwords_en.txt", "r")
new_stopwords = stop.union(new_word)
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in emails_body_text]
Don't do things blindly. Read in your new list of stopwords, inspect it to see that it's right, then add it to the other stopword list. Start with the code suggested by #greg_data, but you'll need to strip newlines and maybe do other things -- who knows what your stopwords file looks like?
This might do it, for example:
new_words = open("stopwords_en.txt", "r").read().split()
new_stopwords = stop.union(new_words)
PS. Don't keep splitting and joining your document; tokenize once and work with the list of tokens.
I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
The final output I get is:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
And expected output should look like:
This beautiful day I work exercise text
No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
And, call the stemmer like this:
stemmer.stem(i)
I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.
>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text
To process a tweet properly you can use following code:
import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
text: str: raw text
lemmatizer: an instance of a class implementing the lemmatize() method
(the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)
abc=nltk.pos_tag(geek)
output = []
for value in abc:
value = list(value)
if value[1][0] =="N":
value[1] = 'n'
elif value[1][0] =="V":
value[1] = 'v'
elif value[1][0] =="J":
value[1] = 'a'
elif value[1][0] =="R":
value[1] = 'r'
else:
value[1]='n'
output.append(value)
abc=output
for value in abc:
bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd
here I have use post_tag (only N,V,J,R and converted rest all into noun as well). This will return a tokenized and lemmatized list of words.