How does the `tfds.features.text.SubwordTextEncoder` create word encoding?

How does the `tfds.features.text.SubwordTextEncoder` create word encoding? - python

I'm currently doing a tensorflow transformer tutorial for sequence to sequence translation. At the beginning of the tutorial the class tfds.features.text.SubwordTextEncoder is called. This class can be used to convert a string to a list with integers, each representing a word.
After using the class SubwordTextEncoder to train an english tokenizer as follows:
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
the tutorial shows how this tokenizer can now be used to convert strings to lists with integers. This code snippet
sample_string = 'Transformer is awesome.'
tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
gives the following result:
[7915, 1248, 7946, 7194, 13, 2799]
where the integer to word mapping can be shown as follows:
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
returns
7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former
13 ----> is
2799 ----> awesome
This all makes sense to me. The tokenizer recognises the words 'is' and 'awesome' from its training set and assigns the corresponding integers. The word 'Transformer' which was not in its training set is being split up into parts as is mentioned in the documentation.
After some experimenting with the tokenizer however, I got confused. Please consider the following code snippets
sample_string2 = 'the best there is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
which returns
[3, 332, 64, 156]
and
for ts in tokenized_string2:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
which returns
3 ----> the
332 ----> best
64 ----> there
156 ----> is
Question: Why does the tokenizer return different integers for the same word if they are in a different part of the sentence? The word 'is' maps to 156 in the second example, where in the first example it is mapped to the integer 13, using the same tokenizer.

I have added one more statement len(tokenizer_en.decode([ts]) in the print statement to see length and I tried the below example -
Example:
sample_string2 = 'is is is is is is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
for ts in tokenized_string2:
print ('{} ----> {} ----> {}'.format(ts, tokenizer_en.decode([ts]),len(tokenizer_en.decode([ts]))))
Output -
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
156 ----> is ----> 2
As per the documentation of arguments, it states -
vocab_list - list<str>, list of subwords for the vocabulary. Note that
an underscore at the end of a subword indicates the end of the word
(i.e. a space will be inserted afterwards when decoding). Underscores
in the interior of subwords are disallowed and should use the
underscore escape sequence.

Related

Python nlpaug Sentece augmenter error (50256 is not in list)

The following code is yielding this error for both GPT2 and Xlnet bases. The download of the bases occurs, but the same error prompt is displayed at the end every time.
I am using google colab, by the way.
ValueError: 50256 is not in list
`
import nlpaug
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
text = "The quick brown fox jumped over the lazy dog"
aug_cs_gpt2 = nas.ContextualWordEmbsForSentenceAug(model_type = 'gpt2')
temp = aug_cs_gpt2.augment(text)
print(temp)
aug_cs_xlnet = nas.ContextualWordEmbsForSentenceAug(model_type = 'xlnet')
temp = aug_cs_xlnet.augment(text)
print(temp)
Expecting the augmented text to be printed.
AttributeError Traceback (most recent call last)
<ipython-input-41-6810452650ff> in <module>
9
10 aug_cs_xlnet = nas.ContextualWordEmbsForSentenceAug(model_type = 'xlnet')
---> 11 temp = aug_cs_xlnet.augment(text)
12 print(temp)
2 frames
/usr/local/lib/python3.8/dist-packages/nlpaug/augmenter/sentence/context_word_embs_sentence.py in _custom_insert(self, all_data)
148 # Mask token is needed for xlnet. No mask token for gpt2
149 if self.model_type in ['xlnet']:
--> 150 text += ' ' + self.model.MASK_TOKEN
151
152 texts.append(text)
AttributeError: 'Gpt2' object has no attribute 'MASK_TOKEN'

This is a known bug in nlpaug in version prior to 0.0.16. You should upgrade nlpaug to a version newer than this one and the issue should be gone.

How to replace text in Powerpoint?

This is the code I am using to replace text in powerpoint. First I am extracting text from powerpoint and then storing the translated and original sentences as dictionary.
prs = Presentation('/content/drive/MyDrive/presentation1.pptx')
# To get shapes in your slides
slides = [slide for slide in prs.slides]
shapes = []
for slide in slides:
for shape in slide.shapes:
shapes.append(shape)
def replace_text(self, replacements: dict, shapes: List):
"""Takes dict of {match: replacement, ... } and replaces all matches.
Currently not implemented for charts or graphics.
"""
for shape in shapes:
for match, replacement in replacements.items():
if shape.has_text_frame:
if (shape.text.find(match)) != -1:
text_frame = shape.text_frame
for paragraph in text_frame.paragraphs:
for run in paragraph.runs:
cur_text = run.text
new_text = cur_text.replace(str(match), str(replacement))
run.text = new_text
if shape.has_table:
for row in shape.table.rows:
for cell in row.cells:
if match in cell.text:
new_text = cell.text.replace(match, replacement)
cell.text = new_text
replace_text(translation, shapes)
I get a error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-97-181cdd92ff8c> in <module>()
9 shapes.append(shape)
10
---> 11 def replace_text(self, replacements: dict, shapes: List):
12 """Takes dict of {match: replacement, ... } and replaces all matches.
13 Currently not implemented for charts or graphics.
NameError: name 'List' is not defined
translation is a dictionary
translation = {' Architecture': 'आर्किटेक्चर',
' Conclusion': 'निष्कर्ष',
' Motivation / Entity Extraction': 'प्रेरणा / इकाई निष्कर्षण',
' Recurrent Deep Neural Networks': 'आवर्तक गहरे तंत्रिका नेटवर्क',
' Results': 'परिणाम',
' Word Embeddings': 'शब्द एम्बेडिंग',
'Agenda': 'कार्यसूची',
'Goals': 'लक्ष्य'}
May I know why am I getting this error. What chnages should be done to resolve it. Also can I save the replaced text using prs.save('output.pptx')
New Error
TypeError Traceback (most recent call last)
<ipython-input-104-957db45f970e> in <module>()
32 cell.text = new_text
33
---> 34 replace_text(translation, shapes)
35
36 prs.save('output.pptx')
TypeError: replace_text() missing 1 required positional argument: 'shapes'

The error you are getting 'NameError: name 'List' is not defined' occurs because 'List' isn't a valid type within python Typing. Since Python 3.9, you'll want to use 'list[type]'
For instance:
def replace_text(self, replacements: dict, shapes: list[str]):
Alternatively, you can use python's typing. However, this is deprecated in newer versions.
from typing import List
def replace_text(self, replacements: dict, shapes: List[str]):

input a string and compare each word with a given word using NLP Python

I want to input a string, tokenize it, and compare each word with a specific word (in this code, the word is 'play'). I have the code
from nltk.tokenize import word_tokenize
txt = "bat ball cocaine golf football cake leg hand me you her she he dog cat drug"
x = word_tokenize(txt)
from nltk.corpus import wordnet
for i in range (10):
syn = wordnet.synsets(x[i])[0]
print ("Synset name : ", syn.name())
w1 = wordnet.synset('play.n.01')
w2 = wordnet.synset(syn)
print(w1.wup_similarity(w2))
i = i +1
This gives an error:
AttributeError Traceback (most recent call last)
<ipython-input-127-a30645977ba6> in <module>()
13
14 w1 = wordnet.synset('play.n.01')
---> 15 w2 = wordnet.synset(syn)
16 print(w1.wup_similarity(w2))
17 i = i +1
help

You passed the wrong argument to the function.
Instead, you should've passed syn.name()
Use this w2 = wordnet.synset(syn.name())
With this correction, IndexError: list index out of range raises on 10th iteration.
Try this to solve the problem
syn = wordnet.synsets(x[i])
if syn:
syn = syn[0]
else:
continue

AttributeError: 'float' object has no attribute 'translate' Python

Im working on doing some NLP with textual data from doctors just trying to do some basic preprocessing text cleaning trying to remove stop words and punctuation. I have already given the program a list of punctuations and stop words.
My text data looks something like this:
"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells"
Then my code looks like:
import string
# Create a function to remove punctuations
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
# Create a function to remove stop words
def remove_stop_words(x):
x = ' '.join([i for i in x.split(' ') if i not in stop])
return x
# Create a function to lowercase the words
def to_lower(x):
return x.lower()
So then I try to apply the functions to the Text column
train['Text'] = train['Text'].apply(remove_punctuation)
train['Text'] = train['Text'].apply(remove_stop_words)
train['Text'] = train['Text'].apply(lower)
And I get an error message like:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
----> 1 train['Text'] = train['Text'].apply(remove_punctuation)
2 train['Text'] = train['Text'].apply(remove_stop_words)
3 train['Text'] = train['Text'].apply(lower)
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in
apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and
isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
in remove_punctuation(sentence)
3 # Create a function to remove punctuations
4 def remove_punctuation(sentence: str) -> str:
----> 5 return sentence.translate(str.maketrans('', '', string.punctuation))
6
7 # Create a function to remove stop words
AttributeError: 'float' object has no attribute 'translate'
Why am I getting this error. Im guessing because digits appear in the text?

float' object has no attribute 'lower'

I'm facing this error and I'm really not able to find the reason for it.
Can somebody please point out the reason for it ?
for i in tweet_raw.comments:
mns_proc.append(processComUni(i))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-416-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_processed.append(processtwt(i))
3
<ipython-input-414-4e1b8a8fb285> in processtwt(tweet)
4 #Convert to lower case
5 #tweet = re.sub('RT[\s]+','',tweet)
----> 6 tweet = tweet.lower()
7 #Convert www.* or https?://* to URL
8 #tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','',tweet)
AttributeError: 'float' object has no attribute 'lower'
A second similar error that facing is this :
for i in tweet_raw.comments:
tweet_proc.append(processtwt(i))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-423-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_proc.append(processtwt(i))
3
<ipython-input-421-38fab2ef704e> in processComUni(tweet)
11 tweet=re.sub(('[http]+s?://[^\s<>"]+|www\.[^\s<>"]+'),'', tweet)
12 #Convert #username to AT_USER
---> 13 tweet = re.sub('#[^\s]+',' ',tweet)
14 #Remove additional white spaces
15 tweet = re.sub('[\s]+', ' ', tweet)
C:\Users\m1027201\AppData\Local\Continuum\Anaconda\lib\re.pyc in sub(pattern, repl, string, count, flags)
149 a callable, it's passed the match object and must return
150 a replacement string to be used."""
--> 151 return _compile(pattern, flags).sub(repl, string, count)
152
153 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or buffer
Shall I check whether of not a particluar tweet is tring before passing it to processtwt() function ? For this error I dont even know which line its failing at.

Just try using this:
tweet = str(tweet).lower()
Lately, I've been facing many of these errors, and converting them to a string before applying lower() always worked for me.

My answer will be broader than shalini answer. If you want to check if the object is of type str then I suggest you check type of object by using isinstance() as shown below. This is more pythonic way.
tweet = "stackoverflow"
## best way of doing it
if isinstance(tweet,(str,)):
print tweet
## other way of doing it
if type(tweet) is str:
print tweet
## This is one more way to do it
if type(tweet) == str:
print tweet
All the above works fine to check the type of object is string or not.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does the `tfds.features.text.SubwordTextEncoder` create word encoding? - python

Related

Python nlpaug Sentece augmenter error (50256 is not in list)

How to replace text in Powerpoint?

input a string and compare each word with a given word using NLP Python

AttributeError: 'float' object has no attribute 'translate' Python

float' object has no attribute 'lower'

Categories

Resources