Im working on doing some NLP with textual data from doctors just trying to do some basic preprocessing text cleaning trying to remove stop words and punctuation. I have already given the program a list of punctuations and stop words.
My text data looks something like this:
"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells"
Then my code looks like:
import string
# Create a function to remove punctuations
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
# Create a function to remove stop words
def remove_stop_words(x):
x = ' '.join([i for i in x.split(' ') if i not in stop])
return x
# Create a function to lowercase the words
def to_lower(x):
return x.lower()
So then I try to apply the functions to the Text column
train['Text'] = train['Text'].apply(remove_punctuation)
train['Text'] = train['Text'].apply(remove_stop_words)
train['Text'] = train['Text'].apply(lower)
And I get an error message like:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
----> 1 train['Text'] = train['Text'].apply(remove_punctuation)
2 train['Text'] = train['Text'].apply(remove_stop_words)
3 train['Text'] = train['Text'].apply(lower)
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in
apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and
isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
in remove_punctuation(sentence)
3 # Create a function to remove punctuations
4 def remove_punctuation(sentence: str) -> str:
----> 5 return sentence.translate(str.maketrans('', '', string.punctuation))
6
7 # Create a function to remove stop words
AttributeError: 'float' object has no attribute 'translate'
Why am I getting this error. Im guessing because digits appear in the text?
Related
This is the code I am using to replace text in powerpoint. First I am extracting text from powerpoint and then storing the translated and original sentences as dictionary.
prs = Presentation('/content/drive/MyDrive/presentation1.pptx')
# To get shapes in your slides
slides = [slide for slide in prs.slides]
shapes = []
for slide in slides:
for shape in slide.shapes:
shapes.append(shape)
def replace_text(self, replacements: dict, shapes: List):
"""Takes dict of {match: replacement, ... } and replaces all matches.
Currently not implemented for charts or graphics.
"""
for shape in shapes:
for match, replacement in replacements.items():
if shape.has_text_frame:
if (shape.text.find(match)) != -1:
text_frame = shape.text_frame
for paragraph in text_frame.paragraphs:
for run in paragraph.runs:
cur_text = run.text
new_text = cur_text.replace(str(match), str(replacement))
run.text = new_text
if shape.has_table:
for row in shape.table.rows:
for cell in row.cells:
if match in cell.text:
new_text = cell.text.replace(match, replacement)
cell.text = new_text
replace_text(translation, shapes)
I get a error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-97-181cdd92ff8c> in <module>()
9 shapes.append(shape)
10
---> 11 def replace_text(self, replacements: dict, shapes: List):
12 """Takes dict of {match: replacement, ... } and replaces all matches.
13 Currently not implemented for charts or graphics.
NameError: name 'List' is not defined
translation is a dictionary
translation = {' Architecture': 'आर्किटेक्चर',
' Conclusion': 'निष्कर्ष',
' Motivation / Entity Extraction': 'प्रेरणा / इकाई निष्कर्षण',
' Recurrent Deep Neural Networks': 'आवर्तक गहरे तंत्रिका नेटवर्क',
' Results': 'परिणाम',
' Word Embeddings': 'शब्द एम्बेडिंग',
'Agenda': 'कार्यसूची',
'Goals': 'लक्ष्य'}
May I know why am I getting this error. What chnages should be done to resolve it. Also can I save the replaced text using prs.save('output.pptx')
New Error
TypeError Traceback (most recent call last)
<ipython-input-104-957db45f970e> in <module>()
32 cell.text = new_text
33
---> 34 replace_text(translation, shapes)
35
36 prs.save('output.pptx')
TypeError: replace_text() missing 1 required positional argument: 'shapes'
The error you are getting 'NameError: name 'List' is not defined' occurs because 'List' isn't a valid type within python Typing. Since Python 3.9, you'll want to use 'list[type]'
For instance:
def replace_text(self, replacements: dict, shapes: list[str]):
Alternatively, you can use python's typing. However, this is deprecated in newer versions.
from typing import List
def replace_text(self, replacements: dict, shapes: List[str]):
I have my 'cost_money' column like this,
0 According to different hospitals, the charging...
1 According to different hospitals, the charging...
2 According to different conditions, different h...
3 According to different hospitals, the charging...
Name: cost_money, dtype: object
Out of which each string has some important data in brackets, which I need to extract.
"According to different hospitals, the charging standard is inconsistent, the city's three hospitals is about (1000-4000 yuan)"
My try for this is,
import regex as re
full_df['cost_money'] = full_df.cost_money.str.extract('\((.*?)\')
full_df
But this gives an error between string and int conversion, I guess. This a whole string and if I print any character it is going to be char type.
Other than that, I don't need 'yuan' word from the brackets so my method to extract the numbers directly was
import regex as re
df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
full_df['cost_money']
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <module>
1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
3 full_df['cost_money']
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1082, in SeriesApply.apply(self)
1078 if isinstance(self.f, str):
1079 # if we are a string, try to dispatch
1080 return self.apply_str()
-> 1082 return self.apply_standard()
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1137, in SeriesApply.apply_standard(self)
1131 values = obj.astype(object)._values
1132 # error: Argument 2 to "map_infer" has incompatible type
1133 # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1134 # Dict[Hashable, Union[Union[Callable[..., Any], str],
1135 # List[Union[Callable[..., Any], str]]]]]"; expected
1136 # "Callable[[Any], Any]"
-> 1137 mapped = lib.map_infer(
1138 values,
1139 f, # type: ignore[arg-type]
1140 convert=self.convert_dtype,
1141 )
1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
1144 # GH#43986 Need to do list(mapped) in order to get treated as nested
1145 # See also GH#25959 regarding EA support
1146 return obj._constructor_expanddim(list(mapped), index=obj.index)
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\_libs\lib.pyx:2870, in pandas._libs.lib.map_infer()
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <lambda>(x)
1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
3 full_df['cost_money']
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\regex\regex.py:338, in findall(pattern, string, flags, pos, endpos, overlapped, concurrent, timeout, ignore_unused, **kwargs)
333 """Return a list of all matches in the string. The matches may be overlapped
334 if overlapped is True. If one or more groups are present in the pattern,
335 return a list of groups; this will be a list of tuples if the pattern has
336 more than one group. Empty matches are included in the result."""
337 pat = _compile(pattern, flags, ignore_unused, kwargs, True)
--> 338 return pat.findall(string, pos, endpos, overlapped, concurrent, timeout)
TypeError: expected string or buffer
I tried the same thing using findall but most posts mentioned using extract so I stuck to that.
MY REQUESTED OUTPUT:
[5000, 8000]
[6000, 7990]
..SO ON
Can somebody please help me out? Thanks
I believe your regex was incorrect. Here are alternatives.
Example input:
df = pd.DataFrame({'cost_money': ['random text (123-456 yuans)',
'other example (789 yuans)']})
Option A:
df['cost_money'].str.extract('\((\d+-\d+)', expand=False)
Option B (allow single cost):
df['cost_money'].str.extract('\((\d+(?:-\d+)?)', expand=False)
Option C (all numbers eater the first '(' as list:
df['cost_money'].str.split('[()]').str[1].str.findall('(\d+)')
Output (assigned as new columns):
cost_money A B C
0 random text (123-456 yuans) 123-456 123-456 [123, 456]
1 other example (789 yuans) NaN 789 [789]
You can use (\d*-\d*) to match the number part and then split on -.
df['money'] = df['cost_money'].str.extract('\((\d*-\d*).*\)')
df['money'] = df['money'].str.split('-')
Or use (\d*)[^\d]*(\d*) to match the two number parts seperately
df['money'] = df['cost_money'].str.extract('\((\d*)[^\d]*(\d*).*\)').values.tolist()
I'm currently doing a tensorflow transformer tutorial for sequence to sequence translation. At the beginning of the tutorial the class tfds.features.text.SubwordTextEncoder is called. This class can be used to convert a string to a list with integers, each representing a word.
After using the class SubwordTextEncoder to train an english tokenizer as follows:
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
the tutorial shows how this tokenizer can now be used to convert strings to lists with integers. This code snippet
sample_string = 'Transformer is awesome.'
tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
gives the following result:
[7915, 1248, 7946, 7194, 13, 2799]
where the integer to word mapping can be shown as follows:
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
returns
7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former
13 ----> is
2799 ----> awesome
This all makes sense to me. The tokenizer recognises the words 'is' and 'awesome' from its training set and assigns the corresponding integers. The word 'Transformer' which was not in its training set is being split up into parts as is mentioned in the documentation.
After some experimenting with the tokenizer however, I got confused. Please consider the following code snippets
sample_string2 = 'the best there is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
which returns
[3, 332, 64, 156]
and
for ts in tokenized_string2:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
which returns
3 ----> the
332 ----> best
64 ----> there
156 ----> is
Question: Why does the tokenizer return different integers for the same word if they are in a different part of the sentence? The word 'is' maps to 156 in the second example, where in the first example it is mapped to the integer 13, using the same tokenizer.
I have added one more statement len(tokenizer_en.decode([ts]) in the print statement to see length and I tried the below example -
Example:
sample_string2 = 'is is is is is is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
for ts in tokenized_string2:
print ('{} ----> {} ----> {}'.format(ts, tokenizer_en.decode([ts]),len(tokenizer_en.decode([ts]))))
Output -
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
156 ----> is ----> 2
As per the documentation of arguments, it states -
vocab_list - list<str>, list of subwords for the vocabulary. Note that
an underscore at the end of a subword indicates the end of the word
(i.e. a space will be inserted afterwards when decoding). Underscores
in the interior of subwords are disallowed and should use the
underscore escape sequence.
I am working on pre processing the data for "Job Description" column which contains text data format. I have created a dataframe and trying to apply a function to pre process the data, but getting the error as "expected string or bytes-like object" when applying function to the column in data frame. Please refer my code below and help.
####################################################
#Function to pre process the data
def clean_text(text):
"""
Applies some pre-processing on the given text.
Steps :
- Removing HTML tags
- Removing punctuation
- Lowering text
"""
# remove HTML tags
text = re.sub(r'<.*?>', '', text)
# remove the characters [\], ['] and ["]
text = re.sub(r"\\", "", text)
text = re.sub(r"\'", "", text)
text = re.sub(r"\"", "", text)
# convert text to lowercase
text = text.strip().lower()
#replace all numbers with empty spaces
text = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(text))
# replace punctuation characters with spaces
filters='!"\'#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'
translate_dict = dict((c, " ") for c in filters)
translate_map = str.maketrans(translate_dict)
text = text.translate(translate_map)
return text
#############################################################
#To apply "Clean_text" function to job_description column in data frame
df['jobnew']=df['job_description'].apply(clean_text)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-c15402ac31ba> in <module>()
----> 1 df['jobnew']=df['job_description'].apply(clean_text)
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-30-5f24dbf9d559> in clean_text(text)
10
11 # remove HTML tags
---> 12 text = re.sub(r'<.*?>', '', text)
13
14 # remove the characters [\], ['] and ["]
~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
The function re.sub is telling you that you called it with something (the argument text) that is not a string. Since it is invoked by calling apply on the contents of df['job_description'], it is clear that the problem must be in how you created this data frame... and you don't show that part of your code.
Construct your dataframe so that this column only contains strings, and your program will run without error for at least a few more lines.
I'm facing this error and I'm really not able to find the reason for it.
Can somebody please point out the reason for it ?
for i in tweet_raw.comments:
mns_proc.append(processComUni(i))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-416-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_processed.append(processtwt(i))
3
<ipython-input-414-4e1b8a8fb285> in processtwt(tweet)
4 #Convert to lower case
5 #tweet = re.sub('RT[\s]+','',tweet)
----> 6 tweet = tweet.lower()
7 #Convert www.* or https?://* to URL
8 #tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','',tweet)
AttributeError: 'float' object has no attribute 'lower'
A second similar error that facing is this :
for i in tweet_raw.comments:
tweet_proc.append(processtwt(i))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-423-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_proc.append(processtwt(i))
3
<ipython-input-421-38fab2ef704e> in processComUni(tweet)
11 tweet=re.sub(('[http]+s?://[^\s<>"]+|www\.[^\s<>"]+'),'', tweet)
12 #Convert #username to AT_USER
---> 13 tweet = re.sub('#[^\s]+',' ',tweet)
14 #Remove additional white spaces
15 tweet = re.sub('[\s]+', ' ', tweet)
C:\Users\m1027201\AppData\Local\Continuum\Anaconda\lib\re.pyc in sub(pattern, repl, string, count, flags)
149 a callable, it's passed the match object and must return
150 a replacement string to be used."""
--> 151 return _compile(pattern, flags).sub(repl, string, count)
152
153 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or buffer
Shall I check whether of not a particluar tweet is tring before passing it to processtwt() function ? For this error I dont even know which line its failing at.
Just try using this:
tweet = str(tweet).lower()
Lately, I've been facing many of these errors, and converting them to a string before applying lower() always worked for me.
My answer will be broader than shalini answer. If you want to check if the object is of type str then I suggest you check type of object by using isinstance() as shown below. This is more pythonic way.
tweet = "stackoverflow"
## best way of doing it
if isinstance(tweet,(str,)):
print tweet
## other way of doing it
if type(tweet) is str:
print tweet
## This is one more way to do it
if type(tweet) == str:
print tweet
All the above works fine to check the type of object is string or not.