Convert Float Obj in Dataframe for CountVectorizer & bow_transformer - python

I am trying to load a dataframe into into bag of words and CountVectorizer but I get TypeError: 'float' object is not iterable when going loading from mess equal a test sentence to mess equaling the dataframe I need to use.
the example corpus on scikit learn docs and the course online both loaded from just list of sentences instead of data frame.
I tried Removing integers
AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer
I get different error
TypeError: list indices must be integers or slices, not str
mess1 = [item for item in mess if not isinstance(item, int)]
this is what works
mess = 'Sample message! Notice: it has punctuation.'
this is the dataframe
i need to use instead.
mess.head()
| bios | artistName
----+---------------------------------------------------------+-------------------
0 | Chris Cosentino Biography Chris Cosentino gre... | Chris Cosentino
----+---------------------------------------------------------+-------------------
1 | Magda Biography The DJ known as Magda was bor... | Magda
----+---------------------------------------------------------+-------------------
2 | Jean-Michel Cousteau Biography Since first be... | jean michel cousteau
----+---------------------------------------------------------+-------------------
3 | Kyle Busch Biography The American stock car r... | Kyle Busch
----+---------------------------------------------------------+-------------------
4 | Naughty by Nature Biography Naughty by Nature... | Naughty by Nature
----+---------------------------------------------------------+-------------------
nopunc = [c for c in mess if c not in string.punctuation]
def text_process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
mess['bios'].head(5).apply(text_process)
Output
0 [Chris, Cosentino, Biography, Chris, Cosentino...
1 [Magda, Biography, DJ, known, Magda, born, rai...
2 [JeanMichel, Cousteau, Biography, Since, first...
3 [Kyle, Busch, Biography, American, stock, car,...
4 [Naughty, Nature, Biography, Naughty, Nature, ...
Name: bios, dtype: object
mess.dtypes
bios object
artistName object
dtype: object
from sklearn.feature_extraction.text import CountVectorizer
then run either
bow_transformer = CountVectorizer(analyzer=text_process)
bow_transformer.fit(mess['bios'])
print(len(bow_transformer.vocabulary_))
or this
bow_transformer = CountVectorizer(analyzer=text_process).fit(mess['bios'])
print(len(bow_transformer.vocabulary))
I get the error
TypeError: 'float' object is not iterable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-148-74d381110eec> in <module>
1 bow_transformer = CountVectorizer(analyzer=text_process)
----> 2 bow_transformer.fit(mess['bios'])
3 print(len(bow_transformer.vocabulary_))
~\anaconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
996 self
997 """
--> 998 self.fit_transform(raw_documents)
999 return self
1000
~\anaconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1030
1031 vocabulary, X = self._count_vocab(raw_documents,
-> 1032 self.fixed_vocabulary_)
1033
1034 if self.binary:
~\anaconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
<ipython-input-134-ad1781692b41> in text_process(mess)
1 def text_process(mess):
2
----> 3 nopunc = [char for char in mess if char not in string.punctuation]
4
5 nopunc = ''.join(nopunc)
TypeError: 'float' object is not iterable

Based Ben Reiniger Comment
I looked for the missing values in the dataframe. Even though it was complete there was thousands of fully blank ones added.
I counted nan
count_nan = len(mess) - mess.count()
count_nan
bios 9682
artistName 9768
dtype: int64
I ran dropna to remove them
mess.dropna(inplace=True)
output is now
bios 0
artistName 0
dtype: int64
Error received when nan successfully dropped
Now I try to run bow_transformer = CountVectorizer(analyzer=text_process)
that fixed my original TypeError: 'float' object is not iterable
however i get a new error but I am one step closer.
bow_transformer = CountVectorizer(analyzer=text_process)
bow_transformer.fit(mess['bios'])
print(len(bow_transformer.vocabulary_))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-62-74d381110eec> in <module>
1 bow_transformer = CountVectorizer(analyzer=text_process)
----> 2 bow_transformer.fit(mess['bios'])
3 print(len(bow_transformer.vocabulary_))
TypeError: list indices must be integers or slices, not str

Related

Get the both numbers in the bracket of a string Regex Python

I have my 'cost_money' column like this,
0 According to different hospitals, the charging...
1 According to different hospitals, the charging...
2 According to different conditions, different h...
3 According to different hospitals, the charging...
Name: cost_money, dtype: object
Out of which each string has some important data in brackets, which I need to extract.
"According to different hospitals, the charging standard is inconsistent, the city's three hospitals is about (1000-4000 yuan)"
My try for this is,
import regex as re
full_df['cost_money'] = full_df.cost_money.str.extract('\((.*?)\')
full_df
But this gives an error between string and int conversion, I guess. This a whole string and if I print any character it is going to be char type.
Other than that, I don't need 'yuan' word from the brackets so my method to extract the numbers directly was
import regex as re
df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
full_df['cost_money']
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <module>
1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
3 full_df['cost_money']
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1082, in SeriesApply.apply(self)
1078 if isinstance(self.f, str):
1079 # if we are a string, try to dispatch
1080 return self.apply_str()
-> 1082 return self.apply_standard()
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\core\apply.py:1137, in SeriesApply.apply_standard(self)
1131 values = obj.astype(object)._values
1132 # error: Argument 2 to "map_infer" has incompatible type
1133 # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1134 # Dict[Hashable, Union[Union[Callable[..., Any], str],
1135 # List[Union[Callable[..., Any], str]]]]]"; expected
1136 # "Callable[[Any], Any]"
-> 1137 mapped = lib.map_infer(
1138 values,
1139 f, # type: ignore[arg-type]
1140 convert=self.convert_dtype,
1141 )
1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
1144 # GH#43986 Need to do list(mapped) in order to get treated as nested
1145 # See also GH#25959 regarding EA support
1146 return obj._constructor_expanddim(list(mapped), index=obj.index)
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\pandas\_libs\lib.pyx:2870, in pandas._libs.lib.map_infer()
c:\Users\Siddhi\HealthcareChatbot\eda.ipynb Cell 11' in <lambda>(x)
1 import regex as re
----> 2 df['cost_money'].apply(lambda x: re.findall(r"[-+]?\d*\.\d+|\d+", x)).tolist()
3 full_df['cost_money']
File c:\Users\Siddhi\HealthcareChatbot\venv\lib\site-packages\regex\regex.py:338, in findall(pattern, string, flags, pos, endpos, overlapped, concurrent, timeout, ignore_unused, **kwargs)
333 """Return a list of all matches in the string. The matches may be overlapped
334 if overlapped is True. If one or more groups are present in the pattern,
335 return a list of groups; this will be a list of tuples if the pattern has
336 more than one group. Empty matches are included in the result."""
337 pat = _compile(pattern, flags, ignore_unused, kwargs, True)
--> 338 return pat.findall(string, pos, endpos, overlapped, concurrent, timeout)
TypeError: expected string or buffer
I tried the same thing using findall but most posts mentioned using extract so I stuck to that.
MY REQUESTED OUTPUT:
[5000, 8000]
[6000, 7990]
..SO ON
Can somebody please help me out? Thanks
I believe your regex was incorrect. Here are alternatives.
Example input:
df = pd.DataFrame({'cost_money': ['random text (123-456 yuans)',
'other example (789 yuans)']})
Option A:
df['cost_money'].str.extract('\((\d+-\d+)', expand=False)
Option B (allow single cost):
df['cost_money'].str.extract('\((\d+(?:-\d+)?)', expand=False)
Option C (all numbers eater the first '(' as list:
df['cost_money'].str.split('[()]').str[1].str.findall('(\d+)')
Output (assigned as new columns):
cost_money A B C
0 random text (123-456 yuans) 123-456 123-456 [123, 456]
1 other example (789 yuans) NaN 789 [789]
You can use (\d*-\d*) to match the number part and then split on -.
df['money'] = df['cost_money'].str.extract('\((\d*-\d*).*\)')
df['money'] = df['money'].str.split('-')
Or use (\d*)[^\d]*(\d*) to match the two number parts seperately
df['money'] = df['cost_money'].str.extract('\((\d*)[^\d]*(\d*).*\)').values.tolist()

How does the `tfds.features.text.SubwordTextEncoder` create word encoding?

I'm currently doing a tensorflow transformer tutorial for sequence to sequence translation. At the beginning of the tutorial the class tfds.features.text.SubwordTextEncoder is called. This class can be used to convert a string to a list with integers, each representing a word.
After using the class SubwordTextEncoder to train an english tokenizer as follows:
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
the tutorial shows how this tokenizer can now be used to convert strings to lists with integers. This code snippet
sample_string = 'Transformer is awesome.'
tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
gives the following result:
[7915, 1248, 7946, 7194, 13, 2799]
where the integer to word mapping can be shown as follows:
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
returns
7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former
13 ----> is
2799 ----> awesome
This all makes sense to me. The tokenizer recognises the words 'is' and 'awesome' from its training set and assigns the corresponding integers. The word 'Transformer' which was not in its training set is being split up into parts as is mentioned in the documentation.
After some experimenting with the tokenizer however, I got confused. Please consider the following code snippets
sample_string2 = 'the best there is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
which returns
[3, 332, 64, 156]
and
for ts in tokenized_string2:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
which returns
3 ----> the
332 ----> best
64 ----> there
156 ----> is
Question: Why does the tokenizer return different integers for the same word if they are in a different part of the sentence? The word 'is' maps to 156 in the second example, where in the first example it is mapped to the integer 13, using the same tokenizer.
I have added one more statement len(tokenizer_en.decode([ts]) in the print statement to see length and I tried the below example -
Example:
sample_string2 = 'is is is is is is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
for ts in tokenized_string2:
print ('{} ----> {} ----> {}'.format(ts, tokenizer_en.decode([ts]),len(tokenizer_en.decode([ts]))))
Output -
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
156 ----> is ----> 2
As per the documentation of arguments, it states -
vocab_list - list<str>, list of subwords for the vocabulary. Note that
an underscore at the end of a subword indicates the end of the word
(i.e. a space will be inserted afterwards when decoding). Underscores
in the interior of subwords are disallowed and should use the
underscore escape sequence.

AttributeError: 'float' object has no attribute 'translate' Python

Im working on doing some NLP with textual data from doctors just trying to do some basic preprocessing text cleaning trying to remove stop words and punctuation. I have already given the program a list of punctuations and stop words.
My text data looks something like this:
"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells"
Then my code looks like:
import string
# Create a function to remove punctuations
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
# Create a function to remove stop words
def remove_stop_words(x):
x = ' '.join([i for i in x.split(' ') if i not in stop])
return x
# Create a function to lowercase the words
def to_lower(x):
return x.lower()
So then I try to apply the functions to the Text column
train['Text'] = train['Text'].apply(remove_punctuation)
train['Text'] = train['Text'].apply(remove_stop_words)
train['Text'] = train['Text'].apply(lower)
And I get an error message like:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
----> 1 train['Text'] = train['Text'].apply(remove_punctuation)
2 train['Text'] = train['Text'].apply(remove_stop_words)
3 train['Text'] = train['Text'].apply(lower)
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in
apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and
isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
in remove_punctuation(sentence)
3 # Create a function to remove punctuations
4 def remove_punctuation(sentence: str) -> str:
----> 5 return sentence.translate(str.maketrans('', '', string.punctuation))
6
7 # Create a function to remove stop words
AttributeError: 'float' object has no attribute 'translate'
Why am I getting this error. Im guessing because digits appear in the text?

NLTK gives error expected string or bytes-like object

I imported a dataset (.csv) with pandas. The first column is the column with tweets, I rename it and transform it to a numpy array as usual with .values. Then I start the pre-processing with NLTK, it works pretty much every time, except for this dataset. It gives me the error TypeError: expected string or bytes-like object and I can't figure out why. The text contains some weird stuff, but far from the worst I've seen. Can someone help out?
data = pd.read_csv("facebook.csv")
text = data["Anonymized Message"].values
X = []
for i in range(0, len(text)):
tweet = re.sub("[^a-zA-Z]", " ", text[i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
tweet = ' '.join(tweet)
X.append(tweet)
gives me this error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-a08c1779c787> in <module>()
1 text_train = []
2 for i in range(0, len(text)):
----> 3 tweet = re.sub("[^a-zA-Z]", " ", text[i])
4 tweet = tweet.lower()
5 tweet = tweet.split()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
189 a callable, it's passed the match object and must return
190 a replacement string to be used."""
--> 191 return _compile(pattern, flags).sub(repl, string, count)
192
193 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
Here's the dataset
http://wwbp.org/downloads/public_data/dataset-fb-valence-arousal-anon.csv

float' object has no attribute 'lower'

I'm facing this error and I'm really not able to find the reason for it.
Can somebody please point out the reason for it ?
for i in tweet_raw.comments:
mns_proc.append(processComUni(i))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-416-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_processed.append(processtwt(i))
3
<ipython-input-414-4e1b8a8fb285> in processtwt(tweet)
4 #Convert to lower case
5 #tweet = re.sub('RT[\s]+','',tweet)
----> 6 tweet = tweet.lower()
7 #Convert www.* or https?://* to URL
8 #tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','',tweet)
AttributeError: 'float' object has no attribute 'lower'
A second similar error that facing is this :
for i in tweet_raw.comments:
tweet_proc.append(processtwt(i))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-423-439073b420d1> in <module>()
1 for i in tweet_raw.comments:
----> 2 tweet_proc.append(processtwt(i))
3
<ipython-input-421-38fab2ef704e> in processComUni(tweet)
11 tweet=re.sub(('[http]+s?://[^\s<>"]+|www\.[^\s<>"]+'),'', tweet)
12 #Convert #username to AT_USER
---> 13 tweet = re.sub('#[^\s]+',' ',tweet)
14 #Remove additional white spaces
15 tweet = re.sub('[\s]+', ' ', tweet)
C:\Users\m1027201\AppData\Local\Continuum\Anaconda\lib\re.pyc in sub(pattern, repl, string, count, flags)
149 a callable, it's passed the match object and must return
150 a replacement string to be used."""
--> 151 return _compile(pattern, flags).sub(repl, string, count)
152
153 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or buffer
Shall I check whether of not a particluar tweet is tring before passing it to processtwt() function ? For this error I dont even know which line its failing at.
Just try using this:
tweet = str(tweet).lower()
Lately, I've been facing many of these errors, and converting them to a string before applying lower() always worked for me.
My answer will be broader than shalini answer. If you want to check if the object is of type str then I suggest you check type of object by using isinstance() as shown below. This is more pythonic way.
tweet = "stackoverflow"
## best way of doing it
if isinstance(tweet,(str,)):
print tweet
## other way of doing it
if type(tweet) is str:
print tweet
## This is one more way to do it
if type(tweet) == str:
print tweet
All the above works fine to check the type of object is string or not.

Categories