How to remove stopwords in gensim? - python

df_clean['message'] = df_clean['message'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x))
I tried this on a dataframe's column 'message' but I get the error:
TypeError: decoding to str: need a bytes-like object, list found

Apparently, the df_clean["message"] column contains a list of words, not a string, hence the error saying that need a bytes-like object, list found.
To fix this issue, you need to convert it to string again using join() method like so:
df_clean['message'] = df_clean['message'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(" ".join(x)))
Notice that the df_clean["message"] will contain string objects after applying the previous code.

This is not a gensim problem, the error is raised by pandas: there is a value in your column message that is of type list instead of string. Here's a minimal pandas example:
import pandas as pd
from gensim.parsing.preprocessing import remove_stopwords
df = pd.DataFrame([['one', 'two'], ['three', ['four']]], columns=['A', 'B'])
df.A.apply(remove_stopwords) # works fine
df.B.apply(remove_stopwords)
TypeError: decoding to str: need a bytes-like object, list found

What the error is saying is that remove_stopwords needs string type object and you are passing a list, So before removing stop words check that all the values in column are of string type. See the Docs

Related

IBM Data Science: float() argument must be a string or a number, not 'method'

I'm trying to run the following code:
#calculate the mean vaule for "stroke" column
avg_stroke=df['stroke'].astype('float').mean(axis=0)
print("Average of stroke:", avg_stroke)
However, I keep getting the following error:
float() argument must be a string or a number, not 'method'` on this code.
I have used the same code structure in different parts of my script and achieve a nice clean mean:
#Write your code below and press Shift+Enter to execute
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
I've already ruled out any suggestions/answers from these SE answers:
TypeError: float() argument must be a string or a number, not 'method' - Multiple variable regression
TypeError: float() argument must be a string or a number, not 'method'
Are you aware that in that column a string with value "?" exists?
You have to remove it first. Other than that the code works for me.
import pandas as pd
url = r"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
col_names = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style", "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type", "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
df = pd.read_csv(url, names=col_names)
df = df[df['stroke']!="?"]
avg_stroke = df['stroke'].astype('float').mean(axis=0)
print("Average of stroke:", avg_stroke)
#out: Average of stroke: 3.255422885572139

How do I get the isalnum code to work for the specific problem?

I've already applied the following code:
my1name = 'Alicia'
my2name = 'Shifflett'
yearP = '2019'
my1nameSpace = my1name + ''
print(my1name.isalnum())
print(my1name.isalnum())
print(yearP.isalnum())
The task wants me to apply isalnum() to the string my1name+my2name. I tried the following:
print(my1name+my2name.isalnum())
But I keep getting the error: TypeError: can only concatenate str (not "bool") to str
How do I type this out correctly to get it to work?

Sort a list by specific location in string

I've got a list of strings and I want to sort this by a specific part of the string only, not the full string.
I would like to sort the entire list only focusing on the second last part
When I use the regular sort() functions I have the issue that it sorts using the full string value.
I've tried using the 'key=' option with split('_') but somehow I am not able to get it working.
# Key to sort profile files
def sortprofiles(item):
item.split('_')[-2]
# Input
local_hostname = 'ma-tsp-a01'
profile_files = ['/path/to/file/TSP_D01_ma-tsp-a01\n', \
'/path/to/file/TSP_D02_ma-tsp-a02\n', \
'/path/to/file/TSP_ASCS00_ma-tsp-a01\n', \
'/path/ato/file/TSP_DVEBMGS03_ma-tsp-a03\n', \
'/path/to/file/TSP_DVEBMGS01_ma-tsp-a01\n']
# Do stuff
profile_files = [i.split()[0] for i in profile_files]
profile_files.sort(key=sortprofiles)
print(profile_files)
I currently get the following error message:
TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'
I would like to get the list sorted as: ['/path/to/file/TSP_ASCS00_ma-tsp-a01', '/path/to/file/TSP_D01_ma-tsp-a01', '/path/to/file/TSP_D02_ma-tsp-a02', '/path/to/file/TSP_DVEBMGS01_ma-tsp-a01', '/path/ato/file/TSP_DVEBMGS03_ma-tsp-a03']
You could use a lambda expression and try
profile_files = sorted(profile_files, key=lambda x: x.split('_')[1])
Each string in the list is split based on the occurrence of _ and the second part is considered for sorting.
But this may not work if the strings are not in the format that you expect.
You are not returning the value on how you want to split on, you need to return it from sortprofiles function and then your function will work as expected!
Earlier you were not returning anything, which is equivalent to returning None and when you try to run a comparison operator like < on None, you get the exception TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'
So the below will work
def sortprofiles(item):
#You need to return the key you want to sort on
return item.split('_')[-2]
local_hostname = 'ma-tsp-a01'
profile_files = ['/path/to/file/TSP_D01_ma-tsp-a01\n',
'/path/to/file/TSP_D02_ma-tsp-a02\n',
'/path/to/file/TSP_ASCS00_ma-tsp-a01\n',
'/path/ato/file/TSP_DVEBMGS03_ma-tsp-a03\n',
'/path/to/file/TSP_DVEBMGS01_ma-tsp-a01\n']
print(sorted(profile_files, key=sortprofiles))
The output will then be
['/path/to/file/TSP_ASCS00_ma-tsp-a01\n', '/path/to/file/TSP_D01_ma-tsp-a01\n', '/path/to/file/TSP_D02_ma-tsp-a02\n', '/path/to/file/TSP_DVEBMGS01_ma-tsp-a01\n', '/path/ato/file/TSP_DVEBMGS03_ma-tsp-a03\n']

AttributeError: 'NoneType' object has no attribute 'lstrip'

I have a dataframe in Pandas that is giving me the error below when I try to strip it of certain characters:
AttributeError: 'NoneType' object has no attribute 'lstrip'
I began by removing any missing or null values:
df_sample1['counties'].fillna('missing')
Inspecting it, I see a lot of unclean data, a mix of actual data (County 1, Count 2...Count n) as well as gibberish ($%ZYC 2).
To clean this further, I ran the following code:
df_sample1['counties'] = df_sample1['counties'].map(lambda x: x.lstrip('+%=/-#$;!\(!\&=&:%;').rstrip('1234567890+%=/-#$;!\(!\&=&:%;'))
df_sample1[:10]
This generates the 'NoneType' error.
I dug up a little, and in the Pandas documentation, there are some hints about skipping missing values.
if df_sample1['counties'] is None:
pass
else:
df_sample1['counties'].map(lambda x: x.lstrip('+%=/-#$;!\(!\&=&:%;').rstrip('1234567890+%=/-#$;!\(!\&=&:%;'))
This still generates the NoneType error mentioned above. Could someone point out what I'm doing wrong?
You can "skip" the None by checking if x is truthy before doing the stripping...
df_sample1['counties'].map(lambda x: x and x.lstrip('+%=/-#$;!\(!\&=&:%;').rstrip('1234567890+%=/-#$;!\(!\&=&:%;'))
This will probably leave some None in the dataframe (in the same places that they were before), but the transform should still work on the strings.
If you are working with text data, why don't you simply first fill None type data with an empty string?
df_sample1['counties'].fillna("", inplace=True)
I suspect that your issue is that when you filled your missing values, you didn't do it inplace. This could be addressed by:
df_sample1['counties'].fillna('missing', inplace=True)
Or, when applying pandas.Series.map, you could use the argument na_action to leave these entries as None.
df_sample1['counties'] = df_sample1['counties'].map(lambda x: ..., na_action='ignore')

Append string to a numpy ndarray

I'm working with a numpy array called "C_ClfGtLabels" in which 374 artist/creator names are stored. I want to append a 375th artist class with a string "other artists". I thought I could just do that as follows:
C_ClfGtLabels.append('other artists')
However, this results in the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'append'
I saw found this problem a few times on stackoverflow, to which the answer in one case was to use concatenate instead of append. When I tried that I got the following error:
TypeError: don't know how to convert scalar number to int
It seems to be a problem that the datatype does not match the datatype that I, trying to append/concatenate, which would be of type string. However, I don't know what I should do to make them match. The data inside the Clabels array is as follows:
[u"admiral, jan l'" u'aldegrever, heinrich' u'allard, abraham'
u'allard, carel' u'almeloveen, jan van' u'altdorfer, albrecht'
u'andriessen, jurriaan' u'anthonisz., cornelis' u'asser, eduard isaac' ..]
Any advice on how I can setup the "other artists" string so that I can append it to C_ClfGtLabels?
A quick workaround is to convert your C_ClfGtLabels into a list first, append, and convert it back into an ndarray
lst = list(C_ClfGtLabels)
lst.append('other artists')
C_ClfGtLabels = np.asarray(lst)

Categories