AttributeError: 'Series' object has no attribute 'label' - python

I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.
def parser(row):
# function to load files and extract features
file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')
# handle exception to check if there isn't a file which is corrupted
try:
# here kaiser_fast is a technique used for faster extraction
X, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
# we extract mfcc feature from data
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
except Exception as e:
print("Error encountered while parsing file: ", file)
return None, None
feature = mfccs
label = row.Class
return [feature, label]
temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']
from sklearn.preprocessing import LabelEncoder
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
As mentioned, I've seen three different tutorials on the same subject, all of which end with the same "temp = train.apply(parser, axis=1) temp.columns = ['feature', 'label']" fragment, so I'm assuming this is assigning correctly, but I don't know where it's going wrong otherwise. Help appreciated!
Edit: Traceback as requested, turns out I'd added the wrong traceback. Also I've since found out that this is a case of converting the series object to a dataframe, so any help with that would be great.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-1613f53e2d98> in <module>()
1 from sklearn.preprocessing import LabelEncoder
2
----> 3 X = np.array(temp.feature.tolist())
4 y = np.array(temp.label.tolist())
5
/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'feature'

Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.
So your temp is actually a Series object. Then the following line dont have any effect:
temp.columns = ['feature', 'label']
Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.
Change your parser() method as following:
def parser(row):
...
...
...
# Return pandas.Series instead of List
return pd.Series([feature, label])
By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.
I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.

Related

preprocessing class error, "AttributeError: 'function' object has no attribute 'str'"

So I did an nlp project earlier now I have pickled the model and trying to apply it to a new data set, the data set is something I scrapped from twitter. So of course the new dataframe doesn't have the same columns as the old dataset, so I am making a class to preprocess the data to make closer the old dataframe which was used for the nlp project. This is what I did
def __init__(self):
pass
def fit(self, text_column):
df = pd.DataFrame(text_column)
df.text_length = self.text_length(text_column)
df.num_capital_letters = self.num_capital_letters(text_column)
df.percentage_of_capital_letters = self.percentage_of_capital_letters(text_column)
df.greater_than_50_percent = self.greater_than_50_percent(text_column)
df.reading_level = self.reading_level(text_column)
#df =pd.DataFrame(Text.df_user_tweets
return df
def text_length(self,column):
return column.apply(lambda x: len(x))
def num_capital_letters(self,column):
return column.apply.str.findall(r"[A-Z]").str.len()
def percentage_of_capital_letters(self,column):
return column.apply.str.findall(r"[A-Z]").str.len()/column.apply(lambda x: len(x))
def greater_than_50_percent(self,column):
return column.apply(lambda x: x>= .5 )
def reading_level(self,column):
return column.apply(lambda x :textstat.flesch_reading_ease(x))
pre = Preprocesser()
pre.fit(text_column = df_user_tweets.Text)
This is the error that I got
<ipython-input-136-3b74ba5d2425> in num_capital_letters(self, column)
17 return column.apply(lambda x: len(x))
18 def num_capital_letters(self,column):
---> 19 return column.apply.str.findall(r"[A-Z]").len()
20 def percentage_of_capital_letters(self,column):
21 return column.apply.str.findall(r"[A-Z]").str.len()/column.apply(lambda x: len(x))
AttributeError: 'function' object has no attribute 'str'
It sounds like my error is in line 19 but not sure what I need to do fix it, appreciate any help
df_user_tweets.Text is of type pd.Series and it has a method apply. this method takes a lambda function to do some work on values of that Series (which is a column), and it does not have an str attribute.
So instead of column.apply.findall do column.str.findall.
you can find the doc of pandas here: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html?highlight=str#pandas.Series.str

Estimate log similarity across two pd df columns with nltk

My python is a little rusty and I feel at this point I have given this a pretty solid try before reaching out.
I have a data set of two columns, each containing n rows of words. I would like to create a new column within this same df that displays the Leacock-Chodorow Similarity of each word combo.
Here is how I have attacked the problem. I think I am writing in the way I would do this in R, which might be leading to the final problem.
Thanks in advance!
#import libraries
import pandas as pd
from nltk.corpus import wordnet as wn
Create dataframe
df = {'A':["cat", "dog", "human"],'B':['bell','leash','clothes']}
df = pd.DataFrame(df)
For single words, this is how I would calculate the LCS estimate:
cat =wn.synset('cat.n.01')
bell =wn.synset('bell.n.01')
wn.lch_similarity(cat, bell)
In an effort to get these estimates for a new column, I followed these steps.
First appened each word with ".n.01" and then create the synset object:
df["A2"] = df["A"] + ".n.01"
df["A3"] = df["A2"].apply(wn.synset)
df["B2"] = df["B"] + ".n.01"
df["B3"] = df["B2"].apply(wn.synset)
Now that columns A3 and B3 are synset representations needed for the analysis I run the following:
df["lch"] = wn.lch_similarity(df["A3"],df["B3"])
I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-27-e5518c48104e> in <module>
----> 1 df["lch"] = wn.lch_similarity(df["A3"], df["B3"])
~\anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py in lch_similarity(self, synset1, synset2, verbose, simulate_root)
1772
1773 def lch_similarity(self, synset1, synset2, verbose=False, simulate_root=True):
-> 1774 return synset1.lch_similarity(synset2, verbose, simulate_root)
1775
1776 lch_similarity.__doc__ = Synset.lch_similarity.__doc__
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'lch_similarity'

Problem about printing certain rows without using Pandas

I want to print out the first 5 rows of the data from sklearn.datasets.load_diabetes. I tried head() and iloc but it seems not effective. What should I do?
Here is my work
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
# 2. Load the data (use .load_diabetes() function )
df = datasets.load_diabetes()
df
# 3. Print out feature names and target names
# Features Names
x = df.feature_names
x
# Target Names
y = df.target
y
# 4. Print out the first 5 rows of the data
df.head(5)
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
113 try:
--> 114 return self[key]
115 except KeyError:
KeyError: 'head'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
114 return self[key]
115 except KeyError:
--> 116 raise AttributeError(key)
117
118 def __setstate__(self, state):
AttributeError: head
According to the documentation for load_diabetes() it doesn't return a Pandas dataframe by default, so no wonder it doesn't work.
You can apparently do
df = datasets.load_diabetes(as_frame=True).data
if you want a dataframe.
If you don't want a dataframe, you need to read up on how Numpy array slicing works, since that's what you get by default.
Well, I thank Mr.AKX for giving me a useful hint. I can find my answer:
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
import pandas as pd
# 2. Load the data (use .load_diabetes() function )
data = datasets.load_diabetes()
# 3. Print out feature names and target names
# Features Names
x = data.feature_names
x
# Target Names
y = data.target
y
# 4. Print out the first 5 rows of the data
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head(5)
The method load_diabetes() doesn't return a DataFrame by default but if you are using sklearn 0.23 or higher you can set as_frame parameter to True so it will return a Pd.DataFrame object.
df = datasets.load_diabetes(as_frame=True)
Then you can call head method and it will show you the first 5 rows, no need to specify 5.
print(df.head())

'Series' object has no attribute 'values_counts'

When I try to apply the values_count() method to series within a function, I am told that 'Series' object has no attribute 'values_counts'.
def replace_1_occ_feat(col_list, df):
for col in col_list:
feat_1_occ = df[col].values_counts()[df[col].values_counts() == 1].index
feat_means = df[col].groupby(col)['SalePrice'].mean()
feat_means_no_1_occ = feat_means.iloc[feat_means.difference(feat_1_occ),:]
for feat in feat_1_occ:
# Find the closest mean SalePrice
replacement = (feat_means_no_1_occ - feat_means.iloc[feat,:]).idxmin()
df.col.replace(feat, replacement, inplace = True)
However when running df.column.values_count() outside a function it works.
The problem occurs on the first line when the values_counts() methods is used.
I checked the pandas version it's 0.23.0.
The function is value_counts(). Note only count is plural.

Saving each new dataframe created inside a for-loop in Python

I wrote a function that iterates over the files in a folder and selects certain data. The .csv files look like this:
Timestamp Value Result
00-00-10 34567 1.0
00-00-20 45425
00-00-30 46773 0.0
00-00-40 64567
00-00-50 25665 1.0
00-01-00 25678
00-01-10 84358
00-01-20 76869 0.0
00-01-30 95830
00-01-40 87890
00-01-50 99537
00-02-00 85957 1.0
00-02-10 58840
They are saved in the path C:/Users/me/Desktop/myfolder/data and I wrote the code in C:/Users/me/Desktop/myfolder. The function (after #Daniel R 's suggestion):
PATH = os.getcwd()+'\DATA\\'
def my_function(SourceFolder):
for i, file_path in enumerate(os.listdir(PATH)):
df = pd.read_csv(PATH+file_path)
mask = (
(df.Result == 1)
| (df.Result.ffill() == 1)
| ((df.Result.ffill() == 0)
& (df.groupby((df.Result.ffill() != df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100))
)
df = mask[df]
df = df.to_csv(PATH+'df_{}.csv'.format(i))
My initial question was: How do I save each df[mask] to NewFolder without overriding the data? The code above throws AttributeError: 'str' object has no attribute 'Result'.
AttributeError Traceback (most recent call last)
<ipython-input-3-14c0dbaf5ace> in <module>()
----> 1 retrieve_data('C:/Users/me/Desktop/myfolder/DATA/*.csv')
<ipython-input-2-ba68702431ca> in my_function(SourceFolder)
6 (df.Result == 1)
7 | (df.Result.ffill() == 1)
----> 8 | ((df.Result.ffill() == 0)
9 & (df.groupby((df.Result.ffill() != df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100)))
10 df = df[mask]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'Result'
If your dataframe has a structure that satisfies the requirements for a pandas DataFrame:
import pandas as pd
import os
# Let '\DATA\\' be the directory where you keep your csv files, as a subdirectory of .getcwd()
PATH = os.getcwd()+'\DATA\\'
def my_function(source_folder):
for i, file_path in enumerate(os.listdir(PATH)):
df = pd.read_csv(PATH+file_path) # Use read_csv here, not DataFrame.
# You are still working with a filepath, not a dictionary.
mask = ( (df.Result == 1) | (df.Result.ffill() == 1) |
((df.Result.ffill() == 0) &
(df.groupby((df.Result.ffill() !=
df.Result.ffill().shift()).cumsum()).Result.transform('size') <= 100))
)
df = df[mask]
df = df.to_csv(PATH+'df_{}.csv'.format(i))
You should provide a sample of the data you are working on when asking question similar to this one, as a general rule. The answers received may not work for you otherwise. Please update the question with a sample of a dataframe/csv file, and a mock content of the directory, so I can update this answer.
If srcPath is different from os.getcwd() you may have to compute the full path, or the path relative to .getcwd(), before iterating on the files.
Also, the call to list() above may not be necessary, test the code with or without it.
Lastly, why are you requiring two variables as inputs for my_function()?
As far as I can see there is only one variable required, which is srcPath called in .glob(), and this is not a variable passed to the function so it must be a global variable.
EDIT: I have updated the code above on the basis of the modifications to the original questions, and the comments to this post down below.
EDIT 2: Turns out that your call to the glob.glob() did not produce what you wanted. See the updated code.

Categories