Estimate log similarity across two pd df columns with nltk - python

My python is a little rusty and I feel at this point I have given this a pretty solid try before reaching out.
I have a data set of two columns, each containing n rows of words. I would like to create a new column within this same df that displays the Leacock-Chodorow Similarity of each word combo.
Here is how I have attacked the problem. I think I am writing in the way I would do this in R, which might be leading to the final problem.
Thanks in advance!
#import libraries
import pandas as pd
from nltk.corpus import wordnet as wn
Create dataframe
df = {'A':["cat", "dog", "human"],'B':['bell','leash','clothes']}
df = pd.DataFrame(df)
For single words, this is how I would calculate the LCS estimate:
cat =wn.synset('cat.n.01')
bell =wn.synset('bell.n.01')
wn.lch_similarity(cat, bell)
In an effort to get these estimates for a new column, I followed these steps.
First appened each word with ".n.01" and then create the synset object:
df["A2"] = df["A"] + ".n.01"
df["A3"] = df["A2"].apply(wn.synset)
df["B2"] = df["B"] + ".n.01"
df["B3"] = df["B2"].apply(wn.synset)
Now that columns A3 and B3 are synset representations needed for the analysis I run the following:
df["lch"] = wn.lch_similarity(df["A3"],df["B3"])
I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-27-e5518c48104e> in <module>
----> 1 df["lch"] = wn.lch_similarity(df["A3"], df["B3"])
~\anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py in lch_similarity(self, synset1, synset2, verbose, simulate_root)
1772
1773 def lch_similarity(self, synset1, synset2, verbose=False, simulate_root=True):
-> 1774 return synset1.lch_similarity(synset2, verbose, simulate_root)
1775
1776 lch_similarity.__doc__ = Synset.lch_similarity.__doc__
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'lch_similarity'

Related

Problem about printing certain rows without using Pandas

I want to print out the first 5 rows of the data from sklearn.datasets.load_diabetes. I tried head() and iloc but it seems not effective. What should I do?
Here is my work
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
# 2. Load the data (use .load_diabetes() function )
df = datasets.load_diabetes()
df
# 3. Print out feature names and target names
# Features Names
x = df.feature_names
x
# Target Names
y = df.target
y
# 4. Print out the first 5 rows of the data
df.head(5)
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
113 try:
--> 114 return self[key]
115 except KeyError:
KeyError: 'head'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
114 return self[key]
115 except KeyError:
--> 116 raise AttributeError(key)
117
118 def __setstate__(self, state):
AttributeError: head
According to the documentation for load_diabetes() it doesn't return a Pandas dataframe by default, so no wonder it doesn't work.
You can apparently do
df = datasets.load_diabetes(as_frame=True).data
if you want a dataframe.
If you don't want a dataframe, you need to read up on how Numpy array slicing works, since that's what you get by default.
Well, I thank Mr.AKX for giving me a useful hint. I can find my answer:
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
import pandas as pd
# 2. Load the data (use .load_diabetes() function )
data = datasets.load_diabetes()
# 3. Print out feature names and target names
# Features Names
x = data.feature_names
x
# Target Names
y = data.target
y
# 4. Print out the first 5 rows of the data
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head(5)
The method load_diabetes() doesn't return a DataFrame by default but if you are using sklearn 0.23 or higher you can set as_frame parameter to True so it will return a Pd.DataFrame object.
df = datasets.load_diabetes(as_frame=True)
Then you can call head method and it will show you the first 5 rows, no need to specify 5.
print(df.head())

Why are some of my columns of my data not recognized on my data frame after importing a csv file to python

Here is my code
Import pandas as pd
finance=pd.read_csv("C:/Users/hp/Desktop/Financial Sample.csv")
finance.Profit.describe()
And the error
AttributeError Traceback (most recent call last) in ----> 1 finance.Profit.describe() ~\Anaconda3\lib\site-packages\pandas\core\generic.py in getattr(self, name) 5177 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5178 return self[name] -> 5179 return object.getattribute(self, name) 5180 5181 def setattr(self, name, value): AttributeError: 'DataFrame' object has no attribute 'Profit'
according to your submitted error
here is correct syntax to describe Profit Column
finance['Profit'].describe()
This syntax will work if the column name has been parsed as how it was saved (case-sensitive):
finance['Profit'].describe()
However, sometimes the name of your column can have added characters before it after reading so the actual call might result in an error. To avoid this, you can also use .iloc()
finance.iloc[:,"(column number here, starts from 0)"].describe()

AttributeError: 'Series' object has no attribute 'label'

I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.
def parser(row):
# function to load files and extract features
file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')
# handle exception to check if there isn't a file which is corrupted
try:
# here kaiser_fast is a technique used for faster extraction
X, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
# we extract mfcc feature from data
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
except Exception as e:
print("Error encountered while parsing file: ", file)
return None, None
feature = mfccs
label = row.Class
return [feature, label]
temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']
from sklearn.preprocessing import LabelEncoder
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
As mentioned, I've seen three different tutorials on the same subject, all of which end with the same "temp = train.apply(parser, axis=1) temp.columns = ['feature', 'label']" fragment, so I'm assuming this is assigning correctly, but I don't know where it's going wrong otherwise. Help appreciated!
Edit: Traceback as requested, turns out I'd added the wrong traceback. Also I've since found out that this is a case of converting the series object to a dataframe, so any help with that would be great.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-1613f53e2d98> in <module>()
1 from sklearn.preprocessing import LabelEncoder
2
----> 3 X = np.array(temp.feature.tolist())
4 y = np.array(temp.label.tolist())
5
/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'feature'
Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.
So your temp is actually a Series object. Then the following line dont have any effect:
temp.columns = ['feature', 'label']
Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.
Change your parser() method as following:
def parser(row):
...
...
...
# Return pandas.Series instead of List
return pd.Series([feature, label])
By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.
I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.

Pandas Error Matching String

I have data like the SampleDf data below. I'm trying to check values in one column in my dataframe to see if they contain 'sum' or 'count' or 'Avg' and then create a new column with the value 'sum', 'count', or 'Avg'. When I run the code below on my real dataframe I'm getting the error below. When I run dtypes on my real dataframe it says all the columns are objects. The code below is related to the post below. Unfortunately I don't get the same errors when I run the code on the SampleDf I've provided, but I couldn't post my whole dataframe.
post:
Pandas and apply function to match a string
Code:
SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Avg(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then LOS end),0)"]],columns=['ReportField','OtherField'])
search1='Sum'
search2='Count'
search3='Avg'
def Agg_type(x):
if search1 in x:
return 'sum'
elif search2 in x:
return 'count'
elif search3 in x:
return 'Avg'
else:
return 'Other'
SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
SampleDf.head()
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-a2b4920246a7> in <module>()
17 return 'Other'
18
---> 19 SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
20
21 #SampleDf.head()
C:\Users\Name\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:66124)()
<ipython-input-17-a2b4920246a7> in Agg_type(x)
8
9 def Agg_type(x):
---> 10 if search1 in x:
11 return 'sum'
12 elif search2 in x:
TypeError: argument of type 'float' is not iterable
You can try this:
SampleDf['new_col'] = np.where(SampleDf.OtherField.str.contains("Avg"),"Avg",
np.where(SampleDf.OtherField.str.contains("Count"),"Count",
np.where(SampleDf.OtherField.str.contains("Sum"),"Sum","Nothing")))
please notice that this will work properly if you don't have both Avg and Count or Sum in the same string.
If you do, please notice me i'll look for a better approach.
Of course if something doesn't suit your needs also report it back.
Hope this was helpful
explanation:
what's happening is that you're looking for indexes where Avg is in the string inside OtherField column and fill new_col with "Avg" in these indexes. for the remaining fields( where there isn't "Avg", you look for Count and do the same and last you do the same for Sum.
documentation:
np.where
pandas.series.str.contains

failing simple groupby example from "Python for Data Analysis" text

I just started learning python (mostly as open source replacement for matlab using "ipython --pylab" ), going through the examples from the "Python for Data Analysis" text. On page 253, a simple example is shown using 'groupby' (passing a list of arrays). I repeat it exactly as in the text, but I get this error:
"TypeError: 'Series' objects are mutable, thus they cannot be hashed"
import pandas as pd
from pandas import DataFrame
df = DataFrame({'key1' : ['a','a','b','b','a'],'key2' : ['one','two','one','two\
','one'],'data1' : np.random.randn(5),'data2' : np.random.randn(5)})
grouped = df['data1'].groupby(df['key1'])
means = df['data1'].groupby(df['key1'],df['key2']).mean()
-----DETAILS OF TYPEERROR-------
TypeError Traceback (most recent call last)
<ipython-input-7-0412f2897849> in <module>()
----> 1 means = df['data1'].groupby(df['key1'],df['key2']).mean()
/home/joeblow/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
2725
2726 from pandas.core.groupby import groupby
-> 2727 axis = self._get_axis_number(axis)
2728 return groupby(self, by, axis=axis, level=level, as_index=as_index,
2729 sort=sort, group_keys=group_keys, squeeze=squeeze)
/home/joeblow/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_axis_number(self, axis)
283
284 def _get_axis_number(self, axis):
--> 285 axis = self._AXIS_ALIASES.get(axis, axis)
286 if com.is_integer(axis):
287 if axis in self._AXIS_NAMES:
/home/joeblow/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.pyc in __hash__(self)
639 def __hash__(self):
640 raise TypeError('{0!r} objects are mutable, thus they cannot be'
--> 641 ' hashed'.format(self.__class__.__name__))
642
643 def __iter__(self):
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What simple thing am I missing here?
You didn't do it exactly as in the text. :^)
>>> means = df['data1'].groupby([df['key1'],df['key2']]).mean()
>>> means
key1 key2
a one 1.127536
two 1.220386
b one 0.402765
two -0.058255
dtype: float64
If you're grouping by two arrays, you need to pass a list of the arrays. You instead passed two arguments: (df['key1'],df['key2']), which are being interpreted as by and axis.

Categories